 # Topic modeling
Another way to compare documents is to extract the latent topics that group words within each document, and compare those distributions.

We'll continue on the topic of fake news with another dataset that has examples of both fake and real news articles, at a much larger scale than the previous data.

1. [Load data](#Load-data)
2. [Latent Semantic Analysis](#Latent-Semantic-Analysis)
3. [Latent Dirichlet Allocation](#Latent-Dirichlet-Allocation)
4. [Exploration](#Exploration)

### Load data

In [None]:
## data = fake news challenge
import pandas as pd
fake_news_article_data = pd.read_csv('data/fake_news_challenge/Fake.csv', sep=',', index_col=False)
real_news_article_data = pd.read_csv('data/fake_news_challenge/True.csv', sep=',', index_col=False)
# get rid of duplicate articles
fake_news_article_data.drop_duplicates('text', inplace=True)
real_news_article_data.drop_duplicates('text', inplace=True)
display(fake_news_article_data.loc[:, 'text'].head(10).values)

Before we try topic modeling, we have to convert the text to a usable format (document-term matrix, like before).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import WordPunctTokenizer
from stop_words import get_stop_words
## combine text data, keep track of fake/real news indices
combined_news_text = fake_news_article_data.loc[:, 'text'].append(real_news_article_data.loc[:, 'text'])
fake_news_text_indices = list(range(fake_news_article_data.shape[0]))
real_news_text_indices = list(range(fake_news_article_data.shape[0], combined_news_text.shape[0]))
## convert text to DTM
en_stops = get_stop_words('en')
tokenizer = WordPunctTokenizer()
cv = CountVectorizer(min_df=0.001, max_df=0.75, lowercase=True, 
                     ngram_range=(1,1), stop_words=en_stops, tokenizer=tokenizer.tokenize)
combined_news_text_dtm = cv.fit_transform(combined_news_text)
print(combined_news_text_dtm.shape)

### Latent Semantic Analysis

For our first method, let's try Latent Semantic Analysis, which is a form of dimensionality reduction.

In [None]:
## LSA
from sklearn.decomposition import TruncatedSVD
num_topics = 10
num_iter = 10
lsa_model = TruncatedSVD(n_components=num_topics, n_iter=num_iter, random_state=123)
combined_news_text_lsa_topics = lsa_model.fit_transform(combined_news_text_dtm)
print(combined_news_text_lsa_topics.shape)

The LSA process outputs continuous values [-inf, +inf] which we need to convert to probabilities [0,1]. We can use the softmax function along each dimension to convert the topic-document matrix to probabilities:

$$\text{softmax}(x_{i}) = \frac{e^{x_{i}}}{\sum_{j}^{K}e^{x_{j}}}$$

where $x$ is one of $K$ topic dimensions.

In [None]:
from sklearn.utils.extmath import softmax
from sklearn.preprocessing import StandardScaler
import numpy as np
# convert per-column scores to a normal distribution (0,1)
scaler = StandardScaler()
combined_news_text_lsa_topic_scores = scaler.fit_transform(combined_news_text_lsa_topics)
# soft-max per-column
combined_news_text_lsa_topic_probs = softmax(combined_news_text_lsa_topic_scores.T).T
# normalize per-row so that probabilities sum to 1
combined_news_text_lsa_topic_probs = combined_news_text_lsa_topic_probs / combined_news_text_lsa_topic_probs.sum(axis=1).reshape(-1,1)

What is the expected probability of a document being assigned to a topic?

In [None]:
combined_news_text_lsa_expected_topics = pd.Series(combined_news_text_lsa_topic_probs.mean(axis=0))
print(f'expected probability of topics = \n{combined_news_text_lsa_expected_topics}')

It looks like the data is "dominated" by 3 topics with high probability.

To figure out what "topics" the model learned, let's look at the news articles with the highest probability for each topic.

We'll take the arg-max along each topic and print the text for the corresponding articles.
We'll look at the most likely topics (0, 4, 7) as a first pass.

In [None]:
def show_articles_with_highest_prob_per_topic(doc_topic_probs, doc_text, num_topics):
    topic_ids = list(range(num_topics))
    top_articles_per_topic = 10
    text_sample_len = 200
    for topic_id_i in topic_ids:
        print(f'processing topic {topic_id_i}')
        # get indices for articles with highest topic probability
        top_article_indices_i = np.argsort(doc_topic_probs[:, topic_id_i])[-top_articles_per_topic:]
        top_article_indices_i = list(reversed(top_article_indices_i))
        for index_j in top_article_indices_i:
            topic_prob_i_j = doc_topic_probs[index_j, topic_id_i]
            print(f'\tarticle {index_j} has P(topic)={topic_prob_i_j} with text = {doc_text.iloc[index_j][:text_sample_len]}')

In [None]:
show_articles_with_highest_prob_per_topic(combined_news_text_lsa_topic_probs, combined_news_text, num_topics)

In [None]:
topic_ids = list(range(num_topics))
top_articles_per_topic = 10
for topic_id_i in topic_ids:
    print(f'processing topic {topic_id_i}')
    # get indices for articles with highest topic probability
    top_article_indices_i = np.argsort(combined_news_text_lsa_topic_probs[:, topic_id_i])[-top_articles_per_topic:]
    top_article_indices_i = list(reversed(top_article_indices_i))
    for index_j in top_article_indices_i:
        topic_prob_i_j = combined_news_text_lsa_topic_probs[index_j, topic_id_i]
        print(f'\tarticle {index_j} has P(topic)={topic_prob_i_j} with text = {combined_news_text.iloc[index_j][:200]}')

Looking at the article text qualitatively, we observe the following:

- Topic 0 includes major election issues such as U.S. president Trump's campaign and action in office.
- Topic 4 includes more subjective claims (`anti-American`, `whine`) and more extreme issues (`conspiracy`, `chaos`, `violence`).
- Topic 7 includes discussion of the 2016 election, particularly related to Clinton (`email`, `classified`).

Which topics are more prevalent in fake news versus real news?

In [None]:
fake_news_text_lsa_topic_probs = combined_news_text_lsa_topic_probs[fake_news_text_indices, :]
real_news_text_lsa_topic_probs = combined_news_text_lsa_topic_probs[real_news_text_indices, :]
fake_news_text_lsa_expected_topics = pd.Series(fake_news_text_lsa_topic_probs.mean(axis=0))
real_news_text_lsa_expected_topics = pd.Series(real_news_text_lsa_topic_probs.mean(axis=0))
print(f'expected probability of topics for fake news = \n{fake_news_text_lsa_expected_topics}')
print(f'expected probability of topics for real news = \n{real_news_text_lsa_expected_topics}')

It looks like real news discusses topic 0 (possible criticism of Trump?) slightly more than fake news, while fake news discusses discusses topic 4 (conspiracy theories?) slightly more than real news.

While this is a useful first pass on the data, it doesn't help us identify which words or phrases may differentiate fake news from real news. 

We'll move onto a more complicated method (Latent Dirichlet Allocation) that identifies latent topics from which words are "generated." 
This will help us pull out specific words that characterize the topics.

### Latent Dirichlet Allocation

In [None]:
## LDA
# get text tokens first using the CountVectorizer from earlier
combined_news_text_dtm_tokens = cv.inverse_transform(combined_news_text_dtm)
from gensim.corpora import Dictionary
lda_dict = Dictionary(combined_news_text_dtm_tokens)
combined_news_text_corpus = list(map(lambda x: lda_dict.doc2bow(x), combined_news_text_dtm_tokens))
# train model
from gensim.models import LdaModel
num_topics = 10
iterations = 50
lda_model = LdaModel(corpus=combined_news_text_corpus, num_topics=10, iterations=iterations)

Like before, let's look at the distribution of topics over all documents and get a sense of the articles that correspond to each topic.

In [None]:
def compute_lda_topic_probs(text_doc, model):
    doc_topics = model.get_document_topics(text_doc, minimum_probability=0.)
    # convert to probability array
    doc_topic_ids, doc_topic_probs = zip(*doc_topics)
    return doc_topic_probs
combined_news_text_lda_topic_probs = np.array(list(map(lambda x: compute_lda_topic_probs(x, lda_model), combined_news_text_corpus)))
combined_news_text_lda_topic_expected_prob = combined_news_text_lda_topic_probs.mean(axis=0)
print(f'expected value of LDA topics =\n{combined_news_text_lda_topic_expected_prob}')

In contrast to the SVD analysis, we see a more even distribution of topics. Let's see which articles were more strongly associated with each topic.

In [None]:
show_articles_with_highest_prob_per_topic(combined_news_text_lda_topic_probs, combined_news_text, num_topics)

Restricting ourselves to the top 5 most frequent topics in the data based on the probabilities above (topics 3, 8, 9, 1, 2), we see the following trends:

- Topic 1 includes U.S. election issues and general content concerning the president.
- Topic 2 includes disasters and violence, possibly fear-mongering.
- Topic 3 includes international politics.
- Topic 8 seems to include inflammatory and "alternative" news content (`hypocrites`, `trashing`).
- Topic 9 includes the politics around U.S. healthcare.

Let's also compare the distribution of topics in each text category.

In [None]:
fake_news_text_lda_topic_probs = combined_news_text_lda_topic_probs[fake_news_text_indices, :]
real_news_text_lda_topic_probs = combined_news_text_lda_topic_probs[real_news_text_indices, :]
fake_news_text_lda_expected_topics = pd.Series(fake_news_text_lda_topic_probs.mean(axis=0))
real_news_text_lda_expected_topics = pd.Series(real_news_text_lda_topic_probs.mean(axis=0))
print(f'expected probability of topics for fake news = \n{fake_news_text_lda_expected_topics}')
print(f'expected probability of topics for real news = \n{real_news_text_lda_expected_topics}')

Real news articles tend to have more representation for topics 3 and 9, while fake news articles have more representation for topics 1, 2 and 8, which makes sense given the more violent and "alternative" content included in those topics.

Now that we've established the high-level differences in topics between fake news and real news, let's look at the individual words that make up the topics.

Specifically, we're going to compute the probability of observing a word given a topic, using the parameters learned by the LDA model.

In [None]:
def show_top_words_all_topics(model, model_dict, num_topics, words_per_topic):
    topic_ids = list(range(num_topics))
    for topic_i in topic_ids:
        topic_word_id_scores_i = model.get_topic_terms(topic_i, topn=words_per_topic)
        topic_word_ids_i, topic_word_scores_i = zip(*topic_word_id_scores_i)
        # convert word ID to words
        topic_words_i = list(map(model_dict.get, topic_word_ids_i))
        print(f'topic {topic_i} has top words: \n\t{", ".join(topic_words_i)}')

In [None]:
words_per_topic = 20
show_top_words_all_topics(lda_model, lda_dict, num_topics, words_per_topic)

Looking at the top words confirms what we saw before, that fake news articles tend to focus on election conflicts (topic 1), violence (topic 2), and possibly more simple or engaging words to correspond with more "opinion" pieces (topic 8).

What happens if we train separate topic models on real news and fake news? This could help highlight groups of words that are specific only to fake news or to real news, which may be "washed out" with the combined topic model.

In [None]:
num_topics = 10
iterations = 100
# train fake news model
def train_lda_model_from_corpus(text_corpus, num_topics, iterations):
    lda_model = LdaModel(text_corpus, num_topics=num_topics, iterations=iterations)
    return lda_model
# fake_news_text_dtm_tokens = list(map(lambda x: combined_news_text_dtm_tokens[x], fake_news_text_indices))
# real_news_text_dtm_tokens = list(map(lambda x: combined_news_text_dtm_tokens[x], real_news_text_indices))
fake_news_text_corpus = list(map(lambda x: combined_news_text_corpus[x], fake_news_text_indices))
real_news_text_corpus = list(map(lambda x: combined_news_text_corpus[x], real_news_text_indices))
## train models
fake_news_lda_model = train_lda_model_from_corpus(fake_news_text_corpus, num_topics, iterations)
real_news_lda_model = train_lda_model_from_corpus(real_news_text_corpus, num_topics, iterations)

What are the top words captured per-topic from each model?

In [None]:
words_per_topic = 20
print('real news: top words per topic')
show_top_words_all_topics(real_news_lda_model, lda_dict, num_topics, words_per_topic)
print('fake news: top words per topic')
show_top_words_all_topics(fake_news_lda_model, lda_dict, num_topics, words_per_topic)

The real news topics include concrete details and "normal" news items such as money (topic 1), immigration statistics (topic 5), and international diplomacy (topic 7).

The fake news topics include sub-discussions around Donald Trump (topic 2: Trump vs. Obama; topic 5: election results) and some topics related to social justice (topic 7: `black`, `white`; topic 8: `protesters`, `police`).

### Exploration
Now it's time for you to keep exploring what topic models can tell us about real and fake news.

Some ideas:
- We used word frequency to represent words when training the topic models, but you can try other metrics such as TF-IDF, which we saw before can up-weight rarer words. What happens if you re-train the topic model using another form of word frequency?
- You can change the number of topics learned by the model to include more or less detail that may reveal different "levels" of granularity. You may want to try using "coherence" as a metric to determine the number of topics that maximizes the similarity among words within the same topic. What broad or fine-grained differences can you find that differentiate real and fake news? 
- One way of reducing "overlap" among words within topics is to **stem** each word and convert it to a base form that is shared among different versions of the word (e.g. `dog` and `dogs` stemmed to `dog`). What happens if you stem the text before training the topic model?
- Some topics may be closer together in "space" than others. For instance, topics that discuss different aspects of international relations. [This package](https://github.com/bmabey/pyLDAvis) visualizes the relationship between LDA topics by projecting the topics to a shared 2-dimensional space via PCA. Can you find topics that are unexpectedly close, and whether these topics indicate similarities or differences between real and fake news?