 # Topic modeling
Another way to compare documents is to extract the latent topics that group words within each document, and compare those distributions.

We'll continue on the topic of fake news with another dataset that has examples of both fake and real news articles, at a much larger scale than the previous data.

1. [Load data](#Load-data)
2. [Latent Semantic Analysis](#Latent-Semantic-Analysis)
3. [Latent Dirichlet Allocation](#Latent-Dirichlet-Allocation)
4. [Visualizing](#Visualizing-topic-models)
5. [Exploration](#Exploration)

### Load data
The data was originally released on Kaggle as a challenge to categorize fake and true data. The real/fake annotations for the data are not well-documented, but seem to be based on trustworthy vs. untrustworthy sources.

In [None]:
!wget https://bitbucket.org/istewart6/core_tutorial_2020/raw/36e69f9d777319ae2cc94354cf57bd01f3e080b3/data.zip .; unzip data.zip

--2020-12-10 05:15:31--  https://bitbucket.org/istewart6/core_tutorial_2020/raw/36e69f9d777319ae2cc94354cf57bd01f3e080b3/data.zip
Resolving bitbucket.org (bitbucket.org)... 104.192.141.1, 2406:da00:ff00::6b17:d1f5, 2406:da00:ff00::3403:4be7, ...
Connecting to bitbucket.org (bitbucket.org)|104.192.141.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 230909807 (220M) [application/zip]
Saving to: ‘data.zip’


2020-12-10 05:15:43 (40.0 MB/s) - ‘data.zip’ saved [230909807/230909807]

--2020-12-10 05:15:43--  http://./
Resolving . (.)... failed: No address associated with hostname.
wget: unable to resolve host address ‘.’
FINISHED --2020-12-10 05:15:43--
Total wall clock time: 12s
Downloaded: 1 files, 220M in 5.5s (40.0 MB/s)
Archive:  data.zip
   creating: data/
   creating: data/fakeNewsDatasets/
  inflating: data/fakeNewsDatasets/fake_news_small.tsv  
  inflating: data/fakeNewsDatasets/real_news_small.tsv  
   creating: data/fake_news_challenge/
  inflating: dat

In [None]:
!pip install stop_words
!pip install pyLDAvis

In [None]:
## data = fake news challenge
import pandas as pd
import numpy as np
fake_news_article_data = pd.read_csv('data/fake_news_challenge/Fake.csv', sep=',', index_col=False)
real_news_article_data = pd.read_csv('data/fake_news_challenge/True.csv', sep=',', index_col=False)
# get rid of duplicate articles
fake_news_article_data.drop_duplicates('text', inplace=True)

Before we try topic modeling, we have to convert the text to a usable format (document-term matrix, like before).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import WordPunctTokenizer
from stop_words import get_stop_words
## combine text data, keep track of fake/real news indices
combined_news_text = fake_news_article_data.loc[:, 'text'].append(real_news_article_data.loc[:, 'text'])
fake_news_text_indices = list(range(fake_news_article_data.shape[0]))
real_news_text_indices = list(range(fake_news_article_data.shape[0], combined_news_text.shape[0]))
## convert text to DTM
en_stops = get_stop_words('en')
tokenizer = WordPunctTokenizer()
cv = CountVectorizer(min_df=0.001, max_df=0.75, lowercase=True, 
                     ngram_range=(1,1), stop_words=en_stops, tokenizer=tokenizer.tokenize)
combined_news_text_dtm = cv.fit_transform(combined_news_text)
print(combined_news_text_dtm.shape)

  'stop_words.' % sorted(inconsistent))


(38872, 13559)


### Latent Semantic Analysis

For our first method, let's try Latent Semantic Analysis, which is a form of dimensionality reduction.

In [None]:
## LSA
from sklearn.decomposition import TruncatedSVD
num_topics = 10
num_iter = 10
lsa_model = TruncatedSVD(n_components=num_topics, n_iter=num_iter, random_state=123)
combined_news_text_lsa_topics = lsa_model.fit_transform(combined_news_text_dtm)
print(combined_news_text_lsa_topics.shape)

(38872, 10)


The LSA process outputs continuous values [-inf, +inf] which we need to convert to probabilities [0,1]. We can use the softmax function along each dimension to convert the topic-document matrix to probabilities:

$$\text{softmax}(x_{i}) = \frac{e^{x_{i}}}{\sum_{j}^{K}e^{x_{j}}}$$

where $x$ is one of $K$ topic dimensions.

In [None]:
from sklearn.utils.extmath import softmax
from sklearn.preprocessing import StandardScaler
import numpy as np
# convert per-column scores to a normal distribution (0,1)
scaler = StandardScaler()
combined_news_text_lsa_topic_scores = scaler.fit_transform(combined_news_text_lsa_topics)
# soft-max per-column
combined_news_text_lsa_topic_probs = softmax(combined_news_text_lsa_topic_scores.T).T
# normalize per-row so that probabilities sum to 1
combined_news_text_lsa_topic_probs = combined_news_text_lsa_topic_probs / combined_news_text_lsa_topic_probs.sum(axis=1).reshape(-1,1)

What is the expected probability of a document being assigned to a topic?

In [None]:
combined_news_text_lsa_expected_topics = pd.Series(combined_news_text_lsa_topic_probs.mean(axis=0))
print(f'expected probability of topics = \n{combined_news_text_lsa_expected_topics}')

expected probability of topics = 
0    0.542500
1    0.000006
2    0.000006
3    0.000006
4    0.315447
5    0.000011
6    0.000009
7    0.133102
8    0.000013
9    0.008900
dtype: float64


It looks like the data is "dominated" by 3 topics with high probability.

To figure out what "topics" the model learned, let's look at the news articles with the highest probability for each topic.

We'll take the arg-max along each topic and print the text for the corresponding articles.
We'll look at the most likely topics (0, 4, 7) as a first pass.

In [None]:
def show_articles_with_highest_prob_per_topic(doc_topic_probs, doc_text, num_topics):
    topic_ids = list(range(num_topics))
    top_articles_per_topic = 10
    text_sample_len = 200
    for topic_id_i in topic_ids:
        print(f'processing topic {topic_id_i}')
        # get indices for articles with highest topic probability
        top_article_indices_i = np.argsort(doc_topic_probs[:, topic_id_i])[-top_articles_per_topic:]
        top_article_indices_i = list(reversed(top_article_indices_i))
        for index_j in top_article_indices_i:
            topic_prob_i_j = doc_topic_probs[index_j, topic_id_i]
            print(f'\tarticle {index_j} has P(topic)={topic_prob_i_j} with text = {doc_text.iloc[index_j][:text_sample_len]}')

In [None]:
show_articles_with_highest_prob_per_topic(combined_news_text_lsa_topic_probs, combined_news_text, num_topics)

processing topic 0
	article 9196 has P(topic)=0.9999984849945176 with text = With mainstream media and establishment politicians stacked against him from the moment he announced his run for the presidency, Donald J. Trump has been in an ongoing pitched battle to communicate hi
	article 17381 has P(topic)=0.9999967339861654 with text = Shawn Helton   21st Century WireGOP presidential frontrunner Donald Trump is a populist candidate among a bevy of warhawk rivals  yet many still wonder how the real estate mogul has marched virtually 
	article 12986 has P(topic)=0.999975002269979 with text = This is a must read for anyone who s undecided or plans on voting for a third party candidate It covers all the bases and cements for you the duty as an American to do what s best for our nation. If y
	article 16773 has P(topic)=0.9999597090174529 with text =  By ramping up US troop levels in Afghanistan, Trump is alienating many supporters. (Photo: DoD/USAF Tech Sgt Brigitte N Brantley. Source: Wikic

Looking at the article text qualitatively, we observe the following:

- Topic 0 includes major executive-branch issues such as U.S. president Trump's campaign and action in office.
- Topic 4 includes more subjective claims (`anti-American`, `whine`) and more extreme issues (`conspiracy`, `chaos`, `violence`).
- Topic 7 includes discussion of the 2016 election, particularly related to Clinton (`investigation`, `email`, `classified`).

Which topics are more prevalent in fake news versus real news?

In [None]:
fake_news_text_lsa_topic_probs = combined_news_text_lsa_topic_probs[fake_news_text_indices, :]
real_news_text_lsa_topic_probs = combined_news_text_lsa_topic_probs[real_news_text_indices, :]
fake_news_text_lsa_expected_topics = pd.Series(fake_news_text_lsa_topic_probs.mean(axis=0))
real_news_text_lsa_expected_topics = pd.Series(real_news_text_lsa_topic_probs.mean(axis=0))
print(f'expected probability of topics for fake news = \n{fake_news_text_lsa_expected_topics}')
print(f'expected probability of topics for real news = \n{real_news_text_lsa_expected_topics}')

expected probability of topics for fake news = 
0    0.502965
1    0.000014
2    0.000014
3    0.000014
4    0.364562
5    0.000024
6    0.000019
7    0.125921
8    0.000029
9    0.006437
dtype: float64
expected probability of topics for real news = 
0    5.747211e-01
1    4.791793e-50
2    1.157458e-21
3    1.558271e-08
4    2.754175e-01
5    4.933393e-16
6    4.747155e-12
7    1.389541e-01
8    6.871054e-07
9    1.090660e-02
dtype: float64


It looks like real news discusses topic 0 (possible criticism of Trump?) slightly more than fake news, while fake news discusses discusses topic 4 (conspiracy theories?) slightly more than real news.

While this is a useful first pass on the data, it doesn't help us identify which words or phrases may differentiate fake news from real news. 

We'll move onto a more complicated method (Latent Dirichlet Allocation) that identifies latent topics from which words are "generated." 
This will help us pull out specific words that characterize the topics.

### Latent Dirichlet Allocation

In [None]:
## LDA
# get text tokens first using the CountVectorizer from earlier
combined_news_text_dtm_tokens = cv.inverse_transform(combined_news_text_dtm)
from gensim.corpora import Dictionary
lda_dict = Dictionary(combined_news_text_dtm_tokens)
combined_news_text_corpus = list(map(lambda x: lda_dict.doc2bow(x), combined_news_text_dtm_tokens))
# train model
from gensim.models import LdaModel
num_topics = 10
iterations = 50
# takes ~1-2 minutes to train on Colab
lda_model = LdaModel(corpus=combined_news_text_corpus, num_topics=10, iterations=iterations, random_state=123)

Like before, let's look at the distribution of topics over all documents and get a sense of the articles that correspond to each topic.

In [None]:
def compute_lda_topic_probs(text_doc, model):
    doc_topics = model.get_document_topics(text_doc, minimum_probability=0.)
    # convert to probability array
    doc_topic_ids, doc_topic_probs = zip(*doc_topics)
    return doc_topic_probs
combined_news_text_lda_topic_probs = np.array(list(map(lambda x: compute_lda_topic_probs(x, lda_model), combined_news_text_corpus)))
combined_news_text_lda_topic_expected_prob = combined_news_text_lda_topic_probs.mean(axis=0)
print(f'expected value of LDA topics =\n{combined_news_text_lda_topic_expected_prob}')

expected value of LDA topics =
[0.07894564 0.09453259 0.09981273 0.06871721 0.0628916  0.12918827
 0.18516402 0.10444001 0.1084727  0.06783988]


In contrast to the SVD model, here we see a more even distribution of topics. Let's see which articles were more strongly associated with each topic.

In [None]:
show_articles_with_highest_prob_per_topic(combined_news_text_lda_topic_probs, combined_news_text, num_topics)

processing topic 0
	article 30244 has P(topic)=0.9859338998794556 with text = LONDON (Reuters) - Britain has made substantive changes to its proposed text for a deal with the European Union, the leader of Northern Ireland s Democratic Unionist Party said on Friday as Prime Mini
	article 36706 has P(topic)=0.9852412939071655 with text = WOLFENBUETTEL, Germany (Reuters) - German Foreign Minister Sigmar Gabriel on Saturday described British Prime Minister Theresa May s Brexit speech as  disappointing , saying it offered no concrete det
	article 31974 has P(topic)=0.9844797253608704 with text = GOTHENBURG (Reuters) - British Prime Minister Theresa May said on Friday she and fellow EU leaders agreed that Brexit divorce talks had made  good progress , but that more work was needed to allow the
	article 30057 has P(topic)=0.9844788312911987 with text = LONDON (Reuters) - Britain intends to prevent a hard border in Ireland after leaving the European Union whatever the outcome of talks with the

Restricting ourselves to the top 5 most frequent topics in the data based on the probabilities above (topics 0, 3, 5, 6, 8), we see the following trends:

- Topic 0 includes international politics.
- Topic 3 includes immigration and travel.
- Topic 5 includes tax debates and financial negotiations in the U.S. legislative branch.
- Topic 6 includes opinion pieces concerning the actions of politicians (especially Donald Trump).
- Topic 8 includes election investigation cases.

Let's also compare the distribution of topics in each text category.

In [None]:
fake_news_text_lda_topic_probs = combined_news_text_lda_topic_probs[fake_news_text_indices, :]
real_news_text_lda_topic_probs = combined_news_text_lda_topic_probs[real_news_text_indices, :]
fake_news_text_lda_expected_topics = pd.Series(fake_news_text_lda_topic_probs.mean(axis=0))
real_news_text_lda_expected_topics = pd.Series(real_news_text_lda_topic_probs.mean(axis=0))
print(f'expected probability of topics for fake news = \n{fake_news_text_lda_expected_topics}')
print(f'expected probability of topics for real news = \n{real_news_text_lda_expected_topics}')

expected probability of topics for fake news = 
0    0.055831
1    0.020798
2    0.126744
3    0.050472
4    0.082015
5    0.044793
6    0.383350
7    0.072026
8    0.087591
9    0.076381
dtype: float32
expected probability of topics for real news = 
0    0.097784
1    0.154627
2    0.077864
3    0.083587
4    0.047305
5    0.197973
6    0.023637
7    0.130857
8    0.125492
9    0.060878
dtype: float32


Real news articles tend to have more representation for topics 0, 3, and 5 (more "standard" news topics concerning every day affairs), while fake news articles have more representation for topics 6 and 8 (more Trump-centric and controversial content).

Now that we've established the high-level differences in topics between fake news and real news, let's look at the individual words that make up the topics.

Specifically, we're going to compute the probability of observing a word given a topic, using the parameters learned by the LDA model.

In [None]:
def show_top_words_all_topics(model, model_dict, num_topics, words_per_topic):
    topic_ids = list(range(num_topics))
    for topic_i in topic_ids:
        topic_word_id_scores_i = model.get_topic_terms(topic_i, topn=words_per_topic)
        topic_word_ids_i, topic_word_scores_i = zip(*topic_word_id_scores_i)
        # convert word ID to words
        topic_words_i = list(map(model_dict.get, topic_word_ids_i))
        print(f'topic {topic_i} has top words: \n\t{", ".join(topic_words_i)}')

In [None]:
words_per_topic = 20
show_top_words_all_topics(lda_model, lda_dict, num_topics, words_per_topic)

topic 0 has top words: 
	minister, party, european, will, union, eu, president, prime, country, parliament, leader, (, states, united, britain, political, government, told, also, one
topic 1 has top words: 
	reuters, (, military, u, minister, united, security, president, state, forces, told, foreign, north, region, war, will, nations, government, states, international
topic 2 has top words: 
	people, police, year, one, (, city, old, two, killed, 000, many, three, home, also, local, around, last, years, told, least
topic 3 has top words: 
	court, reuters, (, rights, government, people, authorities, law, police, ruling, case, state, last, also, two, accused, one, arrested, country, told
topic 4 has top words: 
	social, political, (, german, chancellor, angela, merkel, minister, berlin, people, germany, ankara, one, world, media, anti, country, right, will, also
topic 5 has top words: 
	reuters, (, president, will, u, percent, election, washington, “, donald, house, ’, trump, new, vote, t

- Topic 0 includes more international relations words (`states`, `foreign`, `nations`).
- Topic 3 includes more legal words (`law`, `arrested`, `case`).
- Topic 5 includes more executive-branch words (`president`, `washington`).
- Topic 6 includes more president-specific words, and more subjective (?) words (`think`, `know`, `just`).
- Topic 8 includes words related to Russian affairs and the election investigation (`russia`, `investigation`).

What happens if we train separate topic models on real news and fake news? This could help highlight groups of words that are specific only to fake news or to real news, which may be "washed out" with the combined topic model.

In [None]:
num_topics = 10
iterations = 100
# train fake news model
def train_lda_model_from_corpus(text_corpus, num_topics, iterations):
    lda_model = LdaModel(text_corpus, num_topics=num_topics, iterations=iterations, random_state=123)
    ## add dictionary
    return lda_model
# fake_news_text_dtm_tokens = list(map(lambda x: combined_news_text_dtm_tokens[x], fake_news_text_indices))
# real_news_text_dtm_tokens = list(map(lambda x: combined_news_text_dtm_tokens[x], real_news_text_indices))
fake_news_text_corpus = list(map(lambda x: combined_news_text_corpus[x], fake_news_text_indices))
real_news_text_corpus = list(map(lambda x: combined_news_text_corpus[x], real_news_text_indices))
## train models
# this takes ~2-3 minutes on Colab
fake_news_lda_model = train_lda_model_from_corpus(fake_news_text_corpus, num_topics, iterations)
real_news_lda_model = train_lda_model_from_corpus(real_news_text_corpus, num_topics, iterations)

What are the top words captured per-topic from each model?

In [None]:
words_per_topic = 20
print('real news: top words per topic')
show_top_words_all_topics(real_news_lda_model, lda_dict, num_topics, words_per_topic)

real news: top words per topic
topic 0 has top words: 
	reuters, (, people, one, police, two, country, killed, year, 000, since, last, state, also, government, years, city, many, told, old
topic 1 has top words: 
	reuters, (, minister, government, told, will, prime, president, state, european, u, military, islamic, forces, al, also, statement, last, parliament, turkey
topic 2 has top words: 
	vote, party, election, majority, immigration, (, reuters, will, conservative, support, also, opposition, new, voters, votes, government, conservatives, democrats, country, year
topic 3 has top words: 
	reuters, (, president, told, u, government, also, leader, rodrigo, will, year, wednesday, people, former, philippines, last, first, party, reporters, one
topic 4 has top words: 
	reuters, (, beirut, u, president, trump, donald, washington, will, energy, global, house, agency, climate, oil, thursday, also, administration, “, environmental
topic 5 has top words: 
	percent, million, $, 1, year, (, bill

In [None]:
print('fake news: top words per topic')
show_top_words_all_topics(fake_news_lda_model, lda_dict, num_topics, words_per_topic)

fake news: top words per topic
topic 0 has top words: 
	:, will, t, people, can, one, via, states, america, like, just, president, now, american, country, us, republican, obama, also, support
topic 1 has top words: 
	:, (, new, also, will, million, one, 21st, t, 1, just, according, $, percent, people, news, 5, ?, can, even
topic 2 has top words: 
	:, t, one, ?, news, also, just, told, via, former, people, can, now, will, president, trump, even, (, attorney, clinton
topic 3 has top words: 
	:, t, ?, !, people, like, just, one, can, trump, don, /, re, video, via, know, america, get, us, watch
topic 4 has top words: 
	boiler, sexual, );, ;, document, connect, 1, id, script, return, (, [, 3, 0, version, =, function, net, cdata, :
topic 5 has top words: 
	youtu, moralists, evangelists, 21wire, :, philosophers, t, misguided, like, tune, even, people, will, can, just, acr, one, uninterruptible, hesher, savants
topic 6 has top words: 
	police, :, one, shooting, killed, officers, old, man, t, g

The real news articles include discussions of international relations (topic 1, topic 7); hedging/legal words to make claims "softer" (topic 8 `alleged`, `reported`); legal issues (topic 9); the energy industry (topic 4). In general, these are topics that I expect from a typical newspaper.

The fake news articles include more subjective words (topics 1 and 3 `just`, `know`, `even`); violence/fear issues (topic 6 `police`, `killed`); American identity (?) (topic 0 `america`, `people`, `republican`; topic 5 `americans`, `domestically`); discussions of left-wing politicians, presumably negative (topic 8 `obama`; topic 9 `hillary`).

### Visualizing topic models

One last step to explore: we can visualize the topics that the models have learned to understand the relationship between the topics.

Some topics may be closer together in "space" than others. 
For instance, topics that discuss different aspects of international relations. 
[The pyLDAvis package](https://github.com/bmabey/pyLDAvis) visualizes the relationship between LDA topics by projecting the topics to a shared 2-dimensional space via Principal Components Analysis. 

In [None]:
import pyLDAvis
pyLDAvis.enable_notebook()

In [None]:
## need to make separate dict/corpus for each corpus because of errors with viz
from gensim.corpora import Dictionary
def generate_dict_corpus_model_for_data(combined_data_tokens, data_indices, num_topics, iterations):
    data_tokens = list(map(lambda x: combined_data_tokens[x], data_indices))
    data_dict = Dictionary(data_tokens)
    data_corpus = list(map(lambda x: data_dict.doc2bow(x), data_tokens))
    lda_model = train_lda_model_from_corpus(data_corpus, num_topics, iterations)
    return data_dict, data_corpus, lda_model
num_topics = 10
iterations = 50
clean_fake_news_text_dict, clean_fake_news_text_corpus, clean_fake_news_lda_model = generate_dict_corpus_model_for_data(combined_news_text_dtm_tokens, 
                                                                                                                        fake_news_text_indices, 
                                                                                                                        num_topics, iterations)
clean_real_news_text_dict, clean_real_news_text_corpus, clean_real_news_lda_model = generate_dict_corpus_model_for_data(combined_news_text_dtm_tokens, 
                                                                                                                        real_news_text_indices, 
                                                                                                                        num_topics, iterations)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad 

In [None]:
## plot topics learned on real news
import pyLDAvis.gensim
pyLDAvis.gensim.prepare(clean_real_news_lda_model, clean_real_news_text_corpus, clean_real_news_text_dict)

The main dimensions for real news seem to be "domestic" vs. "international" (PC1) and "facts" vs. "subjective/controversy" (PC2).

In [None]:
## plot topics learned on fake news
import pyLDAvis.gensim
pyLDAvis.gensim.prepare(clean_fake_news_lda_model, clean_fake_news_text_corpus, clean_fake_news_text_dict)

The main dimensions for fake news seem to be "news" vs. "opinion" (PC1) and "government" vs. "people" (PC2).

### Exploration
Now it's time for you to keep exploring what topic models can tell us about real and fake news.

Some ideas:
- We used word frequency to represent words when training the topic models, but you can try other metrics such as TF-IDF, which we saw before can up-weight rarer words. What happens if you re-train the topic model using another form of word frequency?
- You can change the number of topics learned by the model to include more or less detail that may reveal different "levels" of granularity. You may want to try using "coherence" as a metric to determine the number of topics that maximizes the similarity among words within the same topic. What broad or fine-grained differences can you find that differentiate real and fake news? 
- One way of reducing "overlap" among words within topics is to **stem** each word and convert it to a base form that is shared among different versions of the word (e.g. `dog` and `dogs` stemmed to `dog`). What happens if you stem the text before training the topic model?