## Latent Dirichlet Allocation

Latent Dirichlet Annotation (LDA) is a technique that uses unsupervised machine learning to discover topics in collections of documents. It is based on the notion that documents with similar topics = share groups of words. LDA finds topics in document collections by searching for groups of words that frequently occur together in multiple documents across the corpus. LDA discovers these clusters of words. Based on their occurrence in different texts, it models the probablity that any given text will have a topic. 

[LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) was first described in a [2003 paper](https://dl.acm.org/doi/10.5555/944919.944937).

LDA does not tell us what the topics are, it only generates clusters of words with high probabilities of forming a cluster, and it tells us articles with high probabilities of being associated with a particular word cluster. It is up to the human researcher to determine whether these clusters are significant and what topics they represent.

The dataset for this project is from [Kaggle](https://www.kaggle.com/datasets/gauravduttakiit/npr-data?select=npr.csv). It is a dataset of nearly 12000 NPR news articles.

We use the Python tabular data library `pandas` to load the data. We can use the `.head()` method to visualize the first few rows. We can use the `len` method to see how many rows we have. Note that it is a substantial dataset.

For this code to work, you will need an environment with `pandas` and `scikit-learn` installed.

In [98]:
import pandas as pd

In [99]:
npr = pd.read_csv('npr.csv')

In [100]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [101]:
len(npr)

11992

In [102]:
npr['Article'][4000][:500]

'The headline shocked the   world of the surface Navy: Seven sailors aboard the destroyer USS Fitzgerald were killed, and other crew members injured, when the warship collided with a cargo vessel off Japan. As the Navy family grieves, both it and the wider world are asking the same question: How did this happen? The short answer is that no one knows  —   yet. Official inquiries into what led up to the encounter could take months or more. The Navy and the U. S. Coast Guard both likely will eventua'

Now we import the the `CountVectorizer` class from the scikit-learn library. The scikit CountVectorizer converts text data into a numerical vector. It counts the frequency of words in each document and converts the document data into a matrix in which each word in the whole corpus is a column, each row is a document, and the cells are occupied by the term frequency of a given word in the document.

In [103]:
from sklearn.feature_extraction.text import CountVectorizer

We then create an instance of CountVectorizer, which we are here calling `cv`. We set some optional arguments. The `max_df` value throws out words that are in n% of documents. This is used to throw out some of the most common words. Here, we are throwing out words that appear in 90% of the documents. The min_df value requires words to appear in that quantity (or percentage) of documents. Here we are saying we want to throw out words that do not appear in at least two documents. This can throw out some very unique terms or typos. Finally, we eliminate English stop words (the most common words in the language).

In [104]:
cv = CountVectorizer(max_df=0.9, min_df=2, stop_words='english')

To generate the matrix, we run the `fit_transform` method. Here we are selecting just the `Article` column from the data frame. The variable name `dtm` stands for document term matrix. It's a matrix with every word in the corpus and its frequency in any given document. By looking at the `dtm` variable we can see that it is a sparse matrix.

In [105]:
dtm = cv.fit_transform(npr['Article'])

In [106]:
dtm

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 3033388 stored elements and shape (11992, 54777)>

The "shape" property shows that our matrix 11992 rows, one for each article, and 54777 words, the total number of unique words in the entire corpus.

Now, we import the `LatentDirichletAllocation` class from the scikit-learn library.

In [107]:
from sklearn.decomposition import LatentDirichletAllocation

The key variable we can manipulate when we run an LDA algorithm is `n_components`. This represents how many topics the algorithm witll generate. There is no right or wrong setting for this. It depends to an extent on the researcher's intuition regarding the source material and how fine-grained the researcher wishes the analysis to be. It is good practice to run the analysis on several different `n_components` settings to see how it affects the results.

LDA begins by randomly assigning words to topics, creating a starting point for the algorithm. It then samples topics based on actual word distributions, updating topic assignments using probability calculations, until a stable topic representation is achieved. For absolutely reproducible results, it is possible to assign a fixed random start point through the `random_state` property.

Here, we instantiate the LatentDirichletAllocation class, creating an actual allocator object.

In [108]:
LDA = LatentDirichletAllocation(n_components=7, random_state=42)

We can now run the `fit` method on the document term matrix to generate our allocation. This may take some time, depending on your computer's memory and processing specifics.

In [109]:
LDA.fit(dtm)

Note that we can get the complete list of words out of the count vectorizer using the `get_feature_names_out()` method. When we called `fit_transform()` on our count vectorizer object, it acquired information about our data set, which we can now inspect.

In [110]:
len(cv.get_feature_names_out())

54777

Note that the list of names is not a Python list, it is a numpy array, which is a specialized sequence created in the `numpy` library. It is more efficient and faster than a normal Python list.

In [111]:
type(cv.get_feature_names_out())

numpy.ndarray

Here is a bit of code that will retrieve a random word from the array every time the cell is run.

In [112]:
import random
random_word_id = random.randint(0, 54777)
cv.get_feature_names_out()[random_word_id]

'disguise'

Now it's time to look at the topics generated by the algorithm. We can verify that there are the number of topics we specified. Note also that the topics are sequences of words, and they are numpy arrays. The `shape` property shows us that it is a matrix, that is a two dimensional array of arrays, with (in this case) seven rows, each row having 54777 columns.

In [113]:
len(LDA.components_)

7

In [114]:
type(LDA.components_)

numpy.ndarray

In [115]:
LDA.components_.shape

(7, 54777)

If we take a look at any one of these topics, we see it represented as a numpy array of numbers. These represent the probability that any given word will pertain to a given topic. We can see that these are numbers representing the strength of probability that a given word pertains to a particular topic. 

In [116]:
single_topic = LDA.components_[0]
single_topic

array([8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
       1.43006821e-01, 1.42902042e-01, 1.42861626e-01])

We can use the `.argsort()` method to get the indices of the words, in ascending order of topic presence.

In [117]:
single_topic.argsort()

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993])

The `argsort()` method is used to generate an array of indices representing how the indices would have to be arranged in order for the array to be in ascending sorted order.

In [118]:
import numpy as np

For example, say we have a numpy array with the numbers 10 (at index 0), 200 (at index 1), and 1 (at index 2). In order to sort this, we would need the number at index 2 first, then the number at index 0, and finally the number at index 1. When we apply `argsort` to this array, we get the array `[2, 0, 1]`.

In [119]:
arr = np.array([10, 200, 1])

In [120]:
arr.argsort()

array([2, 0, 1])

After using `argsort`, the words with the highest probability of pertaining to a given topic will be at the end of the array. We can use a negative slice to count from the end of the array. Here we get the top twenty words in the first topic, then loop over then. We can use the `cv.get_feature_names_out()` method to get all the words, then the index we got from `argsort` to get the specific word at that index.

In [121]:
top_words = single_topic.argsort()[-20:]

In [122]:
for index in top_words:
    print(cv.get_feature_names_out()[index])

president
state
tax
insurance
trump
companies
money
year
federal
000
new
percent
government
company
million
care
people
health
said
says


We can see that this has to do something with health insurance, a very important topic at the time this dataset was generated, during the 2016 presidential election.

Here, we loop over the topics, getting the top twenty words associated with each topic.

In [123]:
for i, topic in enumerate(LDA.components_):
    print(f"THE TOP 20 WORDS FOR TOPIC # {i}")
    print([cv.get_feature_names_out()[index] for index in topic.argsort()[-20:]])
    print('\n')
    print('\n')

THE TOP 20 WORDS FOR TOPIC # 0
['president', 'state', 'tax', 'insurance', 'trump', 'companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']




THE TOP 20 WORDS FOR TOPIC # 1
['white', 'according', 'attack', 'reported', 'war', 'military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']




THE TOP 20 WORDS FOR TOPIC # 2
['little', 'know', 'don', 'year', 'make', 'way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']




THE TOP 20 WORDS FOR TOPIC # 3
['world', 'research', 'university', 'percent', 'care', 'time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']




THE TOP 20 WORDS FOR TOPIC # 4
['donald', 'political', 'states', 'law', 'just', 'voters', 'vote', 'el

Now, it might be useful to attach the topic numnbers to each article in our data frame. Recall that we have our document term matrix stored in a variable called `dtm`, with 11992 rows, each having 54777 columns, each cell representing the frequency of the word (column) in that document (row). Recall that the 11992 rows each correspond to a row in our original dataset, which itself is stored as a pandas data frame in a variable called `npr`.

In [124]:
dtm

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 3033388 stored elements and shape (11992, 54777)>

In [125]:
npr

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."
...,...
11987,The number of law enforcement officers shot an...
11988,"Trump is busy these days with victory tours,..."
11989,It’s always interesting for the Goats and Soda...
11990,The election of Donald Trump was a surprise to...


Now, we can use the LDA object to run the `transform()` method on our document term matrix. This will transform the dtm into a matrix that still has 11992 rows, but now has seven columns, where each column represents a topic. The value in each cell represents the probability that the document represented by that row will pertain to any given topic. This operation may take some time depending on the computer you are using.

In [126]:
topic_results = LDA.transform(dtm)

In [127]:
topic_results.shape

(11992, 7)

If we look at any given row in this matrix, we will see a sequence of number representing the probability of belonging to the topic in question. 

In [128]:
topic_results[0].round(2)

array([0.02, 0.68, 0.  , 0.  , 0.3 , 0.  , 0.  ])

We want to get the index position of the topic to which the document has the highest probability of pertaining. We can get that index with the `argmax()` method.

In [129]:
topic_results[0].argmax()

np.int64(1)

Now we can create a new column, called "Topic", with the topic number. Here, `axis=1` means we want to find the highest value in each row, as the 1-axis is the horizontal axis.

In [130]:
npr['Topic'] = topic_results.argmax(axis=1)

In [131]:
npr

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
...,...,...
11987,The number of law enforcement officers shot an...,1
11988,"Trump is busy these days with victory tours,...",4
11989,It’s always interesting for the Goats and Soda...,3
11990,The election of Donald Trump was a surprise to...,4


## Non-negative Matrix Factorization

Another technique for topic modeling involves [non-negative matrix factorization](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization). In this technique, we again have to decide how many topics we want go get out of the data. Then will use the scikit NMF class on a document term matrix. This will factor our matrix into two matrices, a document-topic matrix and a term-topic matrix. The document-topic matrix will have the topic weights for each document, while the term-topic matrix will have the word importance for each topic. 

This time, instead of the CountVectorizer, we use the TfidfVectorizer, with similar parameters. LDA relied on word counts, so we needed the CountVectorizer. NMA, in contrast, works with coefficient values representing not the raw frequency, but the importance of a word in the text. For this reason, we use TF-IDF values instead of just TF.

In [132]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [133]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [134]:
dtm = tfidf.fit_transform(npr['Article'])

In [135]:
dtm

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 3033388 stored elements and shape (11992, 54777)>

Once we have our matrix, we can import the NMF module from scikit-learn.

In [136]:
from sklearn.decomposition import NMF

We instantiate our NMF model, giving it the number of topics we want to find, and optionally a random seed for reproducibility. We then run the `fit()` method on our document term matrix.

In [137]:
nmf_model = NMF(n_components=7, random_state=42)

In [138]:
nmf_model.fit(dtm)

Just as was the case with our count vectorizer, when we ran the `fit_transform` method on our TFIDF object, it acquired the list of all the words in our document corpus.

In [139]:
tfidf.get_feature_names_out()[2300]

'albala'

We can use the same technique we used previously in LDA to extract the top 20 words probabilistically associated with any given topic. We can also follow the same process in order to attach topic labels to each document.

In [140]:
for index, topic in enumerate(nmf_model.components_):
    print(f"THE TOP 20 WORDS FOR TOPIC # {index}")
    print([tfidf.get_feature_names_out()[i] for i in topic.argsort()[-20:]])
    print('\n')

THE TOP 20 WORDS FOR TOPIC # 0
['years', 'brain', 'researchers', 'university', 'scientists', 'new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


THE TOP 20 WORDS FOR TOPIC # 1
['intelligence', 'office', 'nominee', 'republicans', 'comey', 'gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


THE TOP 20 WORDS FOR TOPIC # 2
['insurers', 'federal', 'said', 'aca', 'repeal', 'senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


THE TOP 20 WORDS FOR TOPIC # 3
['killed', 'reported', 'military', 'justice', 'city', 'officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


THE TOP 20 WORDS FOR TOP

In [141]:
topic_results = nmf_model.transform(dtm)

In [142]:
topic_results[0].argmax()

np.int64(1)

In [143]:
topic_results.argmax(axis=1)

array([1, 1, 1, ..., 0, 4, 3])

In [144]:
npr['NMA_Topic'] = topic_results.argmax(axis=1)

In [145]:
npr

Unnamed: 0,Article,Topic,NMA_Topic
0,"In the Washington of 2016, even when the polic...",1,1
1,Donald Trump has used Twitter — his prefe...,1,1
2,Donald Trump is unabashedly praising Russian...,1,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1,3
4,"From photography, illustration and video, to d...",2,6
...,...,...,...
11987,The number of law enforcement officers shot an...,1,3
11988,"Trump is busy these days with victory tours,...",4,1
11989,It’s always interesting for the Goats and Soda...,3,0
11990,The election of Donald Trump was a surprise to...,4,4


Based on our familiarity with the subject matter, we can assign our own labels to each topic, and assign these labels to the documents.

In [146]:
my_topic_dict = {
    0: 'Health Research',
    1: 'Russian Election Interference',
    2: 'Health Insurance',
    3: 'Middle East wars',
    4: 'Primary Elections',
    5: 'Books and Music',
    6: 'Education'
}

In [147]:
npr['Topic Label'] = npr['NMA_Topic'].map(my_topic_dict)

In [148]:
npr.head()

Unnamed: 0,Article,Topic,NMA_Topic,Topic Label
0,"In the Washington of 2016, even when the polic...",1,1,Russian Election Interference
1,Donald Trump has used Twitter — his prefe...,1,1,Russian Election Interference
2,Donald Trump is unabashedly praising Russian...,1,1,Russian Election Interference
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1,3,Middle East wars
4,"From photography, illustration and video, to d...",2,6,Education
