## Ravish Chawla
### Topic Modeling with LDA and NMF algorithms on the ABC News Headlines Dataset
#### July 31, 2017
#### Minor changes for Topic Modeling Workshop at Northwestern University, August, 2019.
#### [https://github.com/nuitrcs/topic-modeling-workshop](https://github.com/nuitrcs/topic-modeling-workshop)


# Data imports

We import Pandas, numpy and scipy for data structures. We use gensim for LDA, and sklearn for NMF

In [None]:
import pandas as pd
import numpy as np
import scipy as sp
import sklearn
import sys
from nltk.corpus import stopwords
import nltk
from gensim.models import ldamodel
import gensim.corpora
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import NMF
from sklearn.preprocessing import normalize
import pickle

# Loading the data

We are using the ABC News headlines dataset. The data contain two columns
separated by a comma:

1. *publish_date*, the publication date for the article.
2. *headline_text*, the text of the headline.

We only heed the headline_text for analysis, so we ignore the publish_date column.

Some lines are badly formatted, so we skip them by specifying error_bad_lines=False
in the read_csv call.  If we did not, the read process would fail the first time 
a data line with the wrong number of columns appeared.

The full set of headlines has 1041793 headlines along with an initial line
containing the column names.  This takes a while to process, so for the workshop
we'll use a 5% random sample of 52,090 headlines.

In [None]:

# To use all the headlines, comment out the next line which loads the sample,
# and uncomment the subsequent line which loads the entire set of headlines.

data = pd.read_csv('data/abcnews-headlines-five-percent-sample.csv', error_bad_lines=False)
#data = pd.read_csv('data/abcnews-headlines-full.csv', error_bad_lines=False)

# Both columns of the headlines data have now been read into a dataframe.

**We only need the Headlines_text column from the data, so we grab the headline_text entries and place them in the variable data_text.**

In [None]:
data_text = data[['headline_text']]
data_text = data_text.astype('str')

**Load the stopwords.   We'll use the default set from nltk.**

In [None]:
nltk.download('stopwords', quiet=True)
stop_words = set(stopwords.words('english'))    #stop_words now contains the list of stop words.

**Remove the stopwords from each headline.  We need to tokenize the headline texts to do this.**

In [None]:
for idx in range(len(data_text)):
    
    #go through each word in each data_text row, remove stopwords, and set 
    #them on the index.
    
    data_text.iloc[idx]['headline_text'] = \
        [word for word in data_text.iloc[idx]['headline_text'].strip().split(' ') \
        if word not in stop_words and ( len(word) > 0 ) ]
    
    #print log to monitor output.
    
    if idx % 10000 == 0:
        sys.stdout.write('\rc = ' + str(idx) + ' / ' + str(len(data_text)))

sys.stdout.write('\rc = ' + str(idx + 1) + ' / ' + str(len(data_text)))
sys.stdout.write('  Completed tokenization and stopword removal.')

#At this point, data_text contains the tokenized headlines with stopwords removed.

**Get all the words into a single array for input to the Latent Dirichlet Allocation algorithm.**

In [None]:
train_headlines = [value[0] for value in data_text.iloc[0:].values]

**We'll extract ten topics. You may want to experiment on your own with other values for the number of topics.**

In [None]:
num_topics = 10

# Latent Dirichlet Allocation (LDA)

We will use the gensim library for LDA.  First, obtain a id-to-word dictionary. 
Second, for each headline, use the dictionary to obtain a mapping of the word id 
to the word count for each word in each headline. The LDA model uses both of these mappings.

In [None]:
id_to_word = gensim.corpora.Dictionary(train_headlines)

#id_to_word now contains the id-to-word dictionary.

**Generate a bag-of-words corpus from the dictionary.**

In [None]:
corpus = [id_to_word.doc2bow(text) for text in train_headlines]

**Apply latent dirichlet allocation to extract the topics in the headlines.**

In [None]:
lda = ldamodel.LdaModel(corpus=corpus, id2word=id_to_word, num_topics=num_topics)

#lda has the generated LDA model results.

**Collect top words for each extracted topic.**

Iterate over the number of topics, get the top twenty words (topn) in each extracted topic, and 
add them to a dataframe.  

In [None]:
def get_lda_topics(model, num_topics):
    word_dict = {}
    for i in range(num_topics):
        words = model.show_topic(i, topn = 20)
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words]
    return pd.DataFrame(word_dict)

**Display the dataframe to show the topic words.**

In [None]:
get_lda_topics(lda, num_topics)

# Non-negative matrix factorization (NMF)

For NMF, we need to obtain a design matrix. To improve results, we apply a 
TfIdf transformation to the counts.

**The count vectorizer needs a list of strings, not an array, so move the headline strings into a list, **
**and add a blank between each headline.*

In [None]:
train_headlines_sentences = [' '.join(text) for text in train_headlines]

# train_headlines_sentences now contains the headline texts as a list of strings joined together with a 
# space separating them. 

# Uncomment the next line to display the list if you're interested.
# print( train_headlines_sentences )


**Get a Counts design matrix, for which we use SKLearn’s CountVectorizer 
module.  The transformation returns a matrix of size (Documents x Features), 
where the value of a cell is the number of times the feature (word) 
appears in that document.**

**To reduce the size of the matrix, and to speed up computation, set the maximum 
feature size to 5000. That takes the top 5000 best features that can contribute 
to the model.**

In [None]:
vectorizer = CountVectorizer(analyzer='word', max_features=5000)
x_counts = vectorizer.fit_transform(train_headlines_sentences)

#x_counts contains the feature counts for each headline.

**Next, create a TfIdf Transformer, and transform the counts with the model.**

In [None]:
transformer = TfidfTransformer(smooth_idf=False)
x_tfidf = transformer.fit_transform(x_counts)

#x_tfidf contains the TF/IDF transformed feature counts for the words in the headlines.

**Normalize the TfIdf values to unit length for each row.**

In [None]:
xtfidf_norm = normalize(x_tfidf, norm='l1', axis=1)

**Obtain an NMF model which we will fit with the sentences.**

We use a singular value decomposition to initialize the topic extraction 
rather than a random state.  

In [None]:
model = NMF(n_components=num_topics, init='nndsvd')

**Fit the NMF model to the TFIDF transformed headlines.**

In [None]:
model.fit(xtfidf_norm)

#The model now contains the results of the NMF topic extraction.

**As we did for LDA, we can display the words for each extracted topic in a table.**

In [None]:
def get_nmf_topics(model, n_top_words):
    
    #The word ids obtained need to be reverse-mapped to the words so we can 
    #print the topic names.
    
    feat_names = vectorizer.get_feature_names()
    
    word_dict = {};

    for i in range(num_topics):
        
        #For each topic, obtain the largest values, and add the words they map 
        #to into the dictionary.

        words_ids = model.components_[i].argsort()[:-20 - 1:-1]
        words = [feat_names[key] for key in words_ids]
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = words
    
    return pd.DataFrame(word_dict);

**Display the dataframe to show the topic words.**

In [None]:
get_nmf_topics(model, 20)

**The following discussion applies to the topics produced by using the 5% sample of headlines.  If you are running the full set, the results may differ.**

The two tables above, in each section, show the results from LDA and NMF on both datasets. There is some coherence between the words in each clustering. For example, Topic #01 in LDA shows words associated with potentially violent incidents, such as “police”, “suicide”, and “dying”. Other topics show different patterns. 

On the other hand, comparing the results of LDA to NMF also shows that NMF performs better. Looking at Topic #01, we can see there are many first names clustered into the same category, along with the word “interview”. This type of headline is very common in news articles, with wording similar to “Interview with John Smith”, or “Interview with James C. on …”. 

We also see two topics related to violence. First, Topic #04 focuses on police related terms, such as “probe”, “missing”, “investigate”, “arrest”, and “body”. Second, Topic #02 focuses on assault terms, such as “murder”, “stabbing”, “guilty”, and “killed”. This is an interesting split between the topics because although the terms in each are very closely related, one focuses more on police-related activity, and the other more on criminal activity. Along with the first cluster which obtain first-names, the results show that NMF (using TfIdf) performs much better than LDA on this set of texts.