# Introduction to NLP: Topic Modeling

-----

When analyzing large text corpora, trends can appear. These trends can be repeated use of common phrases or terms that are indicative of common underlying themes or topics. For example, books on programming might refer to themes such as human computer interaction, optimization and performance, or identifying and removing error conditions. Finding these common topics can be important for a number of reasons. On the one hand, when they are completely unknown, they can be used to provide new insight into text documents. On the other hand, when they may be partially or even completely unknown, computationally identified topics can provide deeper or more concise insight into the relationship between documents.

The process of identifying these common topics is known as topic modeling, which is generally a form of unsupervised learning. As a specific example, consider the [twenty newsgroup][tw] data that we have analyzed in scikit learn. While there are twenty different newsgroups, it turns out they can be grouped into six related categories: computers, sports, science, politics, religion, and miscellaneous. While we now these topics ahead of time (from the newsgroup titles), we can apply topic modeling to these data to identify the common words or phrases that define these common topics.

In the rest of this notebook, we explore the concept of topic modeling. First we will use the scikit learn library to perform topic modeling. We will introduce and use non-negative matrix factorization and Latent Dirchlet allocation. We apply topic modeling to a text classification problem, and also explore the terms that make up identified topics. Finally, we introduce the gensim library, which provides additional techniques for topic modeling.

-----

[tw]: http://qwone.com/~jason/20Newsgroups/

## Table of Contents

[N-Grams](#N-Grams)

[N-Gram Classification](#N-Gram-Classification)

[Stemming](#Stemming)

[Clustering Analysis](#Clustering-Analysis)

[Dimension-Reduction](#Dimension-Reduction)

-----

Before proceeding with the rest of this notebook, we first include the notebook setup code and we define our _home_ directory.

-----

In [1]:
# Set up Notebook
% matplotlib inline

# Standard imports
import numpy as np

In [2]:
# First we find our HOME directory
home_dir = !echo $HOME

# Define data directory
home = home_dir[0] +'/'

In [3]:
# Set up Notebook

% matplotlib inline

# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# We do this to ignore several specific Pandas warnings
import warnings
warnings.filterwarnings("ignore")

In [4]:
# Load Dataset
from sklearn.datasets import fetch_20newsgroups

train = fetch_20newsgroups(data_home='/home/data_scientist/data/textdm', 
                           subset='train', shuffle=True, random_state=23,
                           remove=('headers', 'footers', 'quotes'))

test = fetch_20newsgroups(data_home='/home/data_scientist/data/textdm', 
                          subset='test', shuffle=True, random_state=23,
                          remove=('headers', 'footers', 'quotes'))

Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)


In [5]:
# Use TD-IDF on newgroup data.
from sklearn.feature_extraction.text import TfidfVectorizer

cv = TfidfVectorizer(stop_words = 'english',
                     lowercase=True,
                     min_df=2,
                     max_features=5000)
                     
train_data = cv.fit_transform(train['data'])
test_data = cv.transform(test['data'])

-----

### Non-Negative Matrix Factorization

We can apply [non-negative matrix factorization][wnmf] (NMF) to compute
topics in a corpus. We start with a term-document matrix, which we
factor in to a term-feature and a feature-document matrices. The latter
matrix can be used to identify data clusters (or topics) in the corpus.
We demonstrate the use of NMF to perform topic modeling by using the
scikit learn library's [NMF implementation][sknmf]. 

-----

[wnmf]: https://en.wikipedia.org/wiki/Non-negative_matrix_factorization
[sknmf]: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

In [6]:
# Compute topics by using NMF
from sklearn.decomposition import NMF

num_topics = 6
nmf = NMF(n_components = num_topics, max_iter = 1000).fit(train_data)

In [7]:
from helper_code import tpterms as tp

nmf_topics = tp.get_topics(cv, nmf)

ModuleNotFoundError: No module named 'helper_code'

-----

### Understanding Topic Terms

We can explore the terms that are important for each topic by creating a
DataFrame to map our topic terms to the original twenty newsgroups. We
demonstrate this below by first normalizing the transformed data to have
unit probability. We use these data to create the DataFrame and group
the resulting rows by the associated newsgroup as shown below.

-----

In [None]:
# We transform and normalize the data, 
# by using l1 so document topic probabilty sums to unity.

from sklearn.preprocessing import normalize

td = nmf.transform(train_data)
td_norm = normalize(td, norm='l1', axis=1)

In [None]:
# We use a DataFrame to simplify the collecting of the data for display.

df = pd.DataFrame(td_norm, columns=nmf_topics)
df.fillna(value=0, inplace=True)
df['label'] = pd.Series(train['target'])

In [None]:
# Now group and add human names for the labels
df_lbl = df.groupby('label').mean()
df_lbl['Names'] = pd.Series(train['target_names'], dtype="category")

# Now display the grouped data
df_lbl

-----
### Topic-based Classification

If documents are composed of topics, we can leverage defined topics to
classify new documents based on the topics that are assigned to each new
document. In the following code cells, we first train a Naive Bayes
classifier on the topics in the training data sample of the twenty
newsgroup data set. We compute the topics, by using the previously
created NMF model, for the test data and compute classifications from
these topic models. Finally, the resulting classification report and
confusion matrix are shown to demonstrate the quality of this
classification method.

-----

In [None]:
# Build classifier from topics.
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(td, train['target'])

# Apply classifier to blind test data
ts_preds = clf.predict(nmf.transform(test_data))

from sklearn import metrics

print(metrics.classification_report(test['target'], ts_preds,
    target_names = test['target_names']))

In [None]:
# Create Confusion Plot
fig, ax = plt.subplots(figsize=(12, 10))

from helper_code import mlplots as mp
mp.confusion(test['target'], ts_preds, range(20), 20, 'Naive Bayes Model')

-----

<font color='red' size = '5'> Student Exercise </font>

In the preceding cells, we introduced basic topic modeling by using the
scikit learn library and employed NMF in a text classification pipeline.
Now that you have run the Notebook, why do you think the results from
the topic model-based classification are so poor, especially when
compared to the same algorithm without topic modeling (feel free to
discuss this in the class forum)?

Try making the following changes:

1. Increase the number of topics from six to sixty. Do the results change?
2. Change the classification algorithm to a random forest. Do the results
change?
3. Try changing the TFIDF parameters to use more features and n-grams. Do
the results change?

-----

## Latent Dirichlet allocation

Perhaps the most popular topic modeling algorithm is [Latent Dirichlet
allocation][wlda] or LDA. LDA assumes that documents in a Corpus result
from a mixture of a small number of topics,  and that the words in the
document can be attributed to one of the topics that make up that
document. The scikit learn library has an [LDA implementation][sklda],
which can be easily applied to a data set, as demonstrated below. After
constructing an LDA model, we extract the topics (in this case we are
identifying topics for the newsgroup data set) and display the top terms
in each topic.

-----
[wlda]: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
[sklda]: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_topics=num_topics, max_iter=5,
                                learning_method='online', learning_offset=5.,
                                random_state=23).fit(train_data)

In [None]:
lda_topics = tp.get_topics(cv, lda)

-----

### Visualizing Topics

We can visualize he important terms in a topic by constructing a
[wordle][ww], which is a word cloud where the size of the word indicates
its relative importance. To do this, we will use the Python [word cloud][pwn]
library as demonstrated in the following code cell, where we display a
word cloud for the first topic and a word cloud for all topics.

-----
[ww]: http://www.wordle.net
[pwn]: http://amueller.github.io/word_cloud/

In [None]:
from helper_code import wcviz as wc

wc.make_wc(lda_topics[0].replace(',', ''), 'Words from Topic 0')

text = ', '.join(lda_topics)
wc.make_wc(text.replace(',', ''), 'Words from all Topics')

-----

<font color='red' size = '5'> Student Exercise </font>


In the preceding cells, we introduced Latent Dirichlet allocation. Now
that you have run the Notebook, try to use LDA in the previous text
classification problem. Are the results better or worse? Can you explain
why?

-----

## Gensim

While NLTK is a useful library to learn the basic concepts in text
analysis and natural language processing, there are other libraries that
provide powerful NLP functionality. One of the most important libraries
in this category is the [_gensim_ library][gl], which is an open source,
Python library to create vector-space models for text data that can be
used to create topic models. In the following section, we review how to
use the gensim library to perform basic text analysis, before learning
how to use gensim  to create topic models. 

An important task in gensim is to the creation of the vector space model
for a text document. The indices into the vector space are mapped to the
actual terms (or words) by a dictionary; thus we will need the actual
vector space model and this dictionary to use gensim for topic
modeling. These concepts are demonstrated in the following few code
cells, where we analyze the course description.

-----
[gl]: http://radimrehurek.com/gensim/

In [None]:
# Next section follows gensim tutorial

# As a text example, we use the course description for INFO490  SP16.
info_course = ['Advanced Data Science: This class is an asynchronous, online course.', 
               'This course will introduce advanced data science concepts by building on the foundational concepts presented in INFO 490: Foundations of Data Science.', 
               'Students will first learn how to perform more statistical data exploration and constructing and evaluating statistical models.', 
               'Next, students will learn machine learning techniques including supervised and unsupervised learning, dimensional reduction, and cluster finding.', 
               'An emphasis will be placed on the practical application of these techniques to high-dimensional numerical data, time series data, image data, and text data.', 
               'Finally, students will learn to use relational databases and cloud computing software components such as Hadoop, Spark, and NoSQL data stores.', 
               'Students must have access to a fairly modern computer, ideally that supports hardware virtualization, on which they can install software.', 
               'This class is open to sophomores, juniors, seniors and graduate students in any discipline who have either taken a previous INFO 490 data science course or have received instructor permission.']

# Simple stop words
stop_words = set('for a of the and to in on an'.split())

# Parse text into words, make lowercase and remove stop words
txts = [[word for word in sentance.lower().split() if word not in stop_words]
        for sentance in info_course]

# Keep only those words appearing more than once
# Easy with a Counter, but need a flat list
from collections import Counter
frequency = Counter([word for txt in txts for word in txt])

# Now grab tokens that appear more than once
tokens = [[token for token in txt if frequency[token] > 1]
          for txt in txts]

# Display the tokens
import pprint
pp = pprint.PrettyPrinter(indent=2, depth=2, width=80, compact=True)

pp.pprint(tokens)

In [None]:
# Compute dictionary mapptin for given text corpus

from gensim import corpora
dict_gensim = corpora.Dictionary(tokens)
print(dict_gensim)

In [None]:
# Display mapping between index and word in Bag of Word model.

print(dict_gensim.token2id)

In [None]:
# Display sample text string as a bag of words.

new_txt = 'data science is cool, you should take this course to learn data concepts'
new_vec = dict_gensim.doc2bow(new_txt.lower().split())
pp.pprint(new_vec)

In [None]:
# Display corpus as bag of words.

crps = [dict_gensim.doc2bow(txt) for txt in txts]
print(crps)

-----

### Topic Modeling with gensim

We can use the gensim library to perform topic modeling. We first
transform our info text document to a TFIDF model. The gensim library
requires a dictionary to map indices into the TFIDF model to the words,
which we can do with our `dict_gensim` object. In the next few code
cells, we first create our TFIDF document matrix, display the a sample
text string, the bag of words model for this text strings, and the TFIDF
model of this document. Next, we construct an Latent Dirchlet allocation
model of this document using our dictionary mapping object. Finally, we
display the topics, before quantifying the top topic for each sentence
in our original corpus.

-----

In [None]:
from gensim import models

tfidf = models.TfidfModel(crps)

In [None]:
# Print sentance, bago of words model, and TFIDF representation.

print(new_txt)
print(new_vec)

pp.pprint(tfidf[new_vec])

In [None]:
# Compute LDA model for corpus

crps_tfidf = tfidf[crps]
lda_gs = models.LdaModel(corpus=crps_tfidf, id2word=dict_gensim, num_topics=3, passes=15)

In [None]:
# Display topics as functions over their top terms

lda_gs.print_topics(3)

In [None]:
# Determine primary topic for each sentance in original text
import operator

for idx, txt in enumerate(lda_gs[crps_tfidf]):
    srt_txt = sorted(txt, key=operator.itemgetter(1))
    print('Sentance {0:1d} has primary topic {1:1d} with probability = {2:4.3f}'\
          .format(idx, srt_txt[-1][0], srt_txt[-1][1]))

In [None]:
ttps = lda_gs.top_topics(corpus=crps_tfidf, num_words=5)
idx = 0

for lst, val in ttps:
    print('Topic {0}'.format(idx))
    print(35*('-'))
    idx += 1
    for i, z in lst:
        print('    {0:20s}: {1:5.4f}'.format(z, i))
    print(35*('-'))

-----

### Topic Modeling with gensim

We can use the gensim library to perform topic modeling of the twenty
newsgroup data. We first need to transform a sparse matrix (as provided
by the scikit learn library) into a gensim corpus. We also need to
construct a vocabulary dictionary, which we can do by transforming the
scikit learn `CountVectorizer` vocabulary into a dictionary that maps
between `id` and the `word`. We demonstrate this transformation in the
following code cell for the newsgroup training data.

-----

In [None]:
from gensim import matutils as mat
from gensim import models as md
from gensim.corpora.dictionary import Dictionary

# transform sparse matrix into gensim corpus
td_gensim = mat.Sparse2Corpus(train_data, documents_columns=False)

# Build temporary dictionary from scikit learn vectorizer
# for use with gensim
tmp_dct = dict((idv, word) for word, idv in cv.vocabulary_.items())
dct = Dictionary.from_corpus(td_gensim, id2word=tmp_dct)

-----

### Latent Semantic Analysis

We can use the gensim library to perform [Latent Semantic
Analysis][wlsa] or LSA; in gensim, however, this technique is called
[Latent Semantic Indexing][glsi] (or LSI). LSA assumes that words with
similar meanings will occur in close proximity. By leveraging this
assumption, we can build and process a term document matrix. After
processing, a cosine similarity can be used to identify words that are
similar. This technique is applied in the following code cell, where we
build an LSA model with six topics from the newsgroup text. The topics
are subsequently displayed as functions of the most important terms in
each topic.

-----
[wlsa]: https://en.wikipedia.org/wiki/Latent_semantic_analysis
[glsi]: http://radimrehurek.com/gensim/models/lsimodel.html

In [None]:
# LSI

lsi = md.lsimodel.LsiModel(corpus=td_gensim, id2word=dct, num_topics=6)
lsi.print_topics()

-----

### Latent Dirichlet allocation

The gensim library also provides an implementation of the [Latent
Dirichlet allocation][glda] or LDA. We demonstrate the gensim LDA
technique in the following code cell, where we once again create an LDA
model with six topics for the newsgroup text. We subsequently display
the topics as functions of the top words within each topic. Finally, we
display the top five words in each topic, along with their topic
coherence, which is a measure of the words importance to the specific
topic.

-----

[glda]: http://radimrehurek.com/gensim/models/ldamodel.html

In [None]:
# LDA

lda_gs = md.LdaModel(corpus=td_gensim, id2word=dct, num_topics=6, passes=2)
lda_gs.show_topics()

In [None]:
ttps = lda_gs.top_topics(corpus=td_gensim, num_words=5)

In [None]:
idx = 0

for lst, val in ttps:
    print('Topic {0}'.format(idx))
    print(35*('-'))
    idx += 1
    for i, z in lst:
        print('    {0:20s}: {1:5.4f}'.format(z, i))
    print(35*('-'))

-----

<font color='red' size = '5'> Student Exercise </font>


In the preceding cells, we used the gensim library to perform topic
modeling on the twenty newsgroup data set. Now that you have n the
Notebook, try making the following changes.

1. Increase the number of topics, how do the results change?
2. Can you map the topics to the original newsgroups?

-----

-----

<font color='red' size = '5'> Student Exercise </font>

In the preceding cells, we used feature selection to identify the most important features in our simple classification pipeline. Now that you have run the Notebook, go back and make the following changes to see how the results change.

1. Change the vectorizer to change the case of all words an to employ stemming. How do the results (tokens) change?

2. Change the classification algorithm to a more accurate method. How do the results change? How does the computational time change?

Finally, what do the list of tokens say about the fact we did not remove headers or footers from the newsgroup postings? Feel free to comment on these questions in the course forum.

-----

## Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

1. [XML Tutorial][1] by W3Schools
3. [SVG Tutorial][3] by W3Schools
4. The [ColorBrewer2][cb2] website

-----

[1]: http://www.w3schools.com/xml/default.asp
[3]: http://www.w3schools.com/svg/default.asp
[cb2]: http://colorbrewer2.org

**&copy; 2017: Robert J. Brunner at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode