# Topic Models

As you've likely realized by this point in the book, there is an abundance of information available online suited to social science research. The breadth of available data means there's more information out there than any individual researcher--or even teams of researchers--would be able to efficiently analyze without the assistance of some additional tools. 

Luckily, topic modeling allows us to quickly and systematically analyze vast quantities of unstructured text. With topic modelling, we are able to uncover some of the more abstract, underlying themes aross large collections of text without putting in the hours of manpower necessary to read through the texts at a human pace. 

This lesson will go over what, exactly, topic modeling does, walk through how to run Latent Dirichlet allocation (LDA) topic models in Python, and introduce a number of fit statistics to help us better understand the topic models we'll be generating. We'll end with a discussion of several useful visualization tools for topic modeling. 

# What is Topic Modeling?


As a form of unsupervised machine learning, *topic modeling* allows for the classification of large collections of textual documents into natural groups, without the need for extensive human supervision. Employed in text mining and natural language processing, topic models can uncover the hidden, or *latent*, meanings of language patterns within texts.  


> In text mining, we often have collections of documents, such as blog posts or news articles, that we’d like to divide into natural groups so that we can understand them separately. Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for.

[Source](https://www.tidytextmining.com/topicmodeling.html)


## Setup

In addition to `matplotlib inline` and `pandas`, we'll also be importing `CountVectorizer` from [scikit-learn](https://scikit-learn.org/stable/), a machine learning library for Python. We'll discuss the role of `CountVectorizer` in topic modeling in more detail later in the chapter. 

We'll also want to use `pandas` to increase our maximum column width to 120 characters, over the default 50-character column width. This will help us more easily glance over text in the dataframe. 



In [1]:
%matplotlib inline

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer


pd.set_option('display.max_colwidth', 120)

In this chapter we'll be looking at [transcriptions](https://www.kaggle.com/unitednations/un-general-debates) of United Nations General Debates, from 1970 to 2016.

In [2]:
un_df = pd.read_json('data/un-general-debates.json')
print(len(un_df))

3214


With 3,214 complete transcriptions of UN members' statements to work with, this is an ideal dataset to use as we begin playing around with topic models. 

Let's look at a small random sample of 5 texts from the dataframe to get a feel for what we're dealing with:

In [3]:
un_df.sample(5)

Unnamed: 0,index,speech_year,country_code,speech_text
355,810,1991,LVA,"﻿Allow me to convey to the President, on behalf of the Government and people of Latvia and on my own behalf, our sin..."
1720,3601,1999,AGO,"Allow me, to begin, Sir,\nby congratulating you on behalf of the Government of the\nRepublic of Angola, and on my ow..."
2001,4067,1982,BLZ,203.\tThe delegation of the newly independent \nCentral American and Caribbean nation of Belize \nhas listened with ...
666,1435,1996,UGA,"﻿It is a pleasure and an\nhonour for me to address this Assembly. From this lofty\nrostrum, the nations of the world..."
2197,4263,1998,GRC,I wish first to extend to the\nPresident my warm congratulations on his assumption of\nthe conduct of the current se...


### Topic Modeling Exercise 1


Take a look at the text of the UN speeches. When delivering an address, what are the different topics that are covered? Make a list of four topics and provide three example words from each topic.




![](images/lda.jpg)

## Latent Dirichlet allocation (LDA) 

We'll narrow our focus for the time being to one particular form of topic modeling, *Latent Dirichlet allocation*, or *LDA* for short. With LDA topic modeling, we'll be able to treat every document in our corpus as a mixture of topics. Each document in our corpus can contain words associated with any number of topics in varying proportions. At the same time, each topic can be treated as a mixture of words. Any given word can be associated with any number of topics. Considering our documents and topics as these sorts of "mixtures" helps us to mimic the thematic subtleties inherent in natural language.


> Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words.  > This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.
Latent Dirichlet allocation is one of the most common algorithms for topic modeling. Without diving into the math behind the model, we can understand it as being guided by two principles.
> * **Every document is a mixture of topics.** We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.”
> * **Every topic is a mixture of words.** For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the politics topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally.

> LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document. There are a number of existing implementations of this algorithm, and we’ll explore one of them in depth.

[Source](https://www.tidytextmining.com/topicmodeling.html#latent-dirichlet-allocation)

![](images/lda.jpg)

We can import `LatentDirichletAllocation` from scikit-learn to run our own LDA topic models in Python. 

In [4]:
from sklearn.decomposition import LatentDirichletAllocation

### Converting Documents to Vectors

In order to run our topic models, we'll need to convert each document in the corpus into a fixed-length *vector*. We can accomplish this using the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html?highlight=vectorizer#sklearn.feature_extraction.text.CountVectorizer) function we imported earlier in the chapter. Using `CountVectorizer`, we'll be able to transform our dataset from a collection of text documents into a matrix of token counts.

`CountVectorizer` can take a number of parameters; while not an exhaustive list, some important parameters are described below. 

#### `CountVectorizer` Parameters

- **lowercase**: Converts all text to lower case. By default, `lowercase` is set to True.


- **ngram_range**: This allows us to restrict the range of n-values for our n-grams. Formatted as a tuple, the first sets the minimum n-value, and the second sets the maximum n-value. By default, `n-gram range` is set to (1,1). (If we leave it alone, we'll only be looking at unigrams.) 


- **stop_words**: We can use this parameter to rule out words that occur 1) too frequently, 2) not frequently enough, and/or 3) fall outside of a threshold term frequency. This can be set the 'english' to use a pre-determined set of stopwords often found in texts written in the English language. We can also provide our own list of stopwords if we so choose. 

> - **max_df**: Allows us to set a maximum threshold on document frequency for our terms incorporated into our vocabulary. By default, `max_df` is set to 1.0. 
> - **min_df**: Allows us to set a minimum threshold on document frequency for our terms incorporated into our vocabulary. By default, `max_df` is set to 1.0. 
> - **max_features**: Allows us to build a vocabulary exclusively from high-frequency terms occuring throughout our corpus. 

For now, let's set our parameters so that we convert all text to lower case, only look at unigrams, only look at terms with a document frequency of .90 or below, use the default 'english' stopwords list, and only consider the top 1,000 terms in our corpus.  

In [5]:
vectorizer = CountVectorizer(lowercase   = True,
                             ngram_range = (1,1),
                             max_df      = .90,
                             stop_words   = 'english',
                             max_features = 1000)

After setting our parameters, we can `fit` the vectorizer to the `speech_text` key in our UN General Debate dataframe to build a vocabulary out of the raw documents.

*Note*: You'll run into an Attribute Error if the key you plan to fit the vectorizer to contains any missing values. While we don't have to worry about this with our UN dataframe, if you encounter such an error in the future, be sure to [clean your dataframe](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data) before attempting to fit the vectorizer. 

In [6]:
vectorizer.fit(un_df['speech_text'])

CountVectorizer(max_df=0.9, max_features=1000, stop_words='english')

We can use the `len` function, along with `get_feature_names`, to ensure we're dealing with a vocabulary composed of the top 1,000 highest-frequency terms in our corpus. 

In [7]:
len(vectorizer.get_feature_names())

1000

Now we'll want to use the vectorizor to `transform` the raw documents into a document-term matrix. 

In [8]:
un_word_counts = vectorizer.transform(un_df['speech_text'])

### Running the LDA Model

Now that we've vectorized our dataset, we're just about ready to run our first LDA model in Python. Before we do, though, we'll want to set our parameters. Below is some information on the parameters we'll set. 

#### `LatentDirichletAllocation` Parameters

- **n_components**: Sets the number of topics generated. We can set this as high or as low as we like, depending on the size and character of the texts in our corpus. 


- **max_iter**: Sets the maximum number of iterations. By default, `max_iter` is set to 10.
> - ***Note***: In almost every case, we'll want to set `max_iter` above 10. It's highly unlikely our models will converge in 10 iterations or less. In the following example, we'll set `max_iter` to 50, a more reasonable maximum iteration threshold.

- **evaluate_every**: Lets us adjust how frequently we gauge the perplexity of our model across iterations. By default, `evaluate_every` is set to 0.
> - ***Note***: Leaving `evaluate_every` at 0 leaves our model without any built-in goodness of fit measure. We'll discuss other measurements of model fit later in the chapter, but for now it's useful to set `evaluate_every` to some positive number. In the following example, we'll `evaluate_every` 5 iterations.

- **n_jobs**: Sets the number of concurrently running processes. When set to -1, we'll use all processors. To use all processors but one, set `n_jobs` to -2. 


- **verbose**: Lets us determine whether or not every step of the process is logged. If > 0, we'll be able to see what's going on with our LDA model through the output in real time. 

In [11]:
lda_model = LatentDirichletAllocation(n_components = 10,
                                      max_iter     = 50,
                                      evaluate_every = 5,
                                      n_jobs       = -1,
                                      verbose      = 1)

With our parameters set, we can `fit` the LDA model to our document-term matrix of the UN General Debate transcripts.

*Note*: It's going to take a while to work our way through up to 50 iterations. That's alright.

In [12]:
lda_model.fit(un_word_counts)

iteration: 1 of max_iter: 50
iteration: 2 of max_iter: 50
iteration: 3 of max_iter: 50
iteration: 4 of max_iter: 50
iteration: 5 of max_iter: 50, perplexity: 710.8393
iteration: 6 of max_iter: 50
iteration: 7 of max_iter: 50
iteration: 8 of max_iter: 50
iteration: 9 of max_iter: 50
iteration: 10 of max_iter: 50, perplexity: 703.7182
iteration: 11 of max_iter: 50
iteration: 12 of max_iter: 50
iteration: 13 of max_iter: 50
iteration: 14 of max_iter: 50
iteration: 15 of max_iter: 50, perplexity: 702.0994
iteration: 16 of max_iter: 50
iteration: 17 of max_iter: 50
iteration: 18 of max_iter: 50
iteration: 19 of max_iter: 50
iteration: 20 of max_iter: 50, perplexity: 701.4241
iteration: 21 of max_iter: 50
iteration: 22 of max_iter: 50
iteration: 23 of max_iter: 50
iteration: 24 of max_iter: 50
iteration: 25 of max_iter: 50, perplexity: 701.0145
iteration: 26 of max_iter: 50
iteration: 27 of max_iter: 50
iteration: 28 of max_iter: 50
iteration: 29 of max_iter: 50
iteration: 30 of max_iter: 50

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=5, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=50,
                          mean_change_tol=0.001, n_components=10, n_jobs=-1,
                          perp_tol=0.1, random_state=None,
                          topic_word_prior=None, total_samples=1000000.0,
                          verbose=1)

Congrats, you've run your first topic model on Python!

## Some fit statistics

While we can intuitively "eyeball" topic quality as a first step, it's hard to do so objectively.  Calculating some fit statistics can help us to evaluate our topics' quality numerically.


`LatentDirichletAllocation` includes a few handy methods for calculating fit statistics:

- **`score()`** lets us calculate the approximate logged likelihood of the model parameters we've set, given our data. The higher this number is, the better our topic fit. 
- **`perplexity()`**, another (normalized) form of logged likelihood, calculates the amount of "surprise" our model experiences if we introduce some previously unseen data. The lower this number is, the better our topic fit.

To learn more about evaluating fit for LDA topic models in Python, see [here](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0).

In [11]:
# Log Likelihood: Higher the better
print("Log Likelihood: ", lda_model.score(un_word_counts))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda_model.perplexity(un_word_counts))

Log Likelihood:  -17541939.72397841
Perplexity:  700.1354513027705


### Guidelines on topic fit
1. Low perplexity on test data.
    - Remeber that the lower our perplexity score, the better our topic fit to our data.
1. **Topical coherence**
2. Best fit in a classification task.
    - We'll discuss classification tasks in more detail in another chapter. 
3. Extract more and then bin them yourself. 
    - When all else fails, 

In [12]:
print(lda_model.get_params())

{'batch_size': 128, 'doc_topic_prior': None, 'evaluate_every': -1, 'learning_decay': 0.7, 'learning_method': 'batch', 'learning_offset': 10.0, 'max_doc_update_iter': 100, 'max_iter': 50, 'mean_change_tol': 0.001, 'n_components': 10, 'n_jobs': -1, 'perp_tol': 0.1, 'random_state': None, 'topic_word_prior': None, 'total_samples': 1000000.0, 'verbose': 1}


## Visualizing Topics

In this section we'll discuss some of the ways we can visualize the topics we've created with our LDA model. 

### pdtext

First, we can import `topic_words` from the `pdtext` package in order to create a easily interpretable matrix of our topics. Below, we'll look at the first 10:

In [15]:
from pdtext.tm import topic_words

topic_words(lda_model, vectorizer).head(10)


Unnamed: 0,1,2,3,4,5,6,7,8,9,10
Topic 1,arab,israel,council,iraq,palestinian,region,resolutions,aggression,lebanon,war
Topic 2,cooperation,process,social,republic,region,rights,regional,order,important,relations
Topic 3,africa,african,south,situation,delegation,peoples,republic,namibia,unity,angola
Topic 4,peoples,rights,america,today,central,american,freedom,order,respect,state
Topic 5,nuclear,weapons,treaty,disarmament,non,council,rights,year,proliferation,global
Topic 6,operation,south,problems,developing,situation,conference,solution,negotiations,problem,africa
Topic 7,developing,global,small,trade,resources,south,environment,debt,developed,growth
Topic 8,council,rights,conflict,war,keeping,member,year,members,need,process
Topic 9,nuclear,soviet,europe,union,weapons,military,arms,war,relations,policy
Topic 10,republic,south,independence,national,kampuchea,democratic,asia,region,korea,forces


Also from `pdtext`, `topic_pred` will let us see which documents are associated with which topics.

In [16]:
from pdtext.tm import topic_pred

In [17]:
un_topics = topic_pred(lda_model, un_word_counts, vectorizer)

In [18]:
un_topics

Unnamed: 0,arab_israel_council,cooperation_process_social,africa_african_south,peoples_rights_america,nuclear_weapons_treaty,operation_south_problems,developing_global_small,council_rights_conflict,nuclear_soviet_europe,republic_south_independence
0,0.098997,0.059116,0.070022,0.000127,0.091401,0.451918,0.228039,0.000127,0.000127,0.000127
1,0.000135,0.128485,0.000135,0.000135,0.085147,0.322740,0.173677,0.265483,0.023930,0.000135
2,0.043744,0.153191,0.389507,0.114109,0.025540,0.190522,0.035219,0.000077,0.045517,0.002573
3,0.019855,0.196087,0.000147,0.412172,0.000147,0.211575,0.159578,0.000147,0.000147,0.000147
4,0.000089,0.000089,0.351063,0.060128,0.025106,0.314027,0.140074,0.075188,0.034147,0.000089
...,...,...,...,...,...,...,...,...,...,...
3209,0.000270,0.218501,0.000270,0.149425,0.097431,0.080663,0.000270,0.432282,0.020620,0.000270
3210,0.516275,0.216525,0.000097,0.000097,0.000097,0.085901,0.000097,0.180713,0.000097,0.000097
3211,0.000204,0.437315,0.261714,0.000204,0.000204,0.000204,0.033492,0.187718,0.000204,0.078740
3212,0.000121,0.268204,0.000121,0.393142,0.078871,0.000121,0.259057,0.000121,0.000121,0.000121


We can now use our topics as features in order to get a better handle on topic patterns across texts.

One way to do this is to generate a new key in our United Nations dataframe. For exmaple, we can create a `post_soviet` key to divide our general debate speeches between those that occured prior to the fall of the Soviet Union (where `post_soviet` = False), and those that occured after the fall of the Soviet Union (where `post_soviet` = True).

In [19]:
un_df['post_soviet'] = un_df['speech_year'] > 1991

In [20]:
un_topics.groupby(un_df['post_soviet']).mean()

Unnamed: 0_level_0,arab_israel_council,cooperation_process_social,africa_african_south,peoples_rights_america,nuclear_weapons_treaty,operation_south_problems,developing_global_small,council_rights_conflict,nuclear_soviet_europe,republic_south_independence
post_soviet,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
False,0.093063,0.078689,0.118317,0.152372,0.035757,0.269663,0.086197,0.039165,0.07952,0.047258
True,0.054682,0.252532,0.067273,0.102223,0.111426,0.017594,0.126371,0.223457,0.019293,0.025149


In [21]:
un_topics.groupby(un_df['post_soviet']).mean().T

post_soviet,False,True
arab_israel_council,0.093063,0.054682
cooperation_process_social,0.078689,0.252532
africa_african_south,0.118317,0.067273
peoples_rights_america,0.152372,0.102223
nuclear_weapons_treaty,0.035757,0.111426
operation_south_problems,0.269663,0.017594
developing_global_small,0.086197,0.126371
council_rights_conflict,0.039165,0.223457
nuclear_soviet_europe,0.07952,0.019293
republic_south_independence,0.047258,0.025149


### pyLDAvis

The [pyLDAvis](https://pyldavis.readthedocs.io/en/latest/readme.html) library, a port of the [LDAvis](https://github.com/cpsievert/LDAvis) package for R, provides a variety of tools for interactive topic model visualization. 

In [None]:
pip install pyldavis

`pyLDAvis` is conveniently compatible with `LatentDirichletAllocation` from scikit-learn.

In [23]:
import pyLDAvis
import pyLDAvis.sklearn

We'll also import `pyplot` from matplotlib to allow for the generation of interactive plots in Python. 

In [24]:
import matplotlib.pyplot as plt

Now we'll want to use the `enable_notebook()` function to allow us to display visualizations in our notebook.

In [25]:
pyLDAvis.enable_notebook()

Finally, we can use the `prepare()` function to transform the data from our LDA model into interactive visualizations!

In [26]:
pyLDAvis.display(pyLDAvis.sklearn.prepare(lda_model, un_word_counts, vectorizer, mds='tsne'))

### Topic Modeling Exercise 2

In your group, do 1 and 2 in 10_Topic_Modeling_group



#### Text Classification and Sentiment Analysis 

We'll be using the following to apply classify texts and analyze sentiment within our corpus:
- [seaborn](https://seaborn.pydata.org/), a data visualization library based on `matplotlib`. `seaborn` is also discussed in a previous chapter on text classification
- The `SentimentIntensityAnalyzer` available through [vaderSentiment](https://github.com/cjhutto/vaderSentiment), a lexicon for sentiment analysis discussed in the previous chapter on Word Lists and Sentiment Analysis. With this, we'll be able to determine the intensity of positive, negative, and neutral sentiments contained within the documents in our corpus.
- [Afinn](https://pypi.org/project/afinn/), another sentiment analysis tool discussed in the previous chapter on Word Lists and Sentiment Analysis. `Afinn` will produce a single numerical sentiment score, from -5 (negative sentiment) to 5 (positive sentiment).

In [18]:
%matplotlib inline

import pandas as pd
import seaborn as sns



In [None]:
pip install vaderSentiment



In [21]:
lr_classifier = LogisticRegression(solver = 'lbfgs', max_iter= 5000)


In [None]:
lr_classifier.fit(un_topics, un_df['post_soviet'])

In [None]:
prediction = lr_classifier.predict(un_topics)

In [None]:
print(accuracy_score(un_df['post_soviet'], prediction))



In [None]:
print(classification_report(un_df['post_soviet'], prediction))

In [None]:
import seaborn as sns

cm = confusion_matrix(un_df['post_soviet'], prediction)
sns.heatmap(cm, annot=True, cmap="Greens", fmt='g')

### Topic Modeling Exercise 3
In your group, do the rest of 10_Topic_Modeling_group






#### References:

Finn Årup Nielsen. 2011. “A New ANEW: Evaluation of a Word List for Sentiment Analysis in Microblogs.” *Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big Things Come in Small Packages.* Volume 718 in CEUR Workshop Proceedings: 93-98. Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, Mariann Hardey (editors).

Hutto, C.J. and Eric Gilbert. 2014. VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Sievert, Carson and Kenneth Shirley. 2014. "LDAvis: A Method for Visualizing and Interpreting Topics." In *Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces*, pp. 63-70. 

Silge, Julia. and David Robinson. 2017. *Text Mining with R: A Tidy Approach.* "O'Reilly Media, Inc.".