# Introduction to Topic Modeling
## Day 2: Visualization and Evaluation
---
---

## Additional learning goals

* Understand Term Frequency–Inverse Document Frequency (TF-IDF) scores and why they are useful
* Understand stemming and how to implement it
* Get practice with creating word clouds from topic words 
* Understand different methods to calculate topic prevalence
* Learn how to create some simple graphs with topic prevalence
* Learn how to visualize topics with pyLDAvis
* Learn about a few metrics for topic model evaluation


## Outline
- [Load the data](#data)
- [Vectorize and train](#train)
    - [Stemming](#stem)
- [Visualize topic words with `wordcloud`](#cloud)
- [Words aligned with each topic](#words)
- [Topic prevalence over time](#time)
- [Visualising topics with pyLDAvis](#viz)
- [Evaluating the topic model](#eval)
- [Resources and alternatives](#resources)


## Key Terms
* *coherence*:
    * The conditional likelihood of the co-occurrence of words in a topic. The higher the coherence, the more a topic model reflects human interpretation, i.e. expert annotation.
* *likelihood*:
    * A measure of the probability of the observed data, given the model — i.e. how well a topic model fits the observed data. The higher the likelihood, the better the model for the given data.
* *perplexity*:
    * A measure of how well a model predicts a sample, i.e. how much it is “perplexed” by a sample from the observed data. The lower the score, the better the model for the given data.
* *stemming*:
    * A method of text preprocessing that simplifies the language corpus and improves the cleanness of results by removing the ends of words. A common algorithm for this is the PorterStemmeer. Stemming is distinct from lemmatization, which converts words into a base form and is more reliable&mdash;but is also usually slower and more computationally demanding.
* *TF-IDF Scores*: 
    * short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
* *word cloud*:
    * a simple visualization of word frequency in a text corpus, where words are scaled by frequency. These often have appealing aesthetic properties like pretty fonts or colors, but they tell us little about the data other than word frequency.

## Load data

As always, first we load the data. We'll use the same dataset of children's literature described yesterday.

In [None]:
import pandas as pd
import numpy as np

df_lit = pd.read_csv("../assets/childrens_lit.csv.bz2", sep='\t', index_col=0, encoding = 'utf-8', compression='bz2')
df_lit = df_lit.dropna(subset=['text']) # drop where missing text

## Vectorize and train <a id='train'></a>

Yesterday, we used `scikit-learn`'s `CountVectorizer` to build a document-term matrix (DTM) in preparation for topic modeling. This used simple term counts and a bag of words approach to turn texts into numbers&mdash;also known as text vectorization. As a reference, here is the code to load the data, vectorize the texts by term frequencies, and train an LDA model:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer(max_df=0.80, min_df=50,
                                max_features=None,
                                stop_words='english')

# Create sparse DTM
tf_dtm = tf_vectorizer.fit_transform(df_lit.text)

tf_vocab = tf_vectorizer.get_feature_names() # Save vocabulary for later use

In [None]:
from sklearn.decomposition import LatentDirichletAllocation as LDA

n_samples = 2000
n_topics = 4
n_top_words = 50

print("Fitting LDA model with tf features, "
      "n_samples=%d and n_topics=%d..."
      % (n_samples, n_topics))

tf_lda = LDA(n_components=n_topics, 
          max_iter=20,
          learning_method='online',
          learning_offset=80.,
          total_samples=n_samples,
          random_state=0)

#fit the model
tf_lda.fit(tf_dtm)
print("Done!")

In [None]:
def print_top_words(model, feature_names, n_top_words):
    '''Prints the top words for each topic in a pretty way.'''
    
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()
    
# print top words
print("\nTopics in LDA model with TF features and %d topics:" % n_topics)
print_top_words(tf_lda, tf_vocab, n_top_words)

Let's build on yesterday's code by using a more nuanced text vectorization method: `term frequency inverse document frequency (TF-IDF)` scores, which give a word greater weight both when it is more frequent in a text AND when it is rare across the corpus. Words that are frequent, but are also used in every single document, will not be distinguishing. We want to identify words that are unevenly distributed across the corpus to identify distinctive words&mdash;while also filtering out common terms like 'the', 'of', and 'and' without manually removing them during preprocessing.

Traditionally, the inverse document frequency is calculated as such:

number_of_documents / number_documents_with_term

so:

tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)

You can, and often should, normalize the numerator: 

tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)

We can calculate this manually, but scikit-learn has a built-in function to do so. This function also uses log frequencies, so the numbers will not correspond exactly to the calculations above. We'll use the [scikit-learn calculation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), but a challenge for you: use Pandas to calculate this manually. 

### Challenge 1

Use `sklearn`'s `TfidfVectorizer()` function ([more info here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)) to weight features with TF-IDF. Extract the vectorizer vocabulary for later use. 

In [None]:
# your code here

### Challenge 2

Train an LDA model using the TF-IDF-weighted. Then use the `print_top_words()` function defined above to display the top words for each topic. Compare with the output we just saw from the model trained using term frequencies.

In [None]:
# your code here

### Stemming <a id='stem'></a>

As an additional improvement in our text preprocessing, we can stem each word in our input texts BEFORE vectorizing. This allows us to clean up our text data by combining different forms of the same word: for instance, "running" and "runs" both get converted into "run". 

We'll use the PorterStemmer, one of the most basic and common stemming algorithms. The PorterStemmer uses a number of rules to drop word endings&mdash;but in essence, you can think of it as chopping off word endings. For more info on how the PorterStemmer works, see [here](https://tartarus.org/martin/PorterStemmer/). Here's how to use it:

In [None]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

# Example stemming application
toy_sentence = "longer words are more long than shorter words"
for word in toy_sentence.split():
    print(ps.stem(word))

### Challenge

Implement stemming with the PorterStemmer before vectorizing the text. Then construct an LDA model and compare the results to what you saw before.

_Hint:_ Create a new DF column called `text_stemmed` with the stemmed version of the book text. Use an `apply()` function with a list comprehension to implement the stemming algorithm to each word in each text. The basic format for this is:

```python
df[col_new] = df[col].apply(lambda doc: ' '.join([function(word) for word in doc.split()]))
```

In [None]:
# your code here

## Visualize topic words with `wordcloud` <a id='cloud'></a>

To get a feel for the most frequent words in each topic, let's create word clouds. Word clouds are simple visualizations of the words in a group of words (like a document, corpus, or topic) with appealing aesthetic properties like pretty fonts or colors. They scale words by frequency, but otherwise they don't tell us much about the text data. Here's an example of a word cloud:

<img src="../assets/wordcloud.png" alt="Wordcloud" width="700" height="700"/>

To visualize the top words for each topic, let's use the `wordcloud` package (for more info, see [here](https://amueller.github.io/word_cloud/)) together with the topic loadings stored in the `lda` model as `.components_`:

In [None]:
tf_lda.components_

To foster reproducibility and clean code, let's implement `wordcloud` as a function&mdash;with an informative docstring!

In [None]:
import wordcloud
from matplotlib import pyplot as plt

def display_wordcloud(model, feature_names, terms_count, save=False):
    '''Creates a word cloud with specified # terms for each topic in input topic model. 
    Credit for example code: Krunal on Medium: https://medium.com/@krunal18/topic-modeling-with-latent-dirichlet-allocation-lda-decomposition-scikit-learn-and-wordcloud-1ff0b8e8a8eb)
    
    Args:
        model (object): topic model from LatentDirichletAllocation (like lda)
        feature_names (array): vocabulary from text vectorizer
        terms_count (int): number of terms to include for each topic's word cloud
        save (binary): whether to display in notebook or save wordclouds to disk
    
    Returns:
        wordcloud visualizations'''

    for idx,topic in enumerate(model.components_): # loop over topics
        print('Topic# ',idx+1)
        
        # Get N top words for each topic as a list of lists
        topic_terms_sorted = [[feature_names[i], topic[i]] for i in topic.argsort()[:-terms_count - 1:-1]]

        topic_words = []

        # Print top words above each wordcloud
        for i in range(terms_count):
            topic_words.append(topic_terms_sorted[i][0])
        print(','.join( word for word in topic_words))
        print()

        dict_word_frequency = {}

        for i in range(terms_count):
            dict_word_frequency[topic_terms_sorted[i][0]] = topic_terms_sorted[i][1]

        # Initialize wordcloud
        wcloud = wordcloud.WordCloud(background_color="white",mask=None, max_words=100, 
                                     max_font_size=100,min_font_size=10,prefer_horizontal=0.9,
                                     contour_width=3,contour_color='black', 
                                     min_word_length=3)

        wcloud.generate_from_frequencies(dict_word_frequency)

        plt.imshow(wcloud, interpolation='bilinear')
        plt.axis("off")

        # Visual done, now save or display
        if save:
            plt.savefig("WordCloud Topic "+str(idx+1)+".png", format="png")
        else:
            plt.show()

In [None]:
cloud_size_terms = 50

display_wordcloud(tf_lda, tf_vocab, cloud_size_terms)

### Challenge

Create a new LDA model called `tfidf_lda_stemmed` with 10 topics, TF-IDF weighting, and stemmed words. Then create a word cloud for each topic with the top 75 words.

In [None]:
# your code here

## Words aligned with each topic <a id='words'></a>

Let's calculate the total number of words aligned with each topic and compare by author gender. This is a way measuring **topic prevalence**, or how frequent each topic is in the corpus.

First, we need to merge the topic loadings with the text into one big DataFrame, just like we did yesterday.

In [None]:
# Get topic distribution and merge with main DataFrame
topic_dist = tf_lda.transform(tf_dtm)
topic_dist_df = pd.DataFrame(topic_dist)
df_w_topics = topic_dist_df.join(df_lit)

In [None]:
# first, create word count column
df_w_topics['word_count'] = df_w_topics['text'].apply(lambda x: len(str(x).split()))
df_w_topics['word_count']

In [None]:
# multiply topic weight by word count
df_w_topics['0_wc'] = df_w_topics[0] * df_w_topics['word_count']
df_w_topics['0_wc']

In [None]:
# create a for loop to do this for every topic
col_list = []
topic_columns = range(0,4)

for num in topic_columns:
    col = "%d_wc" % num
    col_list.append(col)
    df_w_topics[col] = df_w_topics[num] * df_w_topics['word_count']
    
df_w_topics

### Challenge 1

- What is the total number of words aligned with each topic, by author gender?
- What is the proportion of total words aligned with each topic, by author gender?

In [None]:
# your code here

Question: Why might we want to do one calculation over the other? Take average topic weight per documents versus the average number of words aligned with each topic? What are the benefits/drawbacks of each method?

### Challenge 2

- Find the most prevalent topic in the corpus.
- Find the least prevalent topic in the corpus.        

In [None]:
#your code here

In [None]:
#solution
for e in col_list:
    print(e)
    print(df_w_topics[e].sum()/df_w_topics['word_count'].sum())

## Topic prevalence over time <a id='time'></a>

We can do the same as above, but by year, to graph the prevalence of each topic over time. Let's make some pretty subplots to display these trends.

In [None]:
grouped_year = df_w_topics.groupby('year')
fig3 = plt.figure()
chrt = 0
for e in col_list:
    chrt += 1 
    ax2 = fig3.add_subplot(2,3, chrt)
    (grouped_year[e].sum()/grouped_year['word_count'].sum()).plot(
        kind='line', title=e)
    
fig3.tight_layout()
plt.show()

Topic 2 I interpret to be about battles in France. What is going on between 1880 and 1884 in France that might make this topic increasingly popular over this time period?

## Visualising topics with pyLDAvis <a id='viz'></a>

Understanding the data that underlies a topic model is vital, but fortunately we also have a slightly more human-friendly option to help us interpret the topics!

[pyLDAvis](https://github.com/bmabey/pyLDAvis) is a library for creating interactive topic model visualisations. It even has a helper function specifically for scikit-learn that we can use.

> **It will take a while to load this visualisation!**

In [None]:
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()

# Silence an annoying warning we cannot do anything about
import warnings
warnings.filterwarnings('ignore')

In [None]:
pyLDAvis.sklearn.prepare(tf_lda, tf_dtm, tf_vectorizer)

Here are some hints to help you interpret the visualisation:

* On the **left-hand side** is a scatterplot of some bubbles:
 * Each **bubble** represents a topic.
 * The **size of a bubble** represents how _prevalent_ or popular the topic is overall.
 * The **distance** from one bubble to another represents how similar the topics are to each other. If they overlap then the topics share significant similarity.
 
* On the **right-hand side** is a histogram of terms (tokens):
 * Select a bubble and it shows the top-30 **most relevant terms** for that topic.
 * The **red bar** represents how frequent a term is in the topic.
 * The **blue bar** represents how frequent the term is overall in all topics. So a long red bar with only a short blue bar indicates a term that is highly specific to that particular topic. Conversely, a red bar with a long blue bar means the term is also present in many other topics.
 * By mousing over a particular term, the size of the bubbles changes to show the relative frequency of that term in the various topics.
 * By adjusting the slide, it adjusts the **_relevance_ value (λ)**, which is the weight given to whether a term appears exclusively in a particular topic or is spread over topics more evenly. If λ = 1 terms are ranked according to their probabilities in the particular topic only; if λ = 0 terms are ranked higher if they are unusual terms that occur almost exclusively in that topic. Typically, the optimal value is around 0.6, but it is interesting to adjust it and observe any differences.

### Challenge

Use `pyLDAvis` to visualize the relationships between topics in the `tfidf_lda_stemmed` model you created earlier. What differences do you notice between this and the un-stemmed version?

In [None]:
# your code here

## Evaluating the topic model <a id='eval'></a>

How do we know if our topic models are giving sensible results? One way to find out is by human coding and interpretation (e.g., expert annotation), which is slow and resource-intensive. Luckily, there are a number of quantitative approaches to evaluating the quality of topic models, some of which are built into `scikit-learn`, others of which we need to use outside packages. Let's look at these metrics.

The _likelihood_ measures the probability of the observed data, given the model — i.e. how well a topic model fits the observed data. Similarly, _perplexity_ measures how well a model predicts a sample, i.e. how much it is “perplexed” by a sample from the observed data (a held-out test set). For both of these metrics, the lower the score, the better the model for the observed data. These are the main statistical measures used when constructing topic models&mdash;and even [when choosing the best one](https://datascience.blog.wzb.eu/2017/11/09/topic-modeling-evaluation-in-python-with-tmtoolkit/). However, neither one correlates very strongly with human evaluation of what topics make sense, i.e. what words should be together in which topics.

To better reflect human judgment, a lot of work has been done to develop measures of topic model _coherence_, meaning the conditional likelihood of the co-occurrence of words in a topic. The higher the coherence, the more a topic model reflects human interpretation, i.e. expert annotation. In other words, coherence measures the degree of semantic similarity between high scoring words in a given topic, reflecting how semantically interpretable the topic is&mdash;as opposed to being an artifact of statistical inference. If you want to learn more about the math, intuition, and varieties of semantic coherence measures, check out [this blog](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0) and [this canonical article](https://dl.acm.org/doi/pdf/10.5555/2145432.2145462).

First, let's look at likelihood (usually logged) and perplexity via `scikit-learn`.

In [None]:
# Log Likelihood: Higher the better
print("Log Likelihood: ", tf_lda.score(tf_dtm))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", tf_lda.perplexity(tf_dtm))

To check out coherence of our topic models, we will use the `tmtoolkit` module, which can calculate a number of different quantitative metrics:

In [None]:
from tmtoolkit.topicmod import tm_sklearn 
tm_sklearn.AVAILABLE_METRICS

Let's calculate the coherence using the formula proposed by [Mimno et al. (2011)](https://dl.acm.org/doi/pdf/10.5555/2145432.2145462):

In [None]:
tf_coherence_scores = tm_sklearn.metric_coherence_mimno_2011(
    topic_word_distrib = tf_lda.components_,
    dtm = tf_dtm)

tf_coherence_scores

### Challenge

Calculate the log-likelihood, perplexity, and coherence of the `tfidf_lda_stemmed` model you created earlier. How does this model compare to the 4-topic, un-stemmed one we just examined?

In [None]:
# your code here

---
---

## Resources and alternatives <a id='resources'></a>

In addition to LDA in `scikit-learn`, there are a few other common tools for topic modeling in Python:
- [Here's a detailed example of using LDA in scikit-learn](https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py), including several alternatives (like NMF) we won't explore.
- The other major topic modeling package is [Gensim](https://radimrehurek.com/gensim/) (example implementation [here](https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/topic_modelling.ipynb) and [here](https://github.com/susanli2016/NLP-with-Python/blob/master/LDA_news_headlines.ipynb)). 
- Another option is [textacy](https://textacy.readthedocs.io/en/latest/), which is built on the powerful spaCy library for text manipulation ([example implementation](https://github.com/repmax/topic-model/blob/master/topic-modelling.ipynb)).

Another well-known tool for topic modeling is called [MALLET](http://mallet.cs.umass.edu/topics.php), which is a program (written in Java) that you download to your computer. You have to type commands to use MALLET, but it has otherwise done a great deal for you. 
- [Getting Started with Topic Modeling and MALLET](https://programminghistorian.org/en/lessons/topic-modeling-and-mallet) from Programming Historian gives a step-by-step tutorial on MALLET.
- There is a graphical interface for MALLET called [Topic Modeling Tool](https://github.com/senderle/topic-modeling-tool) that is a bit easier to use. The [Quickstart Guide](https://senderle.github.io/topic-modeling-tool/documentation/2017/01/06/quickstart.html) will get you up and running.

If you are looking to use R rather than Python, then `tidytext` is a popular NLP library that will help you work with the `topicmodels` package. 
- The book _Text Mining with R_ devotes [chapter 6](https://www.tidytextmining.com/topicmodeling.html) to tidytext.

Finally, if coding isn't your thing, you can explore the topics of a few documents in a casual way with the online digital text environment [Voyant Tools](https://voyant-tools.org/), which allows you to upload or copy-and-paste texts and explore a corpus with a number of graphical tools, including topics.