<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
___

# Latent Dirichlet Allocation (LDA) Topic Modeling

**Description:**
This notebook demonstrates how to do topic modeling. The following processes are described:

* Filtering based on a pre-processed ID list
* Filtering based on a stop words list
* Cleaning the tokens in the dataset
* Creating a gensim dictionary
* Creating a gensim bag of words corpus
* Computing a topic list using gensim
* Visualizing the topic list with `pyldavis`

**Use Case:** For Researchers (Mostly code without explanation, not ideal for learners)

**Difficulty:** Intermediate

**Completion time:** 60 minutes

**Knowledge Required:** 
* Python Basics Series ([Start Python Basics I](../Python-basics/python-basics-1.ipynb))

**Knowledge Recommended:**
* [Exploring Metadata](../Exploring-metadata/exploring-metadata.ipynb)
* [Python Intermediate 2](../Python-intermediate/python-intermediate-2.ipynb)
* [Pandas I](../Pandas-basics/pandas-basics-1.ipynb)
* [Creating a Stopwords List](../Stopwords/creating-stopwords-list.ipynb)
* A familiarity with [gensim](https://constellate.org/docs/key-terms/#gensim) is helpful but not required.

**Data Format:** JSON Lines (.jsonl)

**Libraries Used:**
* pandas to load a preprocessing list
* `csv` to load a custom stopwords list
* gensim to accomplish the topic modeling
* NLTK to create a stopwords list (if no list is supplied)
* `pyldavis` to visualize our topic model

**Research Pipeline**
1. Build a dataset
2. Create a "Pre-Processing CSV" with [Exploring Metadata](../Exploring-metadata/exploring-metadata.ipynb) (Optional)
3. Create a "Custom Stopwords List" with [Creating a Stopwords List](../Stopwords/creating-stopwords-list.ipynb) (Optional)
4. Complete the Topic Modeling analysis with this notebook
____

In [None]:
# Import modules and libraries
from pathlib import Path
import gensim
from gensim.models import CoherenceModel
import pyLDAvis.gensim
import gzip
import json

## What is Topic Modeling?

**Topic modeling** is a **machine learning** technique that attempts to discover groupings of words (called topics) that commonly occur together in a body of texts. The body of texts could be anything from journal articles to newspaper articles to tweets.

**Topic modeling** is an unsupervised, clustering technique for text. We give the machine a series of texts that it then attempts to cluster the texts into a given number of topics. There is also a *supervised*, clustering technique called **Topic Classification**, where we supply the machine with examples of pre-labeled topics and then see if the machine can identify them given the examples.

**Topic modeling** is usually considered an exploratory technique; it helps us discover new patterns within a set of texts. **Topic Classification**, using labeled data, is intended to be a predictive technique; we want it to find more things like the examples we give it.

<font color='red'>Read more</font>

* ["Latent Dirichlet Allocation: Intuition, math, implementation and visualisation with pyLDAvis" Ioana](https://towardsdatascience.com/latent-dirichlet-allocation-intuition-math-implementation-and-visualisation-63ccb616e094) 2020
* ["Latent Dirichlet Allocation" Blei, Ng, Jordan](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf?TB_iframe=true&width=370.8&height=658.8) 2003

## Import your dataset
<h3 style="color:red; display:inline">Note! The following code cell assumes that you have downloaded the JSONL file containing metadata, ngrams and full texts to the current working directory.&lt; / &gt; </h3>


In [None]:
from pathlib import Path
import urllib.request

# Check if a data folder exists. If not, create it.
data_folder = Path('./data/')
data_folder.mkdir(exist_ok=True)

# path to the jsonl file in the current directory
dataset_file = '' # copy and paste the path to the JSONL file 

# function that reads a jsonl file into a generator
def dataset_reader(file_path):
    """
    Helper to read in gzip files and yield Python dictionary
    documents.
    """
    with gzip.open(file_path, "rb") as input_file:
        for row in input_file:
            yield json.loads(row)

## Load Stopwords List

If you have created a stopword list in the stopwords notebook, we will import it here. (You can always modify the CSV file to add or subtract words then reload the list.) Otherwise, we'll load the NLTK stopwords list automatically.

We recommend storing your stopwords in a CSV file as shown in the [Creating Stopwords List](../Stopwords/creating-stopwords-list.ipynb) notebook.

In [None]:
# Load a custom data/stop_words.csv if available
# Otherwise, load the nltk stopwords list in English
import csv
# Create an empty Python list to hold the stopwords
stop_words = []

# The filename of the custom data/stop_words.csv file
stopwords_path = Path.cwd() / 'data' / 'stop_words.csv'

if stopwords_path.exists():
    with stopwords_path.open() as f:
        stop_words = list(csv.reader(f))[0]
    print('Custom stopwords list loaded from CSV')
else:
    # Load the NLTK stopwords list
    from nltk.corpus import stopwords
    stop_words = stopwords.words('english')
    
    # Create a CSV file to store an initial set of stopwords
    with open('./data/stop_words.csv', 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(stop_words)
    print('NLTK stopwords list loaded and saved to stopwords.csv.')

In [None]:
# Preview stop words
print(stop_words)

## Define a Function to Process Tokens
Next, we create a short function to clean up our tokens.

In [None]:
def process_token(token):
    token = token.lower()
    if token in stop_words:
        return
    if len(token) < 4:
        return
    if not(token.isalpha()):
        return
    return token

In [None]:
%%time
# Limit to n documents. Set to None to use all documents.

limit = 5000

n = 0
documents = []
for document in dataset_reader(dataset_file):
    processed_document = []
    unigrams = document.get("unigramCount", {})
    for gram, count in unigrams.items():
        clean_gram = process_token(gram)
        if clean_gram is None:
            continue
        processed_document += [clean_gram] * count # Add the unigram as many times as it was counted
    if len(processed_document) > 0:
        documents.append(processed_document)
    if n % 1000 == 0:
        print(f'Unigrams collected for {n} documents...')
    n += 1
    if (limit is not None) and (n >= limit):
       break
print(f'All unigrams collected for {n} documents.')

Build a gensim dictionary corpus and then train the model. More information about parameters can be found at the [Gensim LDA Model page](https://radimrehurek.com/gensim/models/ldamodel.html).

In [None]:
# Build the gensim dictionary
dictionary = gensim.corpora.Dictionary(documents)

In [None]:
doc_count = len(documents)
num_topics = 7 # Change the number of topics
passes = 5 # The number of passes used to train the model
# Remove terms that appear in less than 50 documents and terms that occur in more than 90% of documents.
dictionary.filter_extremes(no_below=50, no_above=0.90)

In [None]:
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

In [None]:
bow_corpus[0]

In [None]:
%%time
# Train the LDA model
model = gensim.models.LdaModel(
    corpus=bow_corpus,
    id2word=dictionary,
    num_topics=num_topics,
    passes=passes
)

## Perplexity

After each pass, the LDA model will output a "perplexity" score that measures the "held out log-likelihood". Perplexity is a measure of how "surpised" the machine is to see certain data. In other words, perplexity measures how successfully a trained topic model predicts new data. The model may be trained many times with different parameters, optimizing for the lowest possible perplexity.

In general, the perplexity score should trend downward as the machine "learns" what to expect from the data. While a low perplexity score may signal the machine has learned the documents' patterns, that does not mean that the topics formed from a model with low perplexity will form the most coherent topics. (See ["Reading Tea Leaves: How Humans Interpret Topic Models" Chang, et al. 2009](https://papers.nips.cc/paper/2009/hash/f92586a25bb3145facd64ab20fd554ff-Abstract.html).)



## Topic Coherence

The failure of perplexity scores to consistently create "good" topics has led to new methods in "topic coherence". Here we demonstrate two of these methods with Gensim but there are additional methods available. Ideally, a researcher would run many topic models, discovering the optimum settings for topic coherence.

Ultimately, however, the best judgment of topic coherence is a disciplinary expert, particularly someone with familiarity with the materials in question.

<font color='red'>Read more</font>

* ["Optimizing Semantic Coherence in Topic Models" Mimno, et al. 2011](http://dirichlet.net/pdf/mimno11optimizing.pdf)
* ["Automatic Evaluation of Topic Coherence" Newman, et al. 2010](https://mimno.infosci.cornell.edu/info6150/readings/N10-1012.pdf))


In [None]:
# Compute the coherence score using UMass
# u_mass is measured from -14 to 14, higher is better
coherence_model_lda = CoherenceModel(
    model=model,
    corpus=bow_corpus,
    dictionary=dictionary, 
    coherence='u_mass'
)

# Compute Coherence Score using UMass
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

## Display a List of Topics
Print the most significant terms, as determined by the model, for each topic.

In [None]:
for topic_num in range(0, num_topics):
    word_ids = model.get_topic_terms(topic_num)
    words = []
    for wid, weight in word_ids:
        word = dictionary.id2token[wid]
        words.append(word)
    print("Topic {}".format(str(topic_num).ljust(5)), " ".join(words))

## Visualize the Topic Distances

Visualize the model using [`pyLDAvis`](https://pyldavis.readthedocs.io/en/latest/). The visualization will be output to an html file in the data folder. (Right-click on the html file and choose "Open in New Browser Tab.")

Try choosing a topic and adjusting the λ slider. When λ approaches 0, the words in a given document occur almost entirely in that topic. When λ approaches 1, the words occur more often in other topics.

In [None]:
# Export this visualization as an HTML file
# An internet connection is still required to view the HTML
p = pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)
pyLDAvis.save_html(p, './data/my_visualization.html')