Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
**For questions/comments/improvements, email nathan.kelber@ithaka.org.**<br />
![CC BY License Logo](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png)
___

# Latent Dirichlet Allocation (LDA) Topic Modeling

This notebook demonstrates how to do topic modeling using the latent dirichlet allocation method. The following processes are described:

* Importing your [dataset](https://docs.tdm-pilot.org/key-terms/#dataset)
* Checking the import was successful with `len()` and `query()`
* Importing libraries including `os`, `warnings`, `gensim`, `nltk`, and `pyLDAvis`
* Writing a helper function to help clean up a single [token](https://docs.tdm-pilot.org/key-terms/#token)
* Building a gensim dictionary and training the model
* Computing a topic list
* Visualizing the topic list

This example uses the [`gensim`](https://radimrehurek.com/gensim/index.html) library for building the topic model. A familiarity with gensim is helpful but not required.
____

In [None]:
import os

import warnings
warnings.filterwarnings('ignore')

In [None]:
import gensim
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import pyLDAvis.gensim

pyLDAvis.enable_notebook()

stop_words = set(stopwords.words('english'))

Initialize a dataset object. 

In [None]:
import tdm_client
tdm_client.get_dataset("f6ae29d4-3a70-36ee-d601-20a8c0311273", "shakespeareQuarterly")

Define a function for processing tokens from the extracted features for volumes in the curated dataset. This function:

* lowercases all tokens
* discards all tokens less than 4 characters
* discards non alphabetical tokens - e.g. --9
* removes stopwords using NLTK's stopword list
* Lemmatizes the token using NLTK's [WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html)

In [None]:
def process_token(token):
    token = token.lower()
    if len(token) < 4:
        return
    if not(token.isalpha()):
        return
    if token in stop_words:
        return
    return WordNetLemmatizer().lemmatize(token)

Loop through the documents in the dataset and build a gensim dictionary from the tokens in each document. For demonstration purposes, we are going to limit to 25 documents in the dataset. 

In [None]:
dictionary = gensim.corpora.Dictionary()
doc_count = 0
# Limit the number of documents, set to None to not limit.
limit_to = 25

with open("./datasets/shakespeareQuarterly.jsonl") as input_file:
    for line in input_file:
        doc = json.loads(line)
        unigram_count = doc["unigramCount"]
        document_tokens = []
        for token, count in unigram_count.items():
            clean_token = process_token(token)
            if clean_token is None:
                continue
            document_tokens += [clean_token] * count
        dictionary.add_documents([document_tokens])
        doc_count += 1 
        if (limit_to is not None) and (doc_count >= limit_to):
            break
                

Build a gensim corpus and then train the model.

In [None]:
num_topics = 10

# Remove terms that appear in less than 10% of documents and more than 75% of documents.
dictionary.filter_extremes(no_below=doc_count * .10, no_above=0.75)

bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

# Train the LDA model.
model = gensim.models.LdaModel(
    corpus=bow_corpus,
    id2word=dictionary,
    num_topics=num_topics,
    passes=15
)


Print the most significant terms, as determined by the model, for each topic.

In [None]:
for topic_num in range(0, num_topics):
    word_ids = model.get_topic_terms(topic_num)
    words = []
    for wid, weight in word_ids:
        word = dictionary.id2token[wid]
        words.append(word)
    print("Topic {}".format(str(topic_num).ljust(5)), " ".join(words))

Visualize the model using [`pyLDAvis`](https://pyldavis.readthedocs.io/en/latest/). This visualization takes several minutes to an hour to generate depending on the size of your dataset. To run, remove the `#` symbol on the line below and run the cell. 

In [None]:
# pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)