# Latent Dirichlet Allocation (LDA) Topic Modeling

This notebook demonstrates how to do topic modeling using the latent dirichlet allocation method. The following processes are described:

* Importing your [dataset](./key-terms.ipynb#dataset)
* Checking the import was successful with `len()` and `query()`
* Importing libraries including `os`, `warnings`, `gensim`, `nltk`, and `pyLDAvis`
* Writing a helper function to help clean up a single [token](./key-terms.ipynb#token)
* Building a gensim dictionary and training the model
* Computing a topic list
* Visualizing the topic list

This example uses the [`gensim`](https://radimrehurek.com/gensim/index.html) library for building the topic model. A familiarity with gensim is helpful but not required.
____

In [1]:
import os

import warnings
#warnings.filterwarnings('ignore')

In [2]:
import gensim
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import pyLDAvis.gensim

pyLDAvis.enable_notebook()

stop_words = set(stopwords.words('english'))

Initialize a dataset object. 

In [3]:
from tdm_client import Dataset

dset = Dataset('59c090b6-3851-3c65-e016-9181833b4a2c')

Print the text of the query that built this dataset.

In [4]:
dset.query_text()

'All documents from JSTOR published in Shakespeare Quarterly from 1700 - 2019'

Find total number of documents in the dataset using the `len()` function. 

In [5]:
len(dset)

6687

In [6]:
dset.query()

'q=*%3A*&fq=yearPublished%3A%5B1700%20TO%202019%5D&fq=-provider%3Aportico&fq=isPartOf%3A(%22Shakespeare%20Quarterly%22)'

Define a function for processing tokens from the extracted features for volumes in the curated dataset. This function:

* lowercases all tokens
* discards all tokens less than 4 characters
* discards non alphabetical tokens - e.g. --9
* removes stopwords using NLTK's stopword list
* Lemmatizes the token using NLTK's [WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html)

In [7]:
def process_token(token):
    token = token.lower()
    if len(token) < 4:
        return
    if not(token.isalpha()):
        return
    if token in stop_words:
        return
    return WordNetLemmatizer().lemmatize(token)

Loop through the volumes in the dataset and make a list of tokens for each volume and then add to a list of the 25 documents in the dataset. We are limiting this example to 25 documents to limit the time it takes to run during demonstrations.

In [8]:
documents = []

for n, unigram_count in enumerate(dset.get_features()):
    this_doc = []
    for token, count in unigram_count.items():
        clean_token = process_token(token)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    documents.append(this_doc)
    if n >= 24:
        break
                

Build a gensim dictionary and corpus and then train the model.

In [9]:
num_topics = 10

dictionary = gensim.corpora.Dictionary(documents)

dictionary.filter_extremes(no_below=len(documents) * .10, no_above=0.5)

bow_corpus = [dictionary.doc2bow(doc) for doc in documents]


# train model, this might take some time
model = gensim.models.LdaModel(
    corpus=bow_corpus,
    id2word=dictionary,
    num_topics=num_topics,
    passes=15
)


Print the most significant terms, as determined by the model, for each topic.

In [10]:
for topic_num in range(0, num_topics):
    word_ids = model.get_topic_terms(topic_num)
    words = []
    for wid, weight in word_ids:
        word = dictionary.id2token[wid]
        words.append(word)
    print("Topic {}".format(str(topic_num).ljust(5)), " ".join(words))

Topic 0     bassanio shylock merchant money venice shall exchange portia christian love
Topic 1     lear king portrait edgar sexual seemed production male head poor
Topic 2     critical woman male view female question student london comedy dryden
Topic 3     date booth paper manuscript macbeth tragedy effect experience idea find
Topic 4     portrait head state folio picture body painted image drawing copy
Topic 5     henry richard production presented royal principal running used repertory france
Topic 6     lear king edgar form report sense saying nothing meaning worst
Topic 7     troilus folger public quarto version private robert secret back folio
Topic 8     macbeth production line lady appeared seemed almost used problem came
Topic 9     sexual language gloss literary seem although frequently reference already claim


Visualize the model using [`pyLDAvis`](https://pyldavis.readthedocs.io/en/latest/). This visualization takes several minutes to an hour to generate depending on the size of your dataset. To run, remove the `#` symbol on the line below and run the cell. 

In [12]:
pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
