# LDA topic modeling with a curated dataset

This example uses the [`gensim`](https://radimrehurek.com/gensim/index.html) library for building the topic model.


In [20]:
import os

import warnings
warnings.filterwarnings('ignore')

In [21]:
import gensim
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import pyLDAvis.gensim

pyLDAvis.enable_notebook()

stop_words = set(stopwords.words('english'))

Initialize a dataset object. 

In [22]:
from tdm_client import Dataset

dset = Dataset('59c090b6-3851-3c65-e016-9181833b4a2c')

In [23]:
len(dset)

6687

In [24]:
dset.query()

'q=*%3A*&fq=yearPublished%3A%5B1700%20TO%202019%5D&fq=-provider%3Aportico&fq=isPartOf%3A(%22Shakespeare%20Quarterly%22)'

Define a function for processing tokens from the extracted features for volumes in the curated dataset. This function:

* lowercases all tokens
* discards all tokens less than 4 characters
* discards non alphabetical tokens - e.g. --9
* removes stopwords using NLTK's stopword list
* Lemmatizes the token using NLTK's [WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html)

In [25]:
def process_token(token):
    token = token.lower()
    if len(token) < 4:
        return
    if not(token.isalpha()):
        return
    if token in stop_words:
        return
    return WordNetLemmatizer().lemmatize(token)

Loop through the volumes in the dataset and make a list of tokens for each volume and then add to a list of the 25 documents in the dataset. We are limiting this example to 25 documents to limit the time it takes to run during demonstrations.

In [26]:
documents = []

for n, unigram_count in enumerate(dset.get_features()):
    this_doc = []
    for token, count in unigram_count.items():
        clean_token = process_token(token)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    documents.append(this_doc)
    if n >= 24:
        break
                

Build a gensim dictionary and corpus and then train the model.

In [27]:
num_topics = 10

dictionary = gensim.corpora.Dictionary(documents)

dictionary.filter_extremes(no_below=len(documents) * .10, no_above=0.5)

bow_corpus = [dictionary.doc2bow(doc) for doc in documents]


# train model, this might take some time
model = gensim.models.LdaModel(
    corpus=bow_corpus,
    id2word=dictionary,
    num_topics=num_topics,
    passes=15
)


Print the most significant terms, as determined by the model, for each topic.

In [28]:
for topic_num in range(0, num_topics):
    word_ids = model.get_topic_terms(topic_num)
    words = []
    for wid, weight in word_ids:
        word = dictionary.id2token[wid]
        words.append(word)
    print("Topic {}".format(str(topic_num).ljust(5)), " ".join(words))

Topic 0     periodical theatrical national magazine popular victorian theater chapter debate published
Topic 1     professor online currently hopkins author edited issue editor modern four
Topic 2     sonnet juliet romeo midsummer film author jackson festival dream theatrical
Topic 3     hamlet russian festival romeo sonnet periodical theater chapter national theatrical
Topic 4     festival hamlet russian memphis would king twelfth gave mask several
Topic 5     romeo midsummer juliet dream scene fact space packer tina comedy
Topic 6     film jackson romeo dream vision another midsummer would juliet better
Topic 7     sonnet poem author poet possibility edition stephen especially open rather
Topic 8     midsummer author festival sonnet juliet romeo professor online dream four
Topic 9     film festival jackson romeo hamlet juliet midsummer periodical national russian


Visualize the model using [`pyLDAvis`](https://pyldavis.readthedocs.io/en/latest/). This visualization takes several minutes to an hour to generate depending on the size of your dataset. To run, change do_viz below to `True`.

In [31]:
do_viz = True
if do_viz is True:
    pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)