# LDA topic modeling with a curated dataset

This example uses the [`gensim`](https://radimrehurek.com/gensim/index.html) library for building the topic model.

Depending on your collection size, this example will take between 10 minutes to an hour+ to run. 

In [1]:
import os

import warnings
warnings.filterwarnings('ignore')

In [2]:
import gensim
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import pyLDAvis.gensim

pyLDAvis.enable_notebook()

stop_words = set(stopwords.words('english'))

Initialize a dataset object. 

In [3]:
from tdm_core.client import Dataset

dset = Dataset('bb3d938b-bc61-4c2c-a21c-9a4f102035c8')

In [4]:
len(dset)

61

In [5]:
dset.query()

'q=%22walt%20whitman%22%20brooklyn&fq=yearPublished%3A%5B1700%20TO%202019%5D&fq=outputFormat%3Aunigrams'

Define a function for processing tokens from the extracted features for volumes in the curated dataset. This function:

* lowercases all tokens
* discards all tokens less than 4 characters
* discards non alphabetical tokens - e.g. --9
* removes stopwords using NLTK's stopword list
* Lemmatizes the token using NLTK's [WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html)

In [6]:
def process_token(token):
    token = token.lower()
    if len(token) < 4:
        return
    if not(token.isalpha()):
        return
    if token in stop_words:
        return
    return WordNetLemmatizer().lemmatize(token)

Loop through the volumes in the dataset and make a list of tokens for each volume and then add to a list of all documents in the dataset.

In [7]:
dset_items = [d for d in dset]
documents = []    

In [9]:
for doc_n, volume in enumerate(dset_items):
    this_doc = []
    try:
        pages = volume['features']['pages']
    except:
        continue
    for pn, page in enumerate(pages):
        body = page.get('body')
        if body is not None:
            for token, pos_count in body.get('tokenPosCount', {}).items():
                t = process_token(token)
                if t is None:
                    continue
                for pos, n in pos_count.items():
                    this_doc += [t] * n
    documents.append(this_doc)
    if doc_n >= 25:
        break
                

Build a gensim dictionary and corpus and then train the model.

In [10]:
dictionary = gensim.corpora.Dictionary(documents)

In [11]:
dictionary.filter_extremes(no_below=10, no_above=0.5)

bow_corpus = [dictionary.doc2bow(doc) for doc in documents]


# train model, this might take some time
model = gensim.models.LdaModel(
    corpus=bow_corpus,
    id2word=dictionary,
    num_topics=5,
    passes=15
)


Visualize the model using [`pyLDAvis`](https://pyldavis.readthedocs.io/en/latest/).

In [12]:
pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)