# LAB 4: Topic modeling

Use topic models to explore hotel reviews

Objectives:
* tokenize with MWEs using spacy
* estimate LDA topic models with tomotopy
* visualize and evaluate topic models
* apply topic models to interpretation of hotel reviews

## Build topic model

In [None]:
import pandas as pd
import numpy as np
from cytoolz import *
from tqdm.auto import tqdm
tqdm.pandas()

### Read in hotel review data and tokenize it

In [None]:
df = pd.read_parquet('hotels.parquet')

In [None]:
from tokenizer import MWETokenizer

tokenizer = MWETokenizer(open('hotel-terms.txt'))

Select a sample of reviews to work with (replace x's below with the sample size; you should use at least 50,000 reviews)

In [None]:
subdf = df.sample(xxxxx)

In [None]:
subdf['tokens'] = pd.Series(subdf['text'].progress_apply(tokenizer.tokenize))

### Estimate model

In [None]:
import tomotopy as tp
import time

These are the model **hyperparameters**: aspects of the model that aren't estimated from the data but have to be set in advance by the analyst. There's no "right" values for these. You'll just have to try out different values to find settings that give you a model that you can interpret:

* *k* = number of topics
* *min_df* = minimum number of reviews that a term has to occur in to be included in the model
* *rm_top* = number of most frequent terms to remove from the model
* *tw* = term weighting strategy (described [here](https://bab2min.github.io/tomotopy/v0.10.1/en/#tomotopy.TermWeight)]
* *alpha*, *eta* = priors for document-topic and topic-word distributions
* *tol* = convergence tolerance


In [None]:
k = 20
min_df = 100
rm_top = 75
tw = tp.TermWeight.ONE
alpha = 0.1
eta = 0.01
tol = 1e-3

Here's where we do the inference. The documentation for `LDAModel` is [here](https://bab2min.github.io/tomotopy/v0.10.1/en/#tomotopy.LDAModel). You might also consider trying out one of the other model types (e.g., `HDPModel`).

In [None]:
%%time

mdl = tp.LDAModel(k=k, min_df=min_df, rm_top=rm_top, tw=tw, alpha=alpha, eta=eta)

for doc in subdf['tokens']:
    if doc:
        mdl.add_doc(doc)

last = np.NINF
for i in range(0, 5000, 50):
    mdl.train(50)
    ll = mdl.ll_per_word
    print(f'{i:5d} LL = {ll:7.4f}', flush=True)
    if ll - last < tol:
        break
    else:
        last = ll

print(f'Done!')

### Evaluate the model

What terms are associated with each topic?

In [None]:
for k in range(mdl.k):
    print(f'{k:3d} ', ', '.join(s for s,_ in mdl.get_topic_words(k)))

Which terms got remove due to `rm_top`?

In [None]:
', '.join(mdl.removed_top_words)

Visualize topic model with LDAvis

In [None]:
import pyLDAvis

topic_term_dists = np.stack([mdl.get_topic_word_dist(k)
                             for k in range(mdl.k)])
doc_topic_dists = np.stack([doc.get_topic_dist() for doc in mdl.docs])
doc_lengths = np.array([len(doc.words) for doc in mdl.docs])
vocab = list(mdl.used_vocabs)
term_frequency = mdl.used_vocab_freq
prepared_data = pyLDAvis.prepare(topic_term_dists,
                                 doc_topic_dists,
                                 doc_lengths,
                                 vocab,
                                 term_frequency, 
                                 mds='tsne', 
                                 sort_topics=False
                                 )

In [None]:
pyLDAvis.display(prepared_data)

Find documents that best represent each topic

In [None]:
for i,d in enumerate(np.argmax(doc_topic_dists, axis=0)):
    print(i, ', '.join(map(first, mdl.get_topic_words(i))))
    print(subdf['text'].iloc[d])
    print()

Generate word clouds for topics

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize=(15,15))
freqs = dict(mdl.get_topic_words(54, 200))
wc = WordCloud(width=1000,height=1000,background_color='white').generate_from_frequencies(freqs)
plt.axis('off')
plt.imshow(wc, interpolation='bilinear')
plt.show()

### Save the final, best model

In [None]:
mdl.save('hotel-topics.bin')