# LAB 4: Topic modeling

Use topic models to explore hotel reviews

Objectives:
* tokenize with MWEs using spacy
* estimate LDA topic models with tomotopy
* visualize and evaluate topic models
* apply topic models to interpretation of hotel reviews

## Analyze reviews

In [None]:
import pandas as pd
import numpy as np
from cytoolz import *
from tqdm.auto import tqdm
tqdm.pandas()

### Read in hotel review data and tokenize it

In [None]:
df = pd.read_parquet('hotels.parquet')

In [None]:
import tomotopy as tp

mdl = tp.LDAModel.load('hotel-topics.bin')

In [None]:
df[df['overall']==1]['name'].value_counts().head(20)

Pick a hotel with a lot of 1 star ratings (other than the Paradise Point Resort & Spa) and pull out all of its reviews

In [None]:
subdf = df[df['name']=='xxxxx']

In [None]:
subdf['overall'].value_counts()

Tokenize

In [None]:
from tokenizer import MWETokenizer

tokenizer = MWETokenizer(open('hotel-terms.txt'))

In [None]:
subdf['tokens'] = subdf['text'].progress_apply(tokenizer.tokenize)

### Apply topic model

In [None]:
subdf['docs'] = [mdl.make_doc(words=toks) for toks in subdf['tokens']]

In [None]:
topic_dist, ll = mdl.infer(subdf['docs'])

### Interpret model

What topics are associated with a review?

In [None]:
subdf['text'].iloc[0]

In [None]:
subdf.iloc[0]

In [None]:
subdf['docs'].iloc[0].get_topics(top_n=5)

In [None]:
mdl.get_topic_words(76)

In [None]:
mdl.get_topic_words(23)

In [None]:
mdl.get_topic_words(16)

In [None]:
mdl.get_topic_words(88)

What are the most common topics?

In [None]:
subdf['topics'] = [list(map(first, d.get_topics(3))) for d in subdf['docs']]

In [None]:
subdf['topics']

In [None]:
from collections import Counter

In [None]:
topic_freq = Counter(concat(subdf['topics']))
print(f'Top Freq Words')
for t, c in topic_freq.most_common(20):
    print(f'{t:3d} {c:4d}', ', '.join(map(first, mdl.get_topic_words(t))))

Most common topics in 1 star reviews?

In [None]:
topic_freq = Counter(concat(subdf[subdf['overall']==1]['topics']))
print(f'Top Freq Words')
for t, c in topic_freq.most_common(20):
    print(f'{t:3d} {c:4d}', ', '.join(map(first, mdl.get_topic_words(t))))

Most common topics in 5 star reviews?

In [None]:
topic_freq = Counter(concat(subdf[subdf['overall']==5]['topics']))
print(f'Top Freq Words')
for t, c in topic_freq.most_common(20):
    print(f'{t:3d} {c:4d}', ', '.join(map(first, mdl.get_topic_words(t))))

### Report

Finish this notebook by writing a brief report to the hotel managers describing what you've found in the reviews of their hotel, along with some actionable advice. Use whatever data, charts, word clouds, etc. that you think will help you make your case. 