# LAB 4: Topic modeling

Use topic models to explore hotel reviews

Objectives:
* tokenize with MWEs using spacy
* estimate LDA topic models with tomotopy
* visualize and evaluate topic models
* apply topic models to interpretation of hotel reviews

## Prepare data for analysis

In [None]:
import numpy as np
import pandas as pd
from cytoolz import *
from tqdm.auto import tqdm

tqdm.pandas()

---

### Prepare reviews

This section is that same as what we did back in Lab 1: 

* Load the JSON file for the hotel reviews
* Guess the language of each review
* Select only the ones that are in English
* Add rating categories (Overall, etc) as new columns

In [None]:
import pycld2

df = pd.read_json('s3://ling583/review.json.gz', lines=True,
                  storage_options={'anon': True})


def guess_lang(text):
    try:
        reliable, _, langs = pycld2.detect(
            text, isPlainText=True, hintLanguage='en')
        if reliable:
            return langs[0][0]
    except pycld2.error as e:
        pass
    return np.NaN


df['lang'] = df['text'].progress_apply(guess_lang)
df = df[df['lang'] == 'ENGLISH'].reset_index(drop=True)
df = pd.concat([df, pd.json_normalize(df['ratings'])], axis=1)

Next we'll get rid of columns that we're not going to need. That makes the tables easier to read and saves a little bit of memory.

In [None]:
df = df.drop(['ratings', 'author', 'num_helpful_votes', 'via_mobile', 'lang', 'id', 'check_in_front_desk',
              'business_service_(e_g_internet_access)'], axis=1)
df.head()

---

### Add hotel info

The data we looked at in the first lab included the text of the reviews but almost nothing about what hotels they were reviews of. In the data in `review.json.gz`, the column `offering_id` is a code number identifying the hotel indexed to data in `offering.json.gz`. So first we'll load that file:

In [None]:
offering = pd.read_json('s3://ling583/offering.json.gz',
                        lines=True, storage_options={'anon': True})

And move some address info into it's own columns:

In [None]:
offering = pd.concat([offering, pd.json_normalize(offering['address'])], axis=1)

Next we'll combine the review info from `df` with the corresponding hotel info from `offering` (in SQL terms, this is an [inner join](https://en.wikipedia.org/wiki/Join_(SQL)#Inner_join) operation)

In [None]:
df = df.merge(offering[['locality', 'id', 'name']],
              left_on='offering_id', right_on='id')

Drop columns we don't need:

In [None]:
df = df.drop(['id', 'offering_id'], axis=1)
df.head()

And finally save the result to a local datafile:

In [None]:
df.to_parquet('hotels.parquet', index=False)

----

### Find domain-specific terms

The next thing we need to do is find domain-specific terminology (multiword expressions, or MWEs) that is relevant for hotel reviews (like "front desk" and "room key").  We'll follow the same procedure as we did in Lab 3 with one different: the tag sequence we're using for matching allows both common nouns (e.g., "coffee pot") and proper nouns (e.g., "New York"). Multi-word proper names aren't usually technical terms in scientific literature, but they are important for hotel reviews.

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm', exclude=[
                 'parser', 'ner', 'lemmatizer', 'attribute_ruler'])

matcher = Matcher(nlp.vocab)
matcher.add('Term', [[{'TAG': {'IN': ['JJ', 'NN', 'NNP']}},
                      {'TAG': {'IN': ['JJ', 'NN', 'IN',
                                      'HYPH', 'NNP']}, 'OP': '*'},
                      {'TAG': {'IN': ['NN', 'NNP']}}]])


def get_candidates(text):
    doc = nlp(text)
    spans = matcher(doc, as_spans=True)
    return [tuple(tok.norm_ for tok in span) for span in spans]

We have a lot of reviews to get through, so we'll parallelize things using dask. In order to run this notebook you'll need to **start a dask cluster** and **replace the port number of the client** (xxxxx below) with the right value for your cluster.

In [None]:
from dask.distributed import Client

client = Client("tcp://127.0.0.1:xxxxx")
client

To save processing time, we'll just use a sub-sample of 100,000 reviews to find MWEs. (Feel free to increase that if you want, but you'll need to adjust `theta` below accordingly.)

In [None]:
import dask.bag as db
import dask.dataframe as dd

texts = dd.from_pandas(df['text'].sample(
    100000, random_state=19), npartitions=50).to_bag()

graph = texts.map(get_candidates).flatten().frequencies()

In [None]:
%%time

candidates = graph.compute()

In [None]:
from nltk import ngrams


def get_subterms(term):
    k = len(term)
    for m in range(k-1, 1, -1):
        yield from ngrams(term, m)

In [None]:
from collections import Counter, defaultdict
from math import log2

freqs = defaultdict(Counter)
for c, f in candidates:
    freqs[len(c)][c] += f


def c_value(F, theta):

    termhood = Counter()
    longer = defaultdict(list)

    for k in sorted(F, reverse=True):
        for term in F[k]:
            if term in longer:
                discount = sum(longer[term]) / len(longer[term])
            else:
                discount = 0
            c = log2(k) * (F[k][term] - discount)
            if c > theta:
                termhood[term] = c
                for subterm in get_subterms(term):
                    if subterm in F[len(subterm)]:
                        longer[subterm].append(F[k][term])
    return termhood

In [None]:
terms = c_value(freqs, theta=200)

In [None]:
for t, c in terms.most_common(20):
    print(f'{c:8.2f} {freqs[len(t)][t]:5d} {" ".join(t)}')

In [None]:
for t, c in tail(20, terms.most_common()):
    print(f'{c:8.2f} {freqs[len(t)][t]:5d} {" ".join(t)}')

Last of all, we'll save the MWEs that we've found for use in the next step

In [None]:
with open('hotel-terms.txt', 'w') as f:
    for t in terms:
        print(' '.join(t), file=f)