In [1]:
import numpy as np
import pandas as pd
from cytoolz import *
from tqdm.auto import tqdm

tqdm.pandas()

---

### Prepare reviews

This section is that same as what we did back in Lab 1: 

* Load the JSON file for the hotel reviews
* Guess the language of each review
* Select only the ones that are in English
* Add rating categories (Overall, etc) as new columns

In [2]:
import pycld2

df = pd.read_json('s3://ling583/review.json.gz', lines=True,
                  storage_options={'anon': True})


def guess_lang(text):
    try:
        reliable, _, langs = pycld2.detect(
            text, isPlainText=True, hintLanguage='en')
        if reliable:
            return langs[0][0]
    except pycld2.error as e:
        pass
    return np.NaN


df['lang'] = df['text'].progress_apply(guess_lang)
df = df[df['lang'] == 'ENGLISH'].reset_index(drop=True)
df = pd.concat([df, pd.json_normalize(df['ratings'])], axis=1)

  0%|          | 0/878561 [00:00<?, ?it/s]

Next we'll get rid of columns that we're not going to need. That makes the tables easier to read and saves a little bit of memory.

In [3]:
df = df.drop(['ratings', 'author', 'num_helpful_votes', 'via_mobile', 'lang', 'id', 'check_in_front_desk',
              'business_service_(e_g_internet_access)'], axis=1)
df.head()

Unnamed: 0,title,text,date_stayed,offering_id,date,service,cleanliness,overall,value,location,sleep_quality,rooms
0,"“Truly is ""Jewel of the Upper Wets Side""”",Stayed in a king suite for 11 nights and yes i...,December 2012,93338,2012-12-17,5.0,5.0,5.0,5.0,5.0,5.0,5.0
1,“My home away from home!”,"On every visit to NYC, the Hotel Beacon is the...",December 2012,93338,2012-12-17,5.0,5.0,5.0,5.0,5.0,5.0,5.0
2,“Great Stay”,This is a great property in Midtown. We two di...,December 2012,1762573,2012-12-18,4.0,5.0,4.0,4.0,5.0,4.0,4.0
3,“Modern Convenience”,The Andaz is a nice hotel in a central locatio...,August 2012,1762573,2012-12-17,5.0,5.0,4.0,5.0,5.0,5.0,5.0
4,“Its the best of the Andaz Brand in the US....”,I have stayed at each of the US Andaz properti...,December 2012,1762573,2012-12-17,4.0,5.0,4.0,3.0,5.0,5.0,5.0


---

### Add hotel info

The data we looked at in the first lab included the text of the reviews but almost nothing about what hotels they were reviews of. In the data in `review.json.gz`, the column `offering_id` is a code number identifying the hotel indexed to data in `offering.json.gz`. So first we'll load that file:

In [4]:
offering = pd.read_json('s3://ling583/offering.json.gz',
                        lines=True, storage_options={'anon': True})

And move some address info into it's own columns:

In [5]:
offering = pd.concat([offering, pd.json_normalize(offering['address'])], axis=1)

Next we'll combine the review info from `df` with the corresponding hotel info from `offering` (in SQL terms, this is an [inner join](https://en.wikipedia.org/wiki/Join_(SQL)#Inner_join) operation)

In [6]:
df = df.merge(offering[['locality', 'id', 'name']],
              left_on='offering_id', right_on='id')

Drop columns we don't need:

In [7]:
df = df.drop(['id', 'offering_id'], axis=1)
df.head()

Unnamed: 0,title,text,date_stayed,date,service,cleanliness,overall,value,location,sleep_quality,rooms,locality,name
0,"“Truly is ""Jewel of the Upper Wets Side""”",Stayed in a king suite for 11 nights and yes i...,December 2012,2012-12-17,5.0,5.0,5.0,5.0,5.0,5.0,5.0,New York City,Hotel Beacon
1,“My home away from home!”,"On every visit to NYC, the Hotel Beacon is the...",December 2012,2012-12-17,5.0,5.0,5.0,5.0,5.0,5.0,5.0,New York City,Hotel Beacon
2,“Excellent location”,Loved the hotel. Great location - only 2 block...,December 2012,2012-12-17,5.0,5.0,5.0,5.0,5.0,5.0,5.0,New York City,Hotel Beacon
3,“All-round fantastic NYC hotel”,Our first stay on the upper west side and can'...,December 2012,2012-12-17,5.0,5.0,5.0,4.0,5.0,5.0,5.0,New York City,Hotel Beacon
4,“Great hotel in nice area”,"Great room, very big with huge bed! Great loca...",December 2012,2012-12-17,5.0,5.0,5.0,4.0,5.0,5.0,5.0,New York City,Hotel Beacon


And finally save the result to a local datafile:

In [8]:
df.to_parquet('hotels.parquet', index=False)

----

### Find domain-specific terms

The next thing we need to do is find domain-specific terminology (multiword expressions, or MWEs) that is relevant for hotel reviews (like "front desk" and "room key").  We'll follow the same procedure as we did in Lab 3 with one different: the tag sequence we're using for matching allows both common nouns (e.g., "coffee pot") and proper nouns (e.g., "New York"). Multi-word proper names aren't usually technical terms in scientific literature, but they are important for hotel reviews.

In [9]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm', exclude=[
                 'parser', 'ner', 'lemmatizer', 'attribute_ruler'])

matcher = Matcher(nlp.vocab)
matcher.add('Term', [[{'TAG': {'IN': ['JJ', 'NN', 'NNP']}},
                      {'TAG': {'IN': ['JJ', 'NN', 'IN',
                                      'HYPH', 'NNP']}, 'OP': '*'},
                      {'TAG': {'IN': ['NN', 'NNP']}}]])


def get_candidates(text):
    doc = nlp(text)
    spans = matcher(doc, as_spans=True)
    return [tuple(tok.norm_ for tok in span) for span in spans]

We have a lot of reviews to get through, so we'll parallelize things using dask. In order to run this notebook you'll need to **start a dask cluster** and **replace the port number of the client** (xxxxx below) with the right value for your cluster.

In [11]:
from dask.distributed import Client

client = Client("tcp://127.0.0.1:44385")
client

0,1
Client  Scheduler: tcp://127.0.0.1:44385  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 16.62 GB


To save processing time, we'll just use a sub-sample of 100,000 reviews to find MWEs. (Feel free to increase that if you want, but you'll need to adjust `theta` below accordingly.)

In [12]:
import dask.bag as db
import dask.dataframe as dd

texts = dd.from_pandas(df['text'].sample(
    100000, random_state=19), npartitions=50).to_bag()

graph = texts.map(get_candidates).flatten().frequencies()

In [13]:
%%time

candidates = graph.compute()

CPU times: user 6.29 s, sys: 831 ms, total: 7.12 s
Wall time: 7min 19s


In [14]:
from nltk import ngrams


def get_subterms(term):
    k = len(term)
    for m in range(k-1, 1, -1):
        yield from ngrams(term, m)

In [15]:
from collections import Counter, defaultdict
from math import log2

freqs = defaultdict(Counter)
for c, f in candidates:
    freqs[len(c)][c] += f


def c_value(F, theta):

    termhood = Counter()
    longer = defaultdict(list)

    for k in sorted(F, reverse=True):
        for term in F[k]:
            if term in longer:
                discount = sum(longer[term]) / len(longer[term])
            else:
                discount = 0
            c = log2(k) * (F[k][term] - discount)
            if c > theta:
                termhood[term] = c
                for subterm in get_subterms(term):
                    if subterm in F[len(subterm)]:
                        longer[subterm].append(F[k][term])
    return termhood

In [16]:
terms = c_value(freqs, theta=200)

In [17]:
for t, c in terms.most_common(20):
    print(f'{c:8.2f} {freqs[len(t)][t]:5d} {" ".join(t)}')

18737.83 19525 front desk
 8601.12  8898 new york
 7272.00  7272 great location
 6159.80  6321 times square
 5562.75  5743 room service
 5295.36  3341 front desk staff
 4128.83  2772 check - in
 3633.00  3781 san francisco
 3250.50  3411 san diego
 3211.00  3211 hotel staff
 3166.00  3374 next door
 2874.00  3009 union square
 2757.50  2948 customer service
 2713.00  2816 central park
 2712.00  2712 next time
 2537.00  2537 friendly staff
 2459.00  2459 first time
 2433.00  2523 minute walk
 2371.00  2371 great place
 2355.50  2567 continental breakfast


In [18]:
for t, c in tail(20, terms.most_common()):
    print(f'{c:8.2f} {freqs[len(t)][t]:5d} {" ".join(t)}')

  206.00   206 first rate
  206.00   206 shuttle driver
  206.00   206 balboa park
  205.00   205 reasonable rate
  204.00   204 small kitchen
  204.00   204 separate bedroom
  204.00   204 hudson river
  204.00   204 small refrigerator
  204.00   204 nice bar
  203.00   203 park hotel
  202.88   128 roll - away
  202.00   202 average size
  202.00   202 first place
  202.00   202 comfortable stay
  201.29   127 complimentary continental breakfast
  201.00   201 storage space
  201.00   201 security guard
  201.00   201 city hall
  201.00   201 bart station
  201.00   201 freedom trail


Last of all, we'll save the MWEs that we've found for use in the next step

In [19]:
with open('hotel-terms.txt', 'w') as f:
    for t in terms:
        print(' '.join(t), file=f)