# LAB 4: Topic modeling

Use topic models to explore hotel reviews

Objectives:
* tokenize with MWEs using spacy
* estimate LDA topic models with tomotopy
* visualize and evaluate topic models
* apply topic models to interpretation of hotel reviews

## Analyze reviews

In [1]:
import pandas as pd
import numpy as np
from cytoolz import *
from tqdm.auto import tqdm
tqdm.pandas()

### Read in hotel review data and tokenize it

In [2]:
df = pd.read_parquet('hotels.parquet')

In [3]:
import tomotopy as tp

mdl = tp.LDAModel.load('hotel-topics.bin')

In [4]:
df[df['overall']==1]['name'].value_counts().head(20)

Hotel Pennsylvania New York                      1358
Hotel Carter                                      685
Hudson New York                                   486
Park Central                                      330
The Boston Park Plaza Hotel & Towers              238
W New York                                        233
The Roosevelt Hotel                               206
Edison Hotel Times Square                         195
Wellington Hotel                                  178
Paradise Point Resort & Spa                       169
Waldorf Astoria New York                          168
Town and Country Resort Hotel                     157
Milford Plaza Hotel                               157
Manhattan Broadway Hotel                          147
Grand Hyatt New York                              145
Doubletree Hotel Metropolitan - New York City     145
Le Parker Meridien                                144
Fort Rapids Indoor Waterpark Resort               143
New York Inn                

Pick a hotel with a lot of 1 star ratings (other than the Paradise Point Resort & Spa) and pull out all of its reviews

In [5]:
subdf = df[df['name']=='Hotel Pennsylvania New York']

In [6]:
subdf['overall'].value_counts()

1.0    1358
3.0     772
2.0     555
4.0     437
5.0     116
Name: overall, dtype: int64

Tokenize

In [7]:
from tokenizer import MWETokenizer

tokenizer = MWETokenizer(open('hotel-terms.txt'))

In [8]:
subdf['tokens'] = subdf['text'].progress_apply(tokenizer.tokenize)

  0%|          | 0/3238 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subdf['tokens'] = subdf['text'].progress_apply(tokenizer.tokenize)


### Apply topic model

In [9]:
subdf['docs'] = [mdl.make_doc(words=toks) for toks in subdf['tokens']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subdf['docs'] = [mdl.make_doc(words=toks) for toks in subdf['tokens']]


In [10]:
topic_dist, ll = mdl.infer(subdf['docs'])

### Interpret model

What topics are associated with a review?

In [11]:
subdf['text'].iloc[0]

"The front desk girl was extremely rude and wouldn't let us check into the room. We were stranded in manhattan and really needed a place to stay although since we made our reservations online she said we couldn't check in. I used to work at a hotel and was extremely confused by the rude service and not disappointed that even tho the hotel had 500 empty rooms they couldn't let us check into ours. Terrible service and I will make sure to inform my friends to never stay here and I will never return. Worst customer service ever!!"

In [12]:
subdf.iloc[0]

title                                         “Worst service ever”
text             The front desk girl was extremely rude and wou...
date_stayed                                          December 2012
date                                           2012-12-19 00:00:00
service                                                        1.0
cleanliness                                                    1.0
overall                                                        1.0
value                                                          1.0
location                                                       3.0
sleep_quality                                                  NaN
rooms                                                          1.0
locality                                             New York City
name                                   Hotel Pennsylvania New York
tokens           [the, front_desk, girl, was, extremely, rude, ...
docs             (the, front_desk, girl, was, extremely, rude,

In [13]:
subdf['docs'].iloc[0].get_topics(top_n=5)

[(67, 0.22349637746810913),
 (46, 0.16731783747673035),
 (2, 0.13108664751052856),
 (58, 0.09505050629377365),
 (9, 0.0940554141998291)]

In [25]:
mdl.get_topic_words(67)

[('never', 0.017958614975214005),
 ('even', 0.01770029403269291),
 ('place', 0.016894331201910973),
 ('ever', 0.014745095744729042),
 ('bad', 0.013660145923495293),
 ('rude', 0.012978176586329937),
 ('worst', 0.012957511469721794),
 ('what', 0.01122159045189619),
 ('their', 0.00972332525998354),
 ('money', 0.009227348491549492)]

In [26]:
mdl.get_topic_words(46)

[('she', 0.137332484126091),
 ('her', 0.07034335285425186),
 ('asked', 0.024063313379883766),
 ('said', 0.023601744323968887),
 ('front_desk', 0.019278375431895256),
 ('who', 0.017816739156842232),
 ('told', 0.01757056824862957),
 ('lady', 0.015124250203371048),
 ('checked', 0.011200908571481705),
 ('what', 0.011154751293361187)]

In [27]:
mdl.get_topic_words(2)

[('upon', 0.023397130891680717),
 ('check', 0.021930620074272156),
 ('arrived', 0.018616974353790283),
 ('check-in', 0.018213964998722076),
 ('arrival', 0.01524735614657402),
 ('checked', 0.013400223106145859),
 ('greeted', 0.012885265052318573),
 ('front_desk', 0.012504643760621548),
 ('after', 0.012247164733707905),
 ('time', 0.011105299927294254)]

In [28]:
mdl.get_topic_words(58)

[('their', 0.021092552691698074),
 ('service', 0.018815666437149048),
 ('front_desk', 0.01835695281624794),
 ('helpful', 0.017989981919527054),
 ('way', 0.015404507517814636),
 ('make', 0.014553803019225597),
 ('help', 0.013978326693177223),
 ('friendly', 0.013060900382697582),
 ('always', 0.013044219464063644),
 ('went', 0.012718950398266315)]

In [29]:
mdl.get_topic_words(9)

[('check', 0.03354354575276375),
 ('arrived', 0.02969188429415226),
 ('p.m.', 0.028241846710443497),
 ('early', 0.02476402372121811),
 ('ready', 0.02423158846795559),
 ('until', 0.024220259860157967),
 ('after', 0.021637381985783577),
 ('before', 0.016981404274702072),
 ('a.m.', 0.016856791451573372),
 ('around', 0.016494283452630043)]

What are the most common topics?

In [19]:
subdf['topics'] = [list(map(first, d.get_topics(3))) for d in subdf['docs']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subdf['topics'] = [list(map(first, d.get_topics(3))) for d in subdf['docs']]


In [20]:
subdf['topics']

65186     [67, 46, 2]
65187     [7, 27, 61]
65188    [92, 85, 59]
65189     [66, 2, 37]
65190     [25, 44, 1]
             ...     
68419    [77, 45, 18]
68420     [66, 0, 14]
68421    [67, 91, 17]
68422    [66, 60, 45]
68423     [14, 37, 0]
Name: topics, Length: 3238, dtype: object

In [21]:
from collections import Counter

In [22]:
topic_freq = Counter(concat(subdf['topics']))
print(f'Top Freq Words')
for t, c in topic_freq.most_common(20):
    print(f'{t:3d} {c:4d}', ', '.join(map(first, mdl.get_topic_words(t))))

Top Freq Words
 85 1104 dirty, bathroom, carpet, bed, been, floor, old, looked, shower, sheets
 17  683 reception, new_york, after, day, lovely, really, booked, too, been, nights
 67  668 never, even, place, ever, bad, rude, worst, what, their, money
 27  417 your, want, go, its, place, need, make, back, then, what
  7  409 reviews, after, some, read, surprised, reading, other, little, pleasantly, negative
 16  392 back, got, went, down, into, go, after, then, came, off
 60  380 old, need, some, bathroom, worn, carpet, furniture, dated, lobby, needs
  0  356 am, what, their, people, know, really, who, them, how, some
 25  306 new_york, nyc, affinia, empire_state_building, manhattan, subway, penn_station, times_square, macy, square
 66  291 place, small, price, want, bathroom, spend, your, looking, budget, nyc
 35  290 really, pretty, little, too, much, bit, got, though, some, because
 81  251 more, than, wanted, because, place, time, what, staying, after, before
 37  217 however, some,

Most common topics in 1 star reviews?

In [23]:
topic_freq = Counter(concat(subdf[subdf['overall']==1]['topics']))
print(f'Top Freq Words')
for t, c in topic_freq.most_common(20):
    print(f'{t:3d} {c:4d}', ', '.join(map(first, mdl.get_topic_words(t))))

Top Freq Words
 85  809 dirty, bathroom, carpet, bed, been, floor, old, looked, shower, sheets
 67  512 never, even, place, ever, bad, rude, worst, what, their, money
 16  214 back, got, went, down, into, go, after, then, came, off
 17  207 reception, new_york, after, day, lovely, really, booked, too, been, nights
  0  174 am, what, their, people, know, really, who, them, how, some
 60  139 old, need, some, bathroom, worn, carpet, furniture, dated, lobby, needs
 27  127 your, want, go, its, place, need, make, back, then, what
 59  110 after, left, told, them, been, called, credit_card, charged, never, manager
 81  106 more, than, wanted, because, place, time, what, staying, after, before
 20   90 reservation, booked, told, available, called, after, made, arrived, another, asked
 62   85 minutes, called, after, told, wait, front_desk, call, back, service, waited
 92   83 door, front_desk, after, called, went, left, back, told, am, morning
 35   71 really, pretty, little, too, much, bit,

Most common topics in 5 star reviews?

In [24]:
topic_freq = Counter(concat(subdf[subdf['overall']==5]['topics']))
print(f'Top Freq Words')
for t, c in topic_freq.most_common(20):
    print(f'{t:3d} {c:4d}', ', '.join(map(first, mdl.get_topic_words(t))))

Top Freq Words
 17   34 reception, new_york, after, day, lovely, really, booked, too, been, nights
  7   32 reviews, after, some, read, surprised, reading, other, little, pleasantly, negative
 27   23 your, want, go, its, place, need, make, back, then, what
 25   18 new_york, nyc, affinia, empire_state_building, manhattan, subway, penn_station, times_square, macy, square
 14   16 back, loved, friendly, really, helpful, go, definitely, comfortable, perfect, everything
  0   13 am, what, their, people, know, really, who, them, how, some
 38   13 breakfast, coffee, helpful, nights, really, facilities, friendly, bathroom, walk, tea
 66   12 place, small, price, want, bathroom, spend, your, looking, budget, nyc
 28   11 recommend, highly, friendly, helpful, excellent, definitely, comfortable, anyone, extremely, staying
 81   10 more, than, wanted, because, place, time, what, staying, after, before
 37    9 however, some, other, quite, service, bit, than, overall, more, though
 45    8 times

### Report

Finish this notebook by writing a brief report to the hotel managers describing what you've found in the reviews of their hotel, along with some actionable advice. Use whatever data, charts, word clouds, etc. that you think will help you make your case. 

Report for Hotel Pennsylvania, New York 

After analyzing 1358 one star reviews for the establishment, it has been determined that your primary issue lies with your front counter and the check in staff. The top 22 percent of your one star reviews include the terms "worst" and "rude" among other non-descriptive terms. These are the most common words in those one star reviews, but the following 22 percent include references to a "she" or "lady" at the "front desk" who "checked"(them in). The next highest frequency terms, at 13 percent of the reviews include the terms "check", "arrived", "checked", "greeted", and "front desk" again. These all paint the picture that the reception area should be your primary area of improvement as this is where the majority of your negative reviews are coming from. 

Another area for improvement should be the cleanliness of your building. Common complaints in reviews include the words "dirty", "bathroom", "carpet", and "old". Perhaps some renovations are in order. 

It is worth noting that the majority of your positive reviews reference the location and local sights, not your hotel. This indicates that your guests prefer to be outside of your hotel, not in their rooms.