# LAB 4: Topic modeling

Use topic models to explore hotel reviews

Objectives:

- tokenize with MWEs using spacy
- estimate LDA topic models with tomotopy
- visualize and evaluate topic models
- apply topic models to interpretation of hotel reviews

In [2]:
from collections import Counter

import numpy as np
import pandas as pd
import tomotopy as tp
from cytoolz import *
from tqdm.auto import tqdm

tqdm.pandas()

ModuleNotFoundError: No module named 'tomotopy'

## Prepare data

In [3]:
df = pd.read_pickle("/data/hotels_id.pkl")
mdl = tp.LDAModel.load("hotel-topics.bin")
labels = list(pd.read_csv("labels.csv")["label"])

FileNotFoundError: [Errno 2] No such file or directory: '/data/hotels_id.pkl'

In [4]:
df[df["overall"] == 1]["offering_id"].value_counts().head(20)

NameError: name 'df' is not defined

Pick a hotel with a lot of 1 star ratings (other than # 93520) and pull out all of its reviews

In [None]:
hotel = df.query("offering_id==93520").copy()
hotel["overall"].value_counts()

4.0    826
5.0    575
3.0    448
1.0    329
2.0    313
Name: overall, dtype: int64

In [None]:
from tokenizer import MWETokenizer

tokenizer = MWETokenizer(open("terms.txt"))

In [None]:
hotel["tokens"] = (hotel["title"] + " " + hotel["text"]).progress_apply(
    tokenizer.tokenize
)

  0%|          | 0/2491 [00:00<?, ?it/s]

## Apply topic model

In [None]:
hotel["doc"] = [mdl.make_doc(words=toks) for toks in hotel["tokens"]]
topic_dist, ll = mdl.infer(hotel["doc"])

## Interpret model

What topics are associated with a review?

In [None]:
hotel["text"].iloc[0]

'Bedbugs!!!! No acknowledgement, no bill adjustment, just fill out a form for Security. I showed the manager a bite, and I am still itching like crazy! Only where my body was in contact with the bed did I have bites.'

In [None]:
hotel["doc"].iloc[0].get_topics(top_n=5)

[(33, 0.2037794142961502),
 (29, 0.08272796869277954),
 (32, 0.06548836082220078),
 (40, 0.06293294578790665),
 (22, 0.054054565727710724)]

In [None]:
mdl.get_topic_words(17)

[('hilton', 0.08599624037742615),
 ('property', 0.06887347251176834),
 ('marriott', 0.06563917547464371),
 ('westin', 0.06506841629743576),
 ('w', 0.042618561536073685),
 ('hotels', 0.037101227790117264),
 ('hyatt', 0.03272540867328644),
 ('am', 0.031583890318870544),
 ('sheraton', 0.023783521726727486),
 ('properties', 0.023593269288539886)]

In [None]:
mdl.get_topic_words(31)

[('its', 0.04651343449950218),
 ('because', 0.026509461924433708),
 ('too', 0.025600189343094826),
 ('bit', 0.020774057134985924),
 ('better', 0.020494280382990837),
 ('hotels', 0.020004672929644585),
 ('much', 0.020004672929644585),
 ('think', 0.017346801236271858),
 ('probably', 0.01706702448427677),
 ('does', 0.016437530517578125)]

In [None]:
[(labels[x], y) for x, y in hotel["doc"].iloc[0].get_topics(top_n=5)]

[('CHARGE', 0.2037794142961502),
 ('GO', 0.08272796869277954),
 ('BED', 0.06548836082220078),
 ('OTHER', 0.06293294578790665),
 ('SHE', 0.054054565727710724)]

What are the most common topics?

In [None]:
hotel["topics"] = [
    [labels[t] for t in map(first, d.get_topics(3))] for d in hotel["doc"]
]

In [None]:
hotel["topics"]

49680                     [CHARGE, GO, BED]
49681             [THEIR, ALWAYS, ELEVATOR]
49705               [SEATTLE, MADE, ALWAYS]
49706     [GREAT_LOCATION, RECOMMEND, WALK]
49707                        [NYC, OLD, GO]
                        ...                
119799                     [LITTLE, 3, NYC]
120641              [UPON, NIGHTS, MINUTES]
122697                 [BOOKED, GO, LITTLE]
123934                [TOO, ITS, RECOMMEND]
128080             [ELEVATOR, CHARGE, DOWN]
Name: topics, Length: 2491, dtype: object

In [None]:
topic_freq = Counter(concat(hotel["topics"]))
topic_freq.most_common()

Most common topics in 1 star reviews?

In [None]:
topic_freq = Counter(concat(hotel.query("overall==1")["topics"]))
topic_freq.most_common()

[('DOWN', 194),
 ('DIRTY', 118),
 ('GO', 81),
 ('MINUTES', 40),
 ('THEIR', 39),
 ('UPON', 39),
 ('HE', 39),
 ('SEE', 37),
 ('ELEVATOR', 36),
 ('ITS', 31),
 ('NOISE', 27),
 ('3', 23),
 ('ALWAYS', 22),
 ('CHECK', 19),
 ('LITTLE', 18),
 ('PARKING', 17),
 ('TOO', 16),
 ('BOOKED', 15),
 ('OTHER', 15),
 ('FOUND', 15),
 ('SHE', 14),
 ('NIGHTS', 13),
 ('NYC', 13),
 ('CHARGE', 12),
 ('4', 12),
 ('OLD', 11),
 ('BED', 11),
 ('MONEY', 10),
 ('BEST', 10),
 ('MADE', 8),
 ('RECOMMEND', 7),
 ('REVIEWS', 4),
 ('RESTAURANT', 3),
 ('SEATTLE', 3),
 ('WALK', 3),
 ('VIEW', 2),
 ('HILTON', 2),
 ('CONFERENCE', 2),
 ('GREAT_LOCATION', 2),
 ('AWAY', 1),
 ('SAN_DIEGO', 1),
 ('STREET', 1),
 ('COFFEE', 1)]

Most common topics in 5 star reviews?

In [None]:
topic_freq = Counter(concat(hotel.query("overall==5")["topics"]))
topic_freq.most_common()

[('NYC', 210),
 ('GO', 178),
 ('RECOMMEND', 110),
 ('ALWAYS', 101),
 ('GREAT_LOCATION', 84),
 ('MADE', 84),
 ('LITTLE', 83),
 ('DOWN', 52),
 ('BEST', 50),
 ('ITS', 45),
 ('WALK', 42),
 ('UPON', 40),
 ('CHECK', 39),
 ('SEE', 36),
 ('ELEVATOR', 35),
 ('REVIEWS', 32),
 ('WALKING_DISTANCE', 31),
 ('SHE', 27),
 ('AWAY', 25),
 ('STREET', 25),
 ('MONEY', 25),
 ('FOUND', 22),
 ('BED', 21),
 ('3', 21),
 ('HE', 21),
 ('NIGHTS', 21),
 ('CLOSE', 21),
 ('THEIR', 21),
 ('OTHER', 21),
 ('MINUTES', 20),
 ('SEATTLE', 20),
 ('BOOKED', 20),
 ('4', 19),
 ('RESTAURANT', 19),
 ('TOO', 14),
 ('VIEW', 12),
 ('NOISE', 12),
 ('PARKING', 11),
 ('OLD', 9),
 ('AIRPORT', 8),
 ('FREE', 8),
 ('DIRTY', 8),
 ('CONFERENCE', 7),
 ('HILTON', 5),
 ('POOL', 3),
 ('COFFEE', 3),
 ('SAN_FRANCISCO', 2),
 ('CHARGE', 2)]

## Report

Finish this notebook by writing a brief report to the hotel managers describing what you've found in the reviews of their hotel, along with some actionable advice. Use whatever data, charts, word clouds, etc. that you think will help you make your case.

In [1]:
avg_rating = hotel["overall"].mean()
avg_rating

NameError: name 'hotel' is not defined

I found that the Hudson Hotel has an average rating of about 3 out of 5 stars with 2,543 reviews. This rating reflects that the place is mediocre and doesn't stand out against other hotels that possible guests could choose from.

The most common topic within these reviews is about the bar. After reading through several reviews, it seems that the bar gets pretty noisy, so many people either love it or hate it. The ones who talked positively about it most likely enjoy drinking and going to bars; those who didn't probably weren't too interested and complained about the noise. One person went so far as to say that they "had to wear earplugs even when reading."

Since one of the most common topic in 1 star reviews is about the bar, perhaps there should be something done that will make guests happier. For example, the hours of the bar can be reduced so it closes at an earlier time or quiet hours can be set. This way people won't complain about the noise late at night when they are trying to sleep. However, many people had a lot of good things to say since it was the most common topic in 5 star reviews. In this sense, it's difficult to gauge how people would feel about reduced hours for the bar. There are surely many ways the hotel could improve its services in order to achieve a higher overall rating.