<a href="https://colab.research.google.com/github/ilektram/EducationDataUS/blob/master/hotels_nlp_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis of Hotel Reviews
*written in google colab* 

In [1]:
# Load dependencies & define settings and environment variables
!pip install pyLDAvis

import os
import re
import pickle 
from google.colab import files
import tqdm
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import spacy
import gensim
import pyLDAvis
import pyLDAvis.gensim

pd.set_option('display.max_columns', None)

nlp = spacy.load("en")

ENTITIES = ['@GPE', '@LOC', '@LANGUAGE', '@DATE', '@TIME', '@PERCENT', '@MONEY', '@QUANTITY']

# check list of stopwords & remove negations
nlp.Defaults.stop_words.remove("no")
nlp.Defaults.stop_words.remove("n't")
nlp.Defaults.stop_words.remove("not")
nlp.Defaults.stop_words.remove("n’t")
nlp.Defaults.stop_words.remove("n‘t")

nlp.Defaults.stop_words



{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

## Load the data

We will initially load the data and explore it. We assume that any new files that contain refreshed data would have the same format as the current one. We do data cleaning and and apply any necessary transformations to obtain a cleaned up version of the dataset.

In [2]:

if os.path.exists("/content/Hotel_Reviews.csv"):
  hotels_df = pd.read_csv("/content/Hotel_Reviews.csv")
else:
  uploaded = files.upload()
  csv_hotels = next(iter(uploaded.keys()))
  print("Uploaded file {name} with length {length} bytes.".format(name=csv_hotels, length=len(uploaded[csv_hotels])))

  hotels_df = pd.read_csv(csv_hotels)
hotels_df.head()

Saving Hotel_Reviews.csv to Hotel_Reviews.csv
Uploaded file Hotel_Reviews.csv with length 124452060 bytes.


Unnamed: 0,id,dateAdded,dateUpdated,address,categories,primaryCategories,city,country,keys,latitude,longitude,name,postalCode,province,reviews.date,reviews.dateAdded,reviews.dateSeen,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username,sourceURLs,websites
0,AWE2FvX5RxPSIh2RscTK,2018-01-18T18:43:12Z,2019-05-20T23:55:47Z,5620 Calle Real,"Hotels,Hotels and motels,Hotel and motel mgmt....",Accommodation & Food Services,Goleta,US,us/ca/goleta/5620callereal/-1127060008,34.44178,-119.81979,Best Western Plus South Coast Inn,93117,CA,2018-01-01T00:00:00.000Z,,2018-01-03T00:00:00Z,3,https://www.tripadvisor.com/Hotel_Review-g3243...,"This hotel was nice and quiet. Did not know, t...",Best Western Plus Hotel,San Jose,UnitedStates,tatsurok2018,https://www.tripadvisor.com/Hotel_Review-g3243...,https://www.bestwestern.com/en_US/book/hotel-r...
1,AVwcj_OhkufWRAb5wi9T,2016-11-06T20:21:05Z,2019-05-20T23:31:56Z,5th And San Carlos PO Box 3574,"Hotels,Lodging,Hotel",Accommodation & Food Services,Carmel by the Sea,US,us/ca/carmelbythesea/5thandsancarlospobox3574/...,36.55722,-121.92194,Best Western Carmel's Town House Lodge,93921,CA,2016-04-02T00:00:00Z,,2016-10-09T00:00:00Z,4,https://www.tripadvisor.com/Hotel_Review-g3217...,We stayed in the king suite with the separatio...,Clean rooms at solid rates in the heart of Carmel,San Francisco,CA,STEPHEN N,http://www.tripadvisor.com/Hotel_Review-g32172...,http://www.bestwestern.com
2,AVwcj_OhkufWRAb5wi9T,2016-11-06T20:21:05Z,2019-05-20T23:31:56Z,5th And San Carlos PO Box 3574,"Hotels,Lodging,Hotel",Accommodation & Food Services,Carmel by the Sea,US,us/ca/carmelbythesea/5thandsancarlospobox3574/...,36.55722,-121.92194,Best Western Carmel's Town House Lodge,93921,CA,2016-01-06T00:00:00Z,,2016-10-09T00:00:00Z,3,https://www.tripadvisor.com/Hotel_Review-g3217...,"Parking was horrible, somebody ran into my ren...",Business,Prescott Valley,AZ,15Deborah,http://www.tripadvisor.com/Hotel_Review-g32172...,http://www.bestwestern.com
3,AVwcj_OhkufWRAb5wi9T,2016-11-06T20:21:05Z,2019-05-20T23:31:56Z,5th And San Carlos PO Box 3574,"Hotels,Lodging,Hotel",Accommodation & Food Services,Carmel by the Sea,US,us/ca/carmelbythesea/5thandsancarlospobox3574/...,36.55722,-121.92194,Best Western Carmel's Town House Lodge,93921,CA,2016-08-22T00:00:00Z,,2016-10-31T00:00:00Z,5,https://www.tripadvisor.com/Hotel_Review-g3217...,Not cheap but excellent location. Price is som...,Very good,Guaynabo,PR,Wilfredo M,http://www.tripadvisor.com/Hotel_Review-g32172...,http://www.bestwestern.com
4,AVwcj_OhkufWRAb5wi9T,2016-11-06T20:21:05Z,2019-05-20T23:31:56Z,5th And San Carlos PO Box 3574,"Hotels,Lodging,Hotel",Accommodation & Food Services,Carmel by the Sea,US,us/ca/carmelbythesea/5thandsancarlospobox3574/...,36.55722,-121.92194,Best Western Carmel's Town House Lodge,93921,CA,2016-03-21T00:00:00Z,,"2016-10-09T00:00:00Z,2016-03-27T00:00:00Z",2,https://www.tripadvisor.com/Hotel_Review-g3217...,If you get the room that they advertised on th...,Low chance to come back here,Reno,NV,Luc D,http://www.tripadvisor.com/Hotel_Review-g32172...,http://www.bestwestern.com


In [3]:
# Replace empty strings with missing values & get count of missing values by field
hotels_df.replace(r'^\s*$', np.nan, regex=True, inplace=True)
hotels_df.isnull().sum()

id                          0
dateAdded                   0
dateUpdated                 0
address                     0
categories                  0
primaryCategories           0
city                        0
country                     0
keys                        0
latitude                    0
longitude                   0
name                        0
postalCode                  0
province                    0
reviews.date                0
reviews.dateAdded       10179
reviews.dateSeen            0
reviews.rating              0
reviews.sourceURLs          0
reviews.text                0
reviews.title               1
reviews.userCity            0
reviews.userProvince        2
reviews.username            0
sourceURLs                  0
websites                    1
dtype: int64

We can see that we have a review with a missing title. We will remove this from the dataset, since it is a single instance, as our primary focus is on NLP analysis and therefore the title and text of the review are essential. We will also drop reviews that are the same in text body and title if they refer to the same property. We will not spend time cleaning the rest of the columns but rather only focus on the text based fields.

In [4]:
hotels_df.dropna(subset=['reviews.title'], axis=0, inplace=True)
hotels_df.drop_duplicates(subset=['reviews.text', 'name'], inplace=True)
hotels_df.isnull().sum()

id                         0
dateAdded                  0
dateUpdated                0
address                    0
categories                 0
primaryCategories          0
city                       0
country                    0
keys                       0
latitude                   0
longitude                  0
name                       0
postalCode                 0
province                   0
reviews.date               0
reviews.dateAdded       9780
reviews.dateSeen           0
reviews.rating             0
reviews.sourceURLs         0
reviews.text               0
reviews.title              0
reviews.userCity           0
reviews.userProvince       2
reviews.username           0
sourceURLs                 0
websites                   1
dtype: int64

For users that have multiple reviews, normalise their ratings to a 0-1 scale with min-max scaling. For users that have a single review use the existing range of 1-5 to denote the min and max values for normalisation.

In [5]:
# Do min max normalisation of the scores per user
grouper = hotels_df.groupby('reviews.username')['reviews.rating']                                                                             
maxes = grouper.transform('max')                                                                                   
mins = grouper.transform('min')                                                                                    
hotels_df['normed_rating'] = (hotels_df['reviews.rating'] - mins) / (maxes - mins)
# for users that only have a single score, standardise it to a 0, 1 interval
hotels_df['normed_rating'] = np.where(
    hotels_df['normed_rating'].isna(), (hotels_df['reviews.rating'] - 1) / (5 - 1), hotels_df['normed_rating']
)                                                   
hotels_df['normed_rating'].min(), hotels_df['normed_rating'].max()

(0.0, 1.0)

## Preprocessing

We shall now create a column of preprocessed text for both the body and title of the reviews. This will contain the following:

*   Case normalisation
*   Stopword removal (*it is important to note that stopwords that contain negation will be replaced with the word "NO" as negation may hold significant meaning within the given task and thus will not be treated as a regular stopword*)
*   Lemmatization
*   Entity replacement (if any entities such as amounts, brands or locations are recognised they will be standardised against a palceholder for denoising purposes)
*   **Punctuation (as punctuation may in this case be indicative of the sentiment of the review, it will not be discarded as is normally done during preprocessing of text, but will rather be kept inside the review -> actually this was tested and then revised as it did not work well, so punctuation was removed in the end)**

Finally, we will do part-of-speech (pos) tagging on the words in the title & reviews in the case that it is useful to us in later stages of the analysis.


In [6]:
def get_entity_placeholders(s: str) -> dict:
  entity_d = {}
  doc = nlp(s)
  for token in doc.ents:
    if "@" + token.label_ in ENTITIES:
      ent = "@" + token.label_
    else:
      ent = token.text
    entity_d[token.text] = ent
  return entity_d


def preprocess_text(s: str, 
  remove_stop=True, 
  remove_punct=True,
  to_lower=True,
  lemmatize=True,
  replace_entities=True) -> (str, dict):
  no_spaces = re.sub(r'\s+',' ', s.strip())
  tokens = []
  pos_d = {}
  if replace_entities:
    entity_d = get_entity_placeholders(no_spaces)
    for k, v in entity_d.items():
      no_spaces = no_spaces.replace(k, v)
  doc = nlp(no_spaces)
  for token in doc:
    pos_d[token.text] = token.pos_
    if token.text in ENTITIES:
      tokens.append(token.text)
      continue
    if remove_punct and token.is_punct:
      continue
    if remove_stop and token.is_stop:
      continue
    if lemmatize:
      t = token.lemma_
    else:
      t = token.text
    tokens.append(t.lower() if to_lower else t)
  return (tokens, pos_d)


x = "Apple is looking at buying U.K. startup for $1 billion"
preprocess_text(x)

(['apple', 'look', 'buy', '@GPE', 'startup', '@MONEY'],
 {'@GPE': 'ADJ',
  '@MONEY': 'PROPN',
  'Apple': 'PROPN',
  'at': 'ADP',
  'buying': 'VERB',
  'for': 'ADP',
  'is': 'AUX',
  'looking': 'VERB',
  'startup': 'NOUN'})

In [7]:
# Apply preprocessing to data (title & text fields)
hotels_df['preprocessed_title'] = hotels_df['reviews.title'].apply(preprocess_text)
hotels_df['preprocessed_text'] = hotels_df['reviews.text'].apply(preprocess_text)
hotels_df['pos_title'] = hotels_df['preprocessed_title'].apply(lambda x: x[1])
hotels_df['pos_text'] = hotels_df['preprocessed_text'].apply(lambda x: x[1])
hotels_df['preprocessed_title'] = hotels_df['preprocessed_title'].apply(lambda x: x[0])
hotels_df['preprocessed_text'] = hotels_df['preprocessed_text'].apply(lambda x: x[0])
hotels_df.head()

Unnamed: 0,id,dateAdded,dateUpdated,address,categories,primaryCategories,city,country,keys,latitude,longitude,name,postalCode,province,reviews.date,reviews.dateAdded,reviews.dateSeen,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username,sourceURLs,websites,normed_rating,preprocessed_title,preprocessed_text,pos_title,pos_text
0,AWE2FvX5RxPSIh2RscTK,2018-01-18T18:43:12Z,2019-05-20T23:55:47Z,5620 Calle Real,"Hotels,Hotels and motels,Hotel and motel mgmt....",Accommodation & Food Services,Goleta,US,us/ca/goleta/5620callereal/-1127060008,34.44178,-119.81979,Best Western Plus South Coast Inn,93117,CA,2018-01-01T00:00:00.000Z,,2018-01-03T00:00:00Z,3,https://www.tripadvisor.com/Hotel_Review-g3243...,"This hotel was nice and quiet. Did not know, t...",Best Western Plus Hotel,San Jose,UnitedStates,tatsurok2018,https://www.tripadvisor.com/Hotel_Review-g3243...,https://www.bestwestern.com/en_US/book/hotel-r...,0.5,"[good, western, plus, hotel]","[hotel, nice, quiet, know, train, track, near,...","{'Best': 'ADJ', 'Western': 'PROPN', 'Plus': 'N...","{'This': 'DET', 'hotel': 'NOUN', 'was': 'AUX',..."
1,AVwcj_OhkufWRAb5wi9T,2016-11-06T20:21:05Z,2019-05-20T23:31:56Z,5th And San Carlos PO Box 3574,"Hotels,Lodging,Hotel",Accommodation & Food Services,Carmel by the Sea,US,us/ca/carmelbythesea/5thandsancarlospobox3574/...,36.55722,-121.92194,Best Western Carmel's Town House Lodge,93921,CA,2016-04-02T00:00:00Z,,2016-10-09T00:00:00Z,4,https://www.tripadvisor.com/Hotel_Review-g3217...,We stayed in the king suite with the separatio...,Clean rooms at solid rates in the heart of Carmel,San Francisco,CA,STEPHEN N,http://www.tripadvisor.com/Hotel_Review-g32172...,http://www.bestwestern.com,0.75,"[clean, room, solid, rate, heart, carmel]","[stay, king, suite, separation, bedroom, live,...","{'Clean': 'ADJ', 'rooms': 'NOUN', 'at': 'ADP',...","{'We': 'PRON', 'stayed': 'VERB', 'in': 'ADP', ..."
2,AVwcj_OhkufWRAb5wi9T,2016-11-06T20:21:05Z,2019-05-20T23:31:56Z,5th And San Carlos PO Box 3574,"Hotels,Lodging,Hotel",Accommodation & Food Services,Carmel by the Sea,US,us/ca/carmelbythesea/5thandsancarlospobox3574/...,36.55722,-121.92194,Best Western Carmel's Town House Lodge,93921,CA,2016-01-06T00:00:00Z,,2016-10-09T00:00:00Z,3,https://www.tripadvisor.com/Hotel_Review-g3217...,"Parking was horrible, somebody ran into my ren...",Business,Prescott Valley,AZ,15Deborah,http://www.tripadvisor.com/Hotel_Review-g32172...,http://www.bestwestern.com,0.5,[business],"[parking, horrible, somebody, run, rental, car...",{'Business': 'NOUN'},"{'Parking': 'NOUN', 'was': 'AUX', 'horrible': ..."
3,AVwcj_OhkufWRAb5wi9T,2016-11-06T20:21:05Z,2019-05-20T23:31:56Z,5th And San Carlos PO Box 3574,"Hotels,Lodging,Hotel",Accommodation & Food Services,Carmel by the Sea,US,us/ca/carmelbythesea/5thandsancarlospobox3574/...,36.55722,-121.92194,Best Western Carmel's Town House Lodge,93921,CA,2016-08-22T00:00:00Z,,2016-10-31T00:00:00Z,5,https://www.tripadvisor.com/Hotel_Review-g3217...,Not cheap but excellent location. Price is som...,Very good,Guaynabo,PR,Wilfredo M,http://www.tripadvisor.com/Hotel_Review-g32172...,http://www.bestwestern.com,1.0,[good],"[cheap, excellent, location, price, somewhat, ...","{'Very': 'ADV', 'good': 'ADJ'}","{'Not': 'PART', 'cheap': 'ADJ', 'but': 'CCONJ'..."
4,AVwcj_OhkufWRAb5wi9T,2016-11-06T20:21:05Z,2019-05-20T23:31:56Z,5th And San Carlos PO Box 3574,"Hotels,Lodging,Hotel",Accommodation & Food Services,Carmel by the Sea,US,us/ca/carmelbythesea/5thandsancarlospobox3574/...,36.55722,-121.92194,Best Western Carmel's Town House Lodge,93921,CA,2016-03-21T00:00:00Z,,"2016-10-09T00:00:00Z,2016-03-27T00:00:00Z",2,https://www.tripadvisor.com/Hotel_Review-g3217...,If you get the room that they advertised on th...,Low chance to come back here,Reno,NV,Luc D,http://www.tripadvisor.com/Hotel_Review-g32172...,http://www.bestwestern.com,0.25,"[low, chance, come]","[room, advertise, website, pay, lucky, stay, d...","{'Low': 'ADJ', 'chance': 'NOUN', 'to': 'PART',...","{'If': 'SCONJ', 'you': 'PRON', 'get': 'AUX', '..."


Finally have a look at general column statistics such as the minimum and maximum ratings, length of text etc.

In [8]:
hotels_df['title_length'] = hotels_df['reviews.title'].apply(lambda x: len(x))
hotels_df['text_length'] = hotels_df['reviews.text'].apply(lambda x: len(x))
hotels_df['preprocessed_title_length'] = hotels_df['preprocessed_title'].apply(lambda x: len(x))
hotels_df['preprocessed_text_length'] = hotels_df['preprocessed_text'].apply(lambda x: len(x))
hotels_df.describe()

Unnamed: 0,latitude,longitude,reviews.dateAdded,reviews.rating,normed_rating,title_length,text_length,preprocessed_title_length,preprocessed_text_length
count,9780.0,9780.0,0.0,9780.0,9780.0,9780.0,9780.0,9780.0,9780.0
mean,34.983743,-101.642603,,4.082618,0.757405,25.688446,651.988446,2.990286,53.195808
std,6.363141,20.213107,,1.151889,0.311207,14.529143,590.254956,1.547095,47.043439
min,19.438604,-159.4803,,1.0,0.0,2.0,8.0,0.0,1.0
25%,29.9577,-117.888954,,4.0,0.75,15.0,303.0,2.0,24.0
50%,33.804844,-95.619149,,4.0,0.75,23.0,464.0,3.0,38.0
75%,38.9464,-84.371578,,5.0,1.0,33.0,813.0,4.0,67.0
max,64.84359,-71.07334,,5.0,1.0,122.0,14254.0,14.0,1169.0


## Analyse Titles

Create gensim dictionary & corpus from preprocessed title tokens.

In [9]:
# Create Dictionary
title_id2word = gensim.corpora.Dictionary(hotels_df['preprocessed_title'])
# Create Corpus
title_texts = hotels_df['preprocessed_title']
# Term Document Frequency
title_corpus = [title_id2word.doc2bow(text) for text in title_texts]
title_corpus[:5]

[[(0, 1), (1, 1), (2, 1), (3, 1)],
 [(4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)],
 [(10, 1)],
 [(0, 1)],
 [(11, 1), (12, 1), (13, 1)]]

Train an LDA topic model and look at the most important terms in each topic.

In [10]:
# Build LDA model
title_lda_model = gensim.models.LdaMulticore(corpus=title_corpus,
                                       id2word=title_id2word,
                                       num_topics=10, 
                                       random_state=100,
                                       chunksize=100,
                                       passes=10,
                                       per_word_topics=True)

# Print the Keyword in the 10 topics & their contributing factors
title_doc_lda = title_lda_model[title_corpus]
title_lda_model.print_topics()

  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt

[(0,
  '0.183*"@GPE" + 0.088*"hotel" + 0.078*"family" + 0.048*"vacation" + 0.043*"awesome" + 0.038*"time" + 0.034*"fantastic" + 0.033*"downtown" + 0.031*"beautiful" + 0.018*"luxury"'),
 (1,
  '0.341*"stay" + 0.145*"place" + 0.141*"great" + 0.034*"visit" + 0.034*"@TIME" + 0.021*"property" + 0.018*"average" + 0.017*"park" + 0.011*"not" + 0.011*"super"'),
 (2,
  '0.209*"nice" + 0.109*"price" + 0.083*"perfect" + 0.053*"convenient" + 0.032*"worth" + 0.026*"ok" + 0.024*"getaway" + 0.018*"way" + 0.016*"town" + 0.015*"quick"'),
 (3,
  '0.256*"great" + 0.209*"location" + 0.082*"service" + 0.066*"room" + 0.044*"value" + 0.039*"staff" + 0.033*"new" + 0.025*"customer" + 0.013*"view" + 0.010*"need"'),
 (4,
  '0.082*"close" + 0.082*"experience" + 0.062*"amazing" + 0.060*"wonderful" + 0.032*"money" + 0.029*"bed" + 0.027*"no" + 0.022*"well" + 0.019*"marriott" + 0.017*"poor"'),
 (5,
  '0.320*"good" + 0.073*"trip" + 0.065*"disney" + 0.050*"value" + 0.032*"weekend" + 0.025*"quiet" + 0.020*"disneyland" + 

To evaluate how good the topic separation is compute the topic coherence score. We clearly need to tune our model parameters to optimise the generated topic clusters. For this we will only look at the number of topics due to time restrictions and select the number of topics that generates the best coherence score.

In [11]:
# Compute Coherence Score
title_coherence_model_lda = gensim.models.CoherenceModel(
    model=title_lda_model, 
    texts=hotels_df['preprocessed_title'], 
    dictionary=title_id2word, 
    coherence='c_v'
)
title_coherence_lda = title_coherence_model_lda.get_coherence()
print('Coherence Score for title lda model: ', title_coherence_lda)

Coherence Score for title lda model:  0.6361201316537488


In [12]:
# supporting function
def compute_coherence_values(corpus, id2word, texts, k, a, b):
    lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=10, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=a,
                                           eta=b,
                                           per_word_topics=True)
    
    coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='c_v')
    return coherence_model_lda.get_coherence()


def tune_topic_num_lda(corpus, id2word, texts, step_size=1, min_topics=5, max_topics=25):
  topics_range = range(min_topics, max_topics, step_size)
  grid = {}
  grid['Validation_Set'] = {}

  alpha=[0.01]
  beta=[.9]
  num_of_docs = len(corpus)

  model_results = {'Validation_Set': [],
                  'Topics': [],
                  'Alpha': [],
                  'Beta': [],
                  'Coherence': []
                  }
  pbar = tqdm.tqdm(total=len(alpha) * len(beta) * len(topics_range))
  for a in alpha:
      for k in topics_range:
          for b in beta:
            # get the coherence score for the given parameters
            cv = compute_coherence_values(corpus=corpus, id2word=id2word, texts=texts, k=k, a=a, b=b)
            model_results['Validation_Set'].append('title_corpus_100%')
            model_results['Topics'].append(k)
            model_results['Alpha'].append(a)
            model_results['Beta'].append(b)
            model_results['Coherence'].append(cv)  
            pbar.update(1)
  pbar.close()                    
  return pd.DataFrame(model_results)


tune_topic_num_lda(title_corpus, title_id2word, hotels_df['preprocessed_title'])

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad 

Unnamed: 0,Validation_Set,Topics,Alpha,Beta,Coherence
0,title_corpus_100%,5,0.01,0.9,0.56459
1,title_corpus_100%,6,0.01,0.9,0.582941
2,title_corpus_100%,7,0.01,0.9,0.581613
3,title_corpus_100%,8,0.01,0.9,0.578417
4,title_corpus_100%,9,0.01,0.9,0.577455
5,title_corpus_100%,10,0.01,0.9,0.577031
6,title_corpus_100%,11,0.01,0.9,0.576227
7,title_corpus_100%,12,0.01,0.9,0.578549
8,title_corpus_100%,13,0.01,0.9,0.578451
9,title_corpus_100%,14,0.01,0.9,0.578284


We will select 20 as the number of clusters for our topic modelling as that value yielded one of the higher scores and results in more separation which will help us detect more differences in our corpus. We retrain our model with the winning parameter and visualise the results.

In [37]:
title_lda_model = gensim.models.LdaMulticore(corpus=title_corpus,
                                           id2word=title_id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=0.01,
                                           eta=0.9)

  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt

In [39]:
# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_title = pyLDAvis.gensim.prepare(title_lda_model, title_corpus, title_id2word)
LDAvis_title

### LDA title clusters:
We can see that in general the reviews are positive (or rather positive reviews are more prominent and noticeable) and focus on a few primary factors such as location, staff & service quality, cleanliness & room quality, value for money, friendliness of staff as well as a need for family friendly accomodations and finally comfort.

This is in line with general expectations. However digging deeper into each cluster should reveal more specific information, per topic number (see the first 5 here or explore more through the interactive graph above):

1.   This topic is mainly focused on positive reviews, where users enjoyed their stay and considered it a good deal in terms of value.
2. This topic is focused on short (perhaps weekend) breaks (such as a trip to disney land) where the stay revolves a lot around activities.
3. This topic revolves around central city area accomodations of perhaps boutique nature where the location and level of service and comfort that the accomodation provided are explicitly praised (perhaps more high-end venues).
4. This topic revolves around the friendliness of the staff, breakfast quality and the way certain accomodations are able to make one feel like home.
5. This topic focuses on rather average reviews or worse, about small, noisy or unclean rooms but also the breakfast menu options.



## Analyse Text

Next we perform a similar analysis on the actual body of text of the reviews...

In [15]:
# Create Dictionary
text_id2word = gensim.corpora.Dictionary(hotels_df['preprocessed_text'])
# Create Corpus
texts = hotels_df['preprocessed_text']
# Term Document Frequency
text_corpus = [text_id2word.doc2bow(text) for text in texts]
text_corpus[:5]

[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 2),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 2),
  (14, 1)],
 [(3, 1),
  (4, 1),
  (11, 2),
  (15, 1),
  (16, 1),
  (17, 2),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 2),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 1)],
 [(11, 1),
  (28, 1),
  (35, 1),
  (36, 1),
  (37, 2),
  (38, 1),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1),
  (44, 1),
  (45, 1),
  (46, 1),
  (47, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (52, 1),
  (53, 1),
  (54, 1)],
 [(3, 1),
  (7, 1),
  (9, 1),
  (19, 1),
  (28, 2),
  (36, 1),
  (55, 1),
  (56, 1),
  (57, 1),
  (58, 1),
  (59, 1),
  (60, 1),
  (61, 1),
  (62, 1),
  (63, 2),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 1),
  (68, 1),
  (69, 1),
  (70, 1),
  (71, 1),
  (72, 1),
  (73, 1),
  (74, 1)],
 [(3, 1),
  (6, 1),
  (11, 3),
  (15, 1),
  (28, 4),
  (34, 1),
 

In [16]:
tune_topic_num_lda(text_corpus, text_id2word, hotels_df['preprocessed_text'])

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad 

Unnamed: 0,Validation_Set,Topics,Alpha,Beta,Coherence
0,title_corpus_100%,5,0.01,0.9,0.413654
1,title_corpus_100%,6,0.01,0.9,0.44742
2,title_corpus_100%,7,0.01,0.9,0.41339
3,title_corpus_100%,8,0.01,0.9,0.416779
4,title_corpus_100%,9,0.01,0.9,0.415337
5,title_corpus_100%,10,0.01,0.9,0.379595
6,title_corpus_100%,11,0.01,0.9,0.418471
7,title_corpus_100%,12,0.01,0.9,0.41339
8,title_corpus_100%,13,0.01,0.9,0.415337
9,title_corpus_100%,14,0.01,0.9,0.415337


In the case of text we can see that the best coherence score is the result of the next to last number of topics.

In [17]:
# Build LDA model
text_lda_model = gensim.models.LdaMulticore(corpus=text_corpus,
                                       id2word=text_id2word,
                                       num_topics=19, 
                                       random_state=100,
                                       chunksize=100,
                                       passes=10,
                                       per_word_topics=True)

# Print the Keyword in the 10 topics
text_doc_lda = text_lda_model[text_corpus]
text_lda_model.print_topics()

  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt

[(0,
  '0.055*"stay" + 0.049*"thank" + 0.029*"time" + 0.023*"review" + 0.022*"hotel" + 0.022*"@GPE" + 0.022*"hope" + 0.021*"staff" + 0.019*"take" + 0.019*"guest"'),
 (1,
  '0.045*"group" + 0.031*"party" + 0.022*"f." + 0.021*"reserve" + 0.020*"long" + 0.018*"aware" + 0.017*"excited" + 0.015*"gym" + 0.015*"apple" + 0.014*"grab"'),
 (2,
  '0.073*"room" + 0.035*"bed" + 0.026*"clean" + 0.022*"nice" + 0.019*"bathroom" + 0.016*"no" + 0.016*"good" + 0.015*"small" + 0.014*"stay" + 0.013*"shower"'),
 (3,
  '0.045*"chicago" + 0.021*"beach" + 0.019*"despite" + 0.018*"drink" + 0.016*"cheese" + 0.015*"village" + 0.015*"tasty" + 0.015*"cook" + 0.014*"100" + 0.013*"lamp"'),
 (4,
  '0.054*"suite" + 0.052*"disney" + 0.034*"kid" + 0.030*"kitchen" + 0.030*"bedroom" + 0.028*"family" + 0.025*"suites" + 0.023*"area" + 0.021*"2" + 0.020*"homewood"'),
 (5,
  '0.035*"machine" + 0.030*"regards" + 0.022*"lake" + 0.019*"exceed" + 0.014*"croissant" + 0.013*"honors" + 0.013*"taxi" + 0.013*"detail" + 0.013*"weather" 

In [18]:
# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_title = pyLDAvis.gensim.prepare(text_lda_model, text_corpus, text_id2word)
LDAvis_title

Similarly we can explore the LDA results of the reviews' text by interrogating the graph above. Interestingly we can see that topic 2 is focused on accomodations that have pool and other luxury facilities, while topic 6 stresses the importance of the hotel location and how people pay attention to accessibility such as parking and commute to and from the nearby airports etc.

## TFIDF significant terms

Next we will generate the 20 most significant terms per review (we shall combine the title and text fields to a single string for this) by computing and sorting by their tf-idf scores for keyword identification. Then, for different segments of the data we will examine frequently co-occuring significant terms to obtain some insight.

In [19]:
def display_scores(vectorizer, tfidf_result, n=25):
    scores_d = {}
    scores = zip(vectorizer.get_feature_names(),
                 np.asarray(tfidf_result.sum(axis=0)).ravel())
    sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)
    for item in sorted_scores[:n]:
      scores_d[item[0]] = item[1]
    return scores_d
        
        
# First combine a column of the preprocessed title and text so as to have them joined and run the analysis only once. 
hotels_df['title_and_text'] = hotels_df['preprocessed_title'] + hotels_df['preprocessed_text']

tfidf_model = TfidfVectorizer(ngram_range=(1, 1), max_df=.9, min_df=10,  lowercase=False)
tfidf_matrix = tfidf_model.fit_transform(hotels_df['title_and_text'].apply(lambda x: " ".join(x)))
tfidf_model.vocabulary_

{'good': 1295,
 'western': 3199,
 'plus': 2194,
 'hotel': 1448,
 'nice': 1958,
 'quiet': 2315,
 'know': 1632,
 'train': 2988,
 'track': 2983,
 'near': 1938,
 'pass': 2106,
 'stay': 2769,
 'change': 528,
 'category': 505,
 'clean': 573,
 'room': 2487,
 'solid': 2696,
 'rate': 2336,
 'heart': 1377,
 'king': 1626,
 'suite': 2833,
 'bedroom': 341,
 'live': 1701,
 'space': 2717,
 'sofa': 2692,
 'bed': 339,
 'DATE': 71,
 'leave': 1666,
 'TIME': 77,
 'comfortable': 620,
 'locate': 1710,
 'walk': 3155,
 'distance': 899,
 'place': 2171,
 'want': 3164,
 'business': 453,
 'parking': 2096,
 'horrible': 1438,
 'somebody': 2700,
 'run': 2502,
 'rental': 2418,
 'car': 487,
 'try': 3023,
 'breakfast': 417,
 'restaurant': 2449,
 'open': 2023,
 'late': 1654,
 'world': 3244,
 'enjoy': 1016,
 'ask': 247,
 'coffee': 602,
 'item': 1571,
 'vend': 3115,
 'machine': 1755,
 'stale': 2755,
 'cheap': 538,
 'excellent': 1059,
 'location': 1712,
 'price': 2249,
 'somewhat': 2701,
 'standard': 2758,
 'reservation': 

In [22]:
def get_top_n_terms(tokens, n=25, model=tfidf_model):
    text = " ".join(tokens)
    i = 0
    tfidf_matrix = model.transform(pd.Series([text]))
    feature_names = model.get_feature_names()
    feature_index = tfidf_matrix[i, :].nonzero()[1]
    tfidf_scores = zip(feature_index, [tfidf_matrix[i, x] for x in feature_index])
    results = [(feature_names[i], s) for (i, s) in sorted(tfidf_scores, reverse=True, key=lambda x: x[1])]
    return results[:n]


def get_top_n_words(corpus, n=25):
    vec = CountVectorizer(lowercase=False).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in     vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

# Get most significant terms per review record
hotels_df['top_terms'] = hotels_df['title_and_text'].map(get_top_n_terms)
# Get most frequent significant terms overall
get_top_n_words(hotels_df['top_terms'].apply(lambda x: " ".join([y[0] for y in x])))

[('hotel', 3259),
 ('room', 3044),
 ('stay', 2991),
 ('great', 2262),
 ('staff', 1846),
 ('GPE', 1804),
 ('clean', 1681),
 ('good', 1679),
 ('nice', 1542),
 ('DATE', 1515),
 ('location', 1431),
 ('TIME', 1422),
 ('breakfast', 1338),
 ('place', 1192),
 ('friendly', 1144),
 ('time', 967),
 ('service', 946),
 ('comfortable', 907),
 ('area', 860),
 ('bed', 835),
 ('helpful', 750),
 ('walk', 738),
 ('restaurant', 732),
 ('need', 718),
 ('no', 712)]

Now we can look at the most significant terms for specific slices of the data.

In [23]:
# Significant terms in bad reviews (where 0 <= rating <= 0.4)
subset = hotels_df[hotels_df['normed_rating'] <= 0.4]['top_terms'].apply(lambda x: " ".join([y[0] for y in x]))
get_top_n_words(subset)

[('room', 472),
 ('hotel', 344),
 ('stay', 333),
 ('TIME', 259),
 ('no', 214),
 ('DATE', 201),
 ('bad', 164),
 ('place', 150),
 ('check', 139),
 ('experience', 132),
 ('desk', 130),
 ('good', 128),
 ('bed', 122),
 ('clean', 121),
 ('like', 119),
 ('guest', 117),
 ('service', 116),
 ('work', 116),
 ('book', 115),
 ('need', 113),
 ('door', 111),
 ('dirty', 111),
 ('nice', 110),
 ('time', 109),
 ('tell', 108)]

We can see that for poor reviews the most significant terms are related to dirty hotel rooms, uncomfortable beds, service quality and the ability to work comfortably.

In [24]:
pd.set_option('display.max_colwidth', -1)

hotels_df[hotels_df['normed_rating'] <= 0.4].head(5)

  """Entry point for launching an IPython kernel.


Unnamed: 0,id,dateAdded,dateUpdated,address,categories,primaryCategories,city,country,keys,latitude,longitude,name,postalCode,province,reviews.date,reviews.dateAdded,reviews.dateSeen,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username,sourceURLs,websites,normed_rating,preprocessed_title,preprocessed_text,pos_title,pos_text,title_length,text_length,preprocessed_title_length,preprocessed_text_length,title_and_text,top_terms
4,AVwcj_OhkufWRAb5wi9T,2016-11-06T20:21:05Z,2019-05-20T23:31:56Z,5th And San Carlos PO Box 3574,"Hotels,Lodging,Hotel",Accommodation & Food Services,Carmel by the Sea,US,us/ca/carmelbythesea/5thandsancarlospobox3574/50035798,36.55722,-121.92194,Best Western Carmel's Town House Lodge,93921,CA,2016-03-21T00:00:00Z,,"2016-10-09T00:00:00Z,2016-03-27T00:00:00Z",2,"https://www.tripadvisor.com/Hotel_Review-g32172-d76386-Reviews-or30-BEST_WESTERN_Carmel_s_Town_House_Lodge-Carmel_Monterey_Peninsula_California.html%252523REVIEWS,http://www.tripadvisor.com/Hotel_Review-g32172-d76386-Reviews-BEST_WESTERN_Carmel_s_Town_House_Lodge-Carmel_Monterey_Peninsula_California.html","If you get the room that they advertised on the website and for what you paid, you may be lucky.If you stay many days , they will give you the not so good rooms.Nobody wants to stay in these rooms: low light/dark rooms, near pool, noisy, smelly bathrooms, or difficult access. If you stay one-two days you will get probably... More",Low chance to come back here,Reno,NV,Luc D,"http://www.tripadvisor.com/Hotel_Review-g32172-d76386-Reviews-or30-BEST_WESTERN_Carmel_s_Town_House_Lodge-Carmel_Monterey_Peninsula_California.html%252523REVIEWS,https://www.yellowpages.com/carmel-ca/mip/best-western-carmels-town-house-lodge-496678069,http://www.tripadvisor.com/Hotel_Review-g32172-d76386-Reviews-BEST_WESTERN_Carmel_s_Town_House_Lodge-Carmel_Monterey_Peninsula_California.html,https://www.tripadvisor.com/Hotel_Review-g32172-d76386-Reviews-BEST_WESTERN_Carmel_s_Town_House_Lodge-Carmel_Monterey_Peninsula_California.html,https://www.tripadvisor.com/Hotel_Review-g32172-d76386-Reviews-or60-BEST_WESTERN_Carmel_s_Town_House_Lodge-Carmel_Monterey_County_California.html%252523REVIEWS,https://www.tripadvisor.com/Hotel_Review-g32172-d76386-Reviews-or50-BEST_WESTERN_Carmel_s_Town_House_Lodge-Carmel_Monterey_County_California.html%252523REVIEWS,https://www.tripadvisor.com/Hotel_Review-g32172-d76386-Reviews-or40-BEST_WESTERN_Carmel_s_Town_House_Lodge-Carmel_Monterey_County_California.html%252523REVIEWS,https://www.tripadvisor.com/Hotel_Review-g32172-d76386-Reviews-or30-BEST_WESTERN_Carmel_s_Town_House_Lodge-Carmel_Monterey_County_California.html%252523REVIEWS,http://www.tripadvisor.com/Hotel_Review-g32172-d76386-Reviews-or40-BEST_WESTERN_Carmel_s_Town_House_Lodge-Carmel_Monterey_Peninsula_California.html%252523REVIEWS,https://www.tripadvisor.com/Hotel_Review-g32172-d76386-Reviews-BEST_WESTERN_Carmel_s_Town_House_Lodge-Carmel_Monterey_County_California.html,https://www.tripadvisor.com/Hotel_Review-g32172-d76386-Reviews-or10-BEST_WESTERN_Carmel_s_Town_House_Lodge-Carmel_Monterey_County_California.html%252523REVIEWS,http://www.tripadvisor.com/Hotel_Review-g32172-d76386-Reviews-or10-BEST_WESTERN_Carmel_s_Town_House_Lodge-Carmel_Monterey_Peninsula_California.html%252523REVIEWS,https://www.tripadvisor.com/Hotel_Review-g32172-d76386-Reviews-or20-BEST_WESTERN_Carmel_s_Town_House_Lodge-Carmel_Monterey_County_California.html%252523REVIEWS,http://www.tripadvisor.com/Hotel_Review-g32172-d76386-Reviews-or20-BEST_WESTERN_Carmel_s_Town_House_Lodge-Carmel_Monterey_Peninsula_California.html%252523REVIEWS",http://www.bestwestern.com,0.25,"[low, chance, come]","[room, advertise, website, pay, lucky, stay, day, good, room, want, stay, room, low, light, dark, room, near, pool, noisy, smelly, bathroom, difficult, access, stay, @DATE, probably]","{'Low': 'ADJ', 'chance': 'NOUN', 'to': 'PART', 'come': 'VERB', 'back': 'ADV', 'here': 'ADV'}","{'If': 'SCONJ', 'you': 'PRON', 'get': 'AUX', 'the': 'DET', 'room': 'NOUN', 'that': 'DET', 'they': 'PRON', 'advertised': 'VERB', 'on': 'ADP', 'website': 'NOUN', 'and': 'CCONJ', 'for': 'ADP', 'what': 'PRON', 'paid': 'VERB', ',': 'PUNCT', 'may': 'VERB', 'be': 'AUX', 'lucky': 'ADJ', '.': 'PUNCT', 'stay': 'VERB', 'many': 'ADJ', 'days': 'NOUN', 'will': 'VERB', 'give': 'VERB', 'not': 'PART', 'so': 'ADV', 'good': 'ADJ', 'rooms': 'NOUN', 'Nobody': 'PRON', 'wants': 'VERB', 'to': 'PART', 'in': 'ADP', 'these': 'DET', ':': 'PUNCT', 'low': 'ADJ', 'light': 'ADJ', '/': 'SYM', 'dark': 'ADJ', 'near': 'SCONJ', 'pool': 'NOUN', 'noisy': 'ADJ', 'smelly': 'ADV', 'bathrooms': 'NOUN', 'or': 'CCONJ', 'difficult': 'ADJ', 'access': 'NOUN', '@DATE': 'PROPN', 'probably': 'ADV', '...': 'PUNCT', 'More': 'ADJ'}",28,331,3,26,"[low, chance, come, room, advertise, website, pay, lucky, stay, day, good, room, want, stay, room, low, light, dark, room, near, pool, noisy, smelly, bathroom, difficult, access, stay, @DATE, probably]","[(low, 0.4195346203216349), (smelly, 0.2835623059884709), (advertise, 0.2616781610302884), (lucky, 0.2603253034861215), (room, 0.23510537658191688), (difficult, 0.23488124246532427), (dark, 0.23167326263488555), (website, 0.22998869528276414), (probably, 0.20794225643409187), (chance, 0.20014933406086), (noisy, 0.19804520796384154), (light, 0.18999518983413716), (access, 0.1790434974320303), (stay, 0.1759687472768677), (day, 0.17140656349884945), (pay, 0.16733515141773536), (near, 0.15676825937051025), (want, 0.14018102066385418), (bathroom, 0.1385919209990519), (come, 0.12995261158162277), (pool, 0.12834440590094198), (DATE, 0.08784826366238013), (good, 0.08241782433619685)]"
6,AVweLARAByjofQCxv5vX,2016-05-16T22:39:30Z,2019-05-20T23:28:44Z,167 W Main St,"Hotels,Hotels and motels,Hotel,Restaurants",Accommodation & Food Services,Lexington,US,us/ky/lexington/167wmainst/-1165617038,38.047014,-84.497742,21c Museum Hotel Lexington,40507,KY,2016-04-18T00:00:00Z,,2016-05-10T00:00:00Z,1,https://www.tripadvisor.com/Hotel_Review-g39588-d8623360-Reviews-21c_Museum_Hotel_Lexington-Lexington_Kentucky.html,We recently stayed at this hotel on a trip to Lexington with other friends. Our group shared the feeling that we would not be back. We routinely waited more than 10 minutes for elevators. The workout room is tiny with 2 treadmills and 2 cheap elliptical machines that rocked off the floor when in use. Everything about the hotel seemed... More,Does not live up to its reputation,Charlotte,NC,GGTravels2016,"https://www.tripadvisor.com/Hotel_Review-g39588-d8623360-Reviews-21c_Museum_Hotel_Lexington-Lexington_Kentucky.html,https://www.yellowbook.com/profile/21c-museum-hotel-lexington_1899659177.html,https://www.hotels.com/ho392800800/21c-museum-hotel-lexington-lexington-united-states-of-america/","http://www.firstnational.com/,https://www.21cmuseumhotels.com,https://www.21cmuseumhotels.com/",0.0,"[live, reputation]","[recently, stay, hotel, trip, @GPE, friend, group, share, feeling, routinely, wait, @TIME, elevator, workout, room, tiny, 2, treadmill, 2, cheap, elliptical, machine, rock, floor, use, hotel]","{'Does': 'AUX', 'not': 'PART', 'live': 'VERB', 'up': 'ADP', 'to': 'ADP', 'its': 'DET', 'reputation': 'NOUN'}","{'We': 'PRON', 'recently': 'ADV', 'stayed': 'VERB', 'at': 'ADP', 'this': 'DET', 'hotel': 'NOUN', 'on': 'ADP', 'a': 'DET', 'trip': 'NOUN', 'to': 'PART', '@GPE': 'VERB', 'with': 'ADP', 'other': 'ADJ', 'friends': 'NOUN', '.': 'PUNCT', 'Our': 'DET', 'group': 'NOUN', 'shared': 'VERB', 'the': 'DET', 'feeling': 'NOUN', 'that': 'DET', 'we': 'PRON', 'would': 'VERB', 'not': 'PART', 'be': 'AUX', 'back': 'ADV', 'routinely': 'ADV', 'waited': 'VERB', '@TIME': 'PROPN', 'for': 'ADP', 'elevators': 'NOUN', 'The': 'DET', 'workout': 'NOUN', 'room': 'NOUN', 'is': 'AUX', 'tiny': 'ADJ', '2': 'NUM', 'treadmills': 'NOUN', 'and': 'CCONJ', 'cheap': 'ADJ', 'elliptical': 'ADJ', 'machines': 'NOUN', 'rocked': 'VERB', 'off': 'ADP', 'floor': 'NOUN', 'when': 'ADV', 'in': 'ADP', 'use': 'NOUN', 'Everything': 'PRON', 'about': 'ADP', 'seemed': 'VERB', '...': 'PUNCT', 'More': 'ADJ'}",34,343,2,26,"[live, reputation, recently, stay, hotel, trip, @GPE, friend, group, share, feeling, routinely, wait, @TIME, elevator, workout, room, tiny, 2, treadmill, 2, cheap, elliptical, machine, rock, floor, use, hotel]","[(elliptical, 0.31686723779408993), (reputation, 0.3107099631645143), (treadmill, 0.29616101854024524), (rock, 0.27310908848101734), (workout, 0.27150689571676906), (feeling, 0.2592280526155571), (tiny, 0.23648590597057856), (machine, 0.2183867372211048), (live, 0.21391025563614352), (group, 0.2120352728643532), (cheap, 0.20648517653047346), (recently, 0.19283315251030764), (elevator, 0.18533169002931454), (friend, 0.18434163835401252), (wait, 0.1784452581073733), (use, 0.16276033946508192), (floor, 0.14762381467837132), (share, 0.1429207023160703), (trip, 0.1358693249592505), (hotel, 0.12453771229268369), (TIME, 0.09315446255763532), (GPE, 0.08373480114623279), (room, 0.05981849292835095), (stay, 0.059696268499553645)]"
7,AV1thAoL3-Khe5l_Ott5,2017-07-23T03:35:56Z,2019-05-20T23:28:32Z,115 W Steve Wariner Dr,"Hotels and motels,Hotel",Accommodation & Food Services,Russell Springs,US,us/ky/russellsprings/115wstevewarinerdr/-411694349,37.065296,-85.07358,Springs Motel LLC,42642,KY,2015-08-13T00:00:00.000Z,,"2017-12-17T00:00:00Z,2017-07-13T00:00:00Z",1,"https://www.tripadvisor.com/Hotel_Review-g39813-d4943963-Reviews-The_Springs_Motel-Russell_Springs_Kentucky.html,http://tripadvisor.com/Hotel_Review-g39813-d4943963-Reviews-The_Springs_Motel-Russell_Springs_Kentucky.html","I reserved a room a week in advance, knowing a motel is usually not great accommodations but we were just passing thru during the longest yard sale. I was quoted 50 over the phone and given our room numbers as a confirmation number and she...More",worst customer service ever,Hanceville,Alabama,madaramapquest,"https://www.tripadvisor.com/Hotel_Review-g39813-d4943963-Reviews-The_Springs_Motel-Russell_Springs_Kentucky.html,https://www.yellowbook.com/profile/springs-motel-llc_1889659990.html,http://tripadvisor.com/Hotel_Review-g39813-d4943963-Reviews-The_Springs_Motel-Russell_Springs_Kentucky.html","http://www.springsmotelky.com/,http://www.springsmotelky.com",0.0,"[bad, customer, service]","[reserve, room, week, advance, know, motel, usually, great, accommodation, pass, long, yard, sale, quote, 50, phone, give, room, number, confirmation, number]","{'worst': 'ADJ', 'customer': 'NOUN', 'service': 'NOUN', 'ever': 'ADV'}","{'I': 'PRON', 'reserved': 'VERB', 'a': 'DET', 'room': 'NOUN', 'week': 'NOUN', 'in': 'ADP', 'advance': 'NOUN', ',': 'PUNCT', 'knowing': 'VERB', 'motel': 'NOUN', 'is': 'AUX', 'usually': 'ADV', 'not': 'PART', 'great': 'ADJ', 'accommodations': 'NOUN', 'but': 'CCONJ', 'we': 'PRON', 'were': 'AUX', 'just': 'ADV', 'passing': 'VERB', 'thru': 'ADP', 'during': 'ADP', 'the': 'DET', 'longest': 'ADJ', 'yard': 'NOUN', 'sale': 'NOUN', '.': 'PUNCT', 'was': 'AUX', 'quoted': 'VERB', '50': 'NUM', 'over': 'ADP', 'phone': 'NOUN', 'and': 'CCONJ', 'given': 'VERB', 'our': 'DET', 'numbers': 'NOUN', 'as': 'SCONJ', 'confirmation': 'NOUN', 'number': 'NOUN', 'she': 'PRON', '...': 'PUNCT', 'More': 'ADJ'}",27,246,3,21,"[bad, customer, service, reserve, room, week, advance, know, motel, usually, great, accommodation, pass, long, yard, sale, quote, 50, phone, give, room, number, confirmation, number]","[(number, 0.39582360924561666), (quote, 0.2953704121662651), (yard, 0.2872485150252962), (confirmation, 0.27208326307483405), (sale, 0.24466638710320668), (week, 0.23870360226896825), (advance, 0.23815293040191587), (50, 0.23654448996223776), (reserve, 0.2284704058898844), (usually, 0.20669108892827617), (phone, 0.19750907734106984), (pass, 0.18531821561607317), (accommodation, 0.17572727144806605), (motel, 0.17046923918165705), (long, 0.16915536887687008), (customer, 0.1596297195917629), (bad, 0.15020720768139031), (give, 0.14601142375471735), (know, 0.13056646261999588), (room, 0.10991055357917015), (service, 0.0986755415887369), (great, 0.07052551571589939)]"
9,AVwdo6WHByjofQCxrGaj,2016-11-02T17:23:39Z,2019-05-20T23:26:47Z,1107 N Main St,"Hotels,Bed Breakfasts,Bed & Breakfasts,Hotels and motels,Lodging,Hotels Motels,travel,Motels,Hotels & Motels,Hotel",Accommodation & Food Services,Hopkinsville,US,us/ky/hopkinsville/1107nmainst/-1877262391,36.889,-87.4813,American Inn,42240,KY,2014-07-15T00:00:00Z,,"2016-03-19T00:00:00Z,2016-05-10T00:00:00Z,2016-07-18T00:00:00Z",1,https://www.tripadvisor.com/Hotel_Review-g39517-d6902839-Reviews-American_Inn-Hopkinsville_Kentucky.html,"Hello, I have traveled a lot and abroad and by far this is the worst place i have ever booked. i paid got the key and walked in. OMG the place is HORRIBLE. this place is one of those hotels they show in horror movies where they kill people. I kid you not! this place is horrible. i only stayed... More",The worst place i've booked,Lawton,OK,johnnytuba,"http://www.citysearch.com/profile/4378740/hopkinsville_ky/american_inn.html,https://www.yellowbook.com/profile/american-inn_1530194539.html,http://www.yellowpages.com/hopkinsville-ky/mip/american-inn-8583386",http://www.americinn.com/,0.0,"[bad, place, book]","[hello, travel, lot, abroad, far, bad, place, book, pay, get, key, walk, omg, place, horrible, place, hotel, horror, movie, kill, people, kid, place, horrible, stay]","{'The': 'DET', 'worst': 'ADJ', 'place': 'NOUN', 'i': 'PRON', ''ve': 'AUX', 'booked': 'VERB'}","{'Hello': 'INTJ', ',': 'PUNCT', 'I': 'PRON', 'have': 'AUX', 'traveled': 'VERB', 'a': 'DET', 'lot': 'NOUN', 'and': 'CCONJ', 'abroad': 'ADV', 'by': 'ADP', 'far': 'ADV', 'this': 'DET', 'is': 'AUX', 'the': 'DET', 'worst': 'ADJ', 'place': 'NOUN', 'i': 'PRON', 'ever': 'ADV', 'booked': 'VERB', '.': 'PUNCT', 'paid': 'VERB', 'got': 'VERB', 'key': 'NOUN', 'walked': 'VERB', 'in': 'ADP', 'OMG': 'VERB', 'HORRIBLE': 'PROPN', 'one': 'NUM', 'of': 'ADP', 'those': 'DET', 'hotels': 'NOUN', 'they': 'PRON', 'show': 'VERB', 'horror': 'NOUN', 'movies': 'NOUN', 'where': 'ADV', 'kill': 'VERB', 'people': 'NOUN', 'kid': 'VERB', 'you': 'PRON', 'not': 'PART', '!': 'PUNCT', 'horrible': 'ADJ', 'only': 'ADV', 'stayed': 'VERB', '...': 'PUNCT', 'More': 'ADJ'}",27,301,3,25,"[bad, place, book, hello, travel, lot, abroad, far, bad, place, book, pay, get, key, walk, omg, place, horrible, place, hotel, horror, movie, kill, people, kid, place, horrible, stay]","[(place, 0.4646290889527978), (horrible, 0.4144421374322214), (omg, 0.2940124045050572), (kill, 0.2913253751340715), (bad, 0.2890782600262412), (book, 0.2648634725847731), (movie, 0.23421671033008187), (hello, 0.19564416871304816), (key, 0.18022542710213954), (kid, 0.17129583542808616), (far, 0.16182693196683823), (pay, 0.15055274367417598), (people, 0.13669648734934844), (travel, 0.13199511200771055), (lot, 0.12563952960243296), (get, 0.12527245357421815), (walk, 0.10055239219715474), (hotel, 0.055047738781852366), (stay, 0.05277348578383681)]"
11,AWB2mcqARxPSIh2RpdHz,2017-12-21T00:00:53Z,2019-05-20T23:10:15Z,4200 Via Real,"Hotels,Lodging,Hotels Motels,Motels,Hotel",Accommodation & Food Services,Carpinteria,US,us/ca/carpinteria/4200viareal/1997906078,34.40507,-119.53119,Motel 6 Santa Barbara - Carpinteria North,93013,CA,2017-11-11T00:00:00.000Z,,"2018-01-03T00:00:00Z,2017-12-17T00:00:00Z",3,https://www.tripadvisor.com/Hotel_Review-g32176-d240039-Reviews-Motel_6_Santa_Barbara_Carpinteria_North-Carpinteria_California.html,"I stayed here for three nights while I explored nearby Santa Barbara. This place is ok for a night or two, if you can live the excess noise coming from the freeway, then three nights is just about doable",Good for location,London,UnitedKingdom,Jonathan C,"http://www.citysearch.com/profile/695299/carpinteria_ca/motel_6.html,https://www.tripadvisor.com/Hotel_Review-g32176-d240039-Reviews-Motel_6_Santa_Barbara_Carpinteria_North-Carpinteria_California.html,https://www.yellowpages.com/carpinteria-ca/mip/motel-6-santa-barbara-carpinteria-north-2504536","https://www.motel6.com/en/motels.ca.carpinteria.346.html?lid=X_PMG_NaturalSearch_Local_Yext_346&utm_source=local&utm_medium=local&utm_campaign=yextlocal-346&travelAgentNumber=TA001305&corporatePlusNumber=CP792N5W,https://www.motel6.com/en/motels.ca.carpinteria.346.html?lid=X_PMG_NaturalSearch_Local_Yext_346andutm_source=localandutm_medium=localandutm_campaign=yextlocal-346andtravelAgentNumber=TA001305andcorporatePlusNumber=CP792N5W",0.0,"[good, location]","[stay, @DATE, explore, nearby, @GPE, place, ok, night, live, excess, noise, come, freeway, @DATE, doable]","{'Good': 'ADJ', 'for': 'ADP', 'location': 'NOUN'}","{'I': 'PRON', 'stayed': 'VERB', 'here': 'ADV', 'for': 'ADP', '@DATE': 'PROPN', 'while': 'SCONJ', 'explored': 'VERB', 'nearby': 'ADJ', '@GPE': 'NOUN', '.': 'PUNCT', 'This': 'DET', 'place': 'NOUN', 'is': 'AUX', 'ok': 'ADJ', 'a': 'DET', 'night': 'NOUN', 'or': 'CCONJ', 'two': 'NUM', ',': 'PUNCT', 'if': 'SCONJ', 'you': 'PRON', 'can': 'VERB', 'live': 'VERB', 'the': 'DET', 'excess': 'ADJ', 'noise': 'NOUN', 'coming': 'VERB', 'from': 'ADP', 'freeway': 'NOUN', 'then': 'ADV', 'just': 'ADV', 'about': 'ADV', 'doable': 'ADJ'}",17,203,2,15,"[good, location, stay, @DATE, explore, nearby, @GPE, place, ok, night, live, excess, noise, come, freeway, @DATE, doable]","[(explore, 0.3878208860026493), (freeway, 0.3731959443373486), (live, 0.3513240434681954), (nearby, 0.3195797546270638), (ok, 0.3162752434546666), (DATE, 0.2936786268769257), (night, 0.2785427113826536), (noise, 0.2727707937559321), (come, 0.21721718186164224), (place, 0.17264104714987494), (location, 0.14832477591117307), (good, 0.13776227595266152), (GPE, 0.1375251917221735), (stay, 0.09804454848065065)]"


In [25]:
# Significant terms in good reviews (where 0.6 <= rating <= 1.0)
subset = hotels_df[hotels_df['normed_rating'] >= 0.6]['top_terms'].apply(lambda x: " ".join([y[0] for y in x]))
get_top_n_words(subset)

[('hotel', 2594),
 ('stay', 2428),
 ('room', 2140),
 ('great', 2094),
 ('staff', 1626),
 ('GPE', 1578),
 ('clean', 1394),
 ('good', 1342),
 ('nice', 1277),
 ('location', 1185),
 ('DATE', 1147),
 ('breakfast', 1119),
 ('friendly', 1017),
 ('TIME', 989),
 ('place', 928),
 ('comfortable', 808),
 ('time', 787),
 ('service', 768),
 ('area', 720),
 ('helpful', 680),
 ('restaurant', 631),
 ('walk', 628),
 ('excellent', 614),
 ('bed', 585),
 ('thank', 577)]

On the other hand, hotels are commonly praised for their nice rooms, friendly and accomodating staff, cleanliness, central/convenient locations, breakfast options and quality, restaurants, beds and general comfort of stay as well as the ability to reach points of interest on foot.

In [26]:
# Significant terms in New York reviews vs New Orleans reviews 
# (unfortunately all reviews are in the US so we can't compare different countries)
subset = hotels_df[hotels_df['city'] == 'New York']['top_terms'].apply(lambda x: " ".join([y[0] for y in x]))
print("New York \n", get_top_n_words(subset))

subset = hotels_df[hotels_df['city'] == 'New Orleans']['top_terms'].apply(lambda x: " ".join([y[0] for y in x]))
get_top_n_words(subset)
print("New Orleans \n", get_top_n_words(subset))

New York 
 [('hotel', 37), ('times', 33), ('nyc', 32), ('square', 25), ('location', 24), ('express', 23), ('stay', 20), ('room', 20), ('great', 18), ('ritz', 18), ('good', 18), ('holiday', 17), ('DATE', 15), ('staff', 13), ('view', 12), ('park', 12), ('service', 11), ('subway', 10), ('GPE', 10), ('time', 10), ('inn', 10), ('wonderful', 9), ('small', 9), ('carlton', 9), ('elevator', 9)]
New Orleans 
 [('hotel', 269), ('french', 249), ('quarter', 217), ('room', 185), ('great', 182), ('stay', 169), ('location', 136), ('bourbon', 128), ('GPE', 119), ('street', 119), ('DATE', 114), ('staff', 111), ('TIME', 103), ('nice', 102), ('time', 97), ('st', 97), ('courtyard', 97), ('good', 92), ('breakfast', 88), ('walk', 81), ('market', 74), ('nola', 71), ('place', 70), ('friendly', 68), ('parking', 64)]


In order to dig a little deeper and exhibit the potential of this technique we look at comparing the hotels in two different US cities. However, we could similarly compare different hotels or hotel brands or even the same type of hotels for stays in different time periods.

We can see that travellers seek from New York accomodations thigs such as central loactions, near points of interest or train/tube stations, service quality and perhaps a good city views, while it would not come as a surpirse that many of these hotel rooms are not very spatious.

In New Orleans on the other hand a theme that is prominent among other more standard guest expectations are courtyeards which guests seem to enjoy or pay attention to, as well as a well stocked hotel bar and the quality of breakfast provided. People staying in New Orleans seem to be more concerned about parking availability whereas travellers in New York seem to be using public transport more and not their own vehicles.

In [27]:
# Look at what the positive comments are in the two places 
subset = hotels_df[(hotels_df['city'] == 'New York') & (hotels_df['normed_rating'] >= 0.7)]['top_terms'].apply(lambda x: " ".join([y[0] for y in x]))
print("New York \n", get_top_n_words(subset))

subset = hotels_df[(hotels_df['city'] == 'New Orleans') & (hotels_df['normed_rating'] >= 0.7)]['top_terms'].apply(lambda x: " ".join([y[0] for y in x]))
get_top_n_words(subset)
print("New Orleans \n", get_top_n_words(subset))

New York 
 [('hotel', 30), ('nyc', 29), ('times', 29), ('square', 22), ('location', 20), ('express', 20), ('stay', 17), ('great', 17), ('ritz', 15), ('holiday', 14), ('room', 13), ('good', 13), ('view', 12), ('staff', 12), ('park', 12), ('DATE', 11), ('service', 10), ('time', 10), ('wonderful', 9), ('inn', 9), ('subway', 8), ('GPE', 8), ('carlton', 8), ('breakfast', 8), ('floor', 8)]
New Orleans 
 [('hotel', 237), ('french', 233), ('quarter', 203), ('great', 172), ('stay', 156), ('room', 148), ('location', 121), ('bourbon', 116), ('GPE', 109), ('street', 108), ('staff', 102), ('DATE', 93), ('nice', 92), ('courtyard', 92), ('time', 91), ('st', 85), ('good', 82), ('TIME', 79), ('breakfast', 75), ('walk', 75), ('market', 71), ('nola', 64), ('friendly', 64), ('place', 64), ('love', 61)]


For more detail we can examine the most prominent terms in each location for positive and negative reviews separately. It is clear how certain landmarks impress the guests if they are easy to reach from the hotel (perhaps the hotel staff can recommend a visit if it makes the experience better for the customers?)...

In [28]:
# Look at what the negative comments are in the two places 
subset = hotels_df[(hotels_df['city'] == 'New York') & (hotels_df['normed_rating'] <= 0.4)]['top_terms'].apply(lambda x: " ".join([y[0] for y in x]))
print("New York \n", get_top_n_words(subset))

subset = hotels_df[(hotels_df['city'] == 'New Orleans') & (hotels_df['normed_rating'] <= 0.4)]['top_terms'].apply(lambda x: " ".join([y[0] for y in x]))
get_top_n_words(subset)
print("New Orleans \n", get_top_n_words(subset))

New York 
 [('room', 2), ('desk', 2), ('terrible', 2), ('subway', 2), ('LANGUAGE', 1), ('one', 1), ('cleaning', 1), ('guy', 1), ('maintenance', 1), ('staff', 1), ('poor', 1), ('speak', 1), ('corner', 1), ('complaint', 1), ('shop', 1), ('clog', 1), ('expect', 1), ('tone', 1), ('preferred', 1), ('good', 1), ('earth', 1), ('blood', 1), ('drip', 1), ('definite', 1), ('sadly', 1)]
New Orleans 
 [('hotel', 18), ('no', 15), ('room', 15), ('TIME', 13), ('DATE', 12), ('sorry', 9), ('experience', 9), ('desk', 9), ('car', 8), ('tell', 7), ('guest', 7), ('service', 7), ('quarter', 7), ('not', 7), ('loud', 7), ('parking', 7), ('bourbon', 7), ('callaisgeneral', 6), ('casey', 6), ('expectation', 6), ('valet', 6), ('charge', 6), ('french', 6), ('stay', 6), ('speak', 6)]


On the other hand we can see that unlucky travelers can end up in poorly mainained and not properly cleaned accomodation (clogged toilets even!), while in New Orleans a lot of pain is caused for lack of available parking, bourbon related troubles (?) and extra charges.

Similarly we can check how travellers' expectations may differ depending on where they are from (see below). Such user-focused analyses can help better target different audiences with different experiences or promotions to make their stay more memorable!

New Yorkers seem to be travelling a lot for business and therefore look for a good location, a comfortable and clean room, friendly and informative/attentive staff as well as access to good meals. People from San Jose are more interested in the nearby sightseeing and leisure travel and are harder to impress, but also value a central location and polite staff.

In [29]:
# Look at what the expectations of people from San Jose compared to people from NY are
subset = hotels_df[hotels_df['reviews.userCity'] == 'New York']['top_terms'].apply(lambda x: " ".join([y[0] for y in x]))
print("New York \n", get_top_n_words(subset))
print("Ratings' stats by New Yorkers \n", hotels_df[hotels_df['reviews.userCity'] == 'New York']['normed_rating'].describe())

subset = hotels_df[hotels_df['reviews.userCity'] == 'San Jose']['top_terms'].apply(lambda x: " ".join([y[0] for y in x]))
get_top_n_words(subset)
print("San Jose \n", get_top_n_words(subset))
print("Ratings' stats by people from San Jose \n", hotels_df[hotels_df['reviews.userCity'] == 'San Jose']['normed_rating'].describe())

New York 
 [('room', 6), ('great', 5), ('hotel', 5), ('friendly', 3), ('breakfast', 3), ('TIME', 3), ('staff', 3), ('GPE', 3), ('information', 2), ('DATE', 2), ('bit', 2), ('beautiful', 2), ('wonderful', 2), ('downtown', 2), ('restaurant', 2), ('event', 2), ('attend', 2), ('conference', 2), ('business', 2), ('easy', 2), ('location', 2), ('stay', 2), ('helpful', 2), ('estancia', 2), ('lovely', 2)]
Ratings' stats by New Yorkers 
 count    9.000000
mean     0.722222
std      0.440959
min      0.000000
25%      0.500000
50%      1.000000
75%      1.000000
max      1.000000
Name: normed_rating, dtype: float64
San Jose 
 [('hotel', 13), ('stay', 13), ('location', 12), ('good', 9), ('room', 8), ('GPE', 7), ('nice', 6), ('friendly', 6), ('great', 6), ('TIME', 6), ('western', 5), ('street', 5), ('staff', 5), ('DATE', 5), ('suite', 4), ('distance', 4), ('value', 4), ('place', 4), ('walk', 4), ('near', 3), ('team', 3), ('little', 3), ('appear', 3), ('area', 3), ('old', 3)]
Ratings' stats by peopl

Now let's see what the nouns that stand out in the text of good and bad reviews are.

In [0]:
hotels_df['text_nouns'] = hotels_df['pos_text'].map(lambda x: [k for k, v in x.items() if v == 'NOUN'])
hotels_df['text_adjectives'] = hotels_df['pos_text'].map(lambda x: [k for k, v in x.items() if v == 'ADJ'])
hotels_df['text_verbs'] = hotels_df['pos_text'].map(lambda x: [k for k, v in x.items() if v == 'VERB'])

In [0]:
verb_tfidf_model = TfidfVectorizer(ngram_range=(1, 1), max_df=.9, min_df=10,  lowercase=False) # specify parameters here
verb_tfidf_matrix = verb_tfidf_model.fit_transform(hotels_df['text_verbs'].apply(lambda x: " ".join(x)))
noun_tfidf_model = TfidfVectorizer(ngram_range=(1, 1), max_df=.9, min_df=10,  lowercase=False) # specify parameters here
noun_tfidf_matrix = noun_tfidf_model.fit_transform(hotels_df['text_nouns'].apply(lambda x: " ".join(x)))
adj_tfidf_model = TfidfVectorizer(ngram_range=(1, 1), max_df=.9, min_df=10,  lowercase=False) # specify parameters here
adj_tfidf_matrix = adj_tfidf_model.fit_transform(hotels_df['text_adjectives'].apply(lambda x: " ".join(x)))


In [32]:
hotels_df['top_nouns'] = hotels_df['text_nouns'].map(get_top_n_terms)
# Get most frequent significant nouns overall
get_top_n_words(hotels_df['top_nouns'].apply(lambda x: " ".join([y[0] for y in x])))

[('hotel', 4622),
 ('room', 3876),
 ('staff', 3601),
 ('rooms', 2539),
 ('location', 2345),
 ('time', 2228),
 ('breakfast', 2168),
 ('stay', 2060),
 ('service', 1559),
 ('area', 1458),
 ('place', 1334),
 ('experience', 1291),
 ('review', 1235),
 ('GPE', 1216),
 ('desk', 1209),
 ('pool', 1037),
 ('feedback', 1030),
 ('bed', 915),
 ('parking', 857),
 ('hotels', 820),
 ('price', 783),
 ('family', 761),
 ('trip', 751),
 ('food', 718),
 ('property', 716)]

In [33]:
hotels_df['top_nouns'] = hotels_df['text_nouns'].map(get_top_n_terms)
# Get most frequent significant nouns overall
get_top_n_words(hotels_df['top_nouns'].apply(lambda x: " ".join([y[0] for y in x])))

[('hotel', 4622),
 ('room', 3876),
 ('staff', 3601),
 ('rooms', 2539),
 ('location', 2345),
 ('time', 2228),
 ('breakfast', 2168),
 ('stay', 2060),
 ('service', 1559),
 ('area', 1458),
 ('place', 1334),
 ('experience', 1291),
 ('review', 1235),
 ('GPE', 1216),
 ('desk', 1209),
 ('pool', 1037),
 ('feedback', 1030),
 ('bed', 915),
 ('parking', 857),
 ('hotels', 820),
 ('price', 783),
 ('family', 761),
 ('trip', 751),
 ('food', 718),
 ('property', 716)]

In [34]:
hotels_df['top_adjectives'] = hotels_df['text_adjectives'].map(get_top_n_terms)
# Get most frequent significant nouns overall
get_top_n_words(hotels_df['top_adjectives'].apply(lambda x: " ".join([y[0] for y in x])))

[('clean', 3277),
 ('great', 3252),
 ('nice', 2497),
 ('friendly', 2294),
 ('good', 2243),
 ('comfortable', 1949),
 ('helpful', 1659),
 ('free', 1068),
 ('small', 1067),
 ('GPE', 1051),
 ('glad', 976),
 ('happy', 946),
 ('wonderful', 906),
 ('little', 904),
 ('best', 904),
 ('quiet', 841),
 ('excellent', 834),
 ('close', 803),
 ('able', 799),
 ('sure', 718),
 ('recent', 697),
 ('perfect', 687),
 ('hot', 680),
 ('easy', 664),
 ('large', 636)]

In [35]:
hotels_df['top_verbs'] = hotels_df['text_verbs'].map(get_top_n_terms)
# Get most frequent significant nouns overall
get_top_n_words(hotels_df['top_verbs'].apply(lambda x: " ".join([y[0] for y in x])))

[('would', 2473),
 ('will', 2239),
 ('stay', 2188),
 ('enjoyed', 1596),
 ('can', 1567),
 ('hope', 1514),
 ('see', 1299),
 ('hear', 1227),
 ('staying', 1195),
 ('look', 1077),
 ('make', 971),
 ('appreciate', 964),
 ('recommend', 916),
 ('walking', 895),
 ('go', 871),
 ('know', 827),
 ('got', 740),
 ('located', 683),
 ('come', 647),
 ('want', 643),
 ('take', 624),
 ('feel', 593),
 ('find', 590),
 ('need', 558),
 ('use', 551)]

So let's see what features make up for a good stay, what the factors that make up these features are and what experiences people enjoy during their stay.

In [36]:
# Significant nouns in bad reviews (where 0 <= rating <= 0.4)
subset = hotels_df[hotels_df['normed_rating'] <= 0.4][['top_nouns', 'top_adjectives', 'top_verbs']]
print("\n Poor reviews are due to...")
print(get_top_n_words(subset['top_nouns'].apply(lambda x: " ".join([y[0] for y in x]))))
print("\n Their characteristics are some of the following...")
print(get_top_n_words(subset['top_adjectives'].apply(lambda x: " ".join([y[0] for y in x]))))
print("\n Users worse experiences/activities were...")
print(get_top_n_words(subset['top_verbs'].apply(lambda x: " ".join([y[0] for y in x]))))

subset = hotels_df[hotels_df['normed_rating'] >= 0.6][['top_nouns', 'top_adjectives', 'top_verbs']]
print("\n Good reviews are thanks to...")
print(get_top_n_words(subset['top_nouns'].apply(lambda x: " ".join([y[0] for y in x]))))
print("\n Their characteristics are some of the following...")
print(get_top_n_words(subset['top_adjectives'].apply(lambda x: " ".join([y[0] for y in x]))))
print("\n Users best experiences/activities were...")
print(get_top_n_words(subset['top_verbs'].apply(lambda x: " ".join([y[0] for y in x]))))


 Poor reviews are due to...
[('room', 583), ('hotel', 513), ('rooms', 325), ('staff', 253), ('time', 234), ('desk', 220), ('experience', 203), ('stay', 201), ('place', 186), ('service', 182), ('feedback', 169), ('breakfast', 166), ('location', 162), ('door', 143), ('hotels', 138), ('bathroom', 134), ('bed', 132), ('area', 126), ('floor', 118), ('review', 110), ('people', 108), ('times', 104), ('guest', 104), ('property', 102), ('shower', 101)]

 Their characteristics are some of the following...
[('good', 244), ('clean', 221), ('nice', 216), ('sorry', 197), ('great', 189), ('bad', 148), ('small', 144), ('old', 122), ('friendly', 122), ('better', 120), ('dirty', 118), ('free', 117), ('recent', 116), ('sure', 113), ('comfortable', 111), ('hot', 102), ('GPE', 93), ('available', 93), ('able', 91), ('new', 87), ('entire', 86), ('best', 84), ('helpful', 78), ('little', 74), ('happy', 71)]

 Users worse experiences/activities were...
[('would', 481), ('will', 382), ('can', 309), ('stay', 272

So according to the above, negative drivers can be the staff, small or dirty rooms, room temperature etc It seems like it is more the lack of positive factors rather than specific negative factors (other than very dirty rooms or very rude staff) that leads to such reviews in the end.

On the opposite side, positive reviews have to do with nice/clean/spatious rooms, good location, polite staff, good breakfast, facilities such as swimming pools and available parking, access to a nice meal, family friendly settings, comfortable beds and a fair price. 

It is interesting how people that enjoy their experiences are keen to give details about what they liked so much and are also keen to share their find with others, thus promoting the hotel.


# How to automate this analysis & next steps

### Presenting our results to non-technical stakeholders at a regular basis and in an automated way :)
For presenting to a non-technical stakeholder, it is easy to take this report, only keep the output results, charts and descriptive text and hide the code snippets. The notebook can thus be exported as a pdf or html page (the latter allows for interraction with dynamic graphs) and emailed to the stakeholders whenever a new report is generated (an aws lamda function or data pipeline object could be setup to trigger this process relatively simply).

Unit tests for the different functions defined in our analysis would need to be implemented as well as checking the data every time it is loaded to make sure the format is what is expected and the distributions look healthy for regular deployment of the notebook. This could be done by separating our function definitions into a utils.py file and leading the data through a defined class that performs all the checks upon initialisation. 

**Repeating the analysis on new data may need tweeking of our methods...**
The analysis can be easily repeated by setting up the notebook to rerun whenever the data is overwritten in a particular location (at the same time we can version the dataset so that we do not miss out on the old data). However the clusters would change every time, so the commentary would become irrelevant. Therefore, we would have to make sure all our outputs and comments get updated in a way beffiting of what is happening in the data and in our results (so basically change our output text comments based on the results each time in a rule based fashion).

At the same time data clusters may not be the same and may shift in time. For a more robust separation we could use the main categories of factors that make up for a pleasant or unlpleasant stay and use them as classification labels. We could use the review text and important factors in order to train a multi-class classifier following the regular cross-validation and hyperparameter tuning ml workflow. Then whenever we have new data, instead of performing topic clustering we could use our pre-trained model, which will be more consistent in its expected behaviour. Looking for salient terms may still be useful in uncovering new trends and factors that are not at present accounted for in our classifier. LDA can then be run taking the labels into account as well in order to plot the nice visual representations.

In this sense we can keep parts of this exploratory analysis, however much of it could be tweaked to make it easier for users to interact with. For instance comparing hotels in different locations could be implemented in a way that allows the user to select the location from a list before viewing the factors that most salient for their selection.

Finally, after discussion with the stakeholders looking at specific slices of the data, such as only 5-star accomodations and what the negative and positive comments against those are may be of use. These additional queries can be easily added to the analysis. 

### What if my data becomes too much?
The report can be migrated to spark code and run on an EMR cluster on a schedule. This will allow to use a cluster of machines whose size can be set to dynamically change based on the amount of data that needs processing each time.


### What more could be done to make our analysis better?

We have mentioned answering additional questions such as what the reviews look like for particular hotel brands or levels (stars). 

In terms of visualising statistics, plots could accompany the dataframe statistical descriptions of ratings and length of reviews for a more appealing view (and for those that cannot comprehend the stats boards). We could also use bokeh to visualise all our different hotels on the US map and show their average ratings with a color coding scheme (and perhaps more metadata about the hotels as bokeh allows for nice interactive visualisations that show information only when a user hovers over a point of interest).

During the keyword extraction step we could have tested YAKE as a competitor to tf-idf, as it benefits from no need for training (purely based on computing statistics from the text) and thus is much faster to yield results.  

Finally apart from LDA topic clustering we could have tried different ways of clustering the data to achieve a much better separation w.r.t. what we are seeking. For this we could have converted the preprocessed text into embeddings of different types (tf-idf, YAKE, word2vec and even BERT embeddings) and then used different clustering algorithms such as K-means, hierarchical clustering, GMMs) to check what methods result in the best separation.

### What did not work
Initially an assumption was made that punctuation would be useful and should not be removed during preprocessing as it would capture enthousiasm and other sentiment states withing the reviews.

This was not the case as it ended up adding more noise than isnight and was later abandoned. 

Although LDA has a nice way of visualising topic clusters and their driving factors, it is really not a favorite as it does not scale nicely and the results are not always detailed enough. I would therefore have probably swapped the approach with one of the options detailed above had they been shown to yield better results.