# Sentiment Analysis for Hotel Reviews

## Overview
This project is analyzing sentiment in hotel reviews using **Mutual Information** and **Vader Sentiment Analyzer**. We crawl hotel review data from Trip Advisor website by using a python script which can be downloaded at [here](https://github.com/aesuli/trip-advisor-crawler).

We use **Vader sentiment analyzer, Mutual Information and Point-wise Mutual Information** to discover the relation between those metrics and ground-truth rating scores.

## What we are going to do

1. Crawl hotel review data from Trip Advisor by using a python script ([click](https://github.com/aesuli/trip-advisor-crawler) to see how to use the script)

2. Read the data (.csv format) into **pandas dataframe**

3. Use **Vader sentiment analyzer****bag-of-words model** (unigram)

4. Calculate **word frequency**, **mutual information** and **pointwise mutual information** for the unigrams to see how they relate with the review scores

5. **Visualize** the distribution of the ground-truth scores and Vader scores
6. Discuss

### Install and import the packages

In [1]:
#Download required ntlk packages and lib
import nltk
nltk.download("vader_lexicon")
nltk.download("stopwords")

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/Joonil/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Joonil/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
from nltk import tokenize
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
from collections import Counter
import re
import math
import html
import sklearn
import sklearn.metrics as metrics
from sklearn.metrics import mutual_info_score
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline



### Test the Vader analyzer

In [3]:
#Sentences to try with vader
sentences = ["VADER is smart, handsome, and funny.",
             "Data Scientists are sexy!",
             "The room was dirty and small",
             "They had excellent facilities!",
             "This hotel is the worst hotel in the city"]

In [4]:
#Instantiate an instance to access SentimentIntensityAnalyzer class
sid = SentimentIntensityAnalyzer()

In [5]:
#Vader output
for sentence in sentences:
    print(sentence)
    ss = sid.polarity_scores(sentence)
    for k in sorted(ss):
         print('{0}: {1}, '.format(k, ss[k]), end='')
    print('\n')

VADER is smart, handsome, and funny.
compound: 0.8316, neg: 0.0, neu: 0.254, pos: 0.746, 

Data Scientists are sexy!
compound: 0.5707, neg: 0.0, neu: 0.448, pos: 0.552, 

The room was dirty and small
compound: -0.4404, neg: 0.367, neu: 0.633, pos: 0.0, 

They had excellent facilities!
compound: 0.6114, neg: 0.0, neu: 0.429, pos: 0.571, 

This hotel is the worst hotel in the city
compound: -0.6249, neg: 0.339, neu: 0.661, pos: 0.0, 



It can be seen from the previous examples that **Vader sentiment analyzer** does good job in identifying polarity of the sentences and give us **compound** score which is in range of [-1, 1]. As compound is close to +1.0, the sentence has positive words and mood.

## Let's read the .csv data we scrapped from Trip Advisor

In [6]:
# Read in from pandas
hotelDf = pd.read_csv('barrie.csv')
hotelDf.columns=['id','filePath','hotelName','review','ratingScore','groundTruth']

In [7]:
hotelDf.head()

Unnamed: 0,id,filePath,hotelName,review,ratingScore,groundTruth
0,99145909,barrie/ca/154980/1238832/112386669.html,Hampton Inn &amp; Suites by Hilton Barrie,"This is what hotels should be like. Clean, lar...",4,positive
1,97597588,barrie/ca/154980/1238832/112386669.html,Hampton Inn &amp; Suites by Hilton Barrie,"The room was beautiful and spacious, but the f...",3,negative
2,96239840,barrie/ca/154980/1238832/112386669.html,Hampton Inn &amp; Suites by Hilton Barrie,I have been staying here off and on most weeks...,5,positive
3,96025643,barrie/ca/154980/1238832/112386669.html,Hampton Inn &amp; Suites by Hilton Barrie,"Dear M, I just wanted to write a note of thank...",5,positive
4,95527737,barrie/ca/154980/1238832/112386669.html,Hampton Inn &amp; Suites by Hilton Barrie,Stayed here for two nights in Feb 2011. Great ...,5,positive


In [8]:
# There are unparsed html tags in the hotelnames. We can changed the html tags to ascii equivalents by using the following code.
for i in range(len(hotelDf)):
    hotelname = hotelDf.at[i, 'hotelName']
    hotelname = hotelname.encode("utf-8")
    hotelname = hotelname.decode("ascii", "ignore")
    hotelname = html.unescape(hotelname)
    hotelDf.at[i, 'hotelName'] = hotelname

One thing we can notice here is the groundtruth is categorical variable ('positive', 'negative'). If the rating score is 4 or 5, we say the ground truth is positive, otherwise we say it is negative.

In [9]:
# Instantiate the sentiment Analyzer
sid = SentimentIntensityAnalyzer()

In [10]:
# We will ignore all stopwords in the review column
stop = set(stopwords.words('english'))

#Add possible Stop Words for Hotel Reviews
stop.add('hotel')
stop.add('room')
stop.add('rooms')
stop.add('stay')
stop.add('staff')
stop.add('ontario')
stop.add('hampton')

In [11]:
#Count the frequency of words
counter = Counter()
for review in hotelDf['review']:
        counter.update([word.lower() for word in re.findall(r'\w+', review) if word.lower() not in stop and len(word) > 2])

In [12]:
#Top k word counted by frequency
k = 500
topk = counter.most_common(k)

In [13]:
vaderScores = []
#Assign Vader score to individual review using Vader compound score
for rownum, review in enumerate(hotelDf['review']):
    scores = sid.polarity_scores(review)
    vaderScores.append(scores['compound'])
    if (rownum % 1000 == 0):
            print("processed %d reviews" % (rownum+1))
print("completed")

processed 1 reviews
processed 1001 reviews
processed 2001 reviews
processed 3001 reviews
processed 4001 reviews
completed


In [14]:
# Assign vader scores in the original df
hotelDf = hotelDf.assign(vaderScore = vaderScores)
hotelDf.head()

Unnamed: 0,id,filePath,hotelName,review,ratingScore,groundTruth,vaderScore
0,99145909,barrie/ca/154980/1238832/112386669.html,Hampton Inn & Suites by Hilton Barrie,"This is what hotels should be like. Clean, lar...",4,positive,0.9534
1,97597588,barrie/ca/154980/1238832/112386669.html,Hampton Inn & Suites by Hilton Barrie,"The room was beautiful and spacious, but the f...",3,negative,0.9522
2,96239840,barrie/ca/154980/1238832/112386669.html,Hampton Inn & Suites by Hilton Barrie,I have been staying here off and on most weeks...,5,positive,0.9516
3,96025643,barrie/ca/154980/1238832/112386669.html,Hampton Inn & Suites by Hilton Barrie,"Dear M, I just wanted to write a note of thank...",5,positive,0.9932
4,95527737,barrie/ca/154980/1238832/112386669.html,Hampton Inn & Suites by Hilton Barrie,Stayed here for two nights in Feb 2011. Great ...,5,positive,0.9543


In [15]:
#Find out if a particular review has the word from topk list
freqReview = []
for i in range(len(hotelDf)):
    tempCounter = Counter([word for word in re.findall(r'\w+', hotelDf['review'][i])])
    topkinReview = [1 if tempCounter[word] > 0 else 0 for (word, wordCount) in topk]
    freqReview.append(topkinReview)
    
#Prepare freqReviewDf
freqReviewDf = pd.DataFrame(freqReview)
dfName = []
for c in topk:
    dfName.append(c[0])
freqReviewDf.columns = dfName

### This is how the bag-of-words model looks like
We used binary feature here.


1: the word exists in the review


0: the word does not exist in the review

In [16]:
freqReviewDf.head()

Unnamed: 0,great,clean,would,breakfast,good,one,pool,nice,stayed,resort,...,options,slow,hwy,adults,certainly,terrible,unfortunately,carpet,trails,standard
0,0,0,0,1,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,1,1,0,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,1,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0


We extract *hotel name*, *rating score*, *ground truth*, *vader score* and *top-k words* from the hotelDf to start the analysis.

In [17]:
finaldf = hotelDf[['hotelName','ratingScore','groundTruth', 'vaderScore']].join(freqReviewDf)
finaldf.head()

Unnamed: 0,hotelName,ratingScore,groundTruth,vaderScore,great,clean,would,breakfast,good,one,...,options,slow,hwy,adults,certainly,terrible,unfortunately,carpet,trails,standard
0,Hampton Inn & Suites by Hilton Barrie,4,positive,0.9534,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
1,Hampton Inn & Suites by Hilton Barrie,3,negative,0.9522,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Hampton Inn & Suites by Hilton Barrie,5,positive,0.9516,1,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,Hampton Inn & Suites by Hilton Barrie,5,positive,0.9932,1,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,Hampton Inn & Suites by Hilton Barrie,5,positive,0.9543,1,1,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0


## Can we learn something about the true ratings from Vader scores?
In other words, can we build infer true review ratings by using vader score? and why even would we be intereted in that?

Usually, we have many other options for the regression problem. *Linear Regression*, *Decision Tree* and etc.

But using Vader analyzer is **simple and fast**. We can use it as an initial tool before building a heavy machine learning model to figure out the trend in the data.

#### Top 5 reviews hotel by *ratings*

In [18]:
ratingByHotel = finaldf.groupby(['hotelName']).mean()['ratingScore'].reset_index()
vaderByHotel = finaldf.groupby(['hotelName']).mean()['vaderScore'].reset_index()

In [19]:
ratingByHotel = ratingByHotel.sort_values('ratingScore', ascending=False)
ratingByHotel.head(10)

Unnamed: 0,hotelName,ratingScore
7,Fairfield Inn & Suites Barrie,4.589286
9,Hampton Inn & Suites by Hilton Barrie,4.430233
4,Carriage Ridge Resort,4.169753
10,Holiday Inn Barrie Hotel & Conference Centre,4.104839
17,Super 8 Barrie,4.095156
0,BEST WESTERN Royal Oak Inn,3.909348
11,Holiday Inn Express Barrie,3.906667
3,Carriage Hills Resort,3.900302
8,Four Points By Sheraton Barrie,3.853659
14,Monte Carlo Inn - Barrie Suites,3.710462


#### Top 5 reviews hotel by *vader score*

In [20]:
vaderByHotel = vaderByHotel.sort_values('vaderScore', ascending=False)
vaderByHotel.head(10)

Unnamed: 0,hotelName,vaderScore
7,Fairfield Inn & Suites Barrie,0.884062
9,Hampton Inn & Suites by Hilton Barrie,0.826441
8,Four Points By Sheraton Barrie,0.822049
10,Holiday Inn Barrie Hotel & Conference Centre,0.793993
4,Carriage Ridge Resort,0.781669
11,Holiday Inn Express Barrie,0.768908
3,Carriage Hills Resort,0.737405
0,BEST WESTERN Royal Oak Inn,0.735878
17,Super 8 Barrie,0.722113
6,Comfort Inn - Barrie / Hart Dr.,0.715257


### The two lists are very similar!
Vader scores are highly correlated with the true rating scores.

## Which words were most sentiment-bearing in the reviews?
#### To explore this question, we will calculate the following 3 factors.
1. Word Frequency
2. Mutual Information
3. Pointwise Mutual Information (PMI)

### 1. Word Frequency
Can word frequency in review data tell us how the customers felt about the hotels?


Let's get the most frequently observed words from the *positive reviews* and *negative reviews*, respectively, to see the difference, if any.

In [21]:
#Add possible Stop Words for Hotel Reviews
stop.add('hotel')
stop.add('room')
stop.add('rooms')
stop.add('stay')
stop.add('staff')
stop.add('ontario')
stop.add('hampton')

In [22]:
#To find out the most frequent word in review when the ground truth is positive
counter = Counter()
for review in hotelDf.loc[hotelDf['groundTruth']=='positive']['review']:
        counter.update([word.lower() for word in re.findall(r'\w+', review) if word.lower() not in stop and len(word) > 2])

counter = Counter()   
#To find out the most frequent word in review when the ground truth is negative
for review in hotelDf.loc[hotelDf['groundTruth']=='negative']['review']:
        counter.update([word.lower() for word in re.findall(r'\w+', review) if word.lower() not in stop and len(word) > 2])

#### Top 10 Words with High Frequency in *Positive* and *Negative* reviews.

In [23]:
from pprint import pprint
k = 10
topkPos = counter.most_common(k)
topkNeg = counter.most_common(k)
print("The most frequently occured top 10 words in positive reviews")
pprint(topkPos)
print("\nThe most frequently occured top 10 words in negative reviews")
pprint(topkNeg)

The most frequently occured top 10 words in positive reviews
[('would', 1122),
 ('one', 876),
 ('night', 744),
 ('breakfast', 701),
 ('front', 694),
 ('get', 687),
 ('desk', 678),
 ('good', 664),
 ('pool', 638),
 ('time', 600)]

The most frequently occured top 10 words in negative reviews
[('would', 1122),
 ('one', 876),
 ('night', 744),
 ('breakfast', 701),
 ('front', 694),
 ('get', 687),
 ('desk', 678),
 ('good', 664),
 ('pool', 638),
 ('time', 600)]


### Uh.. almost same?
It seems like the term frequency doesn't tell us anything about the text. We can observe that there is no difference between the top-k word list for both positive reviews and negative reviews.

If we think about it, this result seems obvious. If a customer was really satistied with breakfast, they would mention the word, 'breakfast', in their review. Even if a customer didn't like their breakfast, they also would mention the word, 'breakfast', in their review (with some bad words).

### 2. Mutual Information

**Mutual information tells you how much you learn about X from knowing the value of Y (on average over the choice of Y).** 


Since we found the word frequency is not a good indicator for the sentiment analysis, we will examine *mutual information*  for an alternative metric.

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html

In [24]:
# positive = 1 / negative = 0
gtScore = []
for i in range(len(finaldf)):
    if finaldf['groundTruth'][i] == 'positive':
        gtScore.append(1)
    else:
        gtScore.append(0)

In [25]:
#Calculate muual information score using scikit-learn package
wordList = [word[0] for word in topk]
miScoreDf = pd.DataFrame(data = {'word': wordList,
             'MI Score': [mutual_info_score(gtScore, finaldf[word].as_matrix()) for word in wordList]})
miScoredf.head(10)

NameError: name 'miScoredf' is not defined

### What does it mean?
If we observe words having high Mutual Information scores in a review, we would learn a lot about the sentiment of review, (positive or negative).

###  3. Pointwise Mutual Information

Similar to MI, PMI is measuring for sigle event where MI is the average of all possible event.

The events P(x,y) = P(0,1) means the event of the review is negative but the specific word is existing in that review

#### Let's see how Pointwise Mutual Information calculated. The PMI of a pair of outcomes x and y belonging to discrete random variables X and Y quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence.
#### To study more about Pointwise Mutual Information, see [Wikipedia](https://en.wikipedia.org/wiki/Pointwise_mutual_information) 

![image](https://wikimedia.org/api/rest_v1/media/math/render/svg/ff54cfce726857db855d4dd0a9dee2c6a5e7be99)

In [None]:
# Define functions to calculate pmi
def getpmiDf(df, x):
    pmilist=[]
    
    for i in ['positive','negative']:
        for j in [0,1]:
            px = sum(finaldf['groundTruth'] == i) / len(df)
            py = sum(finaldf[x] == j) / len(df)
            pxy = len(finaldf[(finaldf['groundTruth']==i) & (finaldf[x]==j)])/len(df)
            
            if pxy == 0:  # value
                pmi = math.log10((pxy+0.0001) / (px*py))
                
            else:
                pmi = math.log10(pxy / (px * py))
                
            pmilist.append([i] + [j] + [px] + [py] + [pxy] + [pmi])
            
    pmidf = pd.DataFrame(pmilist)
    pmidf.columns = ['x','y','px','py','pxy','pmi']
    return pmidf

def getPMI(df, x, gt):
    '''
    - Input
      df: pandas dataframe
      x: hotel name. a string
      gt: target label. 1 or 0 integer values 
    
    - Output
      pointwise mutual information. float
    '''
    px = sum(finaldf['groundTruth'] == gt) / len(df)
    py = sum(finaldf[x] == 1) / len(df)
    pxy = len(finaldf[(finaldf['groundTruth'] == gt) & (finaldf[x] == 1)]) / len(df)
    
    if pxy == 0:
        pmi = math.log10((pxy + 0.0001) / (px * py))
    else:
        pmi = math.log10(pxy / (px * py))
        
    return pmi

In [None]:
getpmiDf(finaldf, 'dirty')

From the table above, we can see the word, 'dirty', has negatively correlated with 'positive' label because the pmi value for 'positive' label is the smallest value (-0.69) among the four values.

In [None]:
pmiPosDf = pd.DataFrame(data = {'word': wordList,
                             'pmiPositive': [getPMI(finaldf, word, 'positive') for word in wordList]})
pmiNegDf = pd.DataFrame(data = {'word': wordList,
                             'pmiNegative': [getPMI(finaldf, word, 'negative') for word in wordList]})

In [None]:
#Sorted top pmi words for positive reviews
pmiPosDf = pmiPosDf.sort_values('pmiPositive',ascending=0)
pmiPosDf.head(10)

In [None]:
#Sorted top pmi words for negative reviews
pmiNegDf = pmiNegDf.sort_values('pmiNegative',ascending=0)
pmiNegDf.head(10)

In [None]:
pmiPosDf.head(10).plot.bar(x='word', rot=40, color='b', title='Top 10 words in Positive Reviews based on PMI scores')
pmiNegDf.head(10).plot.bar(x='word', rot=40, color='r', title='Top 10 words in Negative Reviews based on PMI scores')
plt.show()

### Pointwise Mutual Information seems like a good metric to summarize the reviews in n-gram tokens!

# Visualization
Sometimes, we can learn a lot about the data by visualizing.

## Historgrams

In [None]:
plt.xlabel('Rating Score')
finaldf['ratingScore'].plot(kind='hist', title='Histogram - Rating Scores',
                            bins=np.arange(1,7)-0.5)
plt.show()

In [None]:
plt.xlabel('Vader Sentiment Score')
finaldf['vaderScore'].plot(kind='hist', title='Histogram - Vader Scores', 
                           xticks=[-1.0, -0.5, 0.0, 0.5, 1.0])
plt.show()

In [None]:
#Overlayed Histogram for GT rating and VD score
#Just for demonstrating, I am dividing the rating score by 5
x = [finaldf['ratingScore'].as_matrix() / 5]
y = [(finaldf['vaderScore'].as_matrix() + 1 )/ 2]
bins = np.linspace(0, 1, 100)
plt.hist(x, bins, label='Rescaled True Ratings')
plt.hist(y, bins, label='Rescaled Vader Scores')
plt.legend(loc='upper left')
plt.show()

## Boxplots

In [None]:
#Plot top 5 side-by-side boxplot for top 5 ground truth rated hotel
tp5gthotel = ratingByHotel.sort_values('ratingScore', ascending=False).head(5).hotelName.as_matrix()

tempdf = finaldf[(finaldf.hotelName == tp5gthotel[0]) | (finaldf.hotelName == tp5gthotel[1]) | 
         (finaldf.hotelName == tp5gthotel[2]) | (finaldf.hotelName == tp5gthotel[3]) | 
         (finaldf.hotelName == tp5gthotel[4])]

In [None]:
g = sns.factorplot(kind='box',        # Boxplot
               y='ratingScore',       # Y-axis - values for boxplot
               x='hotelName',        # X-axis - first factor
               data=tempdf,        # Dataframe 
               size=6,            # Figure size (x100px)      
               aspect=1.5,        # Width = size * aspect 
               legend_out=False)  # Make legend inside the plot

for ax in g.axes.flat:
    labels = ax.get_xticklabels() # get x labels
    ax.set_xticklabels(labels, rotation=30) # set new labels
    
plt.show()

In [None]:
g = sns.factorplot(kind='box',        # Boxplot
               y='vaderScore',       # Y-axis - values for boxplot
               x='hotelName',        # X-axis - first factor
               data=tempdf,        # Dataframe 
               size=6,            # Figure size (x100px)      
               aspect=1.5,        # Width = size * aspect 
               legend_out=False)  # Make legend inside the plot

for ax in g.axes.flat:
    labels = ax.get_xticklabels() # get x labels
    ax.set_xticklabels(labels, rotation=30) # set new labels
    
plt.show()

## Scatterplots

In [None]:
y = finaldf['ratingScore'].as_matrix()
x = finaldf['vaderScore'].as_matrix()
plt.title('Vader score vs. True Ratings')
plt.xlabel('Vader Scores')
plt.ylabel('True Ratings')
plt.xticks([-1, -0.5, 0, 0.5, 1])
plt.yticks([1,2,3,4,5])
plt.plot(x, y, "o", ms=3, color='b')
plt.show()