# Hotel Review Analysis

This notebook contains the code used to analyze hotel reviews. The goal is to obtain a *directed sentiment score* towards specific hotel *aspects*, such as the location, room quality, or hospitality.

## My Approach
1. Obtain reviews for hotels
2. Use the Stanford CoreNLP library to obtain sentiment scores for each sentence in a review
3. Treating a sentence as a binary tree, obtain sentiment scores for each phrase (subtree) in a sentence
4. Classify the *aspect* of each phrase
5. Aggregate results for all *aspects*, for all hotels

This approach lets me obtain directed sentiment scores for each hotel. Using this method, we can also obtain sentiment scores for several aspects, even if they appear in the same sentence or have opposite sentiment.

For example, in the sentence:
```
The food was great, but the staff was very rude.
```
The phrase "the food was great" has a positive sentiment towards the **food** aspect, while "staff was very rude" has a negative sentiment towards the **staff** aspect.

According to CoreNLP, the overall sentiment of the entire sentence is negative. However, examining subphrases allows us to accurately measure sentiment towards each aspect individually. 


## Future Work (TODO)

1. Investigate weighting by probability
2. Investigate filtering by probability
3. Investigate using neutral sentiment reviews (currently discarded in `Sentiment.java`)
4. Increase number of aspects
5. Increase regex rule coverage for aspects
6. Replace regex classifier with a neural net trained against the regex rules.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json

%load_ext autoreload
%autoreload 2

%matplotlib inline

## The Dataset


In [2]:
reviews = []
urls = []
BASE_FILENAME = 'reviews-nlp/ReviewSentiment/datasets/reviews7'
with open(BASE_FILENAME + '.jl', 'r') as f:
    for line in f:
        data = json.loads(line)
        reviews.append(data['text'])
        urls.append(data['url'])
# df = pd.read_csv(BASE_FILENAME+'.csv', index_col=None)
# df.rename(columns={'comments' : 'text', 'listing_id' : 'urls'}, inplace=True)
# df.head()

Some reviews are blank (only a title and a star rating), and others are excessively long. To combat this, I filter blank reviews and those that consist of more than 1500 characters. Experiments with all the datasets showed that 1500 characters is roughly the 95% percentile for review length. 

In [3]:
import csv
df = pd.DataFrame()
df['text'] = reviews
df['urls'] = urls
print(len(df))
print(df.text.str.len().quantile(0.95))
df = df[(df.text.str.len() > 0) & (df.text.str.len() < 1500)]
print(len(df))
urls = list(df.urls.values)
reviews = list(df.text.values)
with open(BASE_FILENAME + '.clean.v02.txt', 'w') as f:
    for review in reviews:
        f.write(review + '\n')

18068
1455.0
14094


To easily hold the data, I define a simple data wrapper class for each phrase. 

In [4]:
class Phrase:
    def __init__(self, sentiment, tokens, probability, phrase):
        self.sentiment = sentiment
        self.tokens = tokens
        self.probability = probability
        self.phrase = phrase
        self.cls = None
        self.url = None
        self.docid = -1
        
    def has_class(self):
        return self.cls != None and self.cls != -1
    
    def __repr__(self):
        return 'Class={} Sentiment={} Probability={} Phrase="{}"'.format(self.cls,
                                                                       self.sentiment,
                                                                       self.probability,
                                                                       self.phrase)

## Classifying

To classify the aspect of a phrase, I used simple regex rules. These will be replaced by a neural network trained against a more comprehensive set of rules, but regex rules are enough to obtain results.

Note that the "classifier" below will only report one class. If multiple classes are detected, it will return -1. This is because we want to isolate sentiment scores towards a single aspect.

In [5]:
import re
CLASS_NAMES = ['location', 'hospitality', 'room_quality']
CLASSES = np.array([i for i, _ in enumerate(CLASS_NAMES)])
def get_class(phrase_obj):
    text = phrase_obj.phrase
    
    #location patterns
    location = any([re.search(r'\blocation\b', text, re.IGNORECASE),
                    re.search(r'\bneighborhood\b', text, re.IGNORECASE)])
    
    #hospitality
    hospitality = any([re.search(r'\bstaff\b', text, re.IGNORECASE), 
                       re.search(r'\bhousekeeping\b', text, re.IGNORECASE),
                       re.search(r'\bhost\b', text, re.IGNORECASE),
                       re.search(r'\bfront[- ]?desk\b', text, re.IGNORECASE),
                       re.search(r'\bservice\b', text, re.IGNORECASE)])
    #room quality
    room = any([re.search(r'\bclean\b', text, re.IGNORECASE),
                re.search(r'\bdirty\b', text, re.IGNORECASE),
                re.search(r'\brooms?\b', text, re.IGNORECASE),
                re.search(r'\bbeds?\b', text, re.IGNORECASE),
                re.search(r'\bwi-?fi\b', text, re.IGNORECASE),
                re.search(r'\binternet\b', text, re.IGNORECASE)])

    
    matches = np.array([location, hospitality, room])
    if sum(matches) == 1:
        return CLASSES[matches][0]
    else:
        return -1

When processing phrases, I want to prefer the longest phrase involving a single aspect. My other experiments have shown that the sentiment scores for longer phrases are more accurate (and have higher probability) than those for shorter phrases, especially in the presence of negating words like "not". 

The function below takes a list of phrase objects ordered from longest to shortest, and inspects their classifications. It will greedily take the longest phrase for each aspect, and ingore the rest.

In [6]:
def greedy_class_filter(ordered_phrases):
    found = [False for cls in CLASS_NAMES]
    filtered = []
    for phrase in ordered_phrases:
        if found[phrase.cls]:
            continue
        found[phrase.cls] = True
        filtered.append(phrase)

    return filtered

In order to obtain the sentiment scores for each phrase, see the ReviewSentiment Java source. There, I use CoreNLP to obtain sentiment scores for each sentence and phrase in the sentence. These phrases are obtained by running `Sentiment.java` on a cleaned version of the dataframe created above. The phrases are written out to the `reviews.tagged.txt` file, with document and sentence boundaries. 

The cell below iterates over these phrases, runs the classifier, and records the scores for each aspect.

In [7]:
from collections import defaultdict

#map from url to list of sentiment scores
hotels = defaultdict(list)

hits = 0
totals = 0
with open(BASE_FILENAME + '.out.txt', 'r') as f:
    docid = 0
    buffer = []
    for line in f:
        if line.startswith('===<d'):
            docid += 1
            if docid % 1000 == 0:
                print(docid, len(hotels))
        elif line.startswith('=<s'):
            #handle each sentence
            if len(buffer) != 0:
                #construct phrase wrappers
                phrases = []
                try:
                    for tokens, sentiment, prob, phrase in (p.split('|') for p in buffer):
                        phrases.append(Phrase(int(sentiment), int(tokens), float(prob), phrase))
                except:
                    print("Failed to parse")
                #sort by reverse length
                phrases = sorted(phrases, key = lambda x : x.tokens, reverse=True)
                
                #tag using the simple classifier
                for phrase in phrases:
                    phrase.cls = get_class(phrase)
                
                #filter phrases with no class 
                phrases = [p for p in phrases if p.has_class()]
                
                if len(phrases) > 0:
                    #record the longest phrase for each aspect in this sentence
                    filtered_phrases = greedy_class_filter(phrases)
                    for phrase in filtered_phrases:
                        phrase.docid = docid
                    hotels[urls[docid]].extend(filtered_phrases)
                    hits += 1
                    
            buffer = []
            totals += 1
        else:
            buffer.append(line)

1000 21
2000 22
3000 25
4000 27
5000 29
6000 32
7000 34
8000 34
9000 34
10000 34
11000 34
12000 34
13000 34
14000 34


In [8]:
print(docid)
print(len(reviews))
print(len(urls))

14094
14094
14094


Scoring is the average sentiment towards each aspect for each hotel. Sentiment scores range from -2 to 2, with -2 being the most negative. There is a heavy bias towards good sentiment reviews, so scores for each hotel must be considered relative to other hotels.

In [9]:
print(totals, hits)
#95802 18350

91320 23905


In [10]:
hotel_scores = {}
for hotel in hotels:
    scores = [0 for _ in CLASS_NAMES]
    counts = [0 for _ in CLASS_NAMES]
    for phrase in hotels[hotel]:
        scores[phrase.cls] += phrase.sentiment
        counts[phrase.cls] += 1
    hotel_scores[str(hotel)] = {name : {'sum_sentiment' : score, 'count' : count} for name, score, count in zip(CLASS_NAMES, scores, counts)}

## Results

Finally, the scores! The scores are mostly interpretable. Note that airport associated hotels seem to have extremely low scores in most aspects.  

In [11]:
scores_df = pd.DataFrame.from_dict(hotel_scores, orient='index')
with open(BASE_FILENAME + '_scores.json', 'w') as f:
    json.dump(hotel_scores, f, indent = 2)
scores_df

Unnamed: 0,location,hospitality,room_quality
/Hotel_Review-g47289-d99254-Reviews-Best_Western_Queens_Gold_Coast-Bayside_Queens_New_York.html,"{'sum_sentiment': 33, 'count': 36}","{'sum_sentiment': 49, 'count': 98}","{'sum_sentiment': -27, 'count': 262}"
/Hotel_Review-g47369-d12441510-Reviews-Comfort_Inn_Suites_near_Stadium-Bronx_New_York.html,"{'sum_sentiment': -1, 'count': 5}","{'sum_sentiment': 4, 'count': 16}","{'sum_sentiment': 5, 'count': 14}"
/Hotel_Review-g47369-d12548946-Reviews-Holiday_Inn_Express_Bronx_NYC_Stadium_Area-Bronx_New_York.html,"{'sum_sentiment': 1, 'count': 1}","{'sum_sentiment': 0, 'count': 2}","{'sum_sentiment': -5, 'count': 9}"
/Hotel_Review-g47369-d4447795-Reviews-Rodeway_Inn_Bronx_Zoo-Bronx_New_York.html,"{'sum_sentiment': -1, 'count': 9}","{'sum_sentiment': 10, 'count': 15}","{'sum_sentiment': -7, 'count': 76}"
/Hotel_Review-g47369-d4702280-Reviews-Opera_House_Hotel-Bronx_New_York.html,"{'sum_sentiment': -20, 'count': 141}","{'sum_sentiment': 155, 'count': 299}","{'sum_sentiment': 150, 'count': 483}"
/Hotel_Review-g47369-d99257-Reviews-Ramada_Bronx-Bronx_New_York.html,"{'sum_sentiment': 7, 'count': 15}","{'sum_sentiment': 67, 'count': 151}","{'sum_sentiment': 15, 'count': 282}"
/Hotel_Review-g47626-d12246206-Reviews-Aloft_New_York_LaGuardia_Airport-East_Elmhurst_Queens_New_York.html,"{'sum_sentiment': 1, 'count': 1}","{'sum_sentiment': -1, 'count': 9}","{'sum_sentiment': 4, 'count': 22}"
/Hotel_Review-g47626-d1485917-Reviews-Comfort_Inn_LaGuardia_Airport_83rd_St-East_Elmhurst_Queens_New_York.html,"{'sum_sentiment': 0, 'count': 16}","{'sum_sentiment': -17, 'count': 78}","{'sum_sentiment': -34, 'count': 202}"
/Hotel_Review-g47729-d3411847-Reviews-Asiatic_Hotel_Flushing-Flushing_Queens_New_York.html,"{'sum_sentiment': 14, 'count': 39}","{'sum_sentiment': 2, 'count': 69}","{'sum_sentiment': 10, 'count': 217}"
/Hotel_Review-g47729-d6563501-Reviews-Ramada_Flushing_Queens-Flushing_Queens_New_York.html,"{'sum_sentiment': 6, 'count': 19}","{'sum_sentiment': 0, 'count': 59}","{'sum_sentiment': -4, 'count': 144}"
