In this lab we're going to do a sentiment-based classification of hotels, based on hotel reviews. This is the same case-study we saw in the lecture. What can we learn about the quality of hotels based on aggregated user reviews?

In [1]:
from text_analytics import TextAnalytics
import os
import pandas as pd

ai = TextAnalytics()
ai.data_dir = os.path.join(".", "data")
print("Done!")

Done!


Let's load the reviews that we need for each hotel. We're looking at hotels below 3 stars ("low") and above 4 stars ("high"), each of which has at least 10 reviews.

In [2]:
file = "economic.hotels_as_reviews.gz"
file = os.path.join(ai.data_dir, file)
df = pd.read_csv(file)
print(df)

                                       Hotel Rating  \
0                 11th Avenue Hotel & Hostel    low   
1                                3 West Club   high   
2                                  414 Hotel   high   
3     70 park avenue hotel - a Kimpton Hotel   high   
4       A Victory Inn & Suites Phoenix North    low   
...                                      ...    ...   
5294             easyHotel Paddington London    low   
5295                           iQ Hotel Roma   high   
5296                      misc eatdrinksleep   high   
5297                             nhow Berlin   high   
5298              theWit, a Doubletree Hotel   high   

                                                   Text  
0     This hostel is in a very good location, close ...  
1     We had 5 nights here and were unsure as to wha...  
2     This is a small boutique hotel with a nice int...  
3     I stayed at 70 Park Ave Hotel the night before...  
4     I made a reservation. Cancelled 2 hours lat

We're going to use sentiment words as features. Words with a positive meaning or a negative meaning. These are pre-defined, so we don't need to fit a model before we use them. Let's take a look at how many high and low hotels we have here.

In [3]:
freq = ai.print_labels(df, "Rating")
print(freq)

{'low': 1678, 'high': 3621}


The reviews skew high: there are more good than bad hotels. But we still have a good number of hotels in total, over 5k. So it might be difficult to make generalizations that tell us what a good hotel is, across lots of cities and types of guests. So now we have (1) our data from hotel reviews and (2) our sentiment vectorizer (positive and negative words). We're going to classify these by quality. The basic code is below; this just called our *text_analytics* package. That package splits the data into training and testing data, learns a classifier, and then evaluates the classifier. We're telling the package to use "Rating" as the ground-truth class, with sentiment features.

In [4]:
results = ai.shallow_classification(df, labels = "Rating", features = "sentiment", classifier = "lm")
print(results)

              precision    recall  f1-score   support

        high       0.99      1.00      1.00       369
         low       1.00      0.98      0.99       161

    accuracy                           0.99       530
   macro avg       1.00      0.99      0.99       530
weighted avg       0.99      0.99      0.99       530



**Be patient**

And there we go! We're looking at the classifier accuracy. 

This will change a bit from the lecture, because we're using random train/test splits. That means the classifier is looking at different data each time. If you want more advanced examples for how to solve this authorship classification problem, take a look at the *text_analytics.shallow_classification()* function.