# Predicting Hotel Scores Based on Reviews

### About the Dataset:
##### https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe

##### This dataset contains 515,000 customer reviews and scoring of 1493 luxury hotels across Europe. Meanwhile, the geographical location of hotels are also provided for further analysis.

The csv file contains 17 fields. The description of each field is as below:

    Hotel_Address: Address of hotel.
    Review_Date: Date when reviewer posted the corresponding review.
    Average_Score: Average Score of the hotel, calculated based on the latest comment in the last year.
    Hotel_Name: Name of Hotel
    Reviewer_Nationality: Nationality of Reviewer
    Negative_Review: Negative Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Negative'
    ReviewTotalNegativeWordCounts: Total number of words in the negative review.
    Positive_Review: Positive Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Positive'
    ReviewTotalPositiveWordCounts: Total number of words in the positive review.
    Reviewer_Score: Score the reviewer has given to the hotel, based on his/her experience
    TotalNumberofReviewsReviewerHasGiven: Number of Reviews the reviewers has given in the past.
    TotalNumberof_Reviews: Total number of valid reviews the hotel has.
    Tags: Tags reviewer gave the hotel.
    dayssincereview: Duration between the review date and scrape date.
    AdditionalNumberof_Scoring: There are also some guests who just made a scoring on the service rather than a review. This number indicates how many valid scores without review in there.
    lat: Latitude of the hotel
    lng: longtitude of the hotel


### Guiding Questions:
- Can a hotel score be predicted from the reviews left by customers?

In [63]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import nltk
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [2]:
reviews = pd.read_csv('Hotel_Reviews.csv')
reviews.head()

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
2,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,Australia,Rooms are nice but for elderly a bit difficul...,42,1403,Location was good and staff were ok It is cut...,21,9,7.1,"[' Leisure trip ', ' Family with young childre...",3 days,52.360576,4.915968
3,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,United Kingdom,My room was dirty and I was afraid to walk ba...,210,1403,Great location in nice surroundings the bar a...,26,1,3.8,"[' Leisure trip ', ' Solo traveler ', ' Duplex...",3 days,52.360576,4.915968
4,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/24/2017,7.7,Hotel Arena,New Zealand,You When I booked with your company on line y...,140,1403,Amazing location and building Romantic setting,8,3,6.7,"[' Leisure trip ', ' Couple ', ' Suite ', ' St...",10 days,52.360576,4.915968


### First Step: Positive and Negative Word Counts

Do the number of positive/negative words correlate to the given score? One would think that more positive words would lead to a higher overall score, and vice versa.

In [59]:
# These are the only columns we care about for prediction purposes
data = reviews[['Review_Total_Positive_Word_Counts','Review_Total_Negative_Word_Counts', 'Reviewer_Score']]

# Creating new column for the ratio of positive to negative word counts
data = data.assign(Pos_Neg_Counts=data.Review_Total_Positive_Word_Counts - data.Review_Total_Negative_Word_Counts)
data = data.drop(['Review_Total_Positive_Word_Counts', 'Review_Total_Negative_Word_Counts'], axis=1)
data.head()

Unnamed: 0,Reviewer_Score,Pos_Neg_Counts
0,2.9,-386
1,7.5,105
2,7.1,-21
3,3.8,-184
4,6.7,-132


In [60]:
fig = px.histogram(data, x="Pos_Neg_Counts", nbins=100)
fig.show()

As we can see, the ratio of positive to negative words is a fairly normal distribution.

Normalize Positive/Negative Word Counts by following formula:

$z_{i} = \frac{x_{i} - min(x)}{max(x) - min(x)} $

In [61]:
word_min = data['Pos_Neg_Counts'].min()
word_max = data['Pos_Neg_Counts'].max()

data['Pos_Neg_Counts'] = data['Pos_Neg_Counts'].apply(lambda val: (val - word_min) / (word_max - word_min))
data.head()

Unnamed: 0,Reviewer_Score,Pos_Neg_Counts
0,2.9,0.020356
1,7.5,0.645038
2,7.1,0.484733
3,3.8,0.277354
4,6.7,0.343511


In [62]:
temp = data.groupby('Reviewer_Score').mean()
fig = px.scatter(temp, y=temp.index, x='Pos_Neg_Counts')
fig.update_layout(title='Hotel Score vs. Average Normalized Pos/Neg Word Ratio')
fig.show()

Simple Linear Regression with sklearn

In [67]:
X = data['Pos_Neg_Counts'].values.reshape(-1,1)
y = data['Reviewer_Score'].values.reshape(-1,1)

In [69]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [71]:
model = LinearRegression()  
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [73]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 1.131415012889884
Mean Squared Error: 2.1027897505649795
Root Mean Squared Error: 1.4500999105458146
