<b>Inspiration</b>

The dataset is large and informative, I believe you can have a lot of fun with it! Let me put some ideas below to futher inspire kagglers!

Fit a regression model on reviews and score to see which words are more indicative to a higher/lower score
Perform a sentiment analysis on the reviews
Find correlation between reviewer's nationality and scores.
Beautiful and informative visualization on the dataset.
Clustering hotels based on reviews
Simple recommendation engine to the guest who is fond of a special characteristic of hotel.

In [38]:
# Importing Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import os

# NLP Packages
import nltk 
from textblob import TextBlob 
from textblob import Word
import re
import string

# WordCloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# Sklearn Packages
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text 
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error, precision_score, f1_score, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE

# Pandas Settings
pd.set_option('display.max_columns', 10000)
pd.set_option('display.max_rows', 100)

In [51]:
# Import csv file
df = pd.read_csv('csv/Hotel_Reviews_US.csv',encoding = 'unicode_escape')

# Data Cleaning and EDA

## Understand Dataset

In [52]:
# Taking a lot at the dataset
df.head(2)

Unnamed: 0,address,categories,city,country,latitude,longitude,name,postalCode,province,reviews_date,reviews_dateAdded,reviews_doRecommend,reviews_id,reviews_rating,reviews_text,reviews_title,reviews_userCity,reviews_username,reviews_userProvince
0,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2013-09-22T00:00:00Z,2016-10-24T00:00:25Z,,,4.0,Pleasant 10 min walk along the sea front to th...,Good location away from the crouds,,Russ (kent),
1,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2015-04-03T00:00:00Z,2016-10-24T00:00:25Z,,,5.0,Really lovely hotel. Stayed on the very top fl...,Great hotel with Jacuzzi bath!,,A Traveler,


In [53]:
# Checking the shape of the dataframe
df.shape

(35912, 19)

In [54]:
# Checking null values
df.isna().sum()

address                     0
categories                  0
city                        0
country                     0
latitude                   86
longitude                  86
name                        0
postalCode                 55
province                    0
reviews_date              259
reviews_dateAdded           0
reviews_doRecommend     35912
reviews_id              35912
reviews_rating            862
reviews_text               22
reviews_title            1622
reviews_userCity        19649
reviews_username           43
reviews_userProvince    18394
dtype: int64

In [55]:
# Checking how many hotels in this dataset
len(df.name.unique())

879

In [56]:
# Checking the hotel with the highest number of reviews
df.pivot_table(index=['name'], aggfunc='size').nlargest()

name
The Alexandrian, Autograph Collection    1185
Howard Johnson Inn - Newburgh             714
Americas Best Value Inn                   567
Fiesta Inn and Suites                     546
Ip Casino Resort Spa                      392
dtype: int64

In [57]:
# Double checking if the number matches to the column Total_Number_of_Reviews
df[df['name'] == 'The Alexandrian, Autograph Collection'].head(2)

Unnamed: 0,address,categories,city,country,latitude,longitude,name,postalCode,province,reviews_date,reviews_dateAdded,reviews_doRecommend,reviews_id,reviews_rating,reviews_text,reviews_title,reviews_userCity,reviews_username,reviews_userProvince
4744,480 King St,"Hotels,Hotel",Alexandria,US,38.80474,-77.04455,"The Alexandrian, Autograph Collection",22314,VA,2013-08-25T00:00:00Z,2017-04-20T01:34:00Z,,,5.0,The hotel was great. Staff went above and beyo...,Wonderful,,A verified traveler,
4745,480 King St,"Hotels,Hotel",Alexandria,US,38.80474,-77.04455,"The Alexandrian, Autograph Collection",22314,VA,2010-08-26T00:00:00Z,2017-04-20T01:34:00Z,,,4.0,A wonderful hotel - would definitely stay ther...,Beautiful Hotel with wonderful service.,,A verified traveler,


### Findings:

- There are reviews from 1,492 hotels
- The data is fairly clean. It doesn't much null values
- It's missing the cities where the hotels are located.
- There are reviews without the latitude and longitude.
- The actual number of reviews per hotel does not match to the actual number

## Data Cleaning

In [58]:
# Checking rows where the values are null
df.drop(columns=(['reviews_doRecommend','reviews_id']))
df.head()

Unnamed: 0,address,categories,city,country,latitude,longitude,name,postalCode,province,reviews_date,reviews_dateAdded,reviews_doRecommend,reviews_id,reviews_rating,reviews_text,reviews_title,reviews_userCity,reviews_username,reviews_userProvince
0,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2013-09-22T00:00:00Z,2016-10-24T00:00:25Z,,,4.0,Pleasant 10 min walk along the sea front to th...,Good location away from the crouds,,Russ (kent),
1,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2015-04-03T00:00:00Z,2016-10-24T00:00:25Z,,,5.0,Really lovely hotel. Stayed on the very top fl...,Great hotel with Jacuzzi bath!,,A Traveler,
2,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2014-05-13T00:00:00Z,2016-10-24T00:00:25Z,,,5.0,Ett mycket bra hotell. Det som drog ner betyge...,Lugnt lï¿½ï¿½ge,,Maud,
3,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2013-10-27T00:00:00Z,2016-10-24T00:00:25Z,,,5.0,We stayed here for four nights in October. The...,Good location on the Lido.,,Julie,
4,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2015-03-05T00:00:00Z,2016-10-24T00:00:25Z,,,5.0,We stayed here for four nights in October. The...,ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½...,,sungchul,


### Remove Punctuation

In [59]:
df.columns

Index(['address', 'categories', 'city', 'country', 'latitude', 'longitude',
       'name', 'postalCode', 'province', 'reviews_date', 'reviews_dateAdded',
       'reviews_doRecommend', 'reviews_id', 'reviews_rating', 'reviews_text',
       'reviews_title', 'reviews_userCity', 'reviews_username',
       'reviews_userProvince'],
      dtype='object')

In [71]:
df['reviews_title']

0                       Good location away from the crouds
1                           Great hotel with Jacuzzi bath!
2                                          Lugnt lï¿½ï¿½ge
3                               Good location on the Lido.
4        ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½...
                               ...                        
35907                          Amazing time (both times!!)
35908                          Amazing time (both times!!)
35909                          Amazing time (both times!!)
35910                                                  NaN
35911                                                  NaN
Name: reviews_title, Length: 35912, dtype: object

In [75]:
def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
    text = re.sub("[^a-zA-Z]")
    text = re.sub('\w*\d\w*', ' ', text)
    text = re.sub(r'^https?:\/\/.*[\r\n]*', ' ', text, flags=re.MULTILINE)
    text = text.lower()

    return text

round1 = lambda x: clean_text_round1(x)

In [76]:
# Applying clean_text_round1 function
df['Review_Clean'] = pd.DataFrame(df['reviews_title'].apply(round1))

TypeError: sub() missing 2 required positional arguments: 'repl' and 'string'

#### Findings and Takeaways:

- There are 17 hotels without latitude and longitude. I'll work on it as a stretch goal

## NEED TO WORK ON THAT - Fix Spelling

- To do a spell check, I will choose a random review and check if there is any misspells in it
- I will create a function that will use TextBlob to fix misspellings
- Check the result

In [None]:
# Checking a random review
df['Negative_Review'][4]

It seems that there are a few misspellings, such as the words `theough` and `extreamly`. I'll use TextBlob to fix them.

In [None]:
# Create function to fix misspells
# Create a function to get subjectivity
def spellcheck(text):
    return Word(text).spellcheck

# def spellcheck(text):
    

In [None]:
# Checking if function works
spellcheck(df['Negative_Review'][4])

In [None]:
# w = Word(df['Negative_Review'][4])
# w.spellcheck()

In [13]:
df['Negative_Review_SC'] = df['Negative_Review'].apply(spellcheck)

In [14]:
df['Negative_Review_SC'][4]

<bound method Word.spellcheck of ' You When I booked with your company on line you showed me pictures of a room I thought I was getting and paying for and then when we arrived that s room was booked and the staff told me we could only book the villa suite theough them directly Which was completely false advertising After being there we realised that you have grouped lots of rooms on the photos together leaving me the consumer confused and extreamly disgruntled especially as its my my wife s 40th birthday present Please make your website more clear through pricing and photos as again I didn t really know what I was paying for and how much it had wnded up being Your photos told me I was getting something I wasn t Not happy and won t be using you again '>

In [64]:
df.Negative_Review[4]

' You When I booked with your company on line you showed me pictures of a room I thought I was getting and paying for and then when we arrived that s room was booked and the staff told me we could only book the villa suite theough them directly Which was completely false advertising After being there we realised that you have grouped lots of rooms on the photos together leaving me the consumer confused and extreamly disgruntled especially as its my my wife s 40th birthday present Please make your website more clear through pricing and photos as again I didn t really know what I was paying for and how much it had wnded up being Your photos told me I was getting something I wasn t Not happy and won t be using you again '

In [90]:
blob = TextBlob(df.Negative_Review[4])
blob.correct()

TextBlob(" You When I booked with your company on line you showed me pictures of a room I thought I was getting and paying for and then when we arrived that s room was booked and the staff told me we could only book the villa suite through them directly Which was completely false advertising After being there we realised that you have grouped lots of rooms on the photo together leaving me the consumer confused and extremely disgruntled especially as its my my wife s with birthday present Please make your webster more clear through pricking and photo as again I didn t really know what I was paying for and how much it had ended up being Your photo told me I was getting something I wasn t Not happy and won t be using you again ")

### Findings and Takeaways:
- While checking a random 

# Data Engineering

## Create a function for Sentiment Analysis

In this step, I will generate a sentiment analysis. Normally, this would be a step that I'd run after data cleaning for NLP. However, previous tests showed me that data cleaning does not affect the sentiment analysis using TextBlob.

Running sentiment analysis takes a lot of time because I have more than 515K observations. For this reason, once the sentiment analysis is created, I will pickle the DataFrame and upload it again, so it won't run again.

In [17]:
# Create a function to get subjectivity
def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity

# Create a function to get polarity with tweets
def getPolarity(text):
    return TextBlob(text).sentiment.polarity

<b>NOTE:</b>

Each of the two following cells takes around 10 minutes to run. For this reason, I will sabe the DataFrame into a csv file and upload it again.

In [18]:
# # Create new columns to compare polarity and subjetivity on Negative Reviews
# df['Polarity_Net'] = df['Negative_Review'].apply(getPolarity)
# df['Polarity_Pos'] = df['Positive_Review'].apply(getPolarity)

In [19]:
# # Saving csv with sentiment analysis
# df.to_csv("csv/df_sentiment_analysis.csv")

## Importing the Updated DataFrame

Now let's import the DataFrame again with the sentiment analysis and check if the results make sense

In [10]:
# Importing DataFrame with new Polarity column
df = pd.read_csv('csv/df_sentiment_analysis.csv', index_col=0)

In [11]:
# Checking columns
df.columns

Index(['Hotel_Address', 'Additional_Number_of_Scoring', 'Review_Date',
       'Average_Score', 'Hotel_Name', 'Reviewer_Nationality',
       'Negative_Review', 'Review_Total_Negative_Word_Counts',
       'Total_Number_of_Reviews', 'Positive_Review',
       'Review_Total_Positive_Word_Counts',
       'Total_Number_of_Reviews_Reviewer_Has_Given', 'Reviewer_Score', 'Tags',
       'days_since_review', 'lat', 'lng', 'Negative_Review_Clean',
       'Positive_Review_Clean', 'Polarity_Net', 'Polarity_Pos'],
      dtype='object')

In [12]:
# Creating function to classify the Sentiment Analysis
df['Sent_Analysis_Neg'] = df['Polarity_Net'].apply(lambda x: 0 if x < 0 else 1 if x > -0.1 and x < 0.1 else 2)
df['Sent_Analysis_Pos'] = df['Polarity_Pos'].apply(lambda x: 0 if x < 0 else 1 if x > -0.1 and x < 0.1 else 2)

In [13]:
# Creating a csv file with the sentiment analysis
sentiment_analysis = df[['Hotel_Name','Negative_Review','Positive_Review','Reviewer_Score','Sent_Analysis_Neg','Sent_Analysis_Pos']]

# Uncomment cell below to export file
# sentiment_analysis.to_csv('sentiment_analysis.csv')

### Findings and Takeaways:

- It was created Subjectivity and Polarity features using sentiment analysis for Negative and Positive Reviews. 
- Polarity ranges between -1 and 1. Where -1 means that the review was very negative and 1 means that the review was very positive.
- Seems like sentiment analysis does a good job identifying positive reviews, but the negative reviews could be improved.

## Target Variable

In this section, I will create a target variable and use it to train my models. I will turn the Reviewer Score classes feature into:

- <b>Bad:</b> Scores below 5
- <b>Regular:</b> Scores between 5 and 7
- <b>Good:</b> Scores above 7

In [14]:
# Checking dataframe
df.head(1)

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng,Negative_Review_Clean,Positive_Review_Clean,Polarity_Net,Polarity_Pos,Sent_Analysis_Neg,Sent_Analysis_Pos
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968,i am so angry that i made this post available...,i am so angry that i made this post available...,0.028671,0.283333,1,2


In [15]:
# Create function that turns the Reviewer Score into a classification target with 3 values
df['Score'] = df['Reviewer_Score'].apply(lambda x: 0 if x < 5 else 1 if x >= 5 and x < 7 else 2)

In [16]:
# Checking if function worked
df[['Reviewer_Score', 'Score']].head(5)

Unnamed: 0,Reviewer_Score,Score
0,2.9,0
1,7.5,2
2,7.1,2
3,3.8,0
4,6.7,1


In [17]:
# Checking if there will be class imbalance
df.Score.value_counts()

2    428887
1     64570
0     22281
Name: Score, dtype: int64

### Findings and Takeaways:

- There is class imbalance in the target variable. Since the dataset if very large, it should not be a problem use downsampling or upsampling.

# Vanilla Model

In [18]:
# Evaluation function

def evaluation(y_true, y_pred):
       
# Print Accuracy, Recall, F1 Score, and Precision metrics.
    print('Evaluation Metrics:')
    print('Accuracy: ' + str(metrics.accuracy_score(y_test, y_pred)))
    print('F1 Score: ' + str(metrics.f1_score(y_test, y_pred, average="micro")))

## Vectorizing Dataset

In [19]:
# Instantiate Stop Words
# add_stop_words = ['mention', 'link', 'rt', 'quot', 'amp', 'sxsw']
stop_words = text.ENGLISH_STOP_WORDS.union()

In [20]:
df.columns

Index(['Hotel_Address', 'Additional_Number_of_Scoring', 'Review_Date',
       'Average_Score', 'Hotel_Name', 'Reviewer_Nationality',
       'Negative_Review', 'Review_Total_Negative_Word_Counts',
       'Total_Number_of_Reviews', 'Positive_Review',
       'Review_Total_Positive_Word_Counts',
       'Total_Number_of_Reviews_Reviewer_Has_Given', 'Reviewer_Score', 'Tags',
       'days_since_review', 'lat', 'lng', 'Negative_Review_Clean',
       'Positive_Review_Clean', 'Polarity_Net', 'Polarity_Pos',
       'Sent_Analysis_Neg', 'Sent_Analysis_Pos', 'Score'],
      dtype='object')

In [21]:
df.Negative_Review_Clean

0          i am so angry that i made this post available...
1                                               no negative
2          rooms are nice but for elderly a bit difficul...
3          my room was dirty and i was afraid to walk ba...
4          you when i booked with your company on line y...
                                ...                        
515733     no trolly or staff to help you take the lugga...
515734             the hotel looks like   but surely not   
515735     the ac was useless it was a hot week in vienn...
515736                                          no negative
515737           i was in   floor it didn t work free wife 
Name: Negative_Review_Clean, Length: 515738, dtype: object

In [22]:
# Instantiate CountVectorizer
cv = CountVectorizer(stop_words=stop_words)

# Fit and transform dataframe without data cleaning
df_cv = cv.fit_transform(df.Negative_Review_Clean)
df_tk = pd.DataFrame(df_cv.toarray(), columns = cv.get_feature_names())
df_tk.index = df.index

In [23]:
y = df.Score
X = df_tk

In [24]:
X.shape

(515738, 51632)

## Train Test Split

In [2]:
# Running Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25)

NameError: name 'train_test_split' is not defined

In [1]:
X_train

NameError: name 'X_train' is not defined

## Vanilla Models

### First Model

In [None]:
logreg_base = LogisticRegression(max_inter=100)
logreg_base.fit(X_train, y_train)




# Pickle DataFrame

In [64]:
# Pickle DataFrame
pd.to_pickle(df, "./dummy.pkl")

# Ideas

- Check if the review is worse if it takes time to be made
- Check the country and nationalities
- Time of the year with more complaints

# Stretch Goals

- Get latitude and longitude for hotels that are missing this information
- People might base their review on an isolated bad experience

In [6]:
''' getting hotels latitude and longetude '''

from geopy.extra.rate_limiter import RateLimiter
# 1 - conveneint function to delay between geocoding calls
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
# 2- - create location column
df['location'] = df['ADDRESS'].apply(geocode)
# 3 - create longitude, laatitude and altitude from location column (returns tuple)
df['point'] = df['location'].apply(lambda loc: tuple(loc.point) if loc else None)
# 4 - split point column into latitude, longitude and altitude columns
df[['latitude', 'longitude', 'altitude']] = pd.DataFrame(df['point'].tolist(), index=df.index)

NameError: name 'locator' is not defined