# NLP With Hotel Review Part 1

In [None]:
# Please run the imports below in order to set up the environment first.

# The usual packages
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline

import seaborn as sns
from scipy import stats
import statsmodels.api as sm

# To make our sets
from sklearn.model_selection import train_test_split 

# Scalars
from sklearn.preprocessing import StandardScaler, MinMaxScaler 

# The classifiers 
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

from sklearn.feature_extraction.text import CountVectorizer

Welcome to the NLP with Hotel Review Part 1. Here we will dive into the Exploratory Data Analysts as well as Data Wrangling of the Hotel Review dataset.

### 1. Exploratory Data Analysis

In [None]:
# Please run the code below to read in the following CSV data file.

# Contains various Hotel Reviews and other Hotel information.
hotelreviews_df = pd.read_csv('data/Hotel_Reviews.csv')

Before we begin, let us take a look at the Hotel Review dataset that we will be working with:

In [None]:
# To take a peak at the data we are working with.
hotelreviews_df.head()

At first glance, we can see that the Hotel Review dataset is comprised of a mix of numeric and non-numeric data. We have hotel information such as the hotel address, hotel name, and even the longitude and latitude of the hotel. We have a review date and different review scores along with both positive and negative written text reviews. Our focus for this EDA will be the `Reviewer_Score` which will be our target.

In [4]:
# To see the shape of the dataset.
hotelreviews_df.shape

(515738, 17)

In [5]:
# To get further insights into the dataset.
hotelreviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 515738 entries, 0 to 515737
Data columns (total 17 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   Hotel_Address                               515738 non-null  object 
 1   Additional_Number_of_Scoring                515738 non-null  int64  
 2   Review_Date                                 515738 non-null  object 
 3   Average_Score                               515738 non-null  float64
 4   Hotel_Name                                  515738 non-null  object 
 5   Reviewer_Nationality                        515738 non-null  object 
 6   Negative_Review                             515738 non-null  object 
 7   Review_Total_Negative_Word_Counts           515738 non-null  int64  
 8   Total_Number_of_Reviews                     515738 non-null  int64  
 9   Positive_Review                             515738 non-null  object 
 

In [6]:
hotelreviews_df.isnull().sum()

Hotel_Address                                    0
Additional_Number_of_Scoring                     0
Review_Date                                      0
Average_Score                                    0
Hotel_Name                                       0
Reviewer_Nationality                             0
Negative_Review                                  0
Review_Total_Negative_Word_Counts                0
Total_Number_of_Reviews                          0
Positive_Review                                  0
Review_Total_Positive_Word_Counts                0
Total_Number_of_Reviews_Reviewer_Has_Given       0
Reviewer_Score                                   0
Tags                                             0
days_since_review                                0
lat                                           3268
lng                                           3268
dtype: int64

After reading the dataset into `hotelreviews_df` and reviewing the above, we can see that the Hotel Reviews dataset has 515,738 rows and 17 columns. The data is comprised of three different datatypes: four `float64`, five `int64`, and eight `object` types. The data file size is 66.9+ KB. We have also checked to see whether there are any null values within our dataset, which appears that there are 3268 missing values in both the `lat` column as well as the `lng` column.

Looking at our data, we can see that the `Reviewer_Score` provided are all given as decimal values (float). In order to convert them into integers (int) from 1 to 10, we will take the data within the `Reviewer_Score` column and round it to the nearest ones place (or whole number) using `.round(0)` and then convert the datatype from float to integer using `.astype()` as seen below:

In [7]:
# Rounding 'Reviewer_Score' and converting values from the current datatype float into an integer. 
hotelreviews_df['Reviewer_Score'] = hotelreviews_df['Reviewer_Score'].round(0).astype(int)

Now if we take a look at our data we should see that all of the values within the `Reviewer_Score` column are now whole numbers and the are all of integer datatype. 

In [8]:
# To peak at the changes made above.
hotelreviews_df.head()

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,3,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,8,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
2,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,Australia,Rooms are nice but for elderly a bit difficul...,42,1403,Location was good and staff were ok It is cut...,21,9,7,"[' Leisure trip ', ' Family with young childre...",3 days,52.360576,4.915968
3,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,United Kingdom,My room was dirty and I was afraid to walk ba...,210,1403,Great location in nice surroundings the bar a...,26,1,4,"[' Leisure trip ', ' Solo traveler ', ' Duplex...",3 days,52.360576,4.915968
4,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/24/2017,7.7,Hotel Arena,New Zealand,You When I booked with your company on line y...,140,1403,Amazing location and building Romantic setting,8,3,7,"[' Leisure trip ', ' Couple ', ' Suite ', ' St...",10 days,52.360576,4.915968


In [None]:
# To verify that 'Reviewer_Score' is now dataype 'int'.
hotelreviews_df.info()

Based on the large amount of data we have I would assume that we would have a rather even or balanced distribution of the data among the different reviewer scores ranging from 0 to 10. Let us confirm this by actually taking a look and visualizing the data.

So what is the actual distribution of the reviews...

The best way to view this data is to visualize it by creating a histogram and a boxplot which can be seen below:

In [None]:
# Set up a plt subplot grid
plt.subplots(1, 2, figsize=(20, 5))

# Plot out the histogram
plt.subplot(1, 2, 1)
plt.hist(hotelreviews_df['Reviewer_Score'], bins=20)
plt.title('Reviewer Score Distribution')
plt.xlabel('Reviewer Score')
plt.ylabel('Frequency')

# Plot the boxplot. We can use the seaborn boxplot code for this.
plt.subplot(1, 2, 2)
sns.boxplot(hotelreviews_df['Reviewer_Score'], color='orange')
plt.title('Reviewer Score Distribution')

plt.show()

A potential problem with this distribution is that it is very skewed or unbalanced. There is a very large difference between the distribution of the review scores, with over 175,000 reviews with a `Reviewer_Score` of 10 compared to the next highest counts in the amounts of about 110,000 reviews with a `Reviewer_Score` of 8 and about 100,000 with a `Reviewer_Score` of 9. In order to balance out the distribution and 'unskew' the data, we may need to group certain values together.

Given that this dataset has a good mix of numeric and non-numeric columns. We can easily identify which columns are numeric and which are non-numeric, as well as which columns can be turned from non-numeric columns to numeric. Lets take a look at the data again:

In [None]:
# Take a peak at the data.
hotelreviews_df.head()

In [None]:
# Get further insight into our dataset.
hotelreviews_df.info()

Above, we can see that the following columns listed below are non-numeric: 

- Hotel_Address
- Review_Date
- Hotel_Name
- Reviewer_Nationality
- Negative_Review
- Positive_Review
- Tags
- days_since_review

While the following columns listed below are numeric: 

- Additional_Number_of_Scoring
- Average_Score
- Review_Total_Negative_Word_Counts
- Total_Number_of_Reviews
- Review_Total_Positive_Word_Counts
- Total_Number_of_Reviews_Reviewer_Has_Given
- Reviewer_Score
- lat
- lng

Based on this we can deduce that 2 of the columns can be changed from a non-numeric column to a numeric column, such as the `days_since_review` column which is currently an object and can be turned into an integer. Along with the `Tags` column which is also an object but can possibly be seperated into several different columns and then we can possibly take those newly created columns and turn them from non-numeric to numeric.

## 2. Data Wrangling

#### 1. Converting the `Reviewer_Score` Column into Binary Column

Understanding how review scores generally work we know that there can be a review score with a value anywhere between 0 and 10 within our dataset. Taking a look at the `Reviewer_Score` column, we can see that our dataset currently does not have any review scores of 0 or 1, however it does have values ranging between 2 and 10 as shown below:

In [None]:
# Unique values within the 'Reviewer_Score' column.
reviewvalues = hotelreviews_df['Reviewer_Score'].unique()
sorted(reviewvalues)

Looking at the below plot of the data, we can see that there is an unbalanced distribution between the different review scores, with over 175,000 reviews with a `Reviewer_Score` of 10 compared to the next highest counts in the amounts of about 110,000 reviews with a `Reviewer_Score` of 8 and about 100,000 with a `Reviewer_Score` of 9.

In [None]:
# Visualise the total counts accross the different Reviewer Scores.
sns.countplot(hotelreviews_df['Reviewer_Score'],label="Count")
plt.show()

Given this information, we can proceed with converting the `Reviewer_Score` column into a binary column in the following way:

- Reviews that are below 9 should be encoded as 0 ('not good'). 
    
    `Reviewer_Score` < 9

- Reviews with scores 9 and 10 as 1 ('good').

    `Reviewer_Score` >= 9

In [None]:
# Converting 'Reviewer_Score' column into a binary column.
hotelreviews_df['Reviewer_Score'].values[hotelreviews_df['Reviewer_Score'] < 9] = 0
hotelreviews_df['Reviewer_Score'].values[hotelreviews_df['Reviewer_Score'] >= 9] = 1

In [None]:
# View the dataset with new binary 'Reviewer_Score' column.
hotelreviews_df.head()

In [None]:
# Visualise the total counts accross the different Reviewer Scores again except this time with new binary values.
sns.countplot(hotelreviews_df['Reviewer_Score'],label="Count")
plt.show()

#### 2. Converting Non-Numeric Columns to Numeric

As per the above visualization, we can see that now we have a more even or balanced distribution within our data.

In [None]:
# Take another look at our data.
hotelreviews_df.info()

Looking at our data now, we can see that we still have several non-numeric columns as shown below (excluding `Negative_Review` and `Positive_Review`:
    
- Hotel_Address
- Review_Date 
- Hotel_Name 
- Reviewer_Nationality  
- Tags
- days_since_review

Now we can proceed with converting the `days_since_review` column from its current object datatype into a numeric column. We can do this by splitting up the column into two columns, one column with the integer (actual number of days) and the second column with the string (the word 'days'). We can then remove the second column as it is not necessary.

In [None]:
# Using str.strip to remove 'days' and then converting to numeric by astype.
hotelreviews_df['days_since_review'] = hotelreviews_df['days_since_review'].str.strip(' days').astype(int)

In [None]:
# Confirming datatype of 'days_since_review' column.
hotelreviews_df['days_since_review'].dtype

Moving on to our other non-numeric columns, the `Hotel_Address` column will be dropped along with with `Hotel_Name` as these cannot be converted into numerical values given there are far too may different hotel addresses and hotel names. 

Although normally we would be able to convert `Review_Date` into a numeric column by converting it to a quarterly value representation, we have data spanning muliple different years which would not give us an accurate representation of the review date. For example, we can convert date into values of 1 to 4 representing Q1 to Q4 within a given year however this would mean we no longer know what year that quarter belongs to. Therefore we will also be dropping the `Review_Date` column. 

The `Reviewer_Nationality` can be assigned a unique number representing each different country, however there are a lot of different countries which would result in having a long list of different values which we would not be able to exactly determine which country those values belong to. Hence, we will also drop this column.

`Tags` can be very useful data however it is very messy with regards to order and organization. Although we can split up each string within a given `Tags` row, there is quite a bit of missing data and a lot of organization and cleaning would need to take place in order to actually appropriately arrange the data. For this reason we will also drop this column.

To summarize, the following non-numeric columns will be dropped seeing as we cannot convert them into numerical values:

- Hotel_Address
- Review_Date 
- Hotel_Name 
- Reviewer_Nationality  
- Tags

In [None]:
# Dropping the remianing non-numeric columns.
hotelreviews_df = hotelreviews_df.drop(['Hotel_Address', 'Review_Date', 'Hotel_Name', 'Reviewer_Nationality', 'Tags'], axis=1)
hotelreviews_df

#### 3. Splitting the Data

Now that we have cleaned and oragnized our data, we can proceed with splitting our data into our Train and Test set.

In [None]:
# Assigning our features to X.
X = hotelreviews_df.drop(['Reviewer_Score'], axis=1)

# Assigning our target to y. 
y = hotelreviews_df['Reviewer_Score']

# Check 
display(X)
print(y)

In [None]:
# Creating our training and test sets, 20% test size, random state of 5.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5, stratify=y)

# Check 
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

#### 4. Using CountVectorizer to combine Positive_Review and Negative_Review with the numeric data

First we must instantiate, fit, and transform our X_train data using the CountVectorizer using the min_df parameter for the purpose of ignoring words that have very few occurrences to be considered meaningful. Lets start with the `Positive_Review` column:

In [None]:
# 1. Instantiate 
bagofwords = CountVectorizer(min_df=500)

# 2. Fit 
bagofwords.fit(X_train["Positive_Review"])

# 3. Transform
X_train_positive_transformed = bagofwords.transform(X_train["Positive_Review"])
X_train_positive_transformed

We now have a sparse matrix for `Positive_Review` with a total of 412,590 rows and 999 columns. This means that there are 999 unique terms or tokens.

Next we need to convert this sparse matrix into a numpy array in order to later combine the sparse matrix with the numeric data.

In [None]:
# converting the sparse matrix into a numpy array
X_train_positive_transformed = X_train_positive_transformed.toarray()

In [None]:
X_train_positive_transformed

The same has to be done with the `Negative_Review` column as was done with the `Positive_Review` column:

In [None]:
# 1. Instantiate 
bagofwords = CountVectorizer(min_df=500)

# 2. Fit 
bagofwords.fit(X_train["Negative_Review"])

# 3. Transform
X_train_negative_transformed = bagofwords.transform(X_train["Positive_Review"])
X_train_negative_transformed

We now have a sparse matrix for `Negative_Review` with a total of 412,590 rows and 1199 columns. This means that there are 1199 unique terms or tokens.

Next we need to convert this sparse matrix into a numpy array in order to later combine the sparse matrix with the numeric data.

In [None]:
# converting the sparse matrix into a numpy array
X_train_negative_transformed = X_train_negative_transformed.toarray()

In [None]:
X_train_negative_transformed

We must now also do the same to our Test data. This is because when we train a model on some training data and want to test the same model, the testing data has to be in the exact same format as the training data. This means that the train and test data have to contain the same features.

In [None]:
# Use bag-of-words vectorizer fitted to our training data to transform our test data as well for 'Positive_Review'.
X_test_positive_transformed = bagofwords.transform(X_test)
X_test_positive_transformed

In [None]:
# Use bag-of-words vectorizer fitted to our training data to transform our test data as well for 'Negative_Review'.
X_test_negative_transformed = bagofwords.transform(X_test)
X_test_negative_transformed

Lastly, we need to combine the three matrices (numeric data, positive matrix, and negative matrix):

In [None]:
np.concatenate((X_train, X_train_positive_transformed, X_train_negative_transformed), axis=1)

Let us do the same again for the Test data.

In [None]:
np.concatenate((X_test, X_test_positive_transformed, X_test_negative_transformed), axis=1)

#### 5. What does the min_df parameter do?

Just as seen above, adding a min_df parameter to our count vectorizer allows us to exclude any token that occurs in less than min_df documents specified. Min_df is used for the purpose of ignoring words that have very few occurrences to be considered meaningful. For example, in the `Positive_Review` or `Negative_Review` columns we may have a word that appears in only 1 or two rows. Depending on how large our dataset is, we may qualify this as noise and eliminate it from further analysis by utilizing the min_df parameter. I used the min_df parameter above, and depending what you set the min_df parameter equal to, the higher the number the less columns will be present within the sparce matrix.