# Natural Language Processing using Multinomal Naive Bayes Algorithm
<b><i> Text Classification on TripAdvisor's Online Hotel Reviews

### INTRODUCTION

**Description:** There are two parts to this project. Both parts of the project will use text classification algorithms to predict a desired result by training datasets and using machine learning algorithms. 
 - Train a ML model to determine if a review of a hotel is positive or negative. The data used for training comes from <i>Trip Advisor's Online Hotel</i> Reviews. 
 - The dataset consists of 25,000 hotel reviews and their ratings from 1, most negative, to 5, most positive, reviews. 

**Sources**:
* <b>Part A:</b> Alam, M. H., Ryu, W.-J., Lee  -- ([source](https://zenodo.org/record/1219899#.Y9Y_N9JBwUE))

#### IMPORTING LIBRARIES & DATASET

In [63]:
# import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import random
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import nltk

# modules for preprocessing: tokenizing, stemmer, stopwords
from nltk.tokenize import RegexpTokenizer
# separate text into words
from nltk.tokenize import word_tokenize  
# stopwords = an, am, the 
from nltk.corpus import stopwords
# stem words to get only root. 
from nltk.stem import PorterStemmer

# countVectorizer to change X dataset to vector of numbers
from sklearn.feature_extraction.text import CountVectorizer

<b><u>The Dataset from Kaggle contains extra columns and is unsorted.<br> We will do a preliminary reading of the dataset and clean the dataset to be usable for training.</u></b>

In [64]:
hotel_review_prelim = pd.read_csv('hotel_review_kaggle.csv', encoding = 'ISO-8859-1')
print()
print("Trip Advisor's Online Hotel Reviews:")
hotel_review_prelim.head(5)


Trip Advisor's Online Hotel Reviews:


Unnamed: 0,S.No.,Review,Rating
0,1,nice hotel expensive parking got good deal sta...,4
1,2,ok nothing special charge diamond member hilto...,2
2,3,nice rooms not 4* experience hotel monaco seat...,3
3,4,unique \tgreat stay \twonderful time hotel mon...,5
4,5,great stay great stay \twent seahawk game awes...,5


<b>Clean Up:

In [65]:
# remove unnecessary columns
hotel_review_prelim = hotel_review_prelim.drop(columns = 'S.No.')
hotel_review_prelim.columns = hotel_review_prelim.columns.str.lower()
print("Hotel Review DataFrame Shape:", hotel_review_prelim.shape)

# remove any NaN's
hotel_review_DataFrame = hotel_review_prelim.dropna()
print("Trip Advisor's Online Hotel Reviews [cleaned]:")

hotel_review_DataFrame

Hotel Review DataFrame Shape: (20491, 2)
Trip Advisor's Online Hotel Reviews [cleaned]:


Unnamed: 0,review,rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,unique \tgreat stay \twonderful time hotel mon...,5
4,great stay great stay \twent seahawk game awes...,5
...,...,...
20486,best kept secret 3rd time staying charm \tnot ...,5
20487,great location price view hotel great quick pl...,4
20488,ok just looks nice modern outside \tdesk staff...,2
20489,hotel theft ruined vacation hotel opened sept ...,1


<b><u>There are total of 20,491 Hotel Reviews with their rating. <br>Let's inspect the number of occurences of each rating to see the distribution of the dataset. </u></b>

In [66]:
rating_DF = pd.DataFrame(hotel_review_df.rating.value_counts())
rating_DF.columns = ['Total Count']
display(rating_DF)
print(rating_DF.sum())

Unnamed: 0,Total Count
5,9054
4,6039
3,2184
2,1793
1,1421


Total Count    20491
dtype: int64


<b><u>Our objective in creating a training dataset is to differentiate between positive and negative reviews. Therefore we have to differentiate between the high reviews and low reviews. <br><br>As seen above, in this specific dataset the Total Count for 5 star rating is much higher than the count for 1 star ratings. In order to create a quality and equal training dataset, we can use 1 star or 2 stars to differentiate between a Positive review and a Negative review.</u></b><br>

In [72]:
# boolean indexing on rating column
hotel_rating_DataFrame = hotel_review_DataFrame[(hotel_review_DataFrame['rating'] == 1) | (hotel_review_DataFrame['rating'] == 2)| (hotel_review_DataFrame['rating'] == 5)]

# reset index to be consecutive
hotel_rating_DataFrame = hotel_rating_DataFrame.reset_index(drop=True)
hotel_rating_DataFrame

Unnamed: 0,review,rating
0,ok nothing special charge diamond member hilto...,2
1,unique \tgreat stay \twonderful time hotel mon...,5
2,great stay great stay \twent seahawk game awes...,5
3,love monaco staff husband stayed hotel crazy w...,5
4,cozy stay rainy city \thusband spent 7 nights ...,5
...,...,...
12263,not impressed unfriendly staff checked asked h...,2
12264,best kept secret 3rd time staying charm \tnot ...,5
12265,ok just looks nice modern outside \tdesk staff...,2
12266,hotel theft ruined vacation hotel opened sept ...,1


<b><u>After filtering out just the positive and negative reviews, there are only 12,268 reviews that fit those two categories. We can change the 2 star ratings to 1 star ratings as they both mean negative review.

In [73]:
change_stars = {2:1}
hotel_rating_DataFrame.rating = hotel_rating_DataFrame.rating.replace(change_stars)
print("Unique Rating and its number of occurrences (UPDATED):")
print(hotel_rating_DataFrame.rating.value_counts())

Unique Rating and its number of occurrences (UPDATED):
5    9054
1    3214
Name: rating, dtype: int64


Now we have the dataset ready to be pre-processed before it is used for training set. <br><u>We can create X and Y datasets to pre-process and get tokens from each text. </u>

In [76]:
# X dataset = review column , y = rating column
X = hotel_reviews_df.review
y = hotel_reviews_df.rating
y = hotel_reviews_df.rating
print("Shape of X dataset:", X.shape)
print("Shape of y dataset:", y.shape)

Shape of X dataset: (12268,)
Shape of y dataset: (12268,)


<u><b>Preprocess X: tokenize, filter stop words, and stem words, to get tokens from the strings.<br> This allow more accuracy as some texts might be misspelled or have errors. This improves performance of the NLP algorithm.<br>

In [79]:
# Preprocess X to get tokens from the strings

# tokenize words containing letters only : '[a-zA-Z]+' form tokens with only alpha sequences
tokenizer = RegexpTokenizer('[a-zA-Z]+') 
# stop words: a, the, is, are .. etc.
stop_words= set(stopwords.words("english"))
# root of differently same words
stemmer = PorterStemmer()

def preprocess(s) :
    # tokenizer regex '[a-zA-Z]+' by letters only
    w = tokenizer.tokenize(s.lower()) 
    # after its been tokenized, if w is not in stop_words, make a list of words(that are not stop_words)
    w = [word for word in w if word not in stop_words] 
    # find stem of each token
    w = [stemmer.stem(word) for word in w] 
    # .join() tokens back into a string
    return ' '.join(w)         

# with each row, preprocess the text string, and store all preprocessed rows in the X_processed list. 
# make a dataFrame out of that list
X_processed = pd.DataFrame([preprocess(X.loc[i]) for i in range(len(X))])
print("Processed Hotel Review DataFrame:")
X_processed.head(7)   

Processed Hotel Review DataFrame:


Unnamed: 0,0
0,ok noth special charg diamond member hilton de...
1,uniqu great stay wonder time hotel monaco loca...
2,great stay great stay went seahawk game awesom...
3,love monaco staff husband stay hotel crazi wee...
4,cozi stay raini citi husband spent night monac...
5,hotel stay hotel monaco cruis room gener decor...
6,excel stay hotel monaco past w e delight recep...


In [80]:
# convert X_processed dataset to numbers to use on model
# use countVectorizer 
# CountVectorizer counts the frequency of each word, 
# and returns a vector of numbers (features) that can be used by a machine learning model

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
# learns from X_processed dataset to get all the words(vocabulary) out 
vect.fit(X_processed[0])
# transforms X_processed[0] words(vocabulary) into vector of numbers which can be used to do data transformation
X_vectors = vect.transform(X_processed[0])
print('Shape of new X dataset:', X_vectors.shape)

Shape of new X dataset: (12268, 27552)


We can now use MultinomalNaive Bayes Model to train and test our training and testing datasets. 

In [83]:
# create training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X_vectors,y,test_size=0.2)

# using MultinomialNB() model
classifier = MultinomialNB()
# train 
classifier.fit(X_train, y_train)
# test the model
y_pred = classifier.predict(X_test)

#  find the accuracy of the MultinomialNB model 
print('Accuracy testing for the MultinomialNB model:')
print('Accuracy score:' ,metrics.accuracy_score(y_test, y_pred))
print('Confusion Matrix: \n', metrics.confusion_matrix(y_test, y_pred, labels=[0,1]))

Accuracy testing for the MultinomialNB model:
Accuracy score: 0.9474327628361858
Confusion Matrix: 
 [[  0   0]
 [  0 589]]


The Accuracy Score for the MutlinomalNB Model testing came out excellent. <br>We can test this model by creating our own reviews and testing the reviews on the model. 

In [94]:
# create DataFrame with positive and negative reviews

review = ['We had such a great time at this hotel. The room was kept very clean every night and the bed was extremely comfy!',
           'great time at this hotel!! the room was so clean and everything was comfy!! what a unique time at this place. we would definitely come back!',
           'great bed, great sheets. great staff, great food. great views, great people. everything amazing!',
           'We hated everything here. The check-in time was late. People next door were loud. The bed was dirty and disgusting!', 
           'I loved my weekend at this hotel. The lobby was beautiful and the staff were all friend.', 
           'I would recommended this place to my friends and family. The rooms were clean and the the price was reasonable.',
           'I am never coming back to this hotel.The staff were rude and and were not at the front desk half of the time.',
           'The place in Italy seemd okay but we quickly realized that they had a pest problem. We spotted mutliple cockroaches in the hallway. Will not be returning']

# expected rating
rating = [5,5,5,2,5,5,1,1]

# create dataframe with the 4 reviews
review_test = pd.DataFrame({'review' : review , 'expected rating' : rating})
review_test[['review']]


Unnamed: 0,review
0,We had such a great time at this hotel. The ro...
1,great time at this hotel!! the room was so cle...
2,"great bed, great sheets. great staff, great fo..."
3,We hated everything here. The check-in time wa...
4,I loved my weekend at this hotel. The lobby wa...
5,I would recommended this place to my friends a...
6,I am never coming back to this hotel.The staff...
7,The place in Italy seemd okay but we quickly r...


<b><u>We can see how well the model did and test it with the newly created Review_test DataFrame to show the predicted outcome.

In [95]:
# test model with new review_test dataFrame, show predicted outcome

# X_test = testing data: strings from reviews_test['review'] transform to vector of num
X_test = vect.transform(review_test['review'])
# test  
y_pred = classifier.predict(X_test)

# create new column in dataFrame of predicted value from testing the model with X_test
review_test['predicted outcome'] = y_pred
review_test

Unnamed: 0,review,expected rating,predicted outcome
0,We had such a great time at this hotel. The ro...,5,5
1,great time at this hotel!! the room was so cle...,5,5
2,"great bed, great sheets. great staff, great fo...",5,5
3,We hated everything here. The check-in time wa...,2,1
4,I loved my weekend at this hotel. The lobby wa...,5,5
5,I would recommended this place to my friends a...,5,5
6,I am never coming back to this hotel.The staff...,1,1
7,The place in Italy seemd okay but we quickly r...,1,1


<b><u>The model accurately predicted the positive or negative ratings for all new test reviews. 