## Trip Advisor Hotel Reviews

### About this dataset
- Hotels play a crucial role in traveling and with the increased access to information new pathways of selecting the best ones emerged.
- With this dataset, consisting of 20k reviews crawled from Tripadvisor, you can explore what makes a great hotel and maybe even use this model in your travels!


### Endevour

- 1-  Analyze and explore data 
- 2- Cleaning the texts
- 3- Building a Machine Learning Model / classification



In [None]:
#Importing the basic librarires fot analysis

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use("ggplot")  #using style ggplot

%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D
import plotly.graph_objects as go
import plotly.express as px
from wordcloud import WordCloud, STOPWORDS
import re

In [None]:
#Importing the dataset
df =pd.read_csv("../input/trip-advisor-hotel-reviews/tripadvisor_hotel_reviews.csv")


# look the data set
df.head()

In [None]:
# looking the shape DataSet
df.shape

- We have 20K Reviews

In [None]:
#Checking the dtypes of all the columns

df.info()

In [None]:
#checking null value 
df.isna().sum()

- No any missing value

In [None]:
# look  describe data set
df["Rating"].describe().round(2)

- The average a round 4 , it is good

In [None]:
# any duplicate data 
df.duplicated().sum()

- No any duplicate data 

In [None]:
# how much percentage rating in the dataset

sns.countplot(df['Rating'])
plt.show()

### The percentage rating
- 5 = 44%
- 4 = 30%
- 3 = 10%
- 2 = 9%
- 1 = 7% 

In [None]:
# Length of word in sentence
df['Length'] = df['Review'].apply(len)
df.head()

In [None]:
# look  describe data set
df["Length"].describe().round(2)

In [None]:
# graph what the the relationship between Rating and Length

plt.figure(figsize=(15,7))
sns.lineplot(data=df,x="Rating", y="Length")

In [None]:
px.scatter(df,x="Rating",y="Length", color="Rating")

- I see the length reviews has effect in the rating 

In [None]:
df_5=df[df["Rating"]==5]
df_5

In [None]:
# the highest word repeat in the review rating 5

plt.figure(figsize=(15,15))
wc1 = WordCloud(max_words=1000, min_font_size=10, 
                height=800,width=800,background_color="orange").generate(' '.join(df_5['Review']))

plt.imshow(wc1)

### We see the people satisfy = rating 5 in this words 
- In general the hotel - room - night  - beach - restaurant and food and drink - bed - pool  - locations. 


In [None]:
df_4=df[df["Rating"]==4]
df_4

In [None]:
# the highest work repeat in the review rating 4

plt.figure(figsize=(15,15))
wc2 = WordCloud(max_words=1000, min_font_size=10, 
                height=800,width=800,background_color="violet").generate(' '.join(df_4['Review']))

plt.imshow(wc2)

### We see the people satisfy = rating 4  in this words the same rating 5 but plus ...
- beautiful hotel - staff friendly - service - street.


In [None]:
df_3=df[df["Rating"]==3]
df_3

In [None]:
# the highest work repeat in the review rating 3

plt.figure(figsize=(15,15))
wc3 = WordCloud(max_words=1000, min_font_size=10, 
                height=800,width=800,background_color="lime").generate(' '.join(df_3['Review']))

plt.imshow(wc3)

In [None]:
df_2=df[df["Rating"]==2]
df_2

In [None]:
# the highest work repeat in the review rating 2

plt.figure(figsize=(15,15))
wc4 = WordCloud(max_words=1000, min_font_size=10, 
                height=800,width=800,background_color="black").generate(' '.join(df_2['Review']))

plt.imshow(wc4)

#### We see the people unsatisfied = rating 2  in this words the same rating 5 & 4 but plus ...
- Hotel - staff - beach - srevice - disk - stay - shower

In [None]:
df_1=df[df["Rating"]==1]
df_1

In [None]:
# the highest work repeat in the review rating 2

plt.figure(figsize=(15,15))
wc5 = WordCloud(max_words=1000, min_font_size=10, 
                height=800,width=800,background_color="gray").generate(' '.join(df_1['Review']))

plt.imshow(wc5)

### We see the people unsatisfied = rating 1  in this words the same rating 5 & 4 but plus ...
- room  - hotel - place - staff - door - check in - sleep - toilet - resort -water.


# Analysis

#### We have 20K Reviews in the hotel and the Rating Reviews from 1 to 5

##### The percentage rating in the data set
- 5 = 44%
- 4 = 30%
- 3 = 10%
- 2 = 9%
- 1 = 7%


### I see the length reviews has effect in the rating.

## Rating  5 = 44% -> satisfy 
#### We see the people satisfy = rating 5 in this words
- In general the hotel - room - night - beach - restaurant and food and drink - bed - pool - locations.

## Rating  4 = 30% -> satisfy 

#### We see the people satisfy = rating 4 in this words the same rating 5 but plus ...
- beautiful hotel - staff friendly - service - street.


## Rating  2 = 9% -> unsatisfied 
#### We see the people unsatisfied = rating 2 in this words the same rating 5 & 4 but plus ...
- Hotel - staff - beach - srevice - disk - stay - shower

## Rating  1 = 7% -> unsatisfied 
#### We see the people unsatisfied = rating 1 in this words the same rating 5 & 4 but plus ...
- room - hotel - place - staff - door - check in - sleep - toilet - resort -water.


## I see around 75 % from visitors satisfied 



# 2- Making clean text for ML


In [None]:
df.head()

In [None]:
# first review
a=df.iloc[0,0]
a

- First I want making clean the first review and I will make for loop or Function

In [None]:
# import library for Natural Language Toolkit

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer


In [None]:
# remove any symbol and cover letter to lowercase

a=re.sub('[^a-zA-Z0-9]',' ',a)
a=a.lower().split()
a

In [None]:
# dowmload the stopword from nlit library 
nltk.download('stopwords')


In [None]:
# show what the stopwords 

sw=set(stopwords.words('english'))
print(sw)

In [None]:
# made loop for clean reviews - based on the stopwords

clean_word=[i for i in a if not i in sw]
clean_word

In [None]:
# now I need make sentence - list 
sen=' '.join(clean_word)
sen

In [None]:
# defined function for clean all reviews in the data set


def text_preprocessing(a):
  a=re.sub('[^a-zA-Z]',' ',a)
  a=a.lower().split()
  ps=PorterStemmer()
  clean_word=[ps.stem(i) for i in a if not i in sw]
  sen=' '.join(clean_word)
  return sen

In [None]:
# add new column about the reviews after cleaning

df['clean_word']=df["Review"].apply(text_preprocessing)
df.head()


In [None]:
# Length of word in sentence
df['Length 2'] = df['clean_word'].apply(len)
df.head()

In [None]:
df.describe().round(2)

- We see the length reviews less

# 3- Building a Machine Learning Model / classification


In [None]:
#Importing the basic librarires for building model - classification

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score,r2_score


from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import  KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import  MLPClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.naive_bayes import MultinomialNB

from sklearn.feature_extraction.text import CountVectorizer 

from sklearn.preprocessing import LabelEncoder,StandardScaler


In [None]:
# now we need change the new word to number - array using CountVectorizer

cv=CountVectorizer()
X=cv.fit_transform(df["clean_word"]).toarray()
y=df["Rating"]

In [None]:
# the shape
print("X shape: ", X.shape)
print("y: shape ", y.shape)

In [None]:
# split the data train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print("X Train : ", X_train.shape)
print("X Test  : ", X_test.shape)
print("Y Train : ", y_train.shape)
print("Y Test  : ", y_test.shape)

In [None]:
#Defined object from library classification 

LR = LogisticRegression()
DTR = DecisionTreeClassifier()
RFR = RandomForestClassifier()
KNR = KNeighborsClassifier()
NB = MultinomialNB()

In [None]:
# make for loop for classification 

li = [LR,DTR,RFR,KNR,NB]
d = {}
for i in li:
    i.fit(X_train,y_train)
    ypred = i.predict(X_test)
    print(i,":",accuracy_score(y_test,ypred)*100)
    d.update({str(i):i.score(X_test,y_test)*100})

In [None]:
# make graph about Accuracy

plt.figure(figsize=(30, 6))
plt.title("Algorithm vs Accuracy")
plt.xlabel("Algorithm")
plt.ylabel("Accuracy")
plt.plot(d.keys(),d.values(),marker='o',color='blue')
plt.show()