## Zomato Rating.
DESCRIPTION

Using NLP and machine learning, make a model to predict the rating in a review based on the content of the text review. This will help identify cases with a mismatch.

### Problem Statement: 
Zomato is India’s largest platform for discovering restaurants and ordering food. It operates in India as well as a few cities internationally. Bangalore is one of the biggest customers and restaurant bases for Zomato with 4 to 5 million users using the platform each month.

Users on the platform can also post reviews of restaurants and provide a rating accompanying the review. The content in the reviews should ideally reflect the rating provided by the customer. In many cases, there is a mismatch, owing to multiple reasons, where the rating does not match the customer review. The reviews and rating match is very important as it builds customer trust on the platform and helps the user get an accurate picture of the restaurant. 

You, as a data scientist, need to enable the identification and cleanup of such cases to ensure the ratings reflect the reviews and that the reviews seem trustworthy to the customer. You will need to use NLP techniques in conjunction with machine learning models to predict the rating from the review text. 

In [67]:
import numpy as np
import pandas as pd
import matplotlib as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.stem import PorterStemmer, LancasterStemmer, RegexpStemmer,SnowballStemmer,WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
from nltk.corpus import brown
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import re
import nltk
from sklearn.metrics.pairwise import cosine_similarity
import spacy
import string as str
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from nltk.corpus import wordnet as wn
from sklearn.linear_model import LogisticRegression
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
from nltk.corpus import words
from nltk.probability import ConditionalFreqDist
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from nltk.stem import PorterStemmer,WordNetLemmatizer
from spellchecker import SpellChecker
PS = PorterStemmer()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_squared_error,mean_squared_log_error,r2_score,classification_report,confusion_matrix
import seaborn as sns
from collections import  Counter


import warnings
warnings.filterwarnings('ignore')

### Task <br>
![image.png](attachment:image.png)

In [68]:
Zomato_reviews = pd.read_csv('C:/Working Files/Mac ka folder/Simplilearn/NLP/Online Classes/Day7/Zomato_reviews.csv')

In [69]:
Zomato_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27762 entries, 0 to 27761
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   rating       27762 non-null  float64
 1   review_text  27748 non-null  object 
dtypes: float64(1), object(1)
memory usage: 433.9+ KB


In [70]:
Zomato_reviews.isnull().sum()

rating          0
review_text    14
dtype: int64

In [71]:
Zomato_reviews.dropna(inplace=True)

In [72]:
Zomato_reviews.isnull().sum()

rating         0
review_text    0
dtype: int64

### Set of EDA steps.. 
1. decontracted
2. textPreprocessing
3. Spell_correct

In [73]:
def decontracted(phrase):
    phrase = re.sub(r"ain't", "am not", phrase)
    phrase = re.sub(r"aren't", "are not", phrase)
    phrase = re.sub(r"can't", "cannot", phrase)
    phrase = re.sub(r"can't've", "cannot have", phrase)
    phrase = re.sub(r"'cause", "because", phrase)
    phrase = re.sub(r"could've", "could have", phrase)
    phrase = re.sub(r"couldn't", "could not", phrase)
    phrase = re.sub(r"couldn't've", "could not have", phrase)
    phrase = re.sub(r"didn't", "did not", phrase)
    phrase = re.sub(r"doesn't", "does not", phrase)
    phrase = re.sub(r"don't", "do not", phrase)
    phrase = re.sub(r"hadn't", "had not", phrase)
    phrase = re.sub(r"hadn't've", "had not have", phrase)
    phrase = re.sub(r"hasn't", "has not", phrase)
    phrase = re.sub(r"haven't", "have not", phrase)
    phrase = re.sub(r"he'd", "he had", phrase)
    phrase = re.sub(r"he'd've", "he would have", phrase)
    phrase = re.sub(r"he'll", "he will", phrase)
    phrase = re.sub(r"he'll've", "he will have", phrase)
    phrase = re.sub(r"he's", "he is", phrase)
    phrase = re.sub(r"how'd", "how did", phrase)
    phrase = re.sub(r"how'd'y", "how do you", phrase)
    phrase = re.sub(r"how'll", "how will", phrase)
    phrase = re.sub(r"how's", "how is", phrase)
    phrase = re.sub(r"I'd", "I had", phrase)
    phrase = re.sub(r"I'd've", "I would have", phrase)
    phrase = re.sub(r"I'll", "I will", phrase)
    phrase = re.sub(r"I'll've", "I will have", phrase)
    phrase = re.sub(r"I'm", "I am", phrase)
    phrase = re.sub(r"I've", "I have", phrase)
    phrase = re.sub(r"isn't", "is not", phrase)
    phrase = re.sub(r"it'd", "it had", phrase)
    phrase = re.sub(r"it'd've", "it would have", phrase)
    phrase = re.sub(r"it'll", "it will", phrase)
    phrase = re.sub(r"it'll've", "iit will have", phrase)
    phrase = re.sub(r"it's", "it is", phrase)
    phrase = re.sub(r"let's", "let us", phrase)
    phrase = re.sub(r"ma'am", "madam", phrase)
    phrase = re.sub(r"mayn't", "may not", phrase)
    phrase = re.sub(r"might've", "might have", phrase)
    phrase = re.sub(r"mightn't", "might not", phrase)
    phrase = re.sub(r"mightn't've", "might not have", phrase)
    phrase = re.sub(r"must've", "must have", phrase)
    phrase = re.sub(r"mustn't", "must not", phrase)
    phrase = re.sub(r"mustn't've", "must not have", phrase)
    phrase = re.sub(r"needn't", "need not", phrase)
    phrase = re.sub(r"needn't've", "need not have", phrase)
    phrase = re.sub(r"o'clock", "of the clock", phrase)
    phrase = re.sub(r"oughtn't", "ought not", phrase)
    phrase = re.sub(r"oughtn't've", "ought not have", phrase)
    phrase = re.sub(r"shan't", "shall not", phrase)
    phrase = re.sub(r"sha'n't", "shall not", phrase)
    phrase = re.sub(r"shan't've", "shall not have", phrase)
    phrase = re.sub(r"she'd", "she had", phrase)
    phrase = re.sub(r"she'd've", "she would have", phrase)
    phrase = re.sub(r"she'll", "she will", phrase)
    phrase = re.sub(r"she'll've", "she will have", phrase)
    phrase = re.sub(r"she's", "she is", phrase)
    phrase = re.sub(r"should've", "should have", phrase)
    phrase = re.sub(r"shouldn't", "should not", phrase)
    phrase = re.sub(r"shouldn't've", "should not have", phrase)
    phrase = re.sub(r"so've", "so have", phrase)
    phrase = re.sub(r"so's", "so is", phrase)
    phrase = re.sub(r"that'd", "that had", phrase)
    phrase = re.sub(r"that'd've", "that would have", phrase)
    phrase = re.sub(r"that's", "that is", phrase)
    phrase = re.sub(r"there'd", "there had", phrase)
    phrase = re.sub(r"there'd've", "there would have", phrase)
    phrase = re.sub(r"there's", "there is", phrase)
    phrase = re.sub(r"they'd", "they had", phrase)
    phrase = re.sub(r"they'd've", "they would have", phrase)
    phrase = re.sub(r"they'll", "they will", phrase)
    phrase = re.sub(r"they'll've", "they will have", phrase)
    phrase = re.sub(r"they're", "they are", phrase)
    phrase = re.sub(r"they've", "they have", phrase)
    phrase = re.sub(r"to've", "to have", phrase)
    phrase = re.sub(r"wasn't", "was not", phrase)
    phrase = re.sub(r"we'd", "we had", phrase)
    phrase = re.sub(r"we'd've", "we would have", phrase)
    phrase = re.sub(r"we'll", "we will", phrase)
    phrase = re.sub(r"we'll've", "we will have", phrase)
    phrase = re.sub(r"we're", "we are", phrase)
    phrase = re.sub(r"we've", "we have", phrase)
    phrase = re.sub(r"weren't", "were not", phrase)
    phrase = re.sub(r"what'll", "what will", phrase)
    phrase = re.sub(r"what'll've", "what will have", phrase)
    phrase = re.sub(r"what're", "what are", phrase)
    phrase = re.sub(r"what's", "what is", phrase)
    phrase = re.sub(r"what've", "what have", phrase)
    phrase = re.sub(r"when's", "when is", phrase)
    phrase = re.sub(r"when've", "when have", phrase)
    phrase = re.sub(r"where'd", "where did", phrase)
    phrase = re.sub(r"where's", "where is", phrase)
    phrase = re.sub(r"where've", "where have", phrase)
    phrase = re.sub(r"who'll", "who will", phrase)
    phrase = re.sub(r"who'll've", "who will have", phrase)
    phrase = re.sub(r"who's", "who is", phrase)
    phrase = re.sub(r"who've", "who have", phrase)
    phrase = re.sub(r"why's", "why is", phrase)
    phrase = re.sub(r"why've", "why have", phrase)
    phrase = re.sub(r"will've", "will have", phrase)
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"won't've", "will not have", phrase)
    phrase = re.sub(r"would've", "would have", phrase)
    phrase = re.sub(r"wouldn't", "would not", phrase)
    phrase = re.sub(r"wouldn't've", "would not have", phrase)
    phrase = re.sub(r"y'all", "you all", phrase)
    phrase = re.sub(r"y'all'd", "you all would", phrase)
    phrase = re.sub(r"y'all'd've", "you all would have", phrase)
    phrase = re.sub(r"y'all're", "you all are", phrase)
    phrase = re.sub(r"y'all've", "you all have", phrase)
    phrase = re.sub(r"you'd", "you had", phrase)
    phrase = re.sub(r"you'd've", "you would have", phrase)
    phrase = re.sub(r"you'll", "you will", phrase)
    phrase = re.sub(r"you'll've", "you will have", phrase)
    phrase = re.sub(r"you're", "you are", phrase)
    phrase = re.sub(r"you've", "you have", phrase)
    
    return phrase

In [74]:
my_stopwords = stopwords.words("english")

In [75]:
def textPreprocessing(document):
    document = decontracted(document)
    #1. Remove Punctuations
    sentWithoutPunct = ''.join([char for char in document  if char not in str.punctuation])
    #2. Extract words out of the sentences
    words = sentWithoutPunct.split()
    #3. Normalize the data (lowercase)
    wordNormalized1 = [word.lower() for word in words]
    wordNormalized2 = [word for word in wordNormalized1 if word.isalpha()]
    wordNormalized3 = [word for word in wordNormalized2 if len(word) > 3]
    # 4. Remove Stopwords
    vocabulary = [word for word in wordNormalized3 if word not in my_stopwords]
    sent = ' '.join(vocabulary)
    return sent

In [76]:
Zomato_reviews["review_text"].iloc[1]

"really appreciate their quality and timing . I have tried the thattil kutti dosa I've been addicted to the dosa really and the chutney... really good and money worth much better than a thattukada must try it"

In [77]:
textPreprocessing(Zomato_reviews["review_text"].iloc[1])

'really appreciate quality timing tried thattil kutti dosa addicted dosa really chutney really good money worth much better thattukada must'

In [78]:
spell = SpellChecker(distance=1)
def Spell_correct(x):
    return spell.correction(x)

In [79]:
Spell_correct(textPreprocessing(Zomato_reviews["review_text"].iloc[0]))

'service worst pricing menu different bill give bill increased pricing even serving watermenu order need call times even busy'

In [80]:
Zomato_reviews["clean_text"] = Zomato_reviews["review_text"].apply(textPreprocessing)

In [81]:
Zomato_reviews["clean_text"].iloc[1]

'really appreciate quality timing tried thattil kutti dosa addicted dosa really chutney really good money worth much better thattukada must'

In [82]:
Zomato_reviews["final_clean_text"] = Zomato_reviews["clean_text"].apply(Spell_correct)

In [83]:
Zomato_reviews

Unnamed: 0,rating,review_text,clean_text,final_clean_text
0,1.0,"Their service is worst, pricing in menu is dif...",service worst pricing menu different bill give...,service worst pricing menu different bill give...
1,5.0,really appreciate their quality and timing . I...,really appreciate quality timing tried thattil...,really appreciate quality timing tried thattil...
2,4.0,"Went there on a Friday night, the place was su...",went friday night place surprisingly empty int...,went friday night place surprisingly empty int...
3,4.0,A very decent place serving good food.\r\nOrde...,decent place serving good food ordered chilli ...,decent place serving good food ordered chilli ...
4,5.0,One of the BEST places for steaks in the city....,best places steaks city tried beef steak chili...,best places steaks city tried beef steak chili...
...,...,...,...,...
27757,4.0,Food quality 4.5/5\r\nHospitality 4/5\r\nManag...,food quality hospitality management response c...,food quality hospitality management response c...
27758,4.0,Taste of the food is good and the ambience as ...,taste food good ambience well need reduce pric...,taste food good ambience well need reduce pric...
27759,5.0,Pizza is really thin crust and made from fresh...,pizza really thin crust made freshly prepared ...,pizza really thin crust made freshly prepared ...
27760,5.0,"Visited last Saturday with my kids ,\r\nIt was...",visited last saturday kids superb crowd good v...,visited last saturday kids superb crowd good v...


### Task <br>
![image-2.png](attachment:image-2.png)

In [207]:
Features = Zomato_reviews["final_clean_text"]

In [208]:
Label = Zomato_reviews["rating"]

In [209]:
vectorizer =  TfidfVectorizer(max_features=5000)

In [210]:
X_train,X_test,y_train,y_test = train_test_split(Features,Label,test_size=0.30,random_state=20)

In [211]:
tf_idf_train = vectorizer.fit_transform(X_train).toarray()

In [212]:
tf_idf_test = vectorizer.fit_transform(X_test).toarray()

In [213]:
print("Train >>", tf_idf_train.shape, "Test >>", tf_idf_test.shape)

Train >> (19423, 5000) Test >> (8325, 5000)


In [214]:
rgr_model = RandomForestRegressor(n_estimators = 50, min_samples_split = 20)

In [215]:
rgr_model.fit(tf_idf_train,y_train)

RandomForestRegressor(min_samples_split=20, n_estimators=50)

In [216]:
prediction = rgr_model.predict(tf_idf_test)
print("mean_squared_error ",mean_squared_error(y_test,prediction))
print("root_mean_squared_error ",np.sqrt(mean_squared_error(y_test,prediction)))
print("r2_score", r2_score(y_test,prediction))

mean_squared_error  1.9171117661950778
root_mean_squared_error  1.3845980522140993
r2_score -0.13897299816723496


In [102]:
rgr_model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 20,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 50,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

### Task
![image.png](attachment:image.png)

In [94]:
from sklearn.model_selection import GridSearchCV

In [117]:
rgr_model2 = RandomForestRegressor(n_estimators = 50, min_samples_split = 20)

In [118]:
params = {'max_features':["auto",'sqrt','log2'],'max_depth':[10,15,20,25]}

In [119]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import make_scorer

In [120]:
mse = make_scorer(mean_squared_error,greater_is_better=False)
rsq = make_scorer(r2_score,greater_is_better=True)

In [125]:
model2 = GridSearchCV(rgr_model2,params,cv=5,scoring=mse)

In [126]:
model2.fit(tf_idf_train,y_train)

GridSearchCV(cv=5,
             estimator=RandomForestRegressor(min_samples_split=20,
                                             n_estimators=50),
             param_grid={'max_depth': [10, 15, 20, 25],
                         'max_features': ['auto', 'sqrt', 'log2']},
             scoring=make_scorer(mean_squared_error, greater_is_better=False))

In [127]:
prediction = model2.predict(tf_idf_test)
print("mean_squared_error ",mean_squared_error(y_test,prediction))
print("root_mean_squared_error ",np.sqrt(mean_squared_error(y_test,prediction)))
print("r2_score", r2_score(y_test,prediction))

mean_squared_error  3.0027835543962476
root_mean_squared_error  1.7328541642031645
r2_score -0.7839801769021879


#### r2_score -0.7839801769021879.. very poor.. 

### Use more features for analysis.. 
1. high_frequency_words present in final_clean_text
2. count_chars
3. count_words
4. count_sent
5. count_unique_words

In [128]:
def top_non_stopwords(text):
    stop=set(stopwords.words('english'))
    
    new= text.str.split()
    new=new.values.tolist()
    corpus=[word for i in new for word in i]

    counter=Counter(corpus)
    most=counter.most_common()
    x, y=[], []
    for word,count in most[:300]:
        if (word not in stop):
            x.append(word)
            y.append(count)
            
    return pd.DataFrame(x,y,columns=["word"])

In [129]:
top_300_words = top_non_stopwords(Zomato_reviews["final_clean_text"])

In [130]:
top_300_words.iloc[1:10,0:1]

Unnamed: 0,word
20360,food
20183,good
9463,chicken
8331,service
6626,ordered
6291,ambience
6025,great
5578,taste
5058,really


In [131]:
def high_frequency(text):
    new= text.split()
    c = []
    for count in range(len(new)):
        if new[count] in top_300_words["word"].values:
            c.append(count)
    return len(c)

In [132]:
high_frequency(Zomato_reviews["final_clean_text"].iloc[0])

13

In [133]:
Zomato_reviews["high_frequency_words"] = Zomato_reviews["final_clean_text"].apply(high_frequency)

In [152]:
def count_chars(text):
    return len(text)
def count_words(text):
    return len(text.split())
def count_sent(text):
    return len(nltk.sent_tokenize(text))
def count_unique_words(text):
    return len(set(text.split()))
Zomato_reviews["count_chars"] = Zomato_reviews["final_clean_text"].apply(count_chars)
Zomato_reviews["count_words"] = Zomato_reviews["final_clean_text"].apply(count_words)
Zomato_reviews["count_sent"] = Zomato_reviews["final_clean_text"].apply(count_sent)
Zomato_reviews["count_unique_words"] = Zomato_reviews["final_clean_text"].apply(count_unique_words)

In [153]:
Zomato_reviews.shape

(27748, 9)

In [154]:
Zomato_reviews.columns

Index(['rating', 'review_text', 'clean_text', 'final_clean_text',
       'high_frequency_words', 'count_chars', 'count_words', 'count_sent',
       'count_unique_words'],
      dtype='object')

In [155]:
tf_idf_feartures = vectorizer.fit_transform(Zomato_reviews["final_clean_text"]).toarray()

In [156]:
tf_idf_feartures.shape

(27748, 5000)

In [157]:
Features_DF = pd.DataFrame(tf_idf_feartures)

In [158]:
Features_DF.shape

(27748, 5000)

In [159]:
DF_Merge = pd.merge(Features_DF,Zomato_reviews[['high_frequency_words', 'count_chars', 'count_words', 'count_sent','count_unique_words','rating']],left_index=True, right_index=True)

In [160]:
DF_Merge.shape

(27734, 5006)

In [161]:
Features = DF_Merge.drop(["rating"],axis=1)

In [162]:
Label = DF_Merge["rating"]

In [163]:
X_train1,X_test1,y_train1,y_test1 = train_test_split(Features,Label,test_size=0.30,random_state=20)

In [164]:
rgr_model1 = RandomForestRegressor(n_estimators = 50, min_samples_split = 20)

In [165]:
rgr_model1.fit(X_train1,y_train1)

RandomForestRegressor(min_samples_split=20, n_estimators=50)

In [217]:
prediction1 = rgr_model1.predict(X_test1)
print("mean_squared_error ",mean_squared_error(y_test1,prediction1))
print("root_mean_squared_error ",np.sqrt(mean_squared_error(y_test1,prediction1)))
print("r2_score", r2_score(y_test1,prediction1))

mean_squared_error  0.6790806799868792
root_mean_squared_error  0.8240635169614531
r2_score 0.5899214429585472


## Deployment Phase

In [254]:
textInput = input("Enter Feedback: ")
textInput_clean = textPreprocessing(textInput)
textInput_cleaned = Spell_correct(textInput_clean)
vectorizer =  TfidfVectorizer(max_features=5000)
tf_idf_textInput = vectorizer.fit_transform([textInput_cleaned]).toarray()
num_rows, num_cols = tf_idf_textInput.shape
missing_cols = 5000 - num_cols
tf_idf_textInput_revised = np.append(tf_idf_textInput,np.zeros((1,missing_cols)),axis = 1)
high_frequency_words = high_frequency(textInput_cleaned)
count_chars1 = len(textInput_cleaned)
count_words1 = len(textInput_cleaned.split())
count_sent1 = len(nltk.sent_tokenize(textInput_cleaned))
count_unique_words1 = len(set(textInput_cleaned.split()))

Features_DF1 = pd.DataFrame(tf_idf_textInput_revised)
Feature_list = [[high_frequency_words,count_chars1,count_words1,count_sent1,count_unique_words1]]
Features_DF2 = pd.DataFrame(Feature_list)
featureSet = pd.merge(Features_DF1,Features_DF2,left_index=True, right_index=True)

print("Predicted Feedback is ",rgr_model1.predict(featureSet))

Enter Feedback: 'Their service is worst, pricing in menu is different from bill. They can give you a bill with increased pricing. Even for serving water,menu, order you need to call them 3-4 times even on a non busy day.'
Predicted Feedback is  [3.60773738]


In [253]:
Zomato_reviews["review_text"].iloc[0]

'Their service is worst, pricing in menu is different from bill. They can give you a bill with increased pricing. Even for serving water,menu, order you need to call them 3-4 times even on a non busy day.'

In [244]:
Zomato_reviews.head(5)

Unnamed: 0,rating,review_text,clean_text,final_clean_text,high_frequency_words,count_chars,count_words,count_sent,count_unique_words
0,1.0,"Their service is worst, pricing in menu is dif...",service worst pricing menu different bill give...,service worst pricing menu different bill give...,13,124,19,1,16
1,5.0,really appreciate their quality and timing . I...,really appreciate quality timing tried thattil...,really appreciate quality timing tried thattil...,11,138,20,1,17
2,4.0,"Went there on a Friday night, the place was su...",went friday night place surprisingly empty int...,went friday night place surprisingly empty int...,19,276,40,1,31
3,4.0,A very decent place serving good food.\r\nOrde...,decent place serving good food ordered chilli ...,decent place serving good food ordered chilli ...,19,150,22,1,20
4,5.0,One of the BEST places for steaks in the city....,best places steaks city tried beef steak chili...,best places steaks city tried beef steak chili...,23,286,42,1,41
