**Task**: Help Zomato predict Rating from the Review.



**Objective**: Using NLP and machine learning,make a model to predict the rating in review based on the contents of the text review. This will help identify cases where there is a mismatch.




**Problem Statement:** Zomato is india's largest platfrom for discovering restaurants and ordering food. it operates in india as well as a cities internationally. Bangalore is one of the biggest customer and restaurant bases for Zomato with 4 to 5 million users the platform each month.



users on the platform can also post reviews of restaurants, and provide a rating accompanying the review. The content in the review should ideally reflect the rating provide by the customer. In many cases,there is a mismatch ,owing to mulitiple resons where the rating does not match the customer review. The reviews and rating matching is very important as it builds customer trust on the platform, helps the user get an accurate picture of the restaurant.



You, as a data scientist, need to enable the identification and cleanup of such case, to ensure the rating are reflective of the reviews and that reviews seem trustworthy to the customer. you will need to use NLP techniques in conjunction with machine learning models to predict the rating from the reivew text.




**Analysis to be done:** Perform specific data cleanup, build a  prediction model using Random Forest techniques and NLP.

**Date:** 13-11-2022

# Importing the required Libraries

In [152]:
import pandas as pd            # To work with data frames
import numpy as np             # advanced math library
import re                      # Library to work with regular expression
from google.colab import files # To import files to google Colab from the local Machine

# Importing the CSV file

In [153]:
uploaded = files.upload()

Saving Zomato_reviews.csv to Zomato_reviews (2).csv


In [154]:
# The Python-specific encoding unicode_escape is a dummy encoding that converts all non-ASCII characters into their \uXXXX representations. 
# Code points above the ASCII 0-127 range but below 256 are 
# represented in the two-digit form \xXX.
reviews0 = pd.read_csv("Zomato_reviews.csv", encoding= 'unicode_escape')

# Exploring the data

In [155]:
reviews0.head()

Unnamed: 0,rating,review_text
0,1.0,"Their service is worst, pricing in menu is dif..."
1,5.0,really appreciate their quality and timing . I...
2,4.0,"Went there on a Friday night, the place was su..."
3,4.0,A very decent place serving good food.\r\nOrde...
4,5.0,One of the BEST places for steaks in the city....


In [156]:
# There are null values in the review_text column
reviews0.describe(include="all")

Unnamed: 0,rating,review_text
count,27762.0,27748
unique,,10548
top,,good
freq,,278
mean,3.665784,
std,1.284573,
min,1.0,
25%,3.0,
50%,4.0,
75%,5.0,


In [157]:
# rows having null values
reviews0[reviews0.review_text.isnull()]

Unnamed: 0,rating,review_text
6527,5.0,
6532,5.0,
21963,5.0,
21964,5.0,
26645,5.0,
26650,5.0,
26655,5.0,
27398,4.0,
27399,3.5,
27400,5.0,


In [158]:
# Deleting the rows with null values
reviews1 = reviews0[~reviews0.review_text.isnull()].copy()

In [159]:
# After deleting the rows. the index needs to be also reset.
reviews1[6525:6529]

Unnamed: 0,rating,review_text
6525,2.0,Delivery was delay and food was cold
6526,1.0,Tasteless and not at all fresh or hygienic.
6528,1.0,Even though the food is not to complain the de...
6529,1.0,order was very late


In [160]:
# Reseting the Index
reviews1.reset_index(inplace=True, drop=True)

In [161]:
# Shape of old and new dataframe
reviews0.shape, reviews1.shape

((27762, 2), (27748, 2))

## **Converting to list for easy manipulation**


In [162]:
reviews_list = reviews1.review_text.values

In [163]:
reviews_list

array(['Their service is worst, pricing in menu is different from bill. They can give you a bill with increased pricing. Even for serving water,menu, order you need to call them 3-4 times even on a non busy day.',
       "really appreciate their quality and timing . I have tried the thattil kutti dosa I've been addicted to the dosa really and the chutney... really good and money worth much better than a thattukada must try it",
       'Went there on a Friday night, the place was surprisingly empty. Interesting menu which is almost fully made of dosas. I had bullseye dosa and cheese masala dosa. The bullseye Dosa was really good, with the egg perfectly cooked to a half boiled state. The masala in the cheese masala was good, but the cheese was a bit too chewy for my liking. The chutney was good, the sambar was average. The dishes are reasonably priced.',
       ...,
       'Pizza is really thin crust and made from freshly prepared dough. Unlimited pizza plan is really good if you are giv

In [164]:
len(reviews_list)

27748

# **Step 1:** Converting each review to lower case. **(Normalizing the case)**

In [165]:
reviews_lower = [txt.lower() for txt in reviews_list]

In [166]:
reviews_lower[2:4]

['went there on a friday night, the place was surprisingly empty. interesting menu which is almost fully made of dosas. i had bullseye dosa and cheese masala dosa. the bullseye dosa was really good, with the egg perfectly cooked to a half boiled state. the masala in the cheese masala was good, but the cheese was a bit too chewy for my liking. the chutney was good, the sambar was average. the dishes are reasonably priced.',
 'a very decent place serving good food.\r\nordered chilli fish, chicken & pork sizzler.\r\neverything tasted good but pork could have been slightly better cooked.\r\ntried 2 beverages, both were very sweet.']

# **Step 2:** Removing the \n (new lines) and \r (carriage return) which we get as a consequence of converting the column to a list.

In [167]:
# To understand the below code see the below cell
# join just converts the list to a string with the ' ' in between each element of the list
reviews_lower = [" ".join(txt.split()) for txt in reviews_lower]

In [168]:
# For understanding the code
a = ['a\n\rb\n\rc','a\n\rb\n\rc']
for x in a:
  print(x.split())

['a', 'b', 'c']
['a', 'b', 'c']


In [169]:
reviews_lower[2:4]

['went there on a friday night, the place was surprisingly empty. interesting menu which is almost fully made of dosas. i had bullseye dosa and cheese masala dosa. the bullseye dosa was really good, with the egg perfectly cooked to a half boiled state. the masala in the cheese masala was good, but the cheese was a bit too chewy for my liking. the chutney was good, the sambar was average. the dishes are reasonably priced.',
 'a very decent place serving good food. ordered chilli fish, chicken & pork sizzler. everything tasted good but pork could have been slightly better cooked. tried 2 beverages, both were very sweet.']

# **Step 3:** Performing Tokenization
Breaking a sentence into its parts allows a machine to understand the parts as well as the whole. This will help the program understand each of the words by themselves, as well as how they function in the larger text. This is especially important for larger amounts of text as it allows the machine to count the frequencies of certain words as well as where they frequently appear. This is important for later steps in natural language processing.  

In [170]:
import nltk                             # NLTK is a standard python library that provides a set of diverse algorithms for NLP. 
                                        # It is one of the most used libraries for NLP and Computational Linguistics.
nltk.download('punkt')                  # In NLTK there are various types of tokenizers, word_tokenize requires Punkt sentence tokenization models to be istalled
from nltk.tokenize import word_tokenize # This module is used to split the sentences to words

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [171]:
# This is how tokenization works, each word in the sentence is considered as a different entity
print(word_tokenize(reviews_lower[0]))

['their', 'service', 'is', 'worst', ',', 'pricing', 'in', 'menu', 'is', 'different', 'from', 'bill', '.', 'they', 'can', 'give', 'you', 'a', 'bill', 'with', 'increased', 'pricing', '.', 'even', 'for', 'serving', 'water', ',', 'menu', ',', 'order', 'you', 'need', 'to', 'call', 'them', '3-4', 'times', 'even', 'on', 'a', 'non', 'busy', 'day', '.']


In [172]:
# applying it to the whole dataset
reviews_tokens = [word_tokenize(sent) for sent in reviews_lower]
print(reviews_tokens[0])

['their', 'service', 'is', 'worst', ',', 'pricing', 'in', 'menu', 'is', 'different', 'from', 'bill', '.', 'they', 'can', 'give', 'you', 'a', 'bill', 'with', 'increased', 'pricing', '.', 'even', 'for', 'serving', 'water', ',', 'menu', ',', 'order', 'you', 'need', 'to', 'call', 'them', '3-4', 'times', 'even', 'on', 'a', 'non', 'busy', 'day', '.']


In [173]:
reviews_tokens[0:2] # Its a list within a list

[['their',
  'service',
  'is',
  'worst',
  ',',
  'pricing',
  'in',
  'menu',
  'is',
  'different',
  'from',
  'bill',
  '.',
  'they',
  'can',
  'give',
  'you',
  'a',
  'bill',
  'with',
  'increased',
  'pricing',
  '.',
  'even',
  'for',
  'serving',
  'water',
  ',',
  'menu',
  ',',
  'order',
  'you',
  'need',
  'to',
  'call',
  'them',
  '3-4',
  'times',
  'even',
  'on',
  'a',
  'non',
  'busy',
  'day',
  '.'],
 ['really',
  'appreciate',
  'their',
  'quality',
  'and',
  'timing',
  '.',
  'i',
  'have',
  'tried',
  'the',
  'thattil',
  'kutti',
  'dosa',
  'i',
  "'ve",
  'been',
  'addicted',
  'to',
  'the',
  'dosa',
  'really',
  'and',
  'the',
  'chutney',
  '...',
  'really',
  'good',
  'and',
  'money',
  'worth',
  'much',
  'better',
  'than',
  'a',
  'thattukada',
  'must',
  'try',
  'it']]

# **Step 4: Remove stop words and punctuations**
Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text in order to give more focus to the important information.

In [174]:
from nltk.corpus import stopwords  # To import the list of all stopwords in english language
from string import punctuation     # TO import the list of all punctuation 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [175]:
stop_nltk = stopwords.words("english") # Saving the stop words in the list
stop_punct = list(punctuation)         # Saving the punctuations in the list

All the stop words in English Language.

In [176]:
print(stop_nltk)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [177]:
len(stop_nltk)

179

All the punctuations used in English language.

In [178]:
print(stop_punct)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [179]:
len(stop_punct)

32

Since we would require these words, for our analysis. We would not remove these words from our data.

In [180]:
stop_nltk.remove("no")
stop_nltk.remove("not")
stop_nltk.remove("don")
stop_nltk.remove("won")

In [181]:
"no" in stop_nltk

False

Adding these extra punctuations to the list, cause we want these to be removed from our list **'reviews_tokens'**.

In [182]:
# These contains all the characters that we want to be removed from our list 'reviews_token'
stop_final = stop_nltk + stop_punct + ["...", "``","''", "====", "must"]

In [183]:
# Now we write a function which would remove all the stop words and punctuations
# Here sent must be a list
# This function would generate a new list (Important)
def del_stop(sent):
    return [term for term in sent if term not in stop_final]

In [184]:
# An example of how this function will be applied to the list
del_stop(reviews_tokens[1])

['really',
 'appreciate',
 'quality',
 'timing',
 'tried',
 'thattil',
 'kutti',
 'dosa',
 "'ve",
 'addicted',
 'dosa',
 'really',
 'chutney',
 'really',
 'good',
 'money',
 'worth',
 'much',
 'better',
 'thattukada',
 'try']

In [185]:
# Applying it to the new dataset
reviews_clean = [del_stop(sent) for sent in reviews_tokens]

**Step 6: Applying Lemmetization**

Lemmatization is a text normalization technique used in Natural Language Processing (NLP), that switches any kind of a word to its base root mode. Lemmatization is responsible for grouping different inflected forms of words into the root form, having the same meaning


In [186]:
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [187]:
lemmatizer = WordNetLemmatizer()

In [188]:
def apply_lemmetization(sent):
    return [lemmatizer.lemmatize(term) for term in sent]

In [189]:
reviews_clean[2]

['went',
 'friday',
 'night',
 'place',
 'surprisingly',
 'empty',
 'interesting',
 'menu',
 'almost',
 'fully',
 'made',
 'dosas',
 'bullseye',
 'dosa',
 'cheese',
 'masala',
 'dosa',
 'bullseye',
 'dosa',
 'really',
 'good',
 'egg',
 'perfectly',
 'cooked',
 'half',
 'boiled',
 'state',
 'masala',
 'cheese',
 'masala',
 'good',
 'cheese',
 'bit',
 'chewy',
 'liking',
 'chutney',
 'good',
 'sambar',
 'average',
 'dishes',
 'reasonably',
 'priced']

In [190]:
apply_lemmetization(reviews_clean[2])

['went',
 'friday',
 'night',
 'place',
 'surprisingly',
 'empty',
 'interesting',
 'menu',
 'almost',
 'fully',
 'made',
 'dosas',
 'bullseye',
 'dosa',
 'cheese',
 'masala',
 'dosa',
 'bullseye',
 'dosa',
 'really',
 'good',
 'egg',
 'perfectly',
 'cooked',
 'half',
 'boiled',
 'state',
 'masala',
 'cheese',
 'masala',
 'good',
 'cheese',
 'bit',
 'chewy',
 'liking',
 'chutney',
 'good',
 'sambar',
 'average',
 'dish',
 'reasonably',
 'priced']

In [191]:
# Applying it to the new dataset
reviews_clean = [apply_lemmetization(sent) for sent in reviews_clean]

Step 7:  Now converting the tokenize words back as a sentence

In [192]:

# Initially it was a list within a list, but now its a simple list, where each element which is of type string is a review (There are no stop words or punctuations)
reviews_clean = [" ".join(sent) for sent in reviews_clean]
reviews_clean[2]

'went friday night place surprisingly empty interesting menu almost fully made dosas bullseye dosa cheese masala dosa bullseye dosa really good egg perfectly cooked half boiled state masala cheese masala good cheese bit chewy liking chutney good sambar average dish reasonably priced'

In [193]:
len(reviews_clean)

27748

## **Separate X and Y and perform train test split, 70-30**
Now after the data cleaning we divide the data into test and train.

In [194]:
X = reviews_clean
y = reviews1.rating

In [195]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)


## **Document term matrix using TfIdf**
Now to convert the textual data to numerical data, so that we could input it into our model we use vectorization.There are various vectorization techniques like:
  One-hot Encoding (OHE)

  Count Vectorizer

  Bag-of-Words (BOW)

  N-grams

  Term Frequency-Inverse Document Frequency (TF-IDF)

Here, we use TF-IDF (Term Frequency and Inverse Document Frequency)- It is a product of two measures:


                                              tfidf(t,d,D) = tf(t,d) X idf(t,D)

In [196]:
from sklearn.feature_extraction.text import TfidfVectorizer # Using the tfidf for vectorization

In [197]:
vectorizer = TfidfVectorizer(max_features = 8000)  # Rather than using all the features (unique words) we use only few features based on the max TF value

In [198]:
len(X_train), len(X_test)

(19423, 8325)

In [199]:
# Vectorizing the X_train data
X_train_bow = vectorizer.fit_transform(X_train) # Applying fit and transform

In [200]:
# Vectorizing the X_test data 
X_test_bow = vectorizer.transform(X_test)  # Applying transform only

In [201]:
# Shape of the data
X_train_bow.shape, X_test_bow.shape

((19423, 8000), (8325, 8000))

## **Model building**
Now since we have numeric data, we can fit our model

Model 1: Creating the model with default parameters

In [202]:
from sklearn.ensemble import RandomForestRegressor # importing the required models 

In [53]:
learner_rf = RandomForestRegressor(random_state=42)  # Creating the instance

In [54]:
learner_rf.fit(X_train_bow, y_train)   # Fiting the model

RandomForestRegressor(random_state=42)

In [55]:
y_train_preds = learner_rf.predict(X_train_bow) # Predicting the model  based on the fitted Model

In [56]:
y_test_preds = learner_rf.predict(X_test_bow) # Predicting the model  based on the fitted Model

In [57]:
from sklearn.metrics import mean_squared_error,r2_score # Importing the Mean Square Metric 

In [58]:
mean_squared_error(y_train, y_train_preds)**0.5 # Obtaining the Root Mean Squared Error of train data

0.2310402715388965

In [59]:
r2_score(y_train, y_train_preds) # Finding the variation explained by model.

0.9674784294947738

In [60]:
mean_squared_error(y_test, y_test_preds)**0.5 # Obtaining the Root Mean Squared Error of test data

0.4927333982213405

## **Increasing the number of trees**
Model 2: Increasing the number of trees

In [61]:
learner_rf = RandomForestRegressor(random_state=42, n_estimators=50)  # Creating an instance

In [62]:
%%time
learner_rf.fit(X_train_bow, y_train) # Fitting the model

CPU times: user 3min 16s, sys: 193 ms, total: 3min 16s
Wall time: 3min 17s


RandomForestRegressor(n_estimators=50, random_state=42)

In [63]:
y_train_preds = learner_rf.predict(X_train_bow) # Predicting using the fitted Model

In [64]:
y_test_preds = learner_rf.predict(X_test_bow)

In [65]:
mean_squared_error(y_train, y_train_preds)**0.5  # Finding the root mean squared Error

0.23397149185878208

In [66]:
r2_score(y_train, y_train_preds) # Finding the variation explained by model.

0.9666479889600823

In [67]:
mean_squared_error(y_test, y_test_preds)**0.5 # Obtaining the Root Mean Squared Error of test data

0.49791459756602646

## **Hyper-parameter tuning**

Model 3: Using Grid Search to find the optimum parameters

In [203]:
from sklearn.model_selection import GridSearchCV 

In [204]:
learner_rf = RandomForestRegressor(random_state=42) # Creating an instance

In [205]:
# Create the parameter grid based on the results of random search 
param_grid = {'n_estimators': [700],
    'max_features': [500,800],
    'max_depth': [120,140]

}

In [207]:
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = learner_rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 1, scoring = "neg_mean_squared_error" )


In [208]:
grid_search.fit(X_train_bow, y_train) # Fiting the model

Fitting 3 folds for each of 4 candidates, totalling 12 fits


GridSearchCV(cv=3, estimator=RandomForestRegressor(random_state=42), n_jobs=-1,
             param_grid={'max_depth': [120, 140], 'max_features': [500, 800],
                         'n_estimators': [700]},
             scoring='neg_mean_squared_error', verbose=1)

In [209]:
grid_search.cv_results_  # To observe the performance of all the combinations

{'mean_fit_time': array([274.26310102, 351.44539587, 278.77989268, 344.1306808 ]),
 'std_fit_time': array([ 0.77502513,  3.34587482,  3.13928382, 14.16239297]),
 'mean_score_time': array([1.73970779, 1.66284887, 1.64814496, 1.49754111]),
 'std_score_time': array([0.0654036 , 0.16636329, 0.00869175, 0.11073012]),
 'param_max_depth': masked_array(data=[120, 120, 140, 140],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_max_features': masked_array(data=[500, 800, 500, 800],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_n_estimators': masked_array(data=[700, 700, 700, 700],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'max_depth': 120, 'max_features': 500, 'n_estimators': 700},
  {'max_depth': 120, 'max_features': 800, 'n_estimators': 700},
  {'max_depth': 140, 'max_features': 500, 'n_estimators': 7

In [210]:
grid_search.best_estimator_ # The value of the parameters which provides the best results

RandomForestRegressor(max_depth=140, max_features=500, n_estimators=700,
                      random_state=42)

In [211]:
y_train_pred = grid_search.best_estimator_.predict(X_train_bow) # Now predicting  the train data using the best parameters

In [212]:
y_test_pred = grid_search.best_estimator_.predict(X_test_bow)  # Predicting  the test data using the best parameters

In [213]:
mean_squared_error(y_train, y_train_pred)**0.5 # Finding the Root Mean square Error of the train data

0.2296234214284136

In [214]:
r2_score(y_train, y_train_preds) # Finding the variation explained by model.

0.9666479889600823

In [215]:
mean_squared_error(y_test, y_test_pred)**0.5  # Finding the Root Mean square Error of the test data

0.46509064310873977

**By tuning the Hyper Parameters like the number of trees in the Random Forest, maximum depth of the tree, number of random features to be used while constructing the tree, etc. There  has been an improvement in the model performance**

## Creating a dataframe with our features, True Y and predicted Y

In [None]:
res_df = pd.DataFrame({'review':X_test, 'rating':y_test, 'rating_pred':y_test_pred})

In [216]:
res_df.head()

Unnamed: 0,review,rating,rating_pred
1227,saw place koramangala last time since wish lis...,4.0,4.002444
14689,good quality not best 4 star quantity packing,4.0,3.265577
27308,food really top class dosa match taste bangalo...,5.0,4.481612
16996,real hyderabadi biriyani lover restaurant went...,4.5,4.42439
14092,ambiance place located ameoba church street re...,2.0,4.110818


**Now observing those reviews for which there is a significant difference in the prediction of the ratings and the original rating.**

In [217]:
res_df[abs(res_df.rating - res_df.rating_pred)>=2].shape

(83, 3)

In [219]:
res_df[abs(res_df.rating - res_df.rating_pred)>=2]

Unnamed: 0,review,rating,rating_pred
14092,ambiance place located ameoba church street re...,2.0,4.110818
14654,nothing grand place tucked away alley shivaji ...,1.0,3.009471
3971,think huge amount chilli stock restaurant .......,1.0,3.242127
7277,life saviour serving excellent food worst time...,5.0,2.491940
7112,good kebab,2.0,4.403283
...,...,...,...
20196,ok ok,1.0,3.216208
21946,looking forward try food here.but closed frida...,1.0,4.099945
4572,food salty,1.0,3.032254
3078,quantity le amount pay,1.0,3.137177


# **Conclusion:**
There are 83 Reviews for which we have found a discrepancy in the Predicted and observed ratings. 