Using NLP and machine learning, make a model to predict the rating in a review based on the content of the text review. This will help identify cases with a mismatch.

Problem Statement:  

Zomato is India’s largest platform for discovering restaurants and ordering food. It operates in India as well as a few cities internationally. Bangalore is one of the biggest customers and restaurant bases for Zomato with 4 to 5 million users using the platform each month.

Users on the platform can also post reviews of restaurants and provide a rating accompanying the review. The content in the reviews should ideally reflect the rating provided by the customer. In many cases, there is a mismatch, owing to multiple reasons, where the rating does not match the customer review. The reviews and rating match is very important as it builds customer trust on the platform and helps the user get an accurate picture of the restaurant. 

You, as a data scientist, need to enable the identification and cleanup of such cases to ensure the ratings reflect the reviews and that the reviews seem trustworthy to the customer. You will need to use NLP techniques in conjunction with machine learning models to predict the rating from the review text. 

Domain: Hospitality and internet

Analysis to be done: Perform specific data cleanup, build a rating prediction model using the Random Forest technique and NLP. 

Content: 

rating: the rating given by the customer

review_text: the text in the review

Steps to perform:

Perform clean up on the data; tweak the stop words (negative terms are important). Follow up with a Random Forest Regressor to predict the star rating given by the customers.


Load the data using read_csv function from pandas package
    Null values in the review text? 
    Remove the records where the review text is null

In [2]:
# importing the required Libraries
import pandas as pd, numpy as np
import re

In [3]:
rvw = pd.read_csv('C:\\dataset\\Zomato_reviews.csv')

In [4]:
rvw.head(10)

Unnamed: 0,rating,review_text
0,1.0,"Their service is worst, pricing in menu is dif..."
1,5.0,really appreciate their quality and timing . I...
2,4.0,"Went there on a Friday night, the place was su..."
3,4.0,A very decent place serving good food.\r\nOrde...
4,5.0,One of the BEST places for steaks in the city....
5,5.0,Really lovely place for steaks and sizzlers. T...
6,5.0,This place ia for ultimate steak lovers!\r\nBo...
7,5.0,It's a shame if you haven't tried Once Upon a ...
8,4.0,We visited this place after we were tired and ...
9,5.0,Went there for yesterday dinner. Surprisingly ...


In [5]:
rvw.describe(include='all')

Unnamed: 0,rating,review_text
count,27762.0,27748
unique,,10548
top,,good
freq,,278
mean,3.665784,
std,1.284573,
min,1.0,
25%,3.0,
50%,4.0,
75%,5.0,


In [6]:
rvw.isnull().sum()

rating          0
review_text    14
dtype: int64

In [7]:
# we can see that we have 14 missing values, so we need to get rid of them
reviews = rvw[~rvw.review_text.isnull()].copy()
reviews.reset_index(inplace=True,drop=True)

In [8]:
# check the changes in size made
rvw.shape,reviews.shape

((27762, 2), (27748, 2))

**Converting to list for easy manipulation**

In [9]:
review_list = reviews.review_text.values

In [10]:
len(review_list)

27748

Text clean up

    Normalize the case
    Remove stop words
    remove "not", "no" from the stop word list
    Remove punctuations

Normalizing case


In [11]:
reviews_lower = [txt.lower() for txt in review_list]

In [12]:
reviews_lower [2:4]

['went there on a friday night, the place was surprisingly empty. interesting menu which is almost fully made of dosas. i had bullseye dosa and cheese masala dosa. the bullseye dosa was really good, with the egg perfectly cooked to a half boiled state. the masala in the cheese masala was good, but the cheese was a bit too chewy for my liking. the chutney was good, the sambar was average. the dishes are reasonably priced.',
 'a very decent place serving good food.\r\nordered chilli fish, chicken & pork sizzler.\r\neverything tasted good but pork could have been slightly better cooked.\r\ntried 2 beverages, both were very sweet.']

**Remove extra line breaks**

In [13]:
reviews_lower = [' '.join(txt.split())for txt in reviews_lower]

In [14]:
reviews_lower [2:4]

['went there on a friday night, the place was surprisingly empty. interesting menu which is almost fully made of dosas. i had bullseye dosa and cheese masala dosa. the bullseye dosa was really good, with the egg perfectly cooked to a half boiled state. the masala in the cheese masala was good, but the cheese was a bit too chewy for my liking. the chutney was good, the sambar was average. the dishes are reasonably priced.',
 'a very decent place serving good food. ordered chilli fish, chicken & pork sizzler. everything tasted good but pork could have been slightly better cooked. tried 2 beverages, both were very sweet.']

**Tokenize**

In [15]:
from nltk.tokenize import word_tokenize

In [16]:
print(word_tokenize(reviews_lower[0]))

['their', 'service', 'is', 'worst', ',', 'pricing', 'in', 'menu', 'is', 'different', 'from', 'bill', '.', 'they', 'can', 'give', 'you', 'a', 'bill', 'with', 'increased', 'pricing', '.', 'even', 'for', 'serving', 'water', ',', 'menu', ',', 'order', 'you', 'need', 'to', 'call', 'them', '3-4', 'times', 'even', 'on', 'a', 'non', 'busy', 'day', '.']


In [17]:
reviews_token = [word_tokenize(sent) for sent in reviews_lower]

In [18]:
print(reviews_token[0])

['their', 'service', 'is', 'worst', ',', 'pricing', 'in', 'menu', 'is', 'different', 'from', 'bill', '.', 'they', 'can', 'give', 'you', 'a', 'bill', 'with', 'increased', 'pricing', '.', 'even', 'for', 'serving', 'water', ',', 'menu', ',', 'order', 'you', 'need', 'to', 'call', 'them', '3-4', 'times', 'even', 'on', 'a', 'non', 'busy', 'day', '.']


**Remove Stopwords and Punctuations**

In [19]:
from nltk.corpus import stopwords
from string import punctuation

In [20]:
stop_nltk = stopwords.words('english')
stop_punct = list(punctuation)

In [21]:
print(stop_nltk)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [22]:
stop_nltk.remove("no")
stop_nltk.remove("not")
stop_nltk.remove("don")
stop_nltk.remove("won")

In [23]:
"no" in stop_nltk #checking

False

In [24]:
stop_final = stop_nltk + stop_punct + ["..." "``","''","====", "must"]

In [25]:
def del_stop (sent):
    return [term for term in sent if term not in stop_final]

In [26]:
del_stop(reviews_token[1])

['really',
 'appreciate',
 'quality',
 'timing',
 'tried',
 'thattil',
 'kutti',
 'dosa',
 "'ve",
 'addicted',
 'dosa',
 'really',
 'chutney',
 '...',
 'really',
 'good',
 'money',
 'worth',
 'much',
 'better',
 'thattukada',
 'try']

In [27]:
reviews_clean = [del_stop(sent)for sent in reviews_token]

In [28]:
reviews_clean = [" ".join(sent) for sent in reviews_clean]
reviews_clean[:2]

['service worst pricing menu different bill give bill increased pricing even serving water menu order need call 3-4 times even non busy day',
 "really appreciate quality timing tried thattil kutti dosa 've addicted dosa really chutney ... really good money worth much better thattukada try"]

### Separate X and Y and perform train test split, 70-30

In [29]:
len(reviews_clean)

27748

In [30]:
X = reviews_clean
y = reviews.rating

**Train Split Test**

In [31]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)

### Document Term Matrix

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features = 5000)

In [33]:
len(X_train),len(X_test)

(19423, 8325)

In [34]:
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.fit_transform(X_test)

In [35]:
X_train_bow.shape,X_test_bow.shape

((19423, 5000), (8325, 5000))

### Model Building

In [36]:
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
?RandomForestRegressor

In [37]:
learner_rf =RandomForestRegressor(random_state =42)

In [38]:
%%time
learner_rf.fit(X_train_bow,y_train)

Wall time: 41min 44s


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=42, verbose=0, warm_start=False)

In [41]:
y_train_preds = learner_rf.predict(X_train_bow)

In [42]:
from sklearn.metrics import mean_squared_error

In [43]:
mean_squared_error(y_train, y_train_preds)**0.5

0.2373941606400989

### Increase the number of Trees

In [44]:
learner_rf = RandomForestRegressor(random_state=42, n_estimators=20)


In [46]:
%%time
learner_rf.fit(X_train_bow, y_train)

Wall time: 7min


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=20, n_jobs=None, oob_score=False,
                      random_state=42, verbose=0, warm_start=False)

In [47]:
y_train_preds =learner_rf.predict(X_train_bow)

In [48]:
mean_squared_error(y_train, y_train_preds)**0.5

0.25058964549391705

**Hyper-parameter tuning**

"class_weights" was one of the many hyperparameters to tune for the SVM.

Let's find the best hyper-parameters for the SVM classifier


In [50]:
from sklearn.model_selection import GridSearchCV

In [51]:
?RandomForestRegressor

Instantiate the learner with a random state

In [53]:
learner_rf = RandomForestRegressor(random_state = 42)

In [54]:
# create the parameter grid based on the results of the random search
param_grid = {
    'max_features':[500,"sqrt","log2","auto"],
    'max_depth':[10,15,20,25]
}

In [58]:
# instantiate the grid search model
grid_search = GridSearchCV(estimator=learner_rf,param_grid =param_grid,
                          cv = 5, n_jobs = -1,scoring= "neg_mean_squared_error")

In [59]:
grid_search.fit(X_train_bow, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=100, n_jobs=None,
                                             oob_score=False, random_state=42,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_jobs

In [75]:
grid_search.cv_results_

{'mean_fit_time': array([ 37.07236056,   6.74747071,   4.00827436, 228.57270694,
         58.44549866,   9.62248921,   4.2603394 , 395.91354375,
         89.27541137,  14.44212952,   5.60900841, 564.3533114 ,
        138.09019222,  30.69729834,   8.36399117, 706.30278912]),
 'std_fit_time': array([2.98560762e+00, 8.40883983e-01, 4.36096438e-01, 7.68520732e+00,
        3.18194891e+00, 1.35634405e-01, 1.63885210e-01, 7.27190027e+00,
        8.94719273e-01, 2.43631788e-01, 7.34954741e-02, 1.18547158e+01,
        2.75796845e+01, 6.73438173e+00, 6.90742928e-01, 7.60240893e+01]),
 'mean_score_time': array([0.55898051, 0.46547151, 0.56463108, 0.38365803, 0.41605802,
        0.41518121, 0.41287847, 0.45065603, 0.4473959 , 0.45915704,
        0.41586781, 0.51879048, 0.53820357, 0.60745482, 0.54310079,
        0.82014461]),
 'std_score_time': array([0.16762087, 0.03572358, 0.08233651, 0.05731684, 0.0417932 ,
        0.0178746 , 0.02596757, 0.05728332, 0.0285697 , 0.05432699,
        0.01114991, 

In [65]:
grid_search.best_estimator_

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=25, max_features=500, max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=42, verbose=0, warm_start=False)

#### Using the best estimator to make predictions on the test set

In [77]:
y_train_pred = grid_search.best_estimator_.predict(X_train_bow)

In [79]:
y_test_pred = grid_search.best_estimator_.predict(X_test_bow)

In [83]:
mean_squared_error(y_train,y_train_pred)**0.5

0.5856532943121

In [84]:
mean_squared_error(y_test,y_test_pred)**0.5

1.389190296697332

### Identifying mismatch cases

In [85]:
res_df = pd.DataFrame({'review':X_test, 'rating':y_test, 'rating_pred':y_test_pred})

In [86]:
res_df[(res_df.rating - res_df.rating_pred)>=2].shape

(380, 3)

In [87]:
res_df[(res_df.rating - res_df.rating_pred)>2]

Unnamed: 0,review,rating,rating_pred
21935,food so..ooooooo tasty excellent flavour order...,5.0,2.885323
26563,place cozy nice ambience decor cafe really coo...,5.0,2.440255
27478,food one kind best quality taste good dining e...,5.0,2.718563
1427,100 feet road 's new bar one places homes turn...,4.0,1.918229
15350,food awesome like everything tried chole bhatu...,4.5,2.369793
...,...,...,...
5593,excellent north indian food dall makhani reall...,4.5,1.836567
15714,small place nice food decent price try paneer ...,4.0,1.776567
7001,amazing fresh hot tasty worth money,5.0,2.879295
5614,okay start ... loved food love service ... yes...,5.0,2.106383
