# Building a model to predict restaurants overall rating
In this part of the project, we are going to study the possibility of creating a Machine Learning model to predict the overall rating of the restaurants. This is going to be a tough task, as we only have 40 different restaurants to predict the result of. Maybe we can workaround and upsample our data, let's see.

## 1. Data Preparation

In [167]:
import numpy as np
import pandas as pd
import json
import pymongo
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [170]:
reviews_df = pd.read_csv('reviews_df.csv')

In [171]:
reviews_df.head()

Unnamed: 0,review,text,processedText,polarity
0,O ambiente é maravilhoso! E o atendimento do g...,The setting is wonderful!And the service of th...,the setting is wonderfuland the service of the...,0.535833
1,Muito bem atendidos pelo Ângelo Atendimento ág...,Very well attended by the Ângelo service agile...,very well attended by the ângelo service agile...,0.31875
2,"Comida fantástica, muito bem servida e ambient...","Fantastic food, very well served and extremely...",fantastic food very well served and extremely ...,0.413333
3,Culinária de frutos do mar impecável. Camarão ...,Good seafood cuisine.Shrimp to the delicious s...,good seafood cuisineshrimp to the delicious se...,0.523333
4,"Nota 10, atendente muito simpática e prestativ...","Note 10, very friendly and helpful attendant, ...",note very friendly and helpful attendant call...,0.346167


One of the first things to do is to find out from what restaurant each review is. Let's get the MongoDB Altas connection open again and extract some more information.

In [172]:
# Oppening the MongoDB Altas Connection
f = open('/media/michel/dados/Projects/emails.txt', 'r')
passwd = f.read().splitlines()[2]
my_mongo_url = passwd

# Creating a Client
client = pymongo.MongoClient(my_mongo_url, serverSelectionTimeoutMS=5000)
db = client.restaurant_reviews

# Testing connection
db.reviews_I.find_one()

{'_id': ObjectId('6122570ffbd547706864808a'),
 'restaurant_name': 'Camarada Camarão - Shopping Recife',
 'rating': 5.0,
 'number_of_ratings': 3792,
 'review_title': 'Encontro amigos!',
 'review_date': 'Publicada ontem',
 'reviewer_Name': 'karolrevoredoo',
 'review': 'O ambiente é maravilhoso! E o atendimento do garçom Edvaldo é incrível, muito atencioso e solicito! O Gerente Amaro Rocha é muito cordial!',
 'reviews_scores': {'Excellent': 3073,
  'Very Good': 556,
  'Good': 63,
  'Bad': 19,
  'Terrible': 15}}

Let's get all information as we can so we study the correlation. We have the target variable "rating", and the independent variables: number_of_ratings and reviews_scores.

In [173]:
db.reviews_I.find().count()

  db.reviews_I.find().count()


30746

In [174]:
reviews.info()

AttributeError: 'Cursor' object has no attribute 'info'

As the both data sctructures have the same number of elements, we can just add the new data into new columns.

In [175]:
# Creating some lists to stores the data
ratings = []
numbers_of_ratings = []
excelent = []
very_good = []
good = []
bad = []
terrible = []

# Querying the data
reviews = db.reviews_I.find({})

In [176]:
# Appeding the data to the lists
for review in db.reviews_I.find({}):
    try:
        ratings.append(review['rating'])
        numbers_of_ratings.append(review['number_of_ratings'])
        excelent.append(review['reviews_scores']['Excellent'])
        very_good.append(review['reviews_scores']['Very Good'])
        good.append(review['reviews_scores']['Good'])
        bad.append(review['reviews_scores']['Bad'])
        terrible.append(review['reviews_scores']['Terrible'])
    except:
        pass

In [177]:
len(good)

20660

Some of the documents lack the "reviews_scores" information. Later we will see if it would be better to drop those records or replace with other value. Now let's insert this data into our data frame using a join.

In [178]:
extra_info = pd.DataFrame(data=list(zip(numbers_of_ratings, excelent, very_good, good, bad, terrible, ratings)),
                         columns=[['number_of_ratings', 'excelent', 'very_good', 'good', 'bad', 'terrible', 'rating']])

In [179]:
extra_info.head()

Unnamed: 0,number_of_ratings,excelent,very_good,good,bad,terrible,rating
0,3792,3073,556,63,19,15,5.0
1,3792,3073,556,63,19,15,5.0
2,3792,3073,556,63,19,15,5.0
3,3792,3073,556,63,19,15,5.0
4,3792,3073,556,63,19,15,5.0


Now let's merge it with the reviews dataframe

In [180]:
reviews_df = reviews_df.join(other=extra_info)

In [181]:
reviews_df = reviews_df.drop(['review', 'text', 'processedText'], axis=1)

In [182]:
reviews_df.columns = ['polarity','number_of_ratings', 'excelent', 'very_good', 'good', 'bad', 'terrible', 'rating']

In [183]:
reviews_df['rating']

0        5.0
1        5.0
2        5.0
3        5.0
4        5.0
        ... 
30741    NaN
30742    NaN
30743    NaN
30744    NaN
30745    NaN
Name: rating, Length: 30746, dtype: float64

In [184]:
reviews_df.corr().sort_values('rating')

Unnamed: 0,polarity,number_of_ratings,excelent,very_good,good,bad,terrible,rating
good,-0.119903,0.437261,0.252856,0.911264,1.0,0.864641,0.424079,-0.220718
very_good,-0.191867,0.374534,0.175714,1.0,0.911264,0.714537,0.216611,-0.192676
bad,0.079315,0.644453,0.515883,0.714537,0.864641,1.0,0.681376,0.054789
polarity,1.0,0.25563,0.313744,-0.191867,-0.119903,0.079315,0.264954,0.318941
terrible,0.264954,0.90802,0.906891,0.216611,0.424079,0.681376,1.0,0.483392
number_of_ratings,0.25563,1.0,0.978317,0.374534,0.437261,0.644453,0.90802,0.663502
excelent,0.313744,0.978317,1.0,0.175714,0.252856,0.515883,0.906891,0.755893
rating,0.318941,0.663502,0.755893,-0.192676,-0.220718,0.054789,0.483392,1.0


In our small sample data, the "rating" variable shows a not-bad correlation with some of the independent variables, like "excelent", "number_of_ratings", "terrible". The polarity variables counts only 0.31 of correlation. Seems like the sentiment analysis is not so effective.

In order to build a classifier model upon this dataset, it will be necessary to first binarize the "rating" target variable, as it is not zeros and ones.

In [144]:
reviews_df["rating"] = reviews_df["rating"].apply(lambda x: str(x))
reviews_classifier = pd.get_dummies(reviews_df)
reviews_classifier.head()

Unnamed: 0,polarity,number_of_ratings,excelent,very_good,good,bad,terrible,rating_4.5,rating_5.0
0,0.535833,3792,3073,556,63,19,15,0,1
1,0.31875,3792,3073,556,63,19,15,0,1
2,0.413333,3792,3073,556,63,19,15,0,1
3,0.523333,3792,3073,556,63,19,15,0,1
4,0.346167,3792,3073,556,63,19,15,0,1


## 2. Building a Decision Tree Classifier
To start the Machine Learning part of this project, let's create a Decision Tree with the data in its raw form (not normalized) to see if we can get some result.

#### 2.1 Setting up the training and testing sets

In [162]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import auc, classification_report, roc_curve

#### 2.2 Fitting the DTC

In [148]:
# Setting up the X and y variables, as well as spliting the dataset intro training and testing
X = reviews_classifier.drop(["rating_4.5", "rating_5.0"], axis=1)
y = reviews_classifier[["rating_4.5", "rating_5.0"]]

xtr, xte, ytr, yte = train_test_split(X, y, test_size = 0.2, random_state=1)

In [149]:
dtc = DecisionTreeClassifier(criterion = 'entropy')

In [151]:
dtc.fit(xtr, ytr)

DecisionTreeClassifier(criterion='entropy')

#### 2.3 Evaluating the DTC

In [157]:
dtc_predictions = dtc.predict(xte)

In [161]:
print(classification_report(yte, dtc_predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1939
           1       1.00      1.00      1.00      4211

   micro avg       1.00      1.00      1.00      6150
   macro avg       1.00      1.00      1.00      6150
weighted avg       1.00      1.00      1.00      6150
 samples avg       1.00      1.00      1.00      6150



### Seems like the model is perfect. That is probably because we only have 40 different labels. To correct this we need more data!

1.0