<br>

# Predicting Business Ratings with Yelp Dataset

### General Assembly - Data Science Course - March 2018
#### Lucien Rey


 
<img src="https://upload.wikimedia.org/wikipedia/commons/a/ad/Yelp_Logo.svg" style="float: left; height: 106px">

 
<img src="yelp_mainpage.png" style="align:bottom;left:30;height:500px">

# Table of Content
---
1. Project Idea
2. Dataset
3. EDA
4. Initial Results
5. NLP
6. Conclusion


# 1. Project Idea
---

- Predicting ratings for businesses listed on Yelp
- Ratings between a range of 1-3 and 4-5 for "BAD", and "GOOD", respectively.

## Plan of actions
1. Focus on Dataset
2. Add more variables, such as specific categories, open during weekend, tips, etc..
3. Add sentiment analysis on reviews and tips
4. Compare different models

# 2. The Dataset

# 2.1 The Dataset
---

- All available on their website and on kaggle
    - https://www.yelp.com/dataset
    - https://www.kaggle.com/yelp-dataset/yelp-dataset
- Large dataset available in JSON, SQL, CSV format
- Used 5 different .csv files:
    - yelp_business.csv: id, name, postcode, latitude, review count and categories.
    - yelp_business_attributes.csv: attributes from each business (Dogs allowed, Wifi available, etc..)
    - yelp_tip: all the tips, which are like reviews / short tip from customers.
    - yelp_business_hours: which contains the opening hours of each business.
    - yelp_reviews: contains all 5 millions reviews for each user


<img src="yelp_reviews.png" style="align:bottom;left:30;height:106px">
        


In [54]:
yelp_business = pd.read_csv('/Users/lucienrey/Documents/2018/General Assembly/Final Project - TEST/yelp-dataset/yelp_business.csv')
yelp_business_attributes = pd.read_csv('/Users/lucienrey/Documents/2018/General Assembly/Final Project - TEST/yelp-dataset/yelp_business_attributes.csv', na_values= 'Na', low_memory=False)
yelp_tip = pd.read_csv('/Users/lucienrey/Documents/2018/General Assembly/Final Project - TEST/yelp-dataset/yelp_tip.csv')
yelp_business_hours = pd.read_csv('/Users/lucienrey/Documents/2018/General Assembly/Final Project - TEST/yelp-dataset/yelp_business_hours.csv')
yelp_review = pd.read_csv('/Users/lucienrey/Documents/2018/General Assembly/Final Project - TEST/yelp-dataset/yelp_review.csv')

# Data Shape
---


In [57]:
print("The shape of yelp_business.csv:", yelp_business.shape)
print("The shape of yelp_business_attributes.csv:", yelp_business_attributes.shape)
print("The shape of yelp_tip.csv:", yelp_tip.shape)
print("The shape of yelp_business_hours.csv:", yelp_business_hours.shape)
print("The shape of yelp_review.csv:", yelp_review.shape)

The shape of yelp_business.csv: (174567, 13)
The shape of yelp_business_attributes.csv: (152041, 82)
The shape of yelp_tip.csv: (1098324, 5)
The shape of yelp_business_hours.csv: (174567, 8)
The shape of yelp_review.csv: (5261668, 9)


# 2.2 Data Cleanup
---

- Select only business that is open
- Added the attributes (82 columns) and reduce to NA threshold of 5000
- Group the different categories (54600) with threshold 1000 and added back to the dataset
- Added number of tips for each business
- Added dummy if the business is open on weekend or not
- Transform the target to Good/Bad

Dataset Shape = (146702, 165)

In [58]:
# import all packages
import numpy as np
import pandas as pd
import numba.cuda
from numba import jit
import cython
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(style="ticks", color_codes=True)

from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from itertools import combinations
from collections import defaultdict
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('fivethirtyeight')
# Will display all rows
pd.set_option('display.max_columns', 500)

del yelp_business 
del yelp_business_attributes 
del yelp_tip 
del yelp_business_hours 
del yelp_review 

yelp_business = pd.read_csv('/Users/lucienrey/Documents/2018/General Assembly/Final Project - TEST/yelp-dataset/data_before_EDA.csv')

# 3. EDA

# 3.1 EDA - Stars Ratings
---

<img src="stars.png" style="align:bottom;left:30;height:400px">

# 3.2 EDA - Review and Stars Ratings
---

<img src="distri.png" style="align:bottom;left:30;height:400px">

# 3.3 EDA - Review and Stars Ratings
---

<img src="categories.png" style="align:bottom;left:30;height:400px">

# 4. Initial Models


# 4.2 Initial Models
---

<img src="model1.png" style="align:bottom;left:30;height:400px">

# 5. NLP


# 5.1 NLP
---

- Analysis on the reviews (5m reviews) and tips (1m tips)
- Used nltk package with the SentimentIntensityAnalyzer and Vader
    - This will analyse each text and provides a sentiment between 0 to 1 for Negative, Positive, Neutral impression. 
    - The Compound is simply an average of all three sentiments.
- Review file also provides with the number of useful /funny/ cool votes received for each comments. Added theses variables to the main dataset.

Dataset Shape = (146702, 176)

# 5.2 Examples
---

“VADER is VERY SMART, handsome, and FUNNY.”<br>
-> {'neg': 0.0, 'neu': 0.246, 'pos': **0.754**, 'compound': 0.9227}


“VADER is not smart, handsome, nor funny.”<br>
-> {'neg': **0.646**, 'neu': 0.354, 'pos': 0.0, 'compound': -0.7424}

“Make sure you :) or :D today!”<br>
-> {'neg': 0.0, 'neu': 0.294, 'pos': **0.706**, 'compound': 0.8633}

# 5.3 Final Accuracy Results
---

<img src="Untitled.gif" style="align:bottom;left:30;height:400px">

# 6. Conclusion


# 6.1 Confusion Matrix
---


<img src="confusion_1.png" style="float: left; height: 300px">
<img src="confusion_2.png" style="float: left; height: 300px">

# 6.2 Train Test 
---

<img src="accu_1.png" style="float: left; height: 80px">   <br><br>
of accuracy in predicting the rating of a business in the Yelp in the **train** dataset. 


<br><br>



<img src="accu_2.png" style="float: left; height: 80px"> <br><br>
of accuracy in predicting the rating of a business in the Yelp in the **test** dataset.

# 6.3 Importance Score
---

Below are the features that are the most important for our model:

**Features**|**Importance Score**
:-----:|:-----:
pos\_review|0.298164
comp\_review|0.228329
neg\_review|0.146903
neu\_review|0.090697
comp\_tip|0.019001
longitude|0.016482
latitude|0.016118
categoriesrestaurants|0.015643


# 6.4 Conclusion
---


- Improved our model from 0.67 to 0.82 by adding NLP into our model. This has a strong influence in our model. 
- We can see that by using GridSearch and Lasso slightly improve our model, but all of them are around 80% accuracy with NLP

# Thank You!
<img src="https://upload.wikimedia.org/wikipedia/commons/a/ad/Yelp_Logo.svg" style="height: 200px">

<br><br>

Lucien REY <br>
https://github.com/lucienrey/Yelp
