### AIM
Yelp is a business organisation that connects people with great local businesses. On the Yelp website you can find recommendations for restaurants, home services, auto services and many local businesses near you. 

The aim of this project is to find the factors that affect a Yelp rating of a Restaurant, since a restaurant's success is affected by its reputation. The project will make use of Multiple Linear Regression to investigate the factors that affect a restaurant's Yelp rating. 

Finally, I would love to own a restaurant someday in Berlin, Germany so we would use this model to predict the Yelp rating of this restaurant even before it is opened.

### DATA

The data used in this project was collected from the Yelp website. https://www.yelp.com/dataset. The data is provided in six different files:

- `yelp_business.json`: Contains business data including location data, attributes, and categories.
- `yelp_review.json`: Contains full review text data including the user_id that wrote the review and the business_id the review is written for.
- `yelp_user.json`: User data including the user's friend mapping and all the metadata associated with the user
- `yelp_checkin.json`: Checkins on a business.
- `yelp_tip.json`: Tips written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions.
- `yelp_photo.json`: Contains photo data including the caption and classification (one of "food", "drink", "menu", "inside" or "outside").


#### Ready, Set, Go....

In [1]:
#importing pandas 
import pandas as pd

In [None]:
#Loading the files into dataframes
businesses = pd.read_json('data/yelp_business.json', lines = True)
reviews = pd.read_json('data/yelp_review.json', lines = True)
users = pd.read_json('data/yelp_user.json', lines = True)
checkins = pd.read_json('data/yelp_checkin.json', lines = True)
tips = pd.read_json('data/yelp_tip.json', lines = True)
photos = pd.read_json('data/yelp_photo.json', lines = True)

In [None]:
pd.options.display.max_columns = 60
pd.options.display.max_colwidth = 500

#### Inspecting the businesses dataframe

In [None]:
#First 3 rows of the business dataframe
businesses.head(3)

In [None]:
#Buisness dataframe information
businesses.info()

#### Inspecting the reviews dataframe

In [None]:
reviews.head()

In [None]:
reviews.info()

#### Inspecting the users dataframe

In [None]:
users.head()

In [None]:
users.info()

#### Inspecting the checkins dataframe

In [None]:
checkins.head()

In [None]:
checkins.info()

#### Inspecting the tips dataframe 


In [None]:
tips.head()

In [None]:
tips.info()

#### Inspecting the photos dataframe

In [None]:
photos.head()

How many different businesses are in the dataset? What are the different features in the review DataFrame?

In [None]:
print('Number of different businesses in the dataset:', businesses['business_id'].nunique())

In [None]:
print('Features of the review dataframe:\n', list(reviews.columns))

#### Merging the dataframes on the business_id columns present in all the six dataframes

In [None]:
#Performing the first merge
df = pd.merge(businesses, reviews, how = 'left', on = 'business_id')
df.head()

In [None]:
#performing the rest of merges
df = pd.merge(df, users, how = 'left', on = 'business_id')
df = pd.merge(df, checkins, how = 'left', on = 'business_id')
df = pd.merge(df, tips, how = 'left', on = 'business_id')
df = pd.merge(df, photos, how = 'left', on = 'business_id')

In [None]:
print(df.columns)

In [None]:
df.info()

#### Data Cleaning

In this project and in building the regression model, preference will be given to features that are numerical that will have an effect on the target variable (. Regression models and most of machine models for that matter perform better when features are numbers. So, here all features that do not satisfy our condition will be removed.

In [None]:
#Filtering out the columns that do not satisfy our condition

cols_to_delete = []
for col in df.columns:
    if df[col].dtype == 'object':
        cols_to_delete.append(col)

In [None]:
cols_to_delete

There are other columns that must be deleted as well. Those will added manually to the list

In [None]:
cols_to_delete = cols_to_delete + ['is_open', 'latitude', 'longitude']

In [None]:
cols_to_delete

In [None]:
df.drop(cols_to_delete, axis = 1, inplace = True)

In [None]:
df.info()

In [None]:
#Cehcking for missing values
df.isna().sum()

There are a few columns with missing values. Since our dataset has no information recorded for some businesses in these columns, we will assume the Yelp pages did not display these features. For example, if there is a NaN value for number_pics, it means that the associated business did not have any pictures posted on its Yelp page. Thus we can replace all of our NaNs with 0s.

In [None]:
#Replacing the NaN values with zeros
df.fillna({'weekday_checkins': 0,
          'weekend_checkins': 0,
          'average_tip_length': 0,
          'number_tips': 0,
          'average_caption_length': 0,
          'number_pics': 0}, inplace = True)

In [None]:
df.isna().sum()

#### Exploratory Data Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr

In [None]:
df.corr()

There are features that correlate most with Yelp rating - `average_review_sentiment`, `average_review_length`, `average_review_age`. `average_review_sentiment` is the average sentiment score for all reviews on a business' Yelp page. The sentiment score for a review was calculated using the sentiment analysis tool VADER. VADER uses a labeled set of positive and negative words, along with codified rules of grammar, to estimate how positive or negative a statement is. Scores range from -1, most negative, to +1, most positive, with a score of 0 indicating a neutral statement. While not perfect, VADER does a good job at guessing the sentiment of text data.

In [None]:
#Correlation between average_review_sentiment and stars
corr = pearsonr(df['average_review_sentiment'], df['stars'])
print(corr[0])

In [None]:
#Plotting average_review_sentiment against stars
plt.figure(figsize = (10,7))
plt.scatter(df['average_review_sentiment'], df['stars'], alpha = 0.1)
plt.title('average_review_sentiment vs. stars (correlation:' + str(corr[0]) + ')')
plt.xlabel('average review sentiment')
plt.ylabel('Yelp Rating')
plt.show()

In [None]:
#Correlation between average_review_sentiment and stars
corr = pearsonr(df['average_review_length'], df['stars'])
print(corr[0])

In [None]:
#Plotting average_review_length against stars
plt.figure(figsize = (10,7))
plt.scatter(df['average_review_length'], df['stars'], alpha = 0.1)
plt.title('average_review_length vs. stars (correlation:' + str(corr[0]) + ')')
plt.xlabel('average review length')
plt.ylabel('Yelp Rating')
plt.show()

In [None]:
#Correlation between average_review_sentiment and stars
corr = pearsonr(df['average_review_age'], df['stars'])
print(corr[0])

In [None]:
#Plotting average_review_length against stars
plt.figure(figsize = (10,7))
plt.scatter(df['average_review_age'], df['stars'], alpha = 0.1)
plt.title('average_review_age vs. stars (correlation:' + str(corr[0]) + ')')
plt.xlabel('average review age')
plt.ylabel('Yelp Rating')
plt.show()

In [None]:
#Correlation between average_review_sentiment and stars
corr = pearsonr(df['number_funny_votes'], df['stars'])
print(corr[0])

In [None]:
#Plotting number_funny_votes against stars
#Plotting average_review_length against stars
plt.figure(figsize = (10,7))
plt.scatter(df['number_funny_votes'], df['stars'], alpha = 0.1)
plt.title('number_funny_votes vs. stars (correlation:' + str(corr[0]) + ')')
plt.xlabel('number funny votes')
plt.ylabel('Yelp Rating')
plt.show()

### Data Selection & Model Building
In order to put our data into a Linear Regression model, we need to separate out our features to model on and the Yelp ratings. From our correlation analysis we saw that the three features with the strongest correlations to Yelp rating are `average_review_sentiment`, `average_review_lengt`h, and `average_review_age`. Since we want to dig a little deeper than `average_review_sentiment`, which understandably has a very high correlation with Yelp rating, let's choose to create our first model with `average_review_length` and `average_review_age` as features.

In [None]:
features = df[['average_review_length', 'average_review_age']]
ratings = df['stars']

In [None]:
#Splitting the data into train and test sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size = 0.2, random_state = 1)

In [None]:
#Fitting the model
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

#### Evaluating and understanding the model.

The model can be evaluated using the `.score()` method, which provides the R^2 value for our model. R^2 is the coefficient of determination, or a measure of the variance in the dependent variable, that is the Yelp rating, is explained by the independent variables, the features of the data.R^2 values range from `0` to `1`, with `0` indicating that the created model does not fit the data at all, and `1` indicating the model perfectly fits the data.

In [None]:
model.score(X_train, y_train)

In [None]:
model.score(X_test, y_test)

In [None]:
#Printint the coefficients of the features
sorted(list(zip(['average_review_length','average_review_age'],model.coef_)),key = lambda x: abs(x[1]),reverse=True)

Lastly we can calculate the predicted Yelp ratings for our testing data and compare them to their actual Yelp ratings. 

In [None]:
y_predicted = model.predict(X_test)

In [None]:
plt.figure(figsize = (10,7))
plt.scatter(y_test,y_predicted)
plt.xlabel('Yelp Rating')
plt.ylabel('Predicted Yelp Rating')
plt.ylim(1,5)
plt.show()

#### Define different Subsets of Data

After evaluating the first model, you can see that average_review_length and average_review_age alone are not the best predictors for Yelp rating. Let's go do some more modeling with different subsets of features and see if we can achieve a more accurate model.

In [None]:
# subset of only average review sentiment
sentiment = ['average_review_sentiment']

# subset of all features that have a response range [0,1]
binary_features = ['alcohol?','has_bike_parking','takes_credit_cards','good_for_kids','take_reservations','has_wifi']

# subset of all features that vary on a greater range than [0,1]
numeric_features = ['review_count','price_range','average_caption_length','number_pics','average_review_age','average_review_length','average_review_sentiment','number_funny_votes','number_cool_votes','number_useful_votes','average_tip_length','number_tips','average_number_friends','average_days_on_yelp','average_number_fans','average_review_count','average_number_years_elite','weekday_checkins','weekend_checkins']

# all features
all_features = binary_features + numeric_features

In [None]:
import numpy as np

# take a list of features to model as a parameter
def model(feature_list):
    
    # define ratings and features, with the features limited to our chosen subset of data
    ratings = df.loc[:,'stars']
    features = df.loc[:,feature_list]
    
    # perform train, test, split on the data
    X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size = 0.2, random_state = 1)
    
    
    if len(X_train.shape) < 2:
        X_train = np.array(X_train).reshape(-1,1)
        X_test = np.array(X_test).reshape(-1,1)
    
    # create and fit the model to the training data
    model = LinearRegression()
    model.fit(X_train,y_train)
    
    # print the train and test scores
    print('Train Score:', model.score(X_train,y_train))
    print('Test Score:', model.score(X_test,y_test))
    
    # print the model features and their corresponding coefficients, from most predictive to least predictive
    print(sorted(list(zip(feature_list,model.coef_)),key = lambda x: abs(x[1]),reverse=True))
    
    # calculate the predicted Yelp ratings from the test data
    y_predicted = model.predict(X_test)
    
    # plot the actual Yelp Ratings vs the predicted Yelp ratings for the test data
    plt.scatter(y_test,y_predicted)
    plt.xlabel('Yelp Rating')
    plt.ylabel('Predicted Yelp Rating')
    plt.ylim(1,5)
    plt.show()

In [None]:
#creating model on only sentiment

model(sentiment)

In [None]:
#creating model on only sentiment
model(binary_features)

In [None]:
#creating a model with numeric features
model(numeric_features)

In [None]:
#creating a model with all features
model(all_features)

#### Own Restauarant
Should we open our own restaurant want would the rating be on Yelp? It is observed that a regression model with all the features used produced the best results. So this is the model that will be used here.

In [None]:
features = df.loc[:, all_features]
ratings = df.loc[:, 'stars']
X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size = 0.2, random_state = 1)
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
#Example features for the restauarant
new_restaurant = np.array([0,1,1,1,1,1,10,2,3,10,10,1200,0.9,3,6,5,50,3,50,1800,12,123,0.5,0,0]).reshape(1,-1)

In [None]:
#Predicted rating
model.predict(new_restaurant)

Using arbitrary values for features of the new restaurant, we got a rating of 4!