# Project: Yelp Rating Regression Predictor

The restaurant industry is tougher than ever, with restaurant reviews blazing across the Internet from day one of a restaurant's opening. But as a lover of food, you and your friend decide to break into the industry and open up your own restaurant, Danielle's Delicious Delicacies. Since a restaurant's success is highly correlated with its reputation, you want to make sure Danielle's Delicious Delicacies has the best reviews on the most queried restaurant review site: Yelp! While you know your food will be delicious, you think there are other factors that play into a Yelp rating and will ultimately determine your business's success. With a dataset of different restaurant features and their Yelp ratings, you decide to use a Multiple Linear Regression model to investigate what factors most affect a restaurant's Yelp rating and predict the Yelp rating for your restaurant!

In this project we'll be working with a real dataset provided by Yelp. We have provided six files, listed below with a brief description:
* `yelp_business.json`: establishment data regarding location and attributes for all businesses in the dataset
* `yelp_review.json`: Yelp review metadata by business
* `yelp_user.json`: user profile metadata by business
* `yelp_checkin.json`: online checkin metadata by business
* `yelp_tip.json`: tip metadata by business
* `yelp_photo.json`: photo metadata by business

For a more detailed explanation of the features in each `.json` file, see the accompanying [explanatory feature document](https://docs.google.com/document/d/1V6FjJpKspVBOOBs4E7fBfp_yzHn0--XJkC2uUtWuRgM/edit).

Let's get started by exploring the data in each of these files to see what we are working with.

# Load the Data:

In [1]:
# import pandas library
import pandas as pd

In [2]:
#load yelp_business.json into dataframe
businesses = pd.read_json('/Users/jonathanmatsen/Documents/3_Projects/yelp_regression_project/yelp_business.json', lines = True)

#load yelp_review.json into dataframe
reviews = pd.read_json('/Users/jonathanmatsen/Documents/3_Projects/yelp_regression_project/yelp_review.json', lines=True)

#load yelp_user.json into dataframe
users = pd.read_json('/Users/jonathanmatsen/Documents/3_Projects/yelp_regression_project/yelp_user.json', lines=True)

#load yelp_checkin.json into dataframe
checkins = pd.read_json('/Users/jonathanmatsen/Documents/3_Projects/yelp_regression_project/yelp_checkin.json', lines=True)

#load yelp_tip.json into dataframe
tips = pd.read_json('/Users/jonathanmatsen/Documents/3_Projects/yelp_regression_project/yelp_tip.json', lines=True)

#load yelp_photo.json into dataframe
photos = pd.read_json('/Users/jonathanmatsen/Documents/3_Projects/yelp_regression_project/yelp_photo.json', lines=True)

ValueError: Expected object or value

# Explore the data:

### Set dataframe parameters

In [None]:
# adjust max columns to display as 60 
pd.options.display.max_columns = 60

# adjust max column width to 500 characters 
pd.options.display.max_colwidth = 500

### Getting familiar 

In [None]:
# overview of businesses dataframe 
businesses.head()

In [None]:
# overview of reviews dataframe 
reviews.head()

In [None]:
# overview of users dataframe 
users.head()

In [None]:
# overview of checkins dataframe 
checkins.head()

In [None]:
# overview of tips dataframe 
tips.head()

In [None]:
# overview of photos dataframe 
photos.head()

### How many different businesses are in the dataset? What are the different features in the review DataFrame?

There are 188,593 businesses. 

### Diving Deeper

In [None]:
# dive deeper into structure of data
businesses.info()

#### What is the range of values for the features in the user DataFrame?

* **Number Friends:** _1 - 4219_
* **Days on Yelp:** _76 - 4860_
* **Number Fans:** _0 - 1174.66_
* **Review Count:** _0.66 - 6335_
* **Number of Year Elite:** _0 - 10.66_ 

_All numbers are averages._ 

### Summary Statistics of Dataframes

In [None]:
# review summary statistics of users  

users.describe()

 #### What feature, or column, do the DataFrames have in common?

* Business ID

# Merge the Data:

In [None]:
# merge businesses with reviews on business if into new dataframe 'df'
df = pd.merge(businesses, reviews, how='left', on='business_id')
print(len(df))

Merge each of the other 4 DataFrames into our new DataFrame `df` to combine all the data together. Make sure that `df` is the left DataFrame in each merge and `how=left` since not every DataFrame includes every business in the dataset (this way we won't lose any data during the merges). Once combined, print out the columns of `df`. What features are in this new DataFrame?

In [None]:
# merge users with df and print length of df to ensure accurate number of records during merge
df = pd.merge(df, users, how='left', on='business_id')
print(len(df))

In [None]:
# merge checkins with df and print length of df to ensure accurate number of records during merge
df = pd.merge(df, checkins, how='left', on='business_id')
print(len(df))

In [None]:
# merge tips with df and print length of df to ensure accurate number of records during merge 
df = pd.merge(df, tips, how='left', on='business_id')
print(len(df))

In [None]:
# merge photos with df and print length of df to ensure accurate number of records during merge
df = pd.merge(df, photos, how='left', on='business_id')
print(len(df))

### Quality Check:

In [None]:
# verify number of columns in df = 40 
len(df.columns)

In [None]:
# verify all expected column names present in final merged df 
df.columns

### Explore merged df

In [None]:
# overview of df 
df.head()

# Clean the Data:

In [None]:
# create list of non-continuous and non-binary features to remove
features_to_remove = ['address','attributes','business_id','categories','city','hours','is_open','latitude','longitude','name','neighborhood','postal_code','state','time'] 

# remove unwanted attributes 
df.drop(features_to_remove, axis=1, inplace=True)

In [None]:
#check for missing values
df.isna().any()

In [None]:
# replace missing values by replacing NaNs w/ 0s 
df.fillna({'weekday_checkins': 0, 
           'weekend_checkins': 0, 
           'average_tip_length': 0, 
           'number_tips': 0, 
           'average_caption_length': 0,
           'number_pics': 0},
          inplace=True)

In [None]:
# double check for any remaining missing values 
df.isna().any()

# Exploratory Analysis:



In [None]:
df.corr()

* alcohol: -0.043332
* good_for_kids: -0.030382
* has_bike_parking: 0.068084
* has_wifi: -0.039857
* price_range: -0.052565
* review_count: 0.032413
* **average_review_age: -0.125645**
* **average_review_length: -0.277081**
* **average_review_sentiment: 0.782187**
* average_days_on_yelp: -0.038061
* average_number_fans: -0.031141
* average_review_count: -0.066572
* average_years_elite: -0.064419
* average_tip_length: -0.052899


To further visualize these relationships, we can plot certain features against our dependent variable, the Yelp rating. In the cell below we have provided the code to import Matplotlib. We can use Matplotlib's `.scatter()` method with the below syntax to plot what these correlations look like:

```python
plt.scatter(x_values_to_plot, y_values_to_plot, alpha=blending_val)
```

* `x_values_to_plot` are the values to be plotted along the x-axis
* `y_values_to_plot` are the values to be plotted along the y-axis
* `alpha=blending_val` is the blending value, or how transparent (0) or opaque (1) a plotted point is. This will help us distinguish areas of the plot with high point densities and low point densities

Plot the three features that correlate most with Yelp rating (`average_review_sentiment`, `average_review_length`, `average_review_age`) against `stars`, our Yelp rating. Then plot a lowly correlating feature, such as `number_funny_votes`, against `stars`.

>What is `average_review_sentiment`, you ask? `average_review_sentiment` is the average sentiment score for all reviews on a business' Yelp page. The sentiment score for a review was calculated using the sentiment analysis tool [VADER](https://github.com/cjhutto/vaderSentiment). VADER uses a labeled set of positive and negative words, along with codified rules of grammar, to estimate how positive or negative a statement is. Scores range from `-1`, most negative, to `+1`, most positive, with a score of `0` indicating a neutral statement. While not perfect, VADER does a good job at guessing the sentiment of text data!

What kind of relationships do you see from the plots? Do you think these variables are good or bad features for our Yelp rating prediction model?

In [None]:
from matplotlib import pyplot as plt

# plot average_review_sentiment against stars here
plt.scatter(df.average_review_sentiment, df.stars, alpha=0.1)
plt.show

In [None]:
# plot average_review_length against stars here
plt.scatter(df.average_review_length, df.stars, alpha=0.1)
plt.show

In [None]:
# plot average_review_age against stars here
plt.scatter(df.average_review_age, df.stars, alpha=0.1)

In [None]:
# plot number_funny_votes against stars here
plt.scatter(df.number_funny_votes, df.stars, alpha=0.1)

####  Why do you think `average_review_sentiment` correlates so well with Yelp rating?

* People are emotional beings and feed off of what other people previously felt. They trust their opinions. 

## Data Selection

In order to put our data into a Linear Regression model, we need to **separate out our features to model on and the Yelp ratings**. 

From our correlation analysis we saw that the three features with the strongest correlations to Yelp rating are `average_review_sentiment`, `average_review_length`, and `average_review_age`. 

Since we want to dig a little deeper than `average_review_sentiment`, which understandably has a very high correlation with Yelp rating, let's choose to create our first model with `average_review_length` and `average_review_age` as features.

Pandas lets us select one column of a DataFrame with the following syntax:

```python
subset_of_data = df['feature_to_select']
```
Pandas also lets us select multiple columns from a DataFrame with this syntax:

```python
subset_of_data = df[list_of_features_to_select]
```
Create a new DataFrame `features` that contains the columns we want to model on: `average_review_length` and `average_review_age`. Then create another DataFrame `ratings` that stores the value we want to predict, Yelp rating, or `stars` in `df`.

In [None]:
features = df[['average_review_length', 'average_review_age']]

ratings = df['stars']


## Split the Data into Training and Testing Sets

We are just about ready to model! But first, we need to **break our data into a training set and a test set so we can evaluate how well our model performs**. 

We'll use scikit-learn's `train_test_split` function to do this split, which is provided in the cell below. This function takes **two required parameters: the data, or our features, followed by our dependent variable, in our case the Yelp rating**. 

Set the optional parameter `test_size` to be `0.2`. Finally, set the optional parameter `random_state` to `1`. This will make it so your data is split in the same way as the data in our solution code. 

Remember, **this function returns 4 items in this order**:
1. The training data (features), which we can assign to `X_train`
2. The testing data (features), which we can assign to `X_test`
3. The training dependent variable (Yelp rating), which we can assign to `y_train`
4. The testing dependent variable (Yelp rating), which we can assign to `y_test`

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(features, ratings, test_size=0.2, random_state=1)


## Create and Train the Model

Now that our data is split into training and testing sets, we can finally model! In the cell below we have provided the code to import `LinearRegression` from scikit-learn's `linear_model` module. 

**Create a new `LinearRegression` object named model.** 

_The `.fit()` method will fit our Linear Regression model to our training data and calculate the coefficients for our features._ 

Call the `.fit()` method on `model` with `X_train` and `y_train` as parameters. Just like that our model has now been trained on our training data!

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(x_train, y_train)

## Evaluate and Understand the Model

Now we can evaluate our model in a variety of ways. 

The first way will be by using the `.score()` method, which provides the R^2 value for our model. 
* _Remember, R^2 is the coefficient of determination, or a measure of how much of the variance in our dependent variable, the predicted Yelp rating, is explained by our independent variables, our feature data._
* R^2 values range from `0` to `1`, with `0` indicating that the created model does not fit our data at all, and with `1` indicating the model perfectly fits our feature data. 

**Call `.score()` on our model with `X_train` and `y_train` as parameters to calculate our training R^2 score. Then call `.score()` again on model with `X_test` and `y_test` as parameters to calculate R^2 for our testing data**. 

What do these R^2 values say about our model? Do you think these features alone are able to effectively predict Yelp ratings?

In [None]:
train_R2 = model.score(x_train, y_train)
test_R2 = model.score(x_test, y_test)

print(train_R2)
print(test_R2) 

After all that hard work, we can finally take a look at the coefficients on our different features! The model has an **attribute `.coef_` which is an array of the feature coefficients determined by fitting our model to the training data**. 

To make it easier for you to see which feature corresponds to which coefficient, we have provided some code in the cell that `zip`s together a list of our features with the coefficients and sorts them in descending order from most predictive to least predictive.

In [None]:
sorted(list(zip(['average_review_length','average_review_age'],model.coef_)),key = lambda x: abs(x[1]),reverse=True)

Lastly we can **calculate the predicted Yelp ratings for our testing data and compare them to their actual Yelp ratings!** 

Our model has a `.predict()` method which uses the model's coefficients to calculate the predicted Yelp rating. Call `.predict()` on `X_test` and assign the values to `y_predicted`. Use Matplotlib to plot `y_test` vs `y_predicted`. 

For a perfect linear regression model we would expect to see the data plotted along the line `y = x`, indicating _homoscedasticity_. Is this the case? If not, why not? Would you call this model _heteroscedastic_ or _homoscedastic_ ?

In [None]:
y_predicted = model.predict(x_test)

In [None]:
plt.scatter(y_predicted, y_test, alpha=0.1)

## Define Different Subsets of Data

After evaluating the first model, you can see that `average_review_length` and `average_review_age` alone are not the best predictors for Yelp rating. Let's go do some more modeling with different subsets of features and see if we can achieve a more accurate model! In the cells below we have provided different lists of subsets of features that we will model with and evaluate. What other subsets of features would you want to test? Why do you think those feature sets are more predictive of Yelp rating than others? Create at least one more subset of features that you want to predict Yelp ratings from.

In [None]:
# subset of only average review sentiment
sentiment = ['average_review_sentiment']

In [None]:
# subset of all features that have a response range [0,1]
binary_features = ['alcohol?','has_bike_parking','takes_credit_cards','good_for_kids','take_reservations','has_wifi']

In [None]:
# subset of all features that vary on a greater range than [0,1]
numeric_features = ['review_count','price_range','average_caption_length','number_pics','average_review_age','average_review_length','average_review_sentiment','number_funny_votes','number_cool_votes','number_useful_votes','average_tip_length','number_tips','average_number_friends','average_days_on_yelp','average_number_fans','average_review_count','average_number_years_elite','weekday_checkins','weekend_checkins']

In [None]:
# all features
all_features = binary_features + numeric_features

In [None]:
# add your own feature subset here
feature_subset = ['alcohol?', 'has_bike_parking']

## Further Modeling

Now that we have lists of different feature subsets, we can create new models from them. In order to more easily compare the performance of these new models, we have created a function for you below called `model_these_features()`. This function replicates the model building process you just completed with our first model! Take some time to review how the function works, analyzing it line by line. Fill in the empty comments with an explanation of the task the code beneath it is performing.

In [None]:
import numpy as np

# take a list of features to model as a parameter
def model_these_features(feature_list):
    
    # df.loc -- Access a group of rows and columns by label(s) or a boolean array
    # gathering ratings=x=dependent variable & features=y=independent variable
    ratings = df.loc[:,'stars']
    features = df.loc[:,feature_list]
    
    #splitting variable data into train/test for fitting then testing the model 
    X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size = 0.2, random_state = 1)
    
    # don't worry too much about these lines, just know that they allow the model to work when
    # we model on just one feature instead of multiple features. Trust us on this one :)
    if len(X_train.shape) < 2:
        X_train = np.array(X_train).reshape(-1,1)
        X_test = np.array(X_test).reshape(-1,1)
    
    # create model object and fit the model to relevant data 
    model = LinearRegression()
    model.fit(X_train,y_train)
    
    # determine R2 values=coefficient of determination. Scale of 0 - 1 and tells us how well independent variables explain the dependent variable
    # 0.7 and up is generally considered good 
    print('Train Score:', model.score(X_train,y_train))
    print('Test Score:', model.score(X_test,y_test))
    
    # print the model features and their corresponding coefficients, from most predictive to least predictive
    print(sorted(list(zip(feature_list,model.coef_)),key = lambda x: abs(x[1]),reverse=True))
    
    # see how well the model predicts the actual y_test variables 
    y_predicted = model.predict(X_test)
    
    # visualize the model's accuracy of predicted vs acutal 
    plt.scatter(y_test,y_predicted)
    plt.xlabel('Yelp Rating')
    plt.ylabel('Predicted Yelp Rating')
    plt.ylim(1,5)
    plt.show()

Once you feel comfortable with the steps of the function, run models on the following subsets of data using `model_these_features()`:
* `sentiment`: only `average_review_sentiment`
* `binary_features`: all features that have a response range [0,1]
* `numeric_features`: all features that vary on a greater range than [0,1]
* `all_features`: all features
* `feature_subset`: your own feature subset

How does changing the feature sets affect the model's R^2 value? Which features are most important to predicting Yelp rating in the different models? Which models appear more or less homoscedastic?

In [None]:
# create a model on sentiment here
sentiment_model = model_these_features(sentiment)

In [None]:
# create a model on all binary features here
binary_model = model_these_features(binary_features)

In [None]:
# create a model on all numeric features here
numeric_model = model_these_features(numeric_features)

In [None]:
# create a model on all features here
all_model = model_these_features(all_features)

In [None]:
# create a model on your feature subset here
subset_model = model_these_features(feature_subset)

## Danielle's Delicious Delicacies' Debut

You've loaded the data, cleaned it, modeled it, and evaluated it. You're tired, but glowing with pride after all the hard work. You close your eyes and can clearly see opening day of Danielle's Delicious Delicacies with a line out the door. But what will your Yelp rating be? Let's use our model to make a prediction.

Our best model was the model using all features, so we'll work with this model again. In the cell below print `all_features` to get a reminder of what features we are working with.

In [None]:
print(all_features)

Run the cell below to grab all the features and retrain our model on them.

In [None]:
features = df.loc[:,all_features]
ratings = df.loc[:,'stars']
X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size = 0.2, random_state = 1)
model = LinearRegression()
model.fit(X_train,y_train)

To give you some perspective on the restaurants already out there, we have provided the mean, minimum, and maximum values for each feature below. Will Danielle's Delicious Delicacies be just another average restaurant, or will it be a 5 star behemoth amongst the masses?

In [None]:
pd.DataFrame(list(zip(features.columns,features.describe().loc['mean'],features.describe().loc['min'],features.describe().loc['max'])),columns=['Feature','Mean','Min','Max'])

Based on your plans for the restaurant, how you expect your customers to post on your Yelp page, and the values above, fill in the blanks in the NumPy array below with your desired values. The first blank corresponds with the feature at `index=0` in the DataFrame above, `alcohol?`, and the last blank corresponds to the feature at ``index=24``, `weekend_checkins`. Make sure to enter either `0` or `1` for all binary features, and if you aren't sure of what value to put for a feature, select the mean from the DataFrame above. After you enter the values, run the prediction cell below to receive your Yelp rating! How is Danielle's Delicious Delicacies debut going to be?

In [None]:
danielles_delicious_delicacies = np.array([1,1,1,0,0,1,2,2,5,10,10,596,0.554935,15.61,18.49,50,45,3,200,2,10,20,0,50,60]).reshape(1,-1)

In [None]:
model.predict(danielles_delicious_delicacies)

## Next Steps

You have successfully built a linear regression model that predicts a restaurant's Yelp rating! As you have seen, it can be pretty hard to predict a rating like this even when we have a plethora of data. 

**What other questions come to your mind when you see the data we have**? 

**What insights do you think could come from a different kind of analysis? Here are some ideas to ponder**:

* Can we predict the cuisine of a restaurant based on the users that review it?
* What restaurants are similar to each other in ways besides cuisine?
* Are there different restaurant vibes, and what kind of restaurants fit these conceptions?
* How does social media status affect a restaurant's credibility and visibility?

As you progress further into the field of data science, you will be able to create models that address these questions and many more! But in the meantime, get back to working on that burgeoning restaurant business plan.