# Predicting Yelp Rating from Health Inspection Scores by Restaurant

The first analysis point that the team agreed upon was to see if there is a correlation between the Boulder County Health Inspection Scores and Yelp or Google Ratings for restaurants in Boulder county, Colorado.

The Boulder County Health Inspections Scores were obtained. The features selected for the first analysis were the Health Inspection Score, Facility Type and Facility Category. These features were used to train the model in trying to predict the Yelp Rating per facility. 

The first step in engineering the features for the machine learning model used the filtered dataset to:
* Eliminate all location data so as not to overburden the model
* Average the inspection scores for all routine and regular health inspections by facility
    * This was difficult to eliminate the duplicate rows without losing details (pivot table and merge)
* Bin the averaged health inspection scores to match the Health Department ratings
* Create randomized Yelp Ratings to test the model
* Use Random Forest model as it is fast, simple and flexible 
    * Easy to use during the initial model development process, to see how it performs
    * Provides a good indicator of the importance it assigns to features
    * Limitations include: fast to train, but quite slow to create predictions once they are trained
    
* May need to switch to a neural network, for the second phase which has a lot of different feature types

In [59]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import tensorflow as tf

# Import our input datasets
test_df = pd.read_csv('average_scores_analysis_test3.csv')
scores_df=pd.read_csv('facility_ave_insp_score.csv')
ave_score_df=test_df.set_index('facilityId').join(scores_df.set_index('facilityId'))
ave_score_df.head()

Unnamed: 0_level_0,typeOfFacility,categoryOfFacility,averageInspectionScore
facilityId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FA0003323,RESTAURANT 0 TO 100 SEATS,FULL MENU LIMITED SERVICE,14
FA0000616,RESTAURANT MORE THAN 200 SEATS,FULL SERVICE FULL MENU,48
FA0004494,RESTAURANT 0 TO 100 SEATS,FULL MENU LIMITED SERVICE,30
FA0003893,SPECIAL EVENT,SPECIAL EVENT,4
FA0003472,LIMITED FOOD SERVICE CONVENIENCE OTHER,FAST FOOD LIMITED MENU,15


In [60]:
# Bin the average scores from 1-5 according to Boulder County Health Inspection site "5"= Excellent score 0-19, "4" = Good score 20-39, "3" = Fair score 40-69, "2" = Marginal score 70-99, "1" = Unacceptable score >100.
bins=[0, 20, 40, 70, 100, 1000]
health_scores = ["5", "4", "3", "2", "1"]
ave_score_df["healthScore"]=pd.cut(ave_score_df["averageInspectionScore"], bins, labels=health_scores)
ave_score_df. drop("averageInspectionScore", axis=1, inplace=True)

# Add fake Yelp ratings
ave_score_df["yelpRating"]=np.random.randint(1,6, size=len(ave_score_df))
ave_score_df.head()

Unnamed: 0_level_0,typeOfFacility,categoryOfFacility,healthScore,yelpRating
facilityId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
FA0003323,RESTAURANT 0 TO 100 SEATS,FULL MENU LIMITED SERVICE,5,3
FA0000616,RESTAURANT MORE THAN 200 SEATS,FULL SERVICE FULL MENU,3,4
FA0004494,RESTAURANT 0 TO 100 SEATS,FULL MENU LIMITED SERVICE,4,1
FA0003893,SPECIAL EVENT,SPECIAL EVENT,5,3
FA0003472,LIMITED FOOD SERVICE CONVENIENCE OTHER,FAST FOOD LIMITED MENU,5,5


In [61]:
#ave_score_df.to_csv('average_scores.csv')

# Machine Learning Model

In [62]:
# Preparing our data for ML
# Generate our categorical variable list
fac_cat = ave_score_df.dtypes[ave_score_df.dtypes == "object"].index.tolist()

# Check the number of unique values in each column
ave_score_df[fac_cat].nunique()

typeOfFacility        13
categoryOfFacility    13
dtype: int64

In [63]:
# Create a OneHotEncoder instance
enc = OneHotEncoder(sparse=False)

# Fit and transform the OneHotEncoder using the categorical variable list
encode_df = pd.DataFrame(enc.fit_transform(ave_score_df[fac_cat]))

# Add the encoded variable names to the DataFrame
encode_df.columns = enc.get_feature_names(fac_cat)
encode_df.head()

Unnamed: 0,typeOfFacility_GROCERY STORE MORE THAN 15000 SQ FT,typeOfFacility_GROCERY STORE 0 TO 15000 SQ FT,typeOfFacility_GROCERY STORE W DELI 0 TO 15000 SQ FT,typeOfFacility_GROCERY STORE W DELI MORE THAN 15000 SQ FT,typeOfFacility_LIMITED FOOD SERVICE CONVENIENCE OTHER,typeOfFacility_MOBILE UNIT FULL FOOD SERVICE,typeOfFacility_MOBILE UNIT PREPACKAGED,typeOfFacility_NO FEE LICENSE K12 SCHOOLS NON PROFIT,typeOfFacility_RESTAURANT 0 TO 100 SEATS,typeOfFacility_RESTAURANT 101 TO 200 SEATS,...,categoryOfFacility_FOOD BANK,categoryOfFacility_FULL MENU LIMITED SERVICE,categoryOfFacility_FULL SERVICE FULL MENU,categoryOfFacility_GROCERY FINISHED FOODS,categoryOfFacility_MOBILE UNITS,categoryOfFacility_PRE PACKAGED,categoryOfFacility_RESIDENTIAL FACILITIES,categoryOfFacility_RETAIL COMMISSARY,categoryOfFacility_SPECIAL EVENT,categoryOfFacility_TEMPORARY EVENTS
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [69]:
# Merge one-hot encoded features and drop the originals
ave_score_df = ave_score_df.merge(encode_df,left_index=True, right_index=True)
#ave_score_df = ave_score_df.drop(fac_cat,1)
ave_score_df.head()

Unnamed: 0,typeOfFacility,categoryOfFacility,healthScore,yelpRating,typeOfFacility_GROCERY STORE MORE THAN 15000 SQ FT_x,typeOfFacility_GROCERY STORE 0 TO 15000 SQ FT_x,typeOfFacility_GROCERY STORE W DELI 0 TO 15000 SQ FT_x,typeOfFacility_GROCERY STORE W DELI MORE THAN 15000 SQ FT_x,typeOfFacility_LIMITED FOOD SERVICE CONVENIENCE OTHER_x,typeOfFacility_MOBILE UNIT FULL FOOD SERVICE_x,...,categoryOfFacility_FOOD BANK_y,categoryOfFacility_FULL MENU LIMITED SERVICE_y,categoryOfFacility_FULL SERVICE FULL MENU_y,categoryOfFacility_GROCERY FINISHED FOODS_y,categoryOfFacility_MOBILE UNITS_y,categoryOfFacility_PRE PACKAGED_y,categoryOfFacility_RESIDENTIAL FACILITIES_y,categoryOfFacility_RETAIL COMMISSARY_y,categoryOfFacility_SPECIAL EVENT_y,categoryOfFacility_TEMPORARY EVENTS_y


In [70]:
# Remove Yelp outcome target from features data
y = ave_score_df.yelpRating
X = ave_score_df.drop(columns="yelpRating")

# Split training/test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

ValueError: With n_samples=0, test_size=0.25 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

In [46]:
# Create a StandardScaler instance
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

ValueError: could not convert string to float: 'RESTAURANT 0 TO 100 SEATS'

In [None]:
# Create a random forest classifier.
rf_model = RandomForestClassifier(n_estimators=128, random_state=78)

# Fitting the model
rf_model = rf_model.fit(X_train_scaled, y_train)

# Evaluate the model
y_pred = rf_model.predict(X_test_scaled)
print(f" Random forest predictive accuracy: {accuracy_score(y_test,y_pred):.3f}")

In [None]:
# Add the prediction to the dataframe
ave_score_df["yelpPrediction"]= y_pred
ave_score_df.head()

In [None]:
# # Define the logistic regression model
# log_classifier = LogisticRegression(solver="lbfgs",max_iter=200)

# # Train the model
# log_classifier.fit(X_train,y_train)

# # Evaluate the model
# y_pred = log_classifier.predict(X_test)
# print(f" Logistic regression model accuracy: {accuracy_score(y_test,y_pred):.3f}")