### Predicting restaurant ratings

In [None]:
import pandas as pd
import numpy as np
import requests
import random
import time
import re
import os, glob

import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import scikitplot as skplt

pd.options.display.max_colwidth = 300
pd.options.display.max_columns = 100
plt.style.use('ggplot')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [None]:
# load data
df = pd.read_csv('/Users/katjakrempel/Desktop/capstone/fsa_yelp_final.csv')
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
# drop duplicate rows
df.drop_duplicates(inplace=True)

In [None]:
df.shape

#### Data Dictionary

In [None]:
var_names = df.columns
var_names

In [None]:
data = {'FHRSID': {'type': 'integer', 'source': 'Food Standards Agency', 'description': 'Food Hygiene Rating Scheme (FHRS) ID of business'},
        'BusinessName': {'type': 'string', 'source': 'Food Standards Agency', 'description': 'business name'},
        'BusinessType': {'type': 'string', 'source': 'Food Standards Agency', 'description': 'relevant business types in dataset: Restaurant/Cafe/Canteen, Pub/bar/nightclub, Takeaway/sandwich shop'},
        'BusinessTypeID': {'type': 'integer', 'source': 'Food Standards Agency', 'description': 'numerical ID of business types listed above'},
        'AddressLine1': {'type': 'string', 'source': 'Food Standards Agency', 'description': 'street address of business'},
        'AddressLine2':{'type': 'string', 'source': 'Food Standards Agency', 'description': 'street address of business'},
        'AddressLine3': {'type': 'string', 'source': 'Food Standards Agency', 'description': 'street address of business'},
        'AddressLine4': {'type': 'string', 'source': 'Food Standards Agency', 'description': 'street address of business'}, 
        'PostCode': {'type': 'string', 'source': 'Food Standards Agency', 'description': 'postcode'}, 
        'RatingValue': {'type': 'string', 'source': 'Food Standards Agency', 'description': 'overall rating score: 5 - very good, 4 - good, 3 - generally satisfactory, 2 - some improvement necessary, 1 - major improvement necessary, 0 - urgent improvement necessary'}, 
        'RatingKey': {'type': 'string', 'source': 'Food Standards Agency', 'description': 'combination of rating scheme and score'},
        'RatingDate': {'type': 'string', 'source': 'Food Standards Agency', 'description': 'rating date (YYYY-MM-DD)'},
        'LocalAuthorityCode': {'type': 'string', 'source': 'Food Standards Agency', 'description': 'numerical ID of local authority responsible for enforcing food hygiene standards'},
        'LocalAuthorityName': {'type': 'string', 'source': 'Food Standards Agency', 'description': 'name of local authority responsible for enforcing food hygiene standards'},
        'Hygiene': {'type': 'integer', 'source': 'Food Standards Agency', 'description': 'component score: compliance with food hygiene and safety procedures (0 - very good, 5 - good, 10 - generally satisfactory, 15 - improvement necessary, 20 - major improvement necessary, 25 - urgent improvement necessary)'},
        'Structural': {'type': 'integer', 'source': 'Food Standards Agency', 'description': 'component score: compliance with structural requirements (0 - very good, 5 - good, 10 - generally satisfactory, 15 - improvement necessary, 20 - major improvement necessary, 25 - urgent improvement necessary)'},
        'ConfidenceInManagement': {'type': 'integer', 'source': 'Food Standards Agency', 'description': 'component score: confidence in management/control procedures (0 - very good, 5 - good, 10 - generally satisfactory, 20 - major improvement necessary, 30 - urgent improvement necessary)'},
        'longitude_x': {'type': 'string', 'source': 'Food Standards Agency', 'description': 'geolocation - longitude'},
        'latitude_x': {'type': 'string', 'source': 'Food Standards Agency', 'description': 'geolocation - latitude'}, 
        'RightToReply': {'type': 'string', 'source': 'Food Standards Agency', 'description': 'food business operators have a right to reply; local authorities must publish this with the rating'}, 
        'NewRatingPending': {'type': 'boolean', 'source': 'Food Standards Agency', 'description': 'new rating pending during appeal and notification period, previous rating continues to be published'},
        'name_postc': {'type': 'string', 'source': '', 'description': 'unique identifier created to match businesses in both data sets'},
        'id': {'type': 'string', 'source': 'Yelp', 'description': 'unique Yelp ID of business'},
        'alias': {'type': 'string', 'source': 'Yelp', 'description': 'unique Yelp alias of business'},
        'name': {'type': 'string', 'source': 'Yelp', 'description': 'business name'},
        'review_count': {'type': 'integer', 'source': 'Yelp', 'description': 'number of reviews of business; Yelp API only returns businesses with at least one review'},
        'cat_1': {'type': 'string', 'source': 'Yelp', 'description': 'category associated with business, indicates cuisine or type of business (e.g. Mexican, Gastropub...); multiple categories can be selected'}, 
        'cat_2': {'type': 'string', 'source': 'Yelp', 'description': 'category associated with business, indicates cuisine or type of business (e.g. Mexican, Gastropub...); multiple categories can be selected'},
        'cat_3': {'type': 'string', 'source': 'Yelp', 'description': 'category associated with business, indicates cuisine or type of business (e.g. Mexican, Gastropub...); multiple categories can be selected'},
        'cat_4': {'type': 'string', 'source': 'Yelp', 'description': 'category associated with business, indicates cuisine or type of business (e.g. Mexican, Gastropub...); multiple categories can be selected'}, 
        'cat_5': {'type': 'string', 'source': 'Yelp', 'description': 'category associated with business, indicates cuisine or type of business (e.g. Mexican, Gastropub...); multiple categories can be selected'},
        'rating': {'type': 'float', 'source': 'Yelp', 'description': 'Yelp rating of business (value range from 1, 1.5, ..., 4.5, 5)'},
        'longitude_y': {'type': 'float', 'source': 'Yelp', 'description': 'geolocation - longitude'},
        'latitude_y':{'type': 'float', 'source': 'Yelp', 'description': 'geolocation - latitude'},
        'transaction_1':{'type': 'string', 'source': 'Yelp', 'description': 'Yelp transactions the business is registered for: pickup, delivery, restaurant_reservation'},
        'transaction_2':{'type': 'string', 'source': 'Yelp', 'description': 'Yelp transactions the business is registered for: pickup, delivery, restaurant_reservation'},
        'transaction_3':{'type': 'string', 'source': 'Yelp', 'description': 'Yelp transactions the business is registered for: pickup, delivery, restaurant_reservation'},
        'price': {'type': 'string', 'source': 'Yelp', 'description': 'price level of business: £, ££, £££, ££££'},
        'address1': {'type': 'string', 'source': 'Yelp', 'description': 'street address of business'},
        'address2':{'type': 'string', 'source': 'Yelp', 'description': 'street address of business'},
        'address3':{'type': 'string', 'source': 'Yelp', 'description': 'street address of business'},
        'city':{'type': 'string', 'source': 'Yelp', 'description': 'city'},
        'zip_code': {'type': 'string', 'source': 'Yelp', 'description': 'postcode'},
        'country': {'type': 'string', 'source': 'Yelp', 'description': 'country code of business'} }


In [None]:
data = pd.DataFrame(data, columns=var_names).T
data

#### Data Cleaning and EDA

In [None]:
# check number of missing values per column
df.isnull().sum()

In [None]:
# drop columns without values 
df.drop(columns=['transaction_1', 'transaction_2', 'transaction_3'], inplace=True)

In [None]:
# drop redundant name and address columns
df.drop(columns=['BusinessName', 'AddressLine1', 'AddressLine2', 'AddressLine3', 'AddressLine4',
                 'PostCode', 'longitude_x', 'latitude_x'], inplace=True)

In [None]:
df.info()

In [None]:
df['BusinessType'].value_counts()

In [None]:
df['RatingValue'].value_counts()

In [None]:
# check exempt businesses
df[['BusinessType', 'name', 'cat_1', 'cat_2', 'address1', 'city']][df['RatingValue']=='Exempt']

In [None]:
# define function to convert strings to floats
def convert_to_float(x):
    try:
        return float(x)
    except:
        return np.nan

In [None]:
# convert food hygiene ratings to float
df['RatingValue'] = df['RatingValue'].apply(convert_to_float)

In [None]:
df['RatingValue'].value_counts(normalize=True)

In [None]:
0.024155 + 0.013383 + 0.001697

In [None]:
# drop rows with missing ratings (exempt, awaiting inspection, awaiting publication)
df.dropna(subset = ['RatingValue'], inplace=True) 

In [None]:
df.info()

In [None]:
# distribution of food hygiene ratings
fig, ax = plt.subplots(figsize=(8, 6))
sns.countplot(data=df, x='RatingValue', palette='YlOrRd')
ax.set_ylabel('Counts\n', fontsize=12)
ax.set_title('Food Hygiene Ratings distribution\n', fontsize=14)
plt.show()

In [None]:
# distribution of food hygiene ratings by business type
fig, ax = plt.subplots(figsize=(8, 6))
sns.countplot(data=df, x='RatingValue', hue='BusinessType', palette='YlOrRd')
ax.set_ylabel('Counts\n', fontsize=12)
ax.set_title('Food Hygiene Ratings distribution by Business Type\n', fontsize=14)
plt.show()

In [None]:
# convert RatingDate to datetime
df['RatingDate'] = pd.to_datetime(df['RatingDate'])

In [None]:
# extract year from RatingDate
df['rating_year'] = df['RatingDate'].dt.year

In [None]:
df['rating_year'].value_counts()

In [None]:
# check businesses with most recent rating before 2010 (launch of Food Hygiene Rating Scheme)
df[['BusinessType', 'name', 'RatingValue', 'RatingDate', 'cat_1', 'cat_2', 'address1', 'city']][df['rating_year']<2010]

In [None]:
# remove businesses with most recent rating before 2010
df = df[df['rating_year']>=2010]

In [None]:
df.info()

In [None]:
# component score - Hygiene
df['Hygiene'].value_counts(normalize=True)

In [None]:
# distribution of food hygiene ratings - 'Hygiene' score
fig, ax = plt.subplots(figsize=(8, 6))
sns.countplot(data=df, x='Hygiene', palette='YlOrRd')
ax.set_ylabel('Counts\n', fontsize=12)
ax.set_title('Food Hygiene Ratings distribution ("Hygiene")\n', fontsize=14)
plt.show()

In [None]:
# component score - Structural
df['Structural'].value_counts(normalize=True)

In [None]:
# distribution of food hygiene ratings - 'Structural' score
fig, ax = plt.subplots(figsize=(8, 6))
sns.countplot(data=df, x='Structural', palette='YlOrRd')
ax.set_ylabel('Counts\n', fontsize=12)
ax.set_title('Food Hygiene Ratings distribution ("Structural")\n', fontsize=14)
plt.show()

In [None]:
# component score - Confidence in Management
df['ConfidenceInManagement'].value_counts(normalize=True)

In [None]:
# distribution of food hygiene ratings - 'Confidence in Management' score
fig, ax = plt.subplots(figsize=(8, 6))
sns.countplot(data=df, x='ConfidenceInManagement', palette='YlOrRd')
ax.set_ylabel('Counts\n', fontsize=12)
ax.set_title('Food Hygiene Ratings distribution ("Confidence in Management")\n', fontsize=14)
plt.show()

In [None]:
df['RightToReply'].unique()

In [None]:
# most common business names
df['name'].value_counts()[:20]

In [None]:
# identify chains based on 250 most common names in the data set
chains = ['subway', 'mcdonalds', 'nandos', 'kfc', 'costa coffee', 'dominos pizza', 
          'pret a manger', 'starbucks', 'pizza express', 'burger king', 'caffe nero',
          'pizza hut', 'wagamama', 'zizzi', 'five guys', 'prezzo', 'toby carvery',
          'franco manca', 'greggs', 'papa johns', 'gourmet burger kitchen', 'all bar one', 
          'slug and lettuce', 'tgi fridays', 'leon', 'chicken cottage', 'carluccios',
          'pho', 'frankie and bennys', 'wasabi', 'ask italian', 'le pain quotidien',
          'itsu', 'las iguanas', 'miller and carter', 'black sheep coffee', 'botanist',
          'pure', 'patisserie valerie', 'bills', 'tops pizza', 'pitcher and piano',
          'comptoir libanais', 'breakfast club', 'byron', 'pizza go go', 'wahaca', 
          'burger and lobster', 'german doner kebab']

In [None]:
len(chains)

In [None]:
# define function to identify chains based on name
def is_chain(x):
    if any(chain in x.lower() for chain in chains):
        return 1
    else:
        return 0

In [None]:
df['is_chain'] = df['name'].apply(is_chain)

In [None]:
df[['name', 'is_chain']][:15]

In [None]:
df['is_chain'].value_counts(normalize=True)

In [None]:
# distribution of chain vs non-chain restaurants
fig, ax = plt.subplots(figsize=(8, 6))
sns.countplot(data=df, x='is_chain', palette='YlOrRd')
ax.set_ylabel('Counts\n', fontsize=12)
ax.set_title('Chain vs non-chain distribution\n', fontsize=14)
plt.show()

In [None]:
# define function to extract number of businesses from alias column
# def extract_count(x):
#     try:
#         return int(x.split('-')[-1])
#     except:
#         return np.nan

In [None]:
# extract number of businesses from alias column
# df['count'] = df['alias'].apply(extract_count)

In [None]:
# df[['alias', 'name', 'count', 'city']].sort_values(by='count', ascending=False)[:20]

In [None]:
# descriptive statistics for review count variable
df.review_count.describe()

In [None]:
# top 10 restaurants by number of reviews
df.sort_values(by='review_count', ascending=False)[:30]

In [None]:
# Yelp review count distribution
fig, ax = plt.subplots(figsize=(8, 6))
sns.histplot(data=df, x="review_count", binwidth=5, color='r')
ax.set_title('Yelp review count distribution\n', fontsize=14)
plt.show()

In [None]:
# Yelp review count vs Yelp ratings
ax = sns.catplot(data=df, x='rating', y='review_count', height=6, aspect=1, palette='YlOrRd')
plt.show()

In [None]:
# Yelp review count vs food hygiene ratings
sns.catplot(data=df, x='RatingValue', y='review_count', height=6, aspect=1, palette='YlOrRd')
plt.show()

In [None]:
# categories

In [None]:
# drop rows with missing values in main category (cat_1)
df.dropna(subset = ['cat_1'], inplace=True)

In [None]:
df.info()

In [None]:
# top 20 values in main category
df['cat_1'].value_counts()[:20]

In [None]:
df['cat_1'].value_counts()[:20].plot(figsize=(8,6), kind='bar')
plt.show()

In [None]:
sorted(df['cat_1'].unique())

In [None]:
# compile list of non-food related categories
non_food = ['Accessories', 'Antiques', 'Arcades', 'Art Galleries', 'Arts & Crafts',
            'Arts & Entertainment', 'B&Bs', 'Bikes', 'Boating', 'Bookshops', 
            'Bowling', 'Bridal', 'Cabarets', 'Camping & Campsites', 'Casinos',
            'Cinemas','Camping & Campsites', 'Christmas Trees', 'Comedy Clubs', 
            'Corner Shops', 'Crazy Golf', 'Cultural Centres', 'Dance Schools',
            'DJs', 'Estate Agents', 'Fashion', 'Flowers & Gifts', 'Gas Stations',
            'Gardening Centres', 'Gift Shops', 'Golf', 'Guest Houses', 
            'Hairdressers', 'Hotels', 'Hotel & Travel', 'Internet Cafes',
            'Kitchen & Bath', 'Language Schools', 'Laundry Services', 
            'Landmarks & Historic Buildings', 'Men\'s Clothing', 'MOT Test Centres',
            'Museums', 'Music Venues', 'Nail Salons', 'Off Licence', 'Organic Shops', 
            'Parking', 'Pet Shops', 'Pet Groomers', 'Pet Sitting', 'Pool & Billiards',
            'Pool & Snooker Hall', 'Post Offices', 'Shared Office Spaces',
            'Shoe Shops', 'Social Clubs', 'Souvenir Shops', 'Sports Clubs', 
            'Stables & Horse Riding ', 'Supermarkets', 'Sweet Shops', 
            'Swimming Pools', 'Tabletop Games', 'Taxi & Minicabs', 'Theatres', 
            'Toy Shops', 'Used Bookstores', 'Venues & Event Spaces',
            'Vintage & Second Hand', 'Vinyl Records', 'Wedding Planners', 
            'Women\'s Clothing', 'Zoos']

In [None]:
df[df['cat_1']=='Taxi & Minicabs']

In [None]:
# businesses with non-food related main category
df[df['cat_1'].isin(non_food)]

In [None]:
# top 20 values in secondary category
df['cat_2'].value_counts()[:20]

In [None]:
df['cat_2'].unique()

In [None]:
# remove rows with non-food related main category (cat_1)
df = df[~df['cat_1'].isin(non_food)]

In [None]:
df.info()

In [None]:
# Yelp rating
df['rating'].value_counts()

In [None]:
# Yelp rating
df['rating'].value_counts(normalize=True)

In [None]:
# descriptive statistics for Yelp rating variable
df['rating'].describe()

In [None]:
# distribution of Yelp ratings
fig, ax = plt.subplots(figsize=(8, 6))
sns.countplot(data=df, x='rating', palette='YlOrRd')
ax.set_ylabel('Counts\n', fontsize=12)
ax.set_title('Yelp ratings distribution\n', fontsize=14)
plt.show()

In [None]:
# distribution of Yelp ratings by chain vs non-chain
fig, ax = plt.subplots(figsize=(8, 6))
sns.countplot(data=df, x='rating', hue='is_chain', palette='YlOrRd')
ax.set_ylabel('Counts\n', fontsize=12)
ax.set_title('Yelp ratings distribution by chain vs non-chain\n', fontsize=14)
plt.show()

In [None]:
# price level
df['price'].value_counts()

In [None]:
# distribution of price levels
fig, ax = plt.subplots(figsize=(8, 6))
sns.countplot(data=df, x='price', order=['£', '££', '£££', '££££'], palette='YlOrRd')
ax.set_ylabel('Counts\n', fontsize=12)
ax.set_title('Price levels distribution\n', fontsize=14)
plt.show()

In [None]:
# top 20 cities in data set
df['city'].value_counts()[:20]

In [None]:
df['city'].value_counts(normalize=True)

In [None]:
# top 15 cities in data set
fig, ax = plt.subplots(figsize=(8, 6))
df['city'].value_counts()[:15].plot(kind='bar')
ax.set_title('Top 15 locations by city\n', fontsize=12)
plt.show()


In [None]:
df['city'].unique()

In [None]:
len(df['city'].unique())

In [None]:
sorted(df['city'].unique())

In [None]:
# clean city names to account for upper case/lower case spelling and hyphens
df['city'] = df['city'].apply(lambda x: x.replace("-", " ").title())

In [None]:
len(df['city'].unique())

In [None]:
# top 15 cities in data set (after cleaning)
fig, ax = plt.subplots(figsize=(8, 6))
df['city'].value_counts()[:15].plot(kind='bar')
ax.set_title('Top 15 locations by city\n', fontsize=12)
plt.show()

In [None]:
sorted(df['city'].unique())

In [None]:
# define function to extract postcode area from postcode
def postcode_area(x):
    if x[1].isnumeric():
        return x[0]
    else:
        return x[:2]

In [None]:
df['postc_area'] = df['zip_code'].apply(postcode_area)

In [None]:
df['postc_area'].nunique()

In [None]:
df['postc_area'].unique()

In [None]:
# top 15 locations in data set (by postcode area)
fig, ax = plt.subplots(figsize=(8, 6))
df['postc_area'].value_counts()[:15].plot(kind='bar')
ax.set_title('Top 15 locations by postcode area\n', fontsize=12)
plt.show()


In [None]:
# subset of numerical and binary variables
subset = df[['RatingValue', 'Hygiene', 'Structural', 'ConfidenceInManagement', 'review_count', 'rating', 'is_chain']]

In [None]:
# heatmap
plt.figure(figsize=(10, 10))
sns.heatmap(subset.corr(), annot=True)
plt.show()

Data Cleaning
- 407 businesses did not have a numerical food hygiene rating but showed instead the status 'awaiting inspection', 'awaiting publication' or 'exempt'. These observations were removed from the data set.
- The Food Hygiene Rating Scheme was launched in 2010. 16 businesses had a food hygiene rating date before 2010, these observations were removed from the data set.
- The main category (cat_1) of 174 businesses was not related to food preparation (for examples zoos or swimming pools). These rows were dropped, as were 2 rows with missing values.   

Feature Engineering

- 49 restaurant chains were identified by reviewing the 250 most common business names in the data set. The binary feature 'is_chain' was created to flag whether a restaurant belonged to a chain or not. 12% of the businesses in the data set could be identified as part of a chain (most likely underestimating the true percentage due to limitations of the approach).

- The postcode area ('postc_area') was extracted from the postcode as a proxy for a restaurant's location. 




In [None]:
# clean csv file for Tableau
# df.to_csv('fsa_yelp_clean_tableau.csv', index=False)

#### Classification Models

In [None]:
from sklearn import metrics
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_auc_score, average_precision_score, plot_roc_curve, plot_precision_recall_curve
from sklearn.tree import DecisionTreeClassifier

#### A) Binary Classification

In [None]:
# add column to indicate whether food hygiene rating is high (high=5, low=4 or less)
df['fsa_high'] = df['RatingValue'].apply(lambda x: 1 if x==5 else 0)

In [None]:
# add column to indicate whether Yelp rating is high (high=4 and above)
df['yelp_high'] = df['rating'].apply(lambda x: 1 if x>=4 else 0)

In [None]:
df.head()

In [None]:
# create subset of data dropping missing values for price
sub_price = df.dropna(subset = ['price'])

In [None]:
sub_price.info()

#### Model 1: High vs low hygiene rating - logistic regression - full data set

In [None]:
# baseline accuracy
df['fsa_high'].value_counts(normalize=True)

In [None]:
# define target variable
y = df['fsa_high']

In [None]:
# define predictor variables
X = df[['is_chain', 'postc_area', 'cat_1', 'review_count']]

In [None]:
# dummify predictor variables
X_dum = pd.get_dummies(X, columns=['postc_area', 'cat_1'], drop_first=True)

In [None]:
X_dum.head()

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X_dum, y, stratify=y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
# rescale features
scaler = StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)


In [None]:
# create instance of logistic regression model
logreg_1 = LogisticRegression(random_state=1)

# check for parameters of the model
list(logreg_1.get_params().keys())


In [None]:
# set up grid search 

params = {'C': np.logspace(-3, 3, 10),
          'penalty': ['l1', 'l2'],
          'solver': ['liblinear'],
          'fit_intercept': [True, False]
         }

gs_logreg_1 = GridSearchCV(estimator=logreg_1,
                         param_grid=params,
                         cv=5,
                         scoring='accuracy',
                         return_train_score=True)


In [None]:
# fit the model and extract grid search results

gs_logreg_1.fit(X_train, y_train)

print('Best Parameters:')
print(gs_logreg_1.best_params_)
print('Best estimator C:')
print(gs_logreg_1.best_estimator_.C)
print('Best estimator mean cross validated training score:')
print(gs_logreg_1.best_score_)
print('Best estimator score on the full training set:')
print(gs_logreg_1.score(X_train, y_train))
print('Best estimator score on the test set:')
print(gs_logreg_1.score(X_test, y_test))
print('Best estimator coefficients:')
print(gs_logreg_1.best_estimator_.coef_)


In [None]:
# get predictions
predictions_train = gs_logreg_1.predict(X_train)
predictions_test = gs_logreg_1.predict(X_test)

In [None]:
# confusion matrix for training set
skplt.metrics.plot_confusion_matrix(y_train, predictions_train, labels=[1, 0], figsize=(6,6))
plt.show()

In [None]:
# confusion matrix for test set
skplt.metrics.plot_confusion_matrix(y_test, predictions_test, labels=[1, 0], figsize=(6, 6))
plt.show()

In [None]:
# classification report for training set
print(classification_report(y_train, predictions_train))

In [None]:
# classification report for test set
print(classification_report(y_test, predictions_test))

#### Model 2: High vs low hygiene rating - logistic regression - subset (including price)

In [None]:
# baseline accuracy
sub_price['fsa_high'].value_counts(normalize=True)

In [None]:
# define target variable
y = sub_price['fsa_high']

In [None]:
# define predictor variables
X = sub_price[['is_chain', 'postc_area', 'cat_1', 'review_count', 'price']]

In [None]:
# dummify predictor variables
X_dum = pd.get_dummies(X, columns=['postc_area', 'cat_1', 'price'], drop_first=True)

In [None]:
X_dum.head()

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X_dum, y, stratify=y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
# rescale features
scaler = StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)


In [None]:
# create instance of logistic regression model
logreg_2 = LogisticRegression(random_state=1)

# check for parameters of the model
# list(logreg_2.get_params().keys())

In [None]:
# set up grid search 

params = {'C': np.logspace(-3, 3, 10),
          'penalty': ['l1', 'l2'],
          'solver': ['liblinear'],
          'fit_intercept': [True, False]
         }

gs_logreg_2 = GridSearchCV(estimator=logreg_2,
                         param_grid=params,
                         cv=5,
                         scoring='accuracy',
                         return_train_score=True)


In [None]:
# fit the model and extract grid search results

gs_logreg_2.fit(X_train, y_train)

print('Best Parameters:')
print(gs_logreg_2.best_params_)
print('Best estimator C:')
print(gs_logreg_2.best_estimator_.C)
print('Best estimator mean cross validated training score:')
print(gs_logreg_2.best_score_)
print('Best estimator score on the full training set:')
print(gs_logreg_2.score(X_train, y_train))
print('Best estimator score on the test set:')
print(gs_logreg_2.score(X_test, y_test))
print('Best estimator coefficients:')
print(gs_logreg_2.best_estimator_.coef_)


In [None]:
# get predictions
predictions_train = gs_logreg_2.predict(X_train)
predictions_test = gs_logreg_2.predict(X_test)

In [None]:
# confusion matrix for training set
skplt.metrics.plot_confusion_matrix(y_train, predictions_train, labels=[1, 0], figsize=(6,6))
plt.show()

In [None]:
# confusion matrix for test set
skplt.metrics.plot_confusion_matrix(y_test, predictions_test, labels=[1, 0], figsize=(6, 6))
plt.show()

In [None]:
# classification report for training set
print(classification_report(y_train, predictions_train))

In [None]:
# classification report for test set
print(classification_report(y_test, predictions_test))

#### Model 3: High vs low hygiene rating - decision tree - full data set

In [None]:
# baseline accuracy
df['fsa_high'].value_counts(normalize=True)

In [None]:
# define target variable
y = df['fsa_high']

In [None]:
# define predictor variables
X = df[['is_chain', 'postc_area', 'cat_1', 'review_count']]

In [None]:
# dummify predictor variables
X_dum = pd.get_dummies(X, columns=['postc_area', 'cat_1'], drop_first=True)

In [None]:
X_dum.head()

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X_dum, y, stratify=y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
# rescale features
scaler = StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)


In [None]:
# create instance of Decision Tree Classifier model
dtc_3 = DecisionTreeClassifier(random_state=1)

# check for parameters of the model
list(dtc_3.get_params().keys())


In [None]:
# set up grid search
params_dtc = {'criterion': ['gini', 'entropy'], 
              'max_depth': list(range(1,30)),
              'max_features': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}

gs_dtc_3 = GridSearchCV(estimator=dtc_3,
                      param_grid=params_dtc,
                      cv=5,
                      verbose=1,
                      return_train_score=True)

In [None]:
# fit the model and extract grid search results

gs_dtc_3.fit(X_train, y_train)

print('Best Parameters:')
print(gs_dtc_3.best_params_)
print('Best estimator mean cross validated training score:')
print(gs_dtc_3.best_score_)
print('Best estimator score on the full training set:')
print(gs_dtc_3.score(X_train, y_train))
print('Best estimator score on the test set:')
print(gs_dtc_3.score(X_test, y_test))

In [None]:
# get predictions
predictions_train = gs_dtc_3.predict(X_train)
predictions_test = gs_dtc_3.predict(X_test)

In [None]:
# confusion matrix for training set
skplt.metrics.plot_confusion_matrix(y_train, predictions_train, labels=[1, 0], figsize=(6,6))
plt.show()

In [None]:
# confusion matrix for test set
skplt.metrics.plot_confusion_matrix(y_test, predictions_test, labels=[1, 0], figsize=(6, 6))
plt.show()

In [None]:
# classification report for training set
print(classification_report(y_train, predictions_train))

In [None]:
# classification report for test set
print(classification_report(y_test, predictions_test))

In [None]:
dtc_best = gs_dtc_3.best_estimator_

In [None]:
# feature importance
feat_imp = pd.DataFrame({'feature': X_train.columns, 'importance': dtc_best.feature_importances_})

feat_imp.sort_values('importance', ascending=False, inplace=True)
feat_imp

In [None]:
# plot of top 15 features
fig, ax = plt.subplots(figsize=(12, 10))
ax = sns.barplot(x='importance', y='feature', data=feat_imp[:15], palette='YlOrRd_r')
plt.show()

#### Model 4: High vs low hygiene rating - decision tree - subset (including price)

In [None]:
# baseline accuracy
sub_price['fsa_high'].value_counts(normalize=True)

In [None]:
# define target variable
y = sub_price['fsa_high']

In [None]:
# define predictor variables
X = sub_price[['is_chain', 'postc_area', 'cat_1', 'review_count', 'price']]

In [None]:
# dummify predictor variables
X_dum = pd.get_dummies(X, columns=['postc_area', 'cat_1', 'price'], drop_first=True)

In [None]:
X_dum.head()

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X_dum, y, stratify=y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
# rescale features
scaler = StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)


In [None]:
# create instance of Decision Tree Classifier model
dtc_4 = DecisionTreeClassifier(random_state=1)

# check for parameters of the model
# list(dtc_4.get_params().keys())


In [None]:
# set up grid search
params_dtc = {'criterion': ['gini', 'entropy'], 
              'max_depth': list(range(1,30)),
              'max_features': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}

gs_dtc_4 = GridSearchCV(estimator=dtc_4,
                      param_grid=params_dtc,
                      cv=5,
                      verbose=1,
                      return_train_score=True)

In [None]:
# fit the model and extract grid search results

gs_dtc_4.fit(X_train, y_train)

print('Best Parameters:')
print(gs_dtc_4.best_params_)
print('Best estimator mean cross validated training score:')
print(gs_dtc_4.best_score_)
print('Best estimator score on the full training set:')
print(gs_dtc_4.score(X_train, y_train))
print('Best estimator score on the test set:')
print(gs_dtc_4.score(X_test, y_test))

In [None]:
# get predictions
predictions_train = gs_dtc_4.predict(X_train)
predictions_test = gs_dtc_4.predict(X_test)

In [None]:
# confusion matrix for training set
skplt.metrics.plot_confusion_matrix(y_train, predictions_train, labels=[1, 0], figsize=(6,6))
plt.show()

In [None]:
# confusion matrix for test set
skplt.metrics.plot_confusion_matrix(y_test, predictions_test, labels=[1, 0], figsize=(6, 6))
plt.show()

In [None]:
# classification report for training set
print(classification_report(y_train, predictions_train))

In [None]:
# classification report for test set
print(classification_report(y_test, predictions_test))

#### Model 5: High vs low Yelp rating - logistic regression - full data set

In [None]:
# baseline accuracy
df['yelp_high'].value_counts(normalize=True)

In [None]:
# define target variable
y = df['yelp_high']

In [None]:
# define predictor variables
X = df[['is_chain', 'postc_area', 'cat_1', 'RatingValue', 'review_count']]

In [None]:
# dummify predictor variables
X_dum = pd.get_dummies(X, columns=['postc_area', 'cat_1', 'RatingValue'], drop_first=True)

In [None]:
X_dum.head()

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X_dum, y, stratify=y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
# rescale features
scaler = StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)


In [None]:
# create instance of logistic regression model
logreg_5 = LogisticRegression(random_state=1)

# check for parameters of the model
# list(logreg_5.get_params().keys())


In [None]:
# set up grid search 

params = {'C': np.logspace(-3, 3, 10),
          'penalty': ['l1', 'l2'],
          'solver': ['liblinear'],
          'fit_intercept': [True, False]
         }

gs_logreg_5 = GridSearchCV(estimator=logreg_5,
                         param_grid=params,
                         cv=5,
                         scoring='accuracy',
                         return_train_score=True)

In [None]:
# fit the model and extract grid search results

gs_logreg_5.fit(X_train, y_train)

print('Best Parameters:')
print(gs_logreg_5.best_params_)
print('Best estimator C:')
print(gs_logreg_5.best_estimator_.C)
print('Best estimator mean cross validated training score:')
print(gs_logreg_5.best_score_)
print('Best estimator score on the full training set:')
print(gs_logreg_5.score(X_train, y_train))
print('Best estimator score on the test set:')
print(gs_logreg_5.score(X_test, y_test))
print('Best estimator coefficients:')
print(gs_logreg_5.best_estimator_.coef_)


In [None]:
# get predictions
predictions_train = gs_logreg_5.predict(X_train)
predictions_test = gs_logreg_5.predict(X_test)

In [None]:
# confusion matrix for training set
skplt.metrics.plot_confusion_matrix(y_train, predictions_train, labels=[1, 0], figsize=(6,6))
plt.show()

In [None]:
# confusion matrix for test set
skplt.metrics.plot_confusion_matrix(y_test, predictions_test, labels=[1, 0], figsize=(6, 6))
plt.show()

In [None]:
# classification report for training set
print(classification_report(y_train, predictions_train))

In [None]:
# classification report for test set
print(classification_report(y_test, predictions_test))

#### Model 6: High vs low Yelp rating - logistic regression - subset (including price)

(best model)

In [None]:
# baseline accuracy
sub_price['yelp_high'].value_counts(normalize=True)

In [None]:
# define target variable
y = sub_price['yelp_high']

In [None]:
# define predictor variables
X = sub_price[['is_chain', 'postc_area', 'cat_1', 'RatingValue', 'review_count', 'price']]

In [None]:
# dummify predictor variables
X_dum = pd.get_dummies(X, columns=['postc_area', 'cat_1', 'RatingValue', 'price'], drop_first=True)

In [None]:
X_dum.head()

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X_dum, y, stratify=y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
# rescale features
scaler = StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)


In [None]:
# create instance of logistic regression model
logreg_6 = LogisticRegression(random_state=1)

# check for parameters of the model
# list(logreg_6.get_params().keys())


In [None]:
# set up grid search 

params = {'C': np.logspace(-3, 3, 10),
          'penalty': ['l1', 'l2'],
          'solver': ['liblinear'],
          'fit_intercept': [True, False]
         }

gs_logreg_6 = GridSearchCV(estimator=logreg_6,
                         param_grid=params,
                         cv=5,
                         scoring='accuracy',
                         return_train_score=True)

In [None]:
# fit the model and extract grid search results

gs_logreg_6.fit(X_train, y_train)

print('Best Parameters:')
print(gs_logreg_6.best_params_)
print('Best estimator C:')
print(gs_logreg_6.best_estimator_.C)
print('Best estimator mean cross validated training score:')
print(gs_logreg_6.best_score_)
print('Best estimator score on the full training set:')
print(gs_logreg_6.score(X_train, y_train))
print('Best estimator score on the test set:')
print(gs_logreg_6.score(X_test, y_test))
print('Best estimator coefficients:')
print(gs_logreg_6.best_estimator_.coef_)

In [None]:
# get predictions
predictions_train = gs_logreg_6.predict(X_train)
predictions_test = gs_logreg_6.predict(X_test)

In [None]:
# confusion matrix for training set
skplt.metrics.plot_confusion_matrix(y_train, predictions_train, labels=[1, 0], figsize=(6,6))
plt.show()

In [None]:
# confusion matrix for test set
skplt.metrics.plot_confusion_matrix(y_test, predictions_test, labels=[1, 0], figsize=(6, 6))
plt.show()

In [None]:
# classification report for training set
print(classification_report(y_train, predictions_train))

In [None]:
# classification report for test set
print(classification_report(y_test, predictions_test))

In [None]:
probabilities_train = gs_logreg_6.predict_proba(X_train)
probabilities_test = gs_logreg_6.predict_proba(X_test)

In [None]:
# ROC curve for training set
skplt.metrics.plot_roc(y_train, probabilities_train, figsize=(8,6))
plt.show()

In [None]:
# ROC curve for test set
skplt.metrics.plot_roc(y_test, probabilities_test, figsize=(8,6))
plt.show()

In [None]:
# Precision-recall curve for training set
skplt.metrics.plot_precision_recall(y_train, probabilities_train, figsize=(8,6))
plt.show()

In [None]:
# Precision-recall curve for test set
skplt.metrics.plot_precision_recall(y_test, probabilities_test, figsize=(8,6))
plt.show()

In [None]:
# top 10 positive coefficients (increase in the probability of score being 'high')
coefs_pos = pd.DataFrame(list(zip(X_train.columns, gs_logreg_6.best_estimator_.coef_[0])), 
                         columns=['feature', 'coef']).sort_values(by='coef', ascending=False)[:10] 
coefs_pos

In [None]:
# top 10 negative coefficients (decrease in the probability of score being 'high')
coefs_neg = pd.DataFrame(list(zip(X_train.columns, gs_logreg_6.best_estimator_.coef_[0])), 
                         columns=['feature', 'coef']).sort_values(by='coef')[:10] 
coefs_neg

In [None]:
# combine highest and lowest coefficients in dataframe
coefs_neg_sorted = coefs_neg.sort_values(by='coef', ascending=False)

coefs_dfs = [coefs_pos, coefs_neg_sorted]
coefs_combined = pd.concat(coefs_dfs)

In [None]:
# plot of top positive and negative coefficients
fig, ax = plt.subplots(figsize=(18, 16))
ax = sns.barplot(x='coef', y='feature', data=coefs_combined, palette='YlOrRd_r')
plt.show()

#### Model 7: High vs low Yelp rating - decision tree - full data set

In [None]:
# baseline accuracy
df['yelp_high'].value_counts(normalize=True)

In [None]:
# define target variable
y = df['yelp_high']

In [None]:
# define predictor variables
X = df[['is_chain', 'postc_area', 'cat_1', 'RatingValue', 'review_count']]

In [None]:
# dummify predictor variables
X_dum = pd.get_dummies(X, columns=['postc_area', 'cat_1', 'RatingValue'], drop_first=True)

In [None]:
X_dum.head()

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X_dum, y, stratify=y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
# rescale features
scaler = StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)


In [None]:
# create instance of Decision Tree Classifier model
dtc_7 = DecisionTreeClassifier(random_state=1)

# check for parameters of the model
# list(dtc_7.get_params().keys())


In [None]:
# set up grid search
params_dtc = {'criterion': ['gini', 'entropy'], 
              'max_depth': list(range(1,30)),
              'max_features': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}

gs_dtc_7 = GridSearchCV(estimator=dtc_7,
                      param_grid=params_dtc,
                      cv=5,
                      verbose=1,
                      return_train_score=True)

In [None]:
# fit the model and extract grid search results

gs_dtc_7.fit(X_train, y_train)

print('Best Parameters:')
print(gs_dtc_7.best_params_)
print('Best estimator mean cross validated training score:')
print(gs_dtc_7.best_score_)
print('Best estimator score on the full training set:')
print(gs_dtc_7.score(X_train, y_train))
print('Best estimator score on the test set:')
print(gs_dtc_7.score(X_test, y_test))

In [None]:
# get predictions
predictions_train = gs_dtc_7.predict(X_train)
predictions_test = gs_dtc_7.predict(X_test)

In [None]:
# confusion matrix for training set
skplt.metrics.plot_confusion_matrix(y_train, predictions_train, labels=[1, 0], figsize=(6,6))
plt.show()

In [None]:
# confusion matrix for test set
skplt.metrics.plot_confusion_matrix(y_test, predictions_test, labels=[1, 0], figsize=(6, 6))
plt.show()

In [None]:
# classification report for training set
print(classification_report(y_train, predictions_train))

In [None]:
# classification report for test set
print(classification_report(y_test, predictions_test))

#### Model 8: High vs low Yelp rating - decision tree - subset (including price)

In [None]:
# baseline accuracy
sub_price['yelp_high'].value_counts(normalize=True)

In [None]:
# define target variable
y = sub_price['yelp_high']

In [None]:
# define predictor variables
X = sub_price[['is_chain', 'postc_area', 'cat_1', 'RatingValue', 'review_count', 'price']]

In [None]:
# dummify predictor variables
X_dum = pd.get_dummies(X, columns=['postc_area', 'cat_1', 'RatingValue', 'price'], drop_first=True)

In [None]:
X_dum.head()

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X_dum, y, stratify=y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
# rescale features
scaler = StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)


In [None]:
# create instance of Decision Tree Classifier model
dtc_8 = DecisionTreeClassifier(random_state=1)

# check for parameters of the model
# list(dtc_8.get_params().keys())


In [None]:
# set up grid search
params_dtc = {'criterion': ['gini', 'entropy'], 
              'max_depth': list(range(1,30)),
              'max_features': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}

gs_dtc_8 = GridSearchCV(estimator=dtc_8,
                      param_grid=params_dtc,
                      cv=5,
                      verbose=1,
                      return_train_score=True)

In [None]:
# fit the model and extract grid search results

gs_dtc_8.fit(X_train, y_train)

print('Best Parameters:')
print(gs_dtc_8.best_params_)
print('Best estimator mean cross validated training score:')
print(gs_dtc_8.best_score_)
print('Best estimator score on the full training set:')
print(gs_dtc_8.score(X_train, y_train))
print('Best estimator score on the test set:')
print(gs_dtc_8.score(X_test, y_test))

In [None]:
# get predictions
predictions_train = gs_dtc_8.predict(X_train)
predictions_test = gs_dtc_8.predict(X_test)

In [None]:
# confusion matrix for training set
skplt.metrics.plot_confusion_matrix(y_train, predictions_train, labels=[1, 0], figsize=(6,6))
plt.show()

In [None]:
# confusion matrix for test set
skplt.metrics.plot_confusion_matrix(y_test, predictions_test, labels=[1, 0], figsize=(6, 6))
plt.show()

In [None]:
# classification report for training set
print(classification_report(y_train, predictions_train))

In [None]:
# classification report for test set
print(classification_report(y_test, predictions_test))

In [None]:
probabilities_train = gs_dtc_8.predict_proba(X_train)
probabilities_test = gs_dtc_8.predict_proba(X_test)

In [None]:
# ROC curve for training set
skplt.metrics.plot_roc(y_train, probabilities_train, figsize=(8,6))
plt.show()

In [None]:
# ROC curve for test set
skplt.metrics.plot_roc(y_test, probabilities_test, figsize=(8,6))
plt.show()

In [None]:
# Precision-recall curve for training set
skplt.metrics.plot_precision_recall(y_train, probabilities_train, figsize=(8,6))
plt.show()

In [None]:
# Precision-recall curve for test set
skplt.metrics.plot_precision_recall(y_test, probabilities_test, figsize=(8,6))
plt.show()

In [None]:
dtc_best = gs_dtc_8.best_estimator_

In [None]:
# feature importance
feat_imp = pd.DataFrame({'feature': X_train.columns, 'importance': dtc_best.feature_importances_})

feat_imp.sort_values('importance', ascending=False, inplace=True)
feat_imp

In [None]:
# plot of top 15 features
fig, ax = plt.subplots(figsize=(12, 10))
ax = sns.barplot(x='importance', y='feature', data=feat_imp[:15], palette='YlOrRd_r')
plt.show()

#### B) Multiclass Classification

In [None]:
# add column for food hygiene label (3 classes: 5, 4, 3 and lower)
df['fsa_label'] = df['RatingValue'].apply(lambda x: '5' if x==5 else '4' if x==4 else '3_lower')

In [None]:
# add column for Yelp rating label
df['yelp_label'] = df['rating'].apply(lambda x: '5' if x==5 else '4' if x==4 or x==4.5 
                                      else '3'if x==3 or x==3.5 
                                      else '2'if x==2 or x==2.5
                                      else '1')

In [None]:
df.head()

In [None]:
# create subset of data dropping missing values for price
sub_price = df.dropna(subset = ['price']) 

In [None]:
sub_price.info()

#### Model 9: Hygiene ratings (0-5) - decision tree - subset (including price)

In [None]:
# baseline accuracy
sub_price['RatingValue'].value_counts(normalize=True)

In [None]:
# define target variable
y = sub_price['RatingValue']

In [None]:
# define predictor variables
X = sub_price[['is_chain', 'postc_area', 'cat_1', 'review_count', 'price', 'BusinessType']]

In [None]:
# dummify predictor variables
X_dum = pd.get_dummies(X, columns=['postc_area', 'cat_1', 'price', 'BusinessType'], drop_first=True)

In [None]:
X_dum.head()

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X_dum, y, stratify=y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
# rescale features
scaler = StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

In [None]:
# create instance of Decision Tree Classifier model
dtc_9 = DecisionTreeClassifier(random_state=1)

# check for parameters of the model
# list(dtc_9.get_params().keys())


In [None]:
# set up grid search
params_dtc = {'criterion': ['gini', 'entropy'], 
              'max_depth': list(range(1,30)),
              'max_features': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}

gs_dtc_9 = GridSearchCV(estimator=dtc_9,
                      param_grid=params_dtc,
                      cv=5,
                      verbose=1,
                      return_train_score=True)

In [None]:
# fit the model and extract grid search results

gs_dtc_9.fit(X_train, y_train)

print('Best Parameters:')
print(gs_dtc_9.best_params_)
print('Best estimator mean cross validated training score:')
print(gs_dtc_9.best_score_)
print('Best estimator score on the full training set:')
print(gs_dtc_9.score(X_train, y_train))
print('Best estimator score on the test set:')
print(gs_dtc_9.score(X_test, y_test))

In [None]:
# get predictions
predictions_train = gs_dtc_9.predict(X_train)
predictions_test = gs_dtc_9.predict(X_test)

In [None]:
# confusion matrix for training set
skplt.metrics.plot_confusion_matrix(y_train, predictions_train, labels=[0, 1, 2, 3, 4, 5], figsize=(6,6))
plt.show()

In [None]:
# confusion matrix for test set
skplt.metrics.plot_confusion_matrix(y_test, predictions_test, labels=[0, 1, 2, 3, 4, 5], figsize=(6, 6))
plt.show()

In [None]:
# classification report for training set
print(classification_report(y_train, predictions_train))

In [None]:
# classification report for test set
print(classification_report(y_test, predictions_test))

#### Model 10: Hygiene ratings (5, 4, 3 and lower) - decision tree - subset (including price)¶

In [None]:
# baseline accuracy
sub_price['fsa_label'].value_counts(normalize=True)

In [None]:
# define target variable
y = sub_price['fsa_label']

In [None]:
# define predictor variables
X = sub_price[['is_chain', 'postc_area', 'cat_1', 'review_count', 'price', 'BusinessType']]

In [None]:
# dummify predictor variables
X_dum = pd.get_dummies(X, columns=['postc_area', 'cat_1', 'price', 'BusinessType'], drop_first=True)

In [None]:
X_dum.head()

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X_dum, y, stratify=y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
# rescale features
scaler = StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

In [None]:
# create instance of Decision Tree Classifier model
dtc_10 = DecisionTreeClassifier(random_state=1)

# check for parameters of the model
# list(dtc_10.get_params().keys())


In [None]:
# set up grid search
params_dtc = {'criterion': ['gini', 'entropy'], 
              'max_depth': list(range(1,30)),
              'max_features': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}

gs_dtc_10 = GridSearchCV(estimator=dtc_10,
                      param_grid=params_dtc,
                      cv=5,
                      verbose=1,
                      return_train_score=True)

In [None]:
# fit the model and extract grid search results

gs_dtc_10.fit(X_train, y_train)

print('Best Parameters:')
print(gs_dtc_10.best_params_)
print('Best estimator mean cross validated training score:')
print(gs_dtc_10.best_score_)
print('Best estimator score on the full training set:')
print(gs_dtc_10.score(X_train, y_train))
print('Best estimator score on the test set:')
print(gs_dtc_10.score(X_test, y_test))

In [None]:
# get predictions
predictions_train = gs_dtc_10.predict(X_train)
predictions_test = gs_dtc_10.predict(X_test)

In [None]:
# confusion matrix for training set
skplt.metrics.plot_confusion_matrix(y_train, predictions_train, labels=['5', '4', '3_lower'], figsize=(6,6))
plt.show()

In [None]:
# confusion matrix for test set
skplt.metrics.plot_confusion_matrix(y_test, predictions_test, labels=['5', '4', '3_lower'], figsize=(6, 6))
plt.show()

In [None]:
# classification report for training set
print(classification_report(y_train, predictions_train))

In [None]:
# classification report for test set
print(classification_report(y_test, predictions_test))

#### Model 11: Yelp ratings - decision tree - subset (including price)

In [None]:
# baseline accuracy
sub_price['yelp_label'].value_counts(normalize=True)

In [None]:
# define target variable
y = sub_price['yelp_label']

In [None]:
# define predictor variables
X = sub_price[['is_chain', 'postc_area', 'cat_1', 'RatingValue', 'review_count', 'price', 'BusinessType']]

In [None]:
# dummify predictor variables
X_dum = pd.get_dummies(X, columns=['postc_area', 'cat_1', 'RatingValue', 'price', 'BusinessType'], drop_first=True)

In [None]:
X_dum.head()

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X_dum, y, stratify=y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
# rescale features
scaler = StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)


In [None]:
# create instance of Decision Tree Classifier model
dtc_11 = DecisionTreeClassifier(random_state=1)

# check for parameters of the model
# list(dtc_11.get_params().keys())


In [None]:
# set up grid search
params_dtc = {'criterion': ['gini', 'entropy'], 
              'max_depth': list(range(1,30)),
              'max_features': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}

gs_dtc_11 = GridSearchCV(estimator=dtc_11,
                      param_grid=params_dtc,
                      cv=5,
                      verbose=1,
                      return_train_score=True)

In [None]:
# fit the model and extract grid search results

gs_dtc_11.fit(X_train, y_train)

print('Best Parameters:')
print(gs_dtc_11.best_params_)
print('Best estimator mean cross validated training score:')
print(gs_dtc_11.best_score_)
print('Best estimator score on the full training set:')
print(gs_dtc_11.score(X_train, y_train))
print('Best estimator score on the test set:')
print(gs_dtc_11.score(X_test, y_test))

In [None]:
# get predictions
predictions_train = gs_dtc_11.predict(X_train)
predictions_test = gs_dtc_11.predict(X_test)

In [None]:
# confusion matrix for training set
skplt.metrics.plot_confusion_matrix(y_train, predictions_train, labels=['1', '2', '3', '4', '5'], figsize=(6,6))
plt.show()

In [None]:
# confusion matrix for test set
skplt.metrics.plot_confusion_matrix(y_test, predictions_test, labels=['1', '2', '3', '4', '5'], figsize=(6, 6))
plt.show()

In [None]:
# classification report for training set
print(classification_report(y_train, predictions_train))

In [None]:
# classification report for test set
print(classification_report(y_test, predictions_test))

#### C) Oversampling / SMOTE

In [None]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

In [None]:
# no synthetic data in test set!

#### Model 12: Hygiene ratings (5, 4, 3 and lower) - decision tree - subset/including price - SMOTE - multiclass

In [None]:
# baseline accuracy
sub_price['fsa_label'].value_counts(normalize=True)

In [None]:
# define target variable
y = sub_price['fsa_label']

In [None]:
# define predictor variables
X = sub_price[['is_chain', 'postc_area', 'cat_1', 'review_count', 'price', 'BusinessType']]

In [None]:
# dummify predictor variables
X_dum = pd.get_dummies(X, columns=['postc_area', 'cat_1', 'price', 'BusinessType'], drop_first=True)

In [None]:
X_dum.head()

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X_dum, y, stratify=y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
# https://towardsdatascience.com/the-right-way-of-using-smote-with-cross-validation-92a8d09d00c7
# https://imbalanced-learn.org/stable/references/generated/imblearn.pipeline.Pipeline.html

In [None]:
# use SMOTE as part of pipeline to avoid having synthetic validation data

pipeline = Pipeline(steps = [['smote', SMOTE(random_state=1)],
                             ['scaler', StandardScaler()],
                             ['classifier', DecisionTreeClassifier(random_state=1)]])


In [None]:
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

In [None]:
# pipeline.get_params().keys()

In [None]:
# set up grid search
params_dtc = {'classifier__criterion': ['gini', 'entropy'], 
              'classifier__max_depth': list(range(1,30)),
              'classifier__max_features': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}

gs_dtc_12 = GridSearchCV(estimator=pipeline,
                      param_grid=params_dtc,
                      cv=stratified_kfold,
                      verbose=1,
                      return_train_score=True)

In [None]:
gs_dtc_12.fit(X_train, y_train)

print('Best Parameters:')
print(gs_dtc_12.best_params_)
print('Best estimator mean cross validated training score:')
print(gs_dtc_12.best_score_)
print('Best estimator score on the full training set:')
print(gs_dtc_12.score(X_train, y_train))
print('Best estimator score on the test set:')
print(gs_dtc_12.score(X_test, y_test))

In [None]:
# get predictions
predictions_train = gs_dtc_12.predict(X_train)
predictions_test = gs_dtc_12.predict(X_test)

In [None]:
# confusion matrix for training set
skplt.metrics.plot_confusion_matrix(y_train, predictions_train, labels=['5', '4', '3_lower'], figsize=(6,6))
plt.show()

In [None]:
# confusion matrix for test set
skplt.metrics.plot_confusion_matrix(y_test, predictions_test, labels=['5', '4', '3_lower'], figsize=(6, 6))
plt.show()

In [None]:
# classification report for training set
print(classification_report(y_train, predictions_train))

In [None]:
# classification report for test set
print(classification_report(y_test, predictions_test))

#### Model 13: Hygiene ratings (0-5) - decision tree - subset/including price - SMOTE - multiclass

In [None]:
# baseline accuracy
sub_price['RatingValue'].value_counts(normalize=True)

In [None]:
# define target variable
y = sub_price['RatingValue']

In [None]:
# define predictor variables
X = sub_price[['is_chain', 'postc_area', 'cat_1', 'review_count', 'price', 'BusinessType']]

In [None]:
# dummify predictor variables
X_dum = pd.get_dummies(X, columns=['postc_area', 'cat_1', 'price', 'BusinessType'], drop_first=True)

In [None]:
X_dum.head()

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X_dum, y, stratify=y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
pipeline = Pipeline(steps = [['smote', SMOTE(random_state=1)],
                             ['scaler', StandardScaler()],
                             ['classifier', DecisionTreeClassifier(random_state=1)]])


In [None]:
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

In [None]:
# set up grid search
params_dtc = {'classifier__criterion': ['gini', 'entropy'], 
              'classifier__max_depth': list(range(1,30)),
              'classifier__max_features': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}

gs_dtc_13 = GridSearchCV(estimator=pipeline,
                      param_grid=params_dtc,
                      cv=stratified_kfold,
                      verbose=1,
                      return_train_score=True)

In [None]:
gs_dtc_13.fit(X_train, y_train)

print('Best Parameters:')
print(gs_dtc_13.best_params_)
print('Best estimator mean cross validated training score:')
print(gs_dtc_13.best_score_)
print('Best estimator score on the full training set:')
print(gs_dtc_13.score(X_train, y_train))
print('Best estimator score on the test set:')
print(gs_dtc_13.score(X_test, y_test))

In [None]:
# get predictions
predictions_train = gs_dtc_13.predict(X_train)
predictions_test = gs_dtc_13.predict(X_test)

In [None]:
# confusion matrix for training set
skplt.metrics.plot_confusion_matrix(y_train, predictions_train, labels=[0, 1, 2, 3, 4, 5], figsize=(6,6))
plt.show()

In [None]:
# confusion matrix for test set
skplt.metrics.plot_confusion_matrix(y_test, predictions_test, labels=[0, 1, 2, 3, 4, 5], figsize=(6, 6))
plt.show()

In [None]:
# classification report for training set
print(classification_report(y_train, predictions_train))

In [None]:
# classification report for test set
print(classification_report(y_test, predictions_test))

#### Model 14: Yelp ratings - decision tree - subset/including price  - SMOTE  - multiclass

In [None]:
# baseline accuracy
sub_price['yelp_label'].value_counts(normalize=True)

In [None]:
# define target variable
y = sub_price['yelp_label']

In [None]:
# define predictor variables
X = sub_price[['is_chain', 'postc_area', 'cat_1', 'RatingValue', 'review_count', 'price', 'BusinessType']]

In [None]:
# dummify predictor variables
X_dum = pd.get_dummies(X, columns=['postc_area', 'cat_1', 'RatingValue', 'price', 'BusinessType'], drop_first=True)

In [None]:
X_dum.head()

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X_dum, y, stratify=y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
pipeline = Pipeline(steps = [['smote', SMOTE(random_state=1)],
                             ['scaler', StandardScaler()],
                             ['classifier', DecisionTreeClassifier(random_state=1)]])


In [None]:
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

In [None]:
# set up grid search
params_dtc = {'classifier__criterion': ['gini', 'entropy'], 
              'classifier__max_depth': list(range(1,30)),
              'classifier__max_features': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}

gs_dtc_14 = GridSearchCV(estimator=pipeline,
                      param_grid=params_dtc,
                      cv=stratified_kfold,
                      verbose=1,
                      return_train_score=True)


In [None]:
gs_dtc_14.fit(X_train, y_train)

print('Best Parameters:')
print(gs_dtc_14.best_params_)
print('Best estimator mean cross validated training score:')
print(gs_dtc_14.best_score_)
print('Best estimator score on the full training set:')
print(gs_dtc_14.score(X_train, y_train))
print('Best estimator score on the test set:')
print(gs_dtc_14.score(X_test, y_test))


In [None]:
# get predictions
predictions_train = gs_dtc_14.predict(X_train)
predictions_test = gs_dtc_14.predict(X_test)

In [None]:
# confusion matrix for training set
skplt.metrics.plot_confusion_matrix(y_train, predictions_train, labels=['1', '2', '3', '4', '5'], figsize=(6,6))
plt.show()

In [None]:
# confusion matrix for test set
skplt.metrics.plot_confusion_matrix(y_test, predictions_test, labels=['1', '2', '3', '4', '5'], figsize=(6, 6))
plt.show()

In [None]:
# classification report for training set
print(classification_report(y_train, predictions_train))

In [None]:
# classification report for test set
print(classification_report(y_test, predictions_test))

#### Model Summary ####

Several binary classification models were run in order to predict if a restaurant's food hygiene rating or Yelp rating were high. The cut-off point used was 5 for a high food hygiene rating and 4 for a high Yelp rating.

Both a Logistic Regression and a Decision Tree Classifier were applied to the full dataset and to a subset of the data for which price information was available.

The highest performing model was a Logistic Regression applied to the data subset, which predicted a high Yelp rating. The accuracy of the model (the proportion of correctly classified observations regardless of class) was 0.625814 (mean cross-validated score). This result is 4.1 percentage points above the baseline of 0.584716. Recall was only 0.42 for the low Yelp rating class, i.e. only 42% of truly low Yelp ratings were correctly predicted as such. Precision was 0.66 for the high Yelp rating class and 0.61 for the low Yelp rating class - 66% of observations predicted as high and 61% of observations predicted as low were correctly labeled. Overall, the model has only limited predictive power. 

Review count had the highest positive coefficient, i.e. all else being equal a high number of reviews increased the probability of the Yelp rating being classed as high the most. The strongest decrease in the probability of a high Yelp rating resulted from the restaurant being a pub or part of a chain.

Multiclass classification models using a Decision Tree Classifier were applied to the data subset. The aim was to predict specific food hygiene and Yelp ratings rather than a high or low rating as in the binary model.

The accuracy was very close to the baseline as the models predicted the majority class (food hygiene rating of 5 and Yelp rating of 4) but failed to classify the minority classes correctly. Specifically, recall was very low for the minority classes and in some cases close or equal to 0. Overall the multiclass classification models had poor predictive power.

The SMOTE algorithm was applied to address the underlying problem of class imbalance. SMOTE is an oversampling technique where synthetic data points are generated for the minority classes. This was applied within a pipeline to avoid using synthetic data for cross-validating or testing the model. 

Recall was improved by using SMOTE on the multiclass classification, however this applied mainly to the training scores. For the model predicting a specific food hygiene rating, recall of the minority classes was between 0.75 and 0.84 for the training set. For the test set, these numbers dropped to between 0.09 and 0.26. This indicates overfitting and means the model doesn't generalise well on unseen data. 

With regard to recall, the model predicting a specific Yelp rating class had a low score for a rating of 5 and considerable differences between training and test data for ratings of 1 and 2. The overall accuracy of the model was only 0.39.  


#### Limitations ####

- Class imbalance: The main limitation of the models stems from an underlying imbalance of classes in the data set which could not be fully resolved by applying the SMOTE algorithm.

- Predictors: The limited availability of restaurant attributes through the Yelp API limited the number of predictors that could be used for modelling. There might also be bias due to the approach for feature engineering used to identify restaurants that belong to chains (possibly underestimating the number of chain restaurants).

- Data set: The data set resulted from matching restaurants in two separate data sets (Yelp and Food Standards Agency data). This reduced the number of observations and might have introduced bias (resulting subset might not be representative of overall restaurant population).