![title](images/Graduate_Project_Julie_Doherty.001.jpeg)

![title](images/Graduate_Project_Julie_Doherty.002.jpeg)

![title](images/Graduate_Project_Julie_Doherty.003.jpeg)

![title](images/Graduate_Project_Julie_Doherty.004.jpeg)

### Imports and Reading in Dataset

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import plotly
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go

In [2]:
# Read in json file and store in dataframe
business = pd.read_json('yelp_dataset/business.json', lines = True)
business.shape

(174567, 15)

In [3]:
business.head()

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state
0,"4855 E Warner Rd, Ste B9","{'AcceptsInsurance': True, 'ByAppointmentOnly'...",FYWN1wneV18bWNgQjJ2GNg,"[Dentists, General Dentistry, Health & Medical...",Ahwatukee,"{'Friday': '7:30-17:00', 'Tuesday': '7:30-17:0...",1,33.33069,-111.978599,Dental by Design,,85044,22,4.0,AZ
1,3101 Washington Rd,"{'BusinessParking': {'garage': False, 'street'...",He-G7vWjzVUysIKrfNbPUQ,"[Hair Stylists, Hair Salons, Men's Hair Salons...",McMurray,"{'Monday': '9:00-20:00', 'Tuesday': '9:00-20:0...",1,40.291685,-80.1049,Stephen Szabo Salon,,15317,11,3.0,PA
2,"6025 N 27th Ave, Ste 1",{},KQPW8lFf1y5BT2MxiSZ3QA,"[Departments of Motor Vehicles, Public Service...",Phoenix,{},1,33.524903,-112.11531,Western Motor Vehicle,,85017,18,1.5,AZ
3,"5000 Arizona Mills Cr, Ste 435","{'BusinessAcceptsCreditCards': True, 'Restaura...",8DShNS-LuFqpEWIp0HxijA,"[Sporting Goods, Shopping]",Tempe,"{'Monday': '10:00-21:00', 'Tuesday': '10:00-21...",0,33.383147,-111.964725,Sports Authority,,85282,9,3.0,AZ
4,581 Howe Ave,"{'Alcohol': 'full_bar', 'HasTV': True, 'NoiseL...",PfOCPjBrlQAnz__NXj9h_w,"[American (New), Nightlife, Bars, Sandwiches, ...",Cuyahoga Falls,"{'Monday': '11:00-1:00', 'Tuesday': '11:00-1:0...",1,41.119535,-81.47569,Brick House Tavern + Tap,,44221,116,3.5,OH


The dataset consists of almost 175,000 businesses from a variety of locations, and includes information about their attributes, the categories of business into which they fall, and their hours.

### Data Cleaning

My goal is to analyze the star ratings of restaurants, so I'll use the "categories" column to filter the businesses in the dataset. Using a set containing categories that pertain to any restaurant-type business, I'll check whether each row in the dataframe contains any of the values in the set and create a new dataframe with just those that return true.

In [4]:
# Categories encompassing restaurant-related businesses
restaurant_cats = set(['Restaurants', 'Food', 'Bars', 'Sandwiches', 'Italian', 'Coffee & Tea', 
                   'Ice Cream & Frozen Yogurt', 'Breakfast & Brunch', 'Bakeries', 'Bagels', 'French', 'American (New)'
                   'American (Traditional)', 'Mexican', 'Soup', 'Salad', 'Tea Rooms', 'Pubs', 'Comfort Food', 
                   'Desserts', 'Seafood', 'Fast Food', 'Burgers'])

In [5]:
def in_list(business_cats):
    """
    Return true if any of the business's categories are found in set; otherwise return false.
    
    @param business_cats: list of categories
    @return boolean value 
    """
    if restaurant_cats.intersection(business_cats):
        return True
    else:
        return False

In [6]:
# Add 'restaurant' column to business dataframe that contains true/false values
business['restaurant'] = business['categories'].apply(in_list)

In [7]:
# Create new dataframe containing only restaurants
restaurants = business[business['restaurant'] == True].copy()
restaurants.head()

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state,restaurant
4,581 Howe Ave,"{'Alcohol': 'full_bar', 'HasTV': True, 'NoiseL...",PfOCPjBrlQAnz__NXj9h_w,"[American (New), Nightlife, Bars, Sandwiches, ...",Cuyahoga Falls,"{'Monday': '11:00-1:00', 'Tuesday': '11:00-1:0...",1,41.119535,-81.47569,Brick House Tavern + Tap,,44221,116,3.5,OH,True
5,Richterstr. 11,"{'GoodForMeal': {'dessert': False, 'latenight'...",o9eMRCWt5PkpLDE0gOPtcQ,"[Italian, Restaurants]",Stuttgart,"{'Monday': '18:00-0:00', 'Tuesday': '18:00-0:0...",1,48.7272,9.14795,Messina,,70567,5,4.0,BW,True
8,2612 Brandt School Rd,"{'BusinessParking': {'garage': False, 'street'...",EsMcGiZaQuG1OOvL9iUFug,"[Coffee & Tea, Ice Cream & Frozen Yogurt, Food]",Wexford,{},1,40.615102,-80.091349,Any Given Sundae,,15090,15,5.0,PA,True
10,737 West Pike St,"{'RestaurantsTableService': True, 'GoodForMeal...",XOSRcvtaKc_Q5H1SAzN20A,"[Breakfast & Brunch, Gluten-Free, Coffee & Tea...",Houston,{},0,40.241548,-80.212815,East Coast Coffee,,15342,3,4.5,PA,True
12,35 Main Street N,"{'BusinessParking': {'garage': False, 'street'...",xcgFnd-MwkZeO5G2HQ0gAQ,"[Bakeries, Bagels, Food]",Markham,{},1,43.875177,-79.260153,T & T Bakery and Cafe,Markham Village,L3P 1X3,38,4.0,ON,True


Next I'll remove any businesses that are not currently open using the "is_open" column in order to remove discrepanies based on the open status, drop the columns that will not be used in the analysis (mainly those pertaining to location), and reset the index.

In [8]:
# Filter out restaurants that are not currently open
restaurants = restaurants[restaurants['is_open'] == 1]

# Drop unnecessary columns
restaurants.drop(['address', 'city', 'is_open', 'latitude', 'longitude', 'name', 'neighborhood', 'postal_code', 
                  'state', 'restaurant', 'business_id', 'categories', 'hours'], axis = 1, inplace = True)

# Reset index
restaurants.reset_index(drop = True, inplace = True)

The "attributes" column contains the majority of the information that will be used for the analysis. However, this data is contained in a dictionary in a single column. I'll create a new dataframe in which to split out each value of the dictionaries, and concatenate that to the restaurants dataframe.

In [9]:
# Create a new dataframe in which to split out attributes
attributes = restaurants['attributes'].apply(pd.Series)
attributes.head()

Unnamed: 0,AcceptsInsurance,AgesAllowed,Alcohol,Ambience,BYOB,BYOBCorkage,BestNights,BikeParking,BusinessAcceptsBitcoin,BusinessAcceptsCreditCards,...,RestaurantsCounterService,RestaurantsDelivery,RestaurantsGoodForGroups,RestaurantsPriceRange2,RestaurantsReservations,RestaurantsTableService,RestaurantsTakeOut,Smoking,WheelchairAccessible,WiFi
0,,,full_bar,"{'romantic': False, 'intimate': False, 'classy...",,,"{'monday': False, 'tuesday': False, 'friday': ...",True,,True,...,,False,True,2.0,False,True,True,outdoor,,free
1,,,beer_and_wine,"{'romantic': False, 'intimate': False, 'classy...",,,,,,True,...,,False,True,3.0,True,,False,,False,
2,,,,,,,,False,,True,...,,,,1.0,,,True,,,free
3,,,,,,,,True,,False,...,,False,,1.0,,,True,,,
4,,,,"{'romantic': False, 'intimate': False, 'classy...",,,,True,,True,...,,False,True,1.0,False,False,True,,,free


From the full list of attributes, I'll narrow down the columns in the dataframe to those that I'm predicting will have an impact on restaurants' star ratings. I will then split out the "BusinessParking" attribute, which contains a dictionary in each row.

In [10]:
# View all attributes
attributes.columns

Index(['AcceptsInsurance', 'AgesAllowed', 'Alcohol', 'Ambience', 'BYOB',
       'BYOBCorkage', 'BestNights', 'BikeParking', 'BusinessAcceptsBitcoin',
       'BusinessAcceptsCreditCards', 'BusinessParking', 'ByAppointmentOnly',
       'Caters', 'CoatCheck', 'Corkage', 'DietaryRestrictions', 'DogsAllowed',
       'DriveThru', 'GoodForDancing', 'GoodForKids', 'GoodForMeal',
       'HairSpecializesIn', 'HappyHour', 'HasTV', 'Music', 'NoiseLevel',
       'Open24Hours', 'OutdoorSeating', 'RestaurantsAttire',
       'RestaurantsCounterService', 'RestaurantsDelivery',
       'RestaurantsGoodForGroups', 'RestaurantsPriceRange2',
       'RestaurantsReservations', 'RestaurantsTableService',
       'RestaurantsTakeOut', 'Smoking', 'WheelchairAccessible', 'WiFi'],
      dtype='object')

In [11]:
# Filter down to relevant attributes
attributes_short = attributes[['Alcohol', 'BikeParking', 'BusinessAcceptsCreditCards', 'BusinessParking', 
                               'RestaurantsPriceRange2', 'RestaurantsReservations', 'WiFi']]
attributes_short.head()

Unnamed: 0,Alcohol,BikeParking,BusinessAcceptsCreditCards,BusinessParking,RestaurantsPriceRange2,RestaurantsReservations,WiFi
0,full_bar,True,True,"{'garage': False, 'street': False, 'validated'...",2.0,False,free
1,beer_and_wine,,True,"{'garage': False, 'street': False, 'validated'...",3.0,True,
2,,False,True,"{'garage': False, 'street': False, 'validated'...",1.0,,free
3,,True,False,"{'garage': False, 'street': True, 'validated':...",1.0,,
4,,True,True,,1.0,False,free


In [12]:
# Split out business parking attribute
parking = attributes_short['BusinessParking'].apply(pd.Series)
parking.drop(0, axis = 1, inplace = True)
parking.head()


'<' not supported between instances of 'int' and 'str', sort order is undefined for incomparable objects


'<' not supported between instances of 'int' and 'str', sort order is undefined for incomparable objects



Unnamed: 0,garage,lot,street,valet,validated
0,False,True,False,False,False
1,False,False,False,False,False
2,False,True,False,False,False
3,False,True,True,False,False
4,,,,,


Now that all of the attributes have been split out into individual columns, I'll recombine the attributes and parking dataframes with the original restaurants dataframe and remove the columns containing dictionaries.

In [13]:
# Concatenate all three dataframes together
rest_attributes_short = pd.concat([restaurants, attributes_short, parking], axis = 1)

# Drop the columns containing dictionaries
rest_attributes_short.drop(['attributes', 'BusinessParking'], axis = 1, inplace = True)

Because the dataframe still has plenty of missing data, I'll drop all rows that contain any NaN values and once again reset the index. I've chosen to drop missing data because an NaN value for any attribute does not necessarily mean the restaurant does not have that attribute, and leaving these values intact might skew the analysis.

In [14]:
# Drop rows with any missing data
rest = rest_attributes_short.dropna(axis = 0, how = 'any')

# Reset index
rest.reset_index(drop = True, inplace = True)

# Print shape of the resultant dataframe
print(rest.shape)

rest.head()

(23732, 13)


Unnamed: 0,review_count,stars,Alcohol,BikeParking,BusinessAcceptsCreditCards,RestaurantsPriceRange2,RestaurantsReservations,WiFi,garage,lot,street,valet,validated
0,116,3.5,full_bar,True,True,2.0,False,free,False,True,False,False,False
1,34,3.5,full_bar,True,True,2.0,True,no,False,True,False,False,False
2,78,3.5,full_bar,True,True,2.0,False,free,False,True,True,False,False
3,10,1.0,none,True,True,1.0,False,free,False,True,False,False,False
4,232,3.0,full_bar,True,True,2.0,False,free,False,True,False,False,False


### Feature Engineering

Now that I have a dataframe containing all the attributes I wish to use and the star ratings for each restaurant, I need to prepare the attributes for use in a model. While "review_count" has a data type of int and can be used as-is, the remainder of the attributes contain either categories or booleans. In order to transform these attributes into a format that can be used by a classifier, I'll use one-hot encoding to split these into multiple columns.

In [15]:
# Use one-hot encoding to transform data
rest_info = pd.get_dummies(rest)

rest_info.head()

Unnamed: 0,review_count,stars,RestaurantsPriceRange2,Alcohol_beer_and_wine,Alcohol_full_bar,Alcohol_none,BikeParking_False,BikeParking_True,BusinessAcceptsCreditCards_False,BusinessAcceptsCreditCards_True,...,garage_False,garage_True,lot_False,lot_True,street_False,street_True,valet_False,valet_True,validated_False,validated_True
0,116,3.5,2.0,0,1,0,0,1,0,1,...,1,0,0,1,1,0,1,0,1,0
1,34,3.5,2.0,0,1,0,0,1,0,1,...,1,0,0,1,1,0,1,0,1,0
2,78,3.5,2.0,0,1,0,0,1,0,1,...,1,0,0,1,0,1,1,0,1,0
3,10,1.0,1.0,0,0,1,0,1,0,1,...,1,0,0,1,1,0,1,0,1,0
4,232,3.0,2.0,0,1,0,0,1,0,1,...,1,0,0,1,1,0,1,0,1,0


In [16]:
rest_info.columns

Index(['review_count', 'stars', 'RestaurantsPriceRange2',
       'Alcohol_beer_and_wine', 'Alcohol_full_bar', 'Alcohol_none',
       'BikeParking_False', 'BikeParking_True',
       'BusinessAcceptsCreditCards_False', 'BusinessAcceptsCreditCards_True',
       'RestaurantsReservations_False', 'RestaurantsReservations_True',
       'WiFi_free', 'WiFi_no', 'WiFi_paid', 'garage_False', 'garage_True',
       'lot_False', 'lot_True', 'street_False', 'street_True', 'valet_False',
       'valet_True', 'validated_False', 'validated_True'],
      dtype='object')

### Star Rating by Review Count

In addition to looking at the attributes of restaurants, I looked into whether the number of reviews a restaurant has is related to the star rating. To do so, I isolated the review count and stars data, created a pivot table, and visualized the data with a bar chat.

In [17]:
# Isolate the review count and star rating information in a new dataframe
rest_reviews = rest_info[['review_count', 'stars']]

In [18]:
# View overall value counts of star ratings
rest_reviews['stars'].value_counts()

4.0    7009
3.5    6434
3.0    4031
4.5    2882
2.5    1946
2.0     857
5.0     272
1.5     266
1.0      35
Name: stars, dtype: int64

In [19]:
# Generate a pivot table grouped by star ratings that displays the average number of reviews
pt = rest_reviews.pivot_table(index = 'stars', values = 'review_count')
pt

Unnamed: 0_level_0,review_count
stars,Unnamed: 1_level_1
1.0,7.857143
1.5,24.830827
2.0,33.11902
2.5,52.787256
3.0,69.25552
3.5,103.646876
4.0,149.90769
4.5,145.842817
5.0,48.106618


In [20]:
# Plot the review count against the star rating as a bar chart
data = [go.Bar(
            x = pt.index,
            y = pt['review_count']
    )]

plotly.offline.iplot(data, filename='reviews')

Because the review counts substantially differ with the number of stars for a restaurant, I'll need to take this into account when creating a model.

### K Neighbors Classifier

In [21]:
# Split out features and targets
restaurant_features = rest_info.drop(['stars', 'review_count'], axis = 1)
restaurant_targets = (rest_info['stars']*10).astype('int')

# Train test split
X_train, X_test, y_train, y_test = train_test_split(restaurant_features, restaurant_targets, test_size = 0.3, 
                                                    random_state = 42, stratify = restaurant_targets)

# Model instantiation
knn = KNeighborsClassifier(n_neighbors = 10)

# Train model on training set
knn.fit(X_train, y_train)

# Predict labels for test set
predict = knn.predict(X_test)

# Accuracy score
knn.score(X_test, y_test)

0.27907303370786518

In [22]:
# Display classification report
print(classification_report(y_test, predict))

             precision    recall  f1-score   support

         10       0.00      0.00      0.00        10
         15       0.00      0.00      0.00        80
         20       0.16      0.05      0.07       257
         25       0.19      0.05      0.09       584
         30       0.19      0.18      0.19      1209
         35       0.31      0.36      0.33      1930
         40       0.31      0.45      0.37      2103
         45       0.25      0.10      0.14       865
         50       0.03      0.02      0.03        82

avg / total       0.26      0.28      0.26      7120




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



Overall, the K-Neighbors classifier does not well predict the star ratings of restaurants in the test sample. Based on the classification report, the precision, recall, and F1-score are highest for restaurants with 3.5, 4, and 4.5 stars, which are the most common ratings overall.

### Decision Tree Classifier

In [23]:
# Split out features and targets
restaurant_features1 = rest_info.drop(['stars'], axis = 1)
restaurant_targets1 = (rest_info['stars']*10).astype('int')

# Train test split
X_train1, X_test1, y_train1, y_test1 = train_test_split(restaurant_features1, restaurant_targets1, test_size = 0.3, 
                                                    random_state = 42, stratify = restaurant_targets1)

# Model initialization
tree = DecisionTreeClassifier(max_depth = 3)

# Train model on training set
tree.fit(X_train1, y_train1, sample_weight = X_train1['review_count'])

# Predict labels for test set
predict1 = tree.predict(X_test1)

# Accuracy score
tree.score(X_test1, y_test1)

0.31643258426966292

In [24]:
# Display classification report
print(classification_report(y_test1, predict1))

             precision    recall  f1-score   support

         10       0.00      0.00      0.00        10
         15       0.00      0.00      0.00        80
         20       0.00      0.00      0.00       257
         25       0.00      0.00      0.00       584
         30       0.00      0.00      0.00      1209
         35       0.33      0.34      0.34      1930
         40       0.31      0.76      0.44      2103
         45       0.00      0.00      0.00       865
         50       0.00      0.00      0.00        82

avg / total       0.18      0.32      0.22      7120




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



While performing slightly better than the K Neighbors classifier, the Decision Tree classifier also does not well predict the star ratings of restaurants, even when samples are weighted based on the review count feature.

### Summary

Using the attributes provided in the Yelp dataset, neither a K Neighbors classifier nor a Decision Tree classifier was able to predict with a high degree of accuracy the star rating of a restaurant. While it's possible that features like parking, WiFi, and reservations affect a patron's opinion of a restaurant, the star rating given is likely mainly based upon the quality of the food.  

In the future, I would like to further analyze the dataset by delving into the different categories of restaurant-type businesses to evaluate whether different attributes have varying impacts on the ratings of different categories of businesses. For example, I would predict that the rating of coffee shops is more highly impacted by whether or not they have WiFi than that of a sit-down restaurant.

![title](images/Graduate_Project_Julie_Doherty.005.jpeg)