# Star Predection

The dataset yelp_business_official_test_empty.csv contains 8 new businesses that are not
found in dataset provided by Yelp. These 8 businesses are missing their star rating. Using what
we’ve learned in this course, you are to “predict” the stars for these 8 new businesses.

For example, if a new business was a fast food restaurant in Austin, TX, you might “predict” the
star rating by assigning the average rating for similar restaurants in Austin, TX.

In [1]:
import pandas as pd
import numpy as np

# Introduction
The dataset that I used to predict the star rating for new businesses is 'yelp_academic_dataset_business.csv' file.

# 1. Load the Datasets

In [2]:
# Import the 'yelp_dataset/yelp_business_official_test_empty.csv' file and set the 'business_id' as its index.
business_test_empty = pd.read_csv('yelp_dataset/yelp_business_official_test_empty.csv', 
                                  index_col = 'business_id')
business_test_empty

Unnamed: 0_level_0,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,"""Zingerman's Delicatessen""",,"""422 Detroit St""",Ann Arbor,MI,48104,42.284682,-83.745071,,1754,1,Delis;Breakfast & Brunch;Sandwiches;Restaurants
2,"""A & R Auto Care""",,"""1202 N Cannon Blvd""",Kannapolis,NC,28083,35.510807,-80.608472,,1,1,Automotive;Towing
3,"""Starbucks""",,"""1135 Washington Blvd""",Ogden,UT,84404,41.245215,-111.970461,,21,1,Food;Coffee & Tea
4,"""Starbucks""",,"""5210 S Cicero Ave""",Chicago,IL,60638,41.798023,-87.743579,,2,1,Food;Coffee & Tea
5,"""Starbucks""",,"""4200 Conroy Rd""",Orlando,FL,32839,28.485466,-81.432003,,30,1,Food;Coffee & Tea
6,"""The Tin Fox""",,"""2616 Monroe St""",Madison,WI,53711,43.057715,-89.42837,,7,1,American (New);Restaurants;Coffee & Tea;Food;N...
7,"""Working Draft Beer Company""",,"""1129 E Wilson St""",Madison,WI,53703,43.083359,-89.365438,,24,1,Food;Breweries
8,"""Il Covo""",,"""585 College Street""",Toronto,ON,M6G 1B2,43.655166,-79.413312,,21,1,Restaurants;Italian
9,"""Hawaii Nails & Spa""",,"""1642 Bloor Street W""",Toronto,ON,M6P 1A7,43.655774,-79.456633,,4,1,Beauty & Spas;Day Spas;Nail Salons
10,"""Radiant Acupuncture""",,"""572 Bloor Street W""",Toronto,ON,M6G 1K1,43.665242,-79.412033,,1,1,Day Spas;Beauty & Spas;Health & Medical;Acupun...


In [3]:
# Import the 'yelp_dataset/yelp_academic_dataset_business.csv' file and name it business
business = pd.read_csv('yelp_dataset/yelp_academic_dataset_business.csv')
business.head()

Unnamed: 0,address,attributes,attributes.AcceptsInsurance,attributes.AgesAllowed,attributes.Alcohol,attributes.Ambience,attributes.BYOB,attributes.BYOBCorkage,attributes.BestNights,attributes.BikeParking,...,hours.Wednesday,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state
0,1314 44 Avenue NE,,,,,,,,,False,...,11:0-21:0,1,51.091813,-114.031675,Minhas Micro Brewery,,T2E 6L6,24,4.0,AB
1,,,,,none,,,,,False,...,,0,35.960734,-114.939821,CK'S BBQ & Catering,,89002,3,4.5,NV
2,1335 rue Beaubien E,,,,beer_and_wine,"{'romantic': False, 'intimate': False, 'classy...",,,,True,...,10:0-22:0,0,45.540503,-73.5993,La Bastringue,Rosemont-La Petite-Patrie,H2G 1K7,5,4.0,QC
3,211 W Monroe St,,,,,,,,,,...,,1,33.449999,-112.076979,Geico Insurance,,85003,8,1.5,AZ
4,2005 Alyth Place SE,,,,,,,,,,...,8:0-17:0,1,51.035591,-114.027366,Action Engine,,T2H 0N5,4,2.0,AB


# 2. Prediction Process
First of all, I want to predict the star rating for these new businesses by the average star rating for businesses in the <b>same city</b>.

Take a close look at 'state' column of 'business_test_empty' dataframe, there're 7 businesses in the United States and 3 businesses in Canada. So for better prediction, I want to leave only the information with the correct US state abbreviations, plus the information about Toronto, ON, Canada. 

In [4]:
# Create a list of state abbreviations of the United States, as well as 'ON'(Canada)
states = ['AL', 'AK', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 
          'LA', 'ME', 'MD', 'MA', 'MI', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 
          'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY', 'ON']
# Transform the list into a dataframe
df_state = pd.DataFrame({'state': states})
df_state.head()

Unnamed: 0,state
0,AL
1,AK
2,AR
3,AZ
4,CA


In [5]:
# I don't want to change the original dataset, so copy the original business dataset and assign it to 'df_business'
df_business = business.copy()

In [6]:
# Use 'pd.merge' to get a new dataframe that only contains the 
# information about the business with right state abbreviations that I need.
# Assign the new dataframe to 'new_df'
new_df = pd.merge(df_business, df_state, on = 'state')
new_df.head()

Unnamed: 0,address,attributes,attributes.AcceptsInsurance,attributes.AgesAllowed,attributes.Alcohol,attributes.Ambience,attributes.BYOB,attributes.BYOBCorkage,attributes.BestNights,attributes.BikeParking,...,hours.Wednesday,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state
0,,,,,none,,,,,False,...,,0,35.960734,-114.939821,CK'S BBQ & Catering,,89002,3,4.5,NV
1,703 N Rancho Dr,,,,,,,,,,...,,1,36.178348,-115.176916,Citi Trends,,89106,4,4.0,NV
2,1549 N Rancho Dr,,,,,,,,,,...,10:0-18:0,1,36.188386,-115.186124,Nevada Title And Payday Loans,,89106,4,1.0,NV
3,"3940 Martin Luther King Blvd, Ste 101",,,,,,,,,True,...,11:0-18:0,0,36.192284,-115.159272,CakesbyToi,,89106,3,1.5,NV
4,,,,,,,,,,,...,,1,36.260816,-115.17113,Park Stone Pavers,,89031,20,5.0,NV


In [7]:
# Group 'state' and 'city' together and get the average star rating of the businesses in a city.
df_groupby_state_city = new_df.groupby(['state', 'city'])['stars'].aggregate([np.mean]).rename(
    columns = {'mean': 'average star rating'})

df_groupby_state_city[:10]

Unnamed: 0_level_0,Unnamed: 1_level_0,average star rating
state,city,Unnamed: 2_level_1
AL,Chandler,3.0
AL,Henderson Nevada,5.0
AR,Mesa,5.0
AR,Phoenix,5.0
AZ,Ahwahtukee,5.0
AZ,Ahwatukee,3.705882
AZ,Ahwatukee Foothills Village,5.0
AZ,Anthem,3.557692
AZ,Arrowhead,3.5
AZ,Avondale,3.496983


In [8]:
# Use pd.merge to get a new dataframe that merge the 'business_test_empty' and the 'df_groupby_state_city'
# together, so that the resulting dataframe will have a 'average star rating' as a column. Assign the new 
# dataframe to 'predict_star'
predict_star = pd.merge(business_test_empty, df_groupby_state_city, 
                         left_on = ['state', 'city'], right_index = True, 
                         how = 'left')
predict_star

Unnamed: 0_level_0,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories,average star rating
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,"""Zingerman's Delicatessen""",,"""422 Detroit St""",Ann Arbor,MI,48104,42.284682,-83.745071,,1754,1,Delis;Breakfast & Brunch;Sandwiches;Restaurants,
2,"""A & R Auto Care""",,"""1202 N Cannon Blvd""",Kannapolis,NC,28083,35.510807,-80.608472,,1,1,Automotive;Towing,3.416185
3,"""Starbucks""",,"""1135 Washington Blvd""",Ogden,UT,84404,41.245215,-111.970461,,21,1,Food;Coffee & Tea,
4,"""Starbucks""",,"""5210 S Cicero Ave""",Chicago,IL,60638,41.798023,-87.743579,,2,1,Food;Coffee & Tea,
5,"""Starbucks""",,"""4200 Conroy Rd""",Orlando,FL,32839,28.485466,-81.432003,,30,1,Food;Coffee & Tea,
6,"""The Tin Fox""",,"""2616 Monroe St""",Madison,WI,53711,43.057715,-89.42837,,7,1,American (New);Restaurants;Coffee & Tea;Food;N...,3.654173
7,"""Working Draft Beer Company""",,"""1129 E Wilson St""",Madison,WI,53703,43.083359,-89.365438,,24,1,Food;Breweries,3.654173
8,"""Il Covo""",,"""585 College Street""",Toronto,ON,M6G 1B2,43.655166,-79.413312,,21,1,Restaurants;Italian,3.490099
9,"""Hawaii Nails & Spa""",,"""1642 Bloor Street W""",Toronto,ON,M6P 1A7,43.655774,-79.456633,,4,1,Beauty & Spas;Day Spas;Nail Salons,3.490099
10,"""Radiant Acupuncture""",,"""572 Bloor Street W""",Toronto,ON,M6G 1K1,43.665242,-79.412033,,1,1,Day Spas;Beauty & Spas;Health & Medical;Acupun...,3.490099


After the first operation, 6 businesses received predicted star rating while other 4 businesses didn't. The reason is that the <b>cities</b>(or <b>states</b>) where the 4 businesses are located failed to find the same match in the 'new_df' dataset.

So, let's do some more exploration.

In [9]:
missing_rating_states = ['MI', 'UT', 'IL', 'FL'] # The 4 states where the businesses didn't received predicted rating
states = ['AL', 'AK', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 
          'LA', 'ME', 'MD', 'MA', 'MI', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 
          'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY', 'ON']

It is clear that 'IL' and 'FL' are in the 'new_df' dataset while 'MI' and 'UT' are not. 

What about cities?

In [10]:
'Chicago' in new_df[new_df['state'] == 'IL']['city']

False

In [11]:
'Orlando' in new_df[new_df['state'] == 'FL']['city']

False

No matching cities in 'new_df' dataset. So I decided to predict the star rating for businesses in these two cities by the average rating of their corresponding states.

Let's start the second operation!

In [12]:
# Group the new_df by 'state' and get the average star rating of the businesses in state.
df_groupby_state_IL_FL = new_df.groupby('state').aggregate(
    {'stars': np.mean}).loc[['IL', 'FL']].rename(columns = {'stars': 'average star rating'})

df_groupby_state_IL_FL

Unnamed: 0_level_0,average star rating
state,Unnamed: 1_level_1
IL,3.510842
FL,4.0


In [13]:
# Fill the results obtained above into 'predict_star'
predict_star.iloc[3, 12] = 3.510842
predict_star.iloc[4, 12] = 4.000000

In [14]:
predict_star

Unnamed: 0_level_0,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories,average star rating
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,"""Zingerman's Delicatessen""",,"""422 Detroit St""",Ann Arbor,MI,48104,42.284682,-83.745071,,1754,1,Delis;Breakfast & Brunch;Sandwiches;Restaurants,
2,"""A & R Auto Care""",,"""1202 N Cannon Blvd""",Kannapolis,NC,28083,35.510807,-80.608472,,1,1,Automotive;Towing,3.416185
3,"""Starbucks""",,"""1135 Washington Blvd""",Ogden,UT,84404,41.245215,-111.970461,,21,1,Food;Coffee & Tea,
4,"""Starbucks""",,"""5210 S Cicero Ave""",Chicago,IL,60638,41.798023,-87.743579,,2,1,Food;Coffee & Tea,3.510842
5,"""Starbucks""",,"""4200 Conroy Rd""",Orlando,FL,32839,28.485466,-81.432003,,30,1,Food;Coffee & Tea,4.0
6,"""The Tin Fox""",,"""2616 Monroe St""",Madison,WI,53711,43.057715,-89.42837,,7,1,American (New);Restaurants;Coffee & Tea;Food;N...,3.654173
7,"""Working Draft Beer Company""",,"""1129 E Wilson St""",Madison,WI,53703,43.083359,-89.365438,,24,1,Food;Breweries,3.654173
8,"""Il Covo""",,"""585 College Street""",Toronto,ON,M6G 1B2,43.655166,-79.413312,,21,1,Restaurants;Italian,3.490099
9,"""Hawaii Nails & Spa""",,"""1642 Bloor Street W""",Toronto,ON,M6P 1A7,43.655774,-79.456633,,4,1,Beauty & Spas;Day Spas;Nail Salons,3.490099
10,"""Radiant Acupuncture""",,"""572 Bloor Street W""",Toronto,ON,M6G 1K1,43.665242,-79.412033,,1,1,Day Spas;Beauty & Spas;Health & Medical;Acupun...,3.490099


Great! Now we only have 2 businesses that didn't receive predicted star rating. Since 'new_df' contains no information about their states, I have to think about another way for the prediction.

The method I decided to use is to first classify the business categories in 'new_df' dataset, and then group the dataset by different classified categories and calculate the average star rating for each group. Finally I can use these values to predict the remaining 2 businesses.

Now, let's start the third operation!

In [15]:
# Before digging into the 'categories', remove all the null values, and select only the columns that I need.
new_df_sub = new_df[new_df['categories'].notnull()][['categories', 'stars']]
new_df_sub.head()

Unnamed: 0,categories,stars
0,"Chicken Wings, Burgers, Caterers, Street Vendo...",4.5
1,"Shopping, Fashion, Department Stores",4.0
2,"Financial Services, Check Cashing/Pay-day Loan...",1.0
3,"American (Traditional), Food, Bakeries, Restau...",1.5
4,"Home Services, Masonry/Concrete, Professional ...",5.0


In order to classify the categories, I need to retrieve the key words for the 2 businesses' categories. Those key words will help to classify and screen the two types of businesses that I want to predict.

In [16]:
# Get the key words of the "Zingerman's Delicatessen" category
predict_star[predict_star['state'] == 'MI']['categories']

business_id
1    Delis;Breakfast & Brunch;Sandwiches;Restaurants
Name: categories, dtype: object

In [17]:
# Get the key words of the "Starbucks" category
predict_star[predict_star['state'] == 'UT']['categories']

business_id
3    Food;Coffee & Tea
Name: categories, dtype: object

Now, create a function to classify the categories.

In [18]:
# This function is used to classify the categories in 'new_df_sub' into 3 categories.
def classify(s):
    s = s.lower()
    category = ''
    if 'delis' in s or 'breakfast' in s or 'brunch' in s or 'sandwiche' in s or 'restaurant' in s:
        category = 'Zingerman_like'
    elif 'food' in s or 'coffee' in s or 'tea' in s:
        category = 'Starbucks_like'
    else:
        category = 'Else'
    return category

In [19]:
# Apply the function to 'categories' of 'new_df_sub' and create a new column called 'classified', 
# after this step, I will get the 'categories' of 'new_df_sub' being classified into 3 types.
new_df_sub['classified'] = new_df_sub.categories.apply(classify)
new_df_sub.head()

Unnamed: 0,categories,stars,classified
0,"Chicken Wings, Burgers, Caterers, Street Vendo...",4.5,Zingerman_like
1,"Shopping, Fashion, Department Stores",4.0,Else
2,"Financial Services, Check Cashing/Pay-day Loan...",1.0,Else
3,"American (Traditional), Food, Bakeries, Restau...",1.5,Zingerman_like
4,"Home Services, Masonry/Concrete, Professional ...",5.0,Else


In [20]:
# Group the new_df_sub by 'classified', and calculate the average star rating in each group
new_df_sub.groupby('classified').aggregate({'stars': np.mean}).rename(
    columns = {'stars': 'average star rating'})

Unnamed: 0_level_0,average star rating
classified,Unnamed: 1_level_1
Else,3.735955
Starbucks_like,3.666807
Zingerman_like,3.420675


In [21]:
# Fill the results obtained above into 'predict_star'
predict_star.iloc[0, 12] = 3.420675
predict_star.iloc[2, 12] = 3.666807

In [22]:
# Round the predicted star to 2 decimals.
predict_star['average star rating'] = np.round(predict_star['average star rating'], 3)

In [23]:
predict_star

Unnamed: 0_level_0,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories,average star rating
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,"""Zingerman's Delicatessen""",,"""422 Detroit St""",Ann Arbor,MI,48104,42.284682,-83.745071,,1754,1,Delis;Breakfast & Brunch;Sandwiches;Restaurants,3.421
2,"""A & R Auto Care""",,"""1202 N Cannon Blvd""",Kannapolis,NC,28083,35.510807,-80.608472,,1,1,Automotive;Towing,3.416
3,"""Starbucks""",,"""1135 Washington Blvd""",Ogden,UT,84404,41.245215,-111.970461,,21,1,Food;Coffee & Tea,3.667
4,"""Starbucks""",,"""5210 S Cicero Ave""",Chicago,IL,60638,41.798023,-87.743579,,2,1,Food;Coffee & Tea,3.511
5,"""Starbucks""",,"""4200 Conroy Rd""",Orlando,FL,32839,28.485466,-81.432003,,30,1,Food;Coffee & Tea,4.0
6,"""The Tin Fox""",,"""2616 Monroe St""",Madison,WI,53711,43.057715,-89.42837,,7,1,American (New);Restaurants;Coffee & Tea;Food;N...,3.654
7,"""Working Draft Beer Company""",,"""1129 E Wilson St""",Madison,WI,53703,43.083359,-89.365438,,24,1,Food;Breweries,3.654
8,"""Il Covo""",,"""585 College Street""",Toronto,ON,M6G 1B2,43.655166,-79.413312,,21,1,Restaurants;Italian,3.49
9,"""Hawaii Nails & Spa""",,"""1642 Bloor Street W""",Toronto,ON,M6P 1A7,43.655774,-79.456633,,4,1,Beauty & Spas;Day Spas;Nail Salons,3.49
10,"""Radiant Acupuncture""",,"""572 Bloor Street W""",Toronto,ON,M6G 1K1,43.665242,-79.412033,,1,1,Day Spas;Beauty & Spas;Health & Medical;Acupun...,3.49


The star prediction for all new businesses are now completed.