## Creating Cleaned Dataset of Restaurants from Yelp Businesses JSON

This notebook contains code for final cleaning of the Yelp Businesses Dataset for use in our ML model.


### Data Source
- Data: https://www.yelp.com/dataset/download
- Documentation: https://www.yelp.com/dataset/documentation/main


### Filters:
- open businesses
- categories not null
- zip codes not null
- US zip codes
- "Restaurant" in categories


### Cleaning and Transformation Steps:
- All unique categories listed
- Unique categories filtered for only food related categories (including alcohol) by hand
- Columns added for all food categories for each business
- Rest of preprocessing in SQL?? grouping by zip code (Ave stars, total count reviews, total count different types of each category)

In [1]:
# Dependencies
import os
import pandas as pd
import numpy as np
import re

In [2]:
# Load file
file_path = os.path.join("..", "Raw_Data", "yelp_dataset", "yelp_academic_dataset_business.json")
businesses_df = pd.read_json(file_path, lines=True)

In [3]:
businesses_df.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


In [4]:
businesses_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150346 entries, 0 to 150345
Data columns (total 14 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   business_id   150346 non-null  object 
 1   name          150346 non-null  object 
 2   address       150346 non-null  object 
 3   city          150346 non-null  object 
 4   state         150346 non-null  object 
 5   postal_code   150346 non-null  object 
 6   latitude      150346 non-null  float64
 7   longitude     150346 non-null  float64
 8   stars         150346 non-null  float64
 9   review_count  150346 non-null  int64  
 10  is_open       150346 non-null  int64  
 11  attributes    136602 non-null  object 
 12  categories    150243 non-null  object 
 13  hours         127123 non-null  object 
dtypes: float64(3), int64(2), object(9)
memory usage: 16.1+ MB


In [5]:
# Drop ["attributes", "hours", "address", "latitude", "longitude"]
businesses_df.drop(["attributes", "hours", "address", "latitude", "longitude"], axis=1, inplace=True)

In [6]:
businesses_df.columns

Index(['business_id', 'name', 'city', 'state', 'postal_code', 'stars',
       'review_count', 'is_open', 'categories'],
      dtype='object')

In [7]:
# Filter for open businesses
open_businesses_df = businesses_df[businesses_df['is_open']==1]
len(open_businesses_df)

119698

In [8]:
# Delete rows with null values in categories column
open_businesses_df = open_businesses_df[open_businesses_df['categories'].notnull()]
len(open_businesses_df)

119603

In [9]:
# Ryan's code for getting rid of Canadian zips
# non-int postal codes = Canada
def check_int(value):
    try:
        int(value)
        return np.NaN
    except ValueError:
        return value

# Keeps rows marked as "NaN" by function above
open_businesses_df = open_businesses_df[open_businesses_df.postal_code.apply(check_int).isna()]

In [10]:
len(open_businesses_df)

115234

In [11]:
# Delete "is_open" column
open_businesses_df.drop(["is_open"],axis=1, inplace=True)

In [12]:
# Reset index to that we can loop through df
open_businesses_df = open_businesses_df.reset_index()

In [13]:
open_businesses_df.head()

Unnamed: 0,index,business_id,name,city,state,postal_code,stars,review_count,categories
0,1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,Affton,MO,63123,3.0,15,"Shipping Centers, Local Services, Notaries, Ma..."
1,3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,Philadelphia,PA,19107,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
2,4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,Green Lane,PA,18054,4.5,13,"Brewpubs, Breweries, Food"
3,5,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,Ashland City,TN,37015,2.0,6,"Burgers, Fast Food, Sandwiches, Food, Ice Crea..."
4,6,n_0UpQx1hsNbnPUSlodU8w,Famous Footwear,Brentwood,MO,63144,2.5,13,"Sporting Goods, Fashion, Shoe Stores, Shopping..."


In [14]:
# Drop old index column
open_businesses_df.drop(['index'], axis=1, inplace=True)

In [15]:
open_businesses_df.head()

Unnamed: 0,business_id,name,city,state,postal_code,stars,review_count,categories
0,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,Affton,MO,63123,3.0,15,"Shipping Centers, Local Services, Notaries, Ma..."
1,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,Philadelphia,PA,19107,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
2,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,Green Lane,PA,18054,4.5,13,"Brewpubs, Breweries, Food"
3,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,Ashland City,TN,37015,2.0,6,"Burgers, Fast Food, Sandwiches, Food, Ice Crea..."
4,n_0UpQx1hsNbnPUSlodU8w,Famous Footwear,Brentwood,MO,63144,2.5,13,"Sporting Goods, Fashion, Shoe Stores, Shopping..."



### Filter for Restaurants


In [16]:
open_businesses_df.columns

Index(['business_id', 'name', 'city', 'state', 'postal_code', 'stars',
       'review_count', 'categories'],
      dtype='object')

In [17]:
# Make new empty df for restaurants
restaurants_df = pd.DataFrame(columns=['business_id', 'name', 'city', 'state', 'postal_code', 'stars',
       'review_count', 'categories'])

In [18]:
# Loop throuph open_businesses_df to find restaurants and append to restaurants_df
for i in range(len(open_businesses_df)):
    if "Restaurants" in open_businesses_df['categories'][i]:
        restaurants_df = restaurants_df.append(open_businesses_df.loc[i], ignore_index=True)

In [19]:
len(restaurants_df)

33250

In [20]:
# Change postal_code to integer
restaurants_df['postal_code']=restaurants_df['postal_code'].astype(int)

In [21]:
restaurants_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33250 entries, 0 to 33249
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   business_id   33250 non-null  object 
 1   name          33250 non-null  object 
 2   city          33250 non-null  object 
 3   state         33250 non-null  object 
 4   postal_code   33250 non-null  int64  
 5   stars         33250 non-null  float64
 6   review_count  33250 non-null  object 
 7   categories    33250 non-null  object 
dtypes: float64(1), int64(1), object(6)
memory usage: 2.0+ MB


### Parse out different restaurant categories

In [22]:
# Explore what the categories look like
restaurants_df['categories'].sample(20)

17964    Barbeque, Breweries, Bars, Restaurants, Nightl...
24868                                   Restaurants, Pizza
23549                        Restaurants, Tex-Mex, Mexican
18699    Seafood, Burgers, Restaurants, Chicken Wings, ...
33094    Restaurants, American (Traditional), Diners, B...
25775    Restaurants, Fast Food, Coffee & Tea, Burgers,...
9671                                  Restaurants, Italian
17603    Coffee Roasteries, Coffee & Tea, Vegan, Restau...
17269                          American (New), Restaurants
19608    American (New), Cocktail Bars, Restaurants, Ho...
16994    American (Traditional), Breakfast & Brunch, Bu...
21269    Middle Eastern, American (Traditional), Food, ...
3816                                  Mexican, Restaurants
22532    American (Traditional), Breakfast & Brunch, Re...
13289       Food, Restaurants, Mexican, Food Trucks, Tacos
11605    Event Planning & Services, Arts & Entertainmen...
2648     Restaurants, Chicken Wings, American (Traditio.

In [23]:
restaurants_df['categories'].explode().value_counts()

Restaurants, Pizza                                                                                                              637
Pizza, Restaurants                                                                                                              547
Restaurants, Chinese                                                                                                            513
Chinese, Restaurants                                                                                                            492
Restaurants, Mexican                                                                                                            443
                                                                                                                               ... 
Food, Food Delivery Services, Fast Food, Delis, Restaurants, Sandwiches                                                           1
Coffee Roasteries, Juice Bars & Smoothies, Breakfast & Brunch, Coffee & Tea,

In [24]:
raw_categories_list = restaurants_df['categories'].to_list()

In [25]:
raw_categories_list

['Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries',
 'Burgers, Fast Food, Sandwiches, Food, Ice Cream & Frozen Yogurt, Restaurants',
 'Ice Cream & Frozen Yogurt, Fast Food, Burgers, Restaurants, Food',
 'Vietnamese, Food, Restaurants, Food Trucks',
 'American (Traditional), Restaurants, Diners, Breakfast & Brunch',
 'Sushi Bars, Restaurants, Japanese',
 'Korean, Restaurants',
 'Steakhouses, Asian Fusion, Restaurants',
 'Restaurants, Italian',
 'Pizza, Chicken Wings, Sandwiches, Restaurants',
 'Pizza, Restaurants',
 'Eatertainment, Arts & Entertainment, Brewpubs, American (Traditional), Bakeries, Breweries, Food, Restaurants',
 'Restaurants, Specialty Food, Steakhouses, Food, Italian, Pizza, Pasta Shops',
 'Restaurants, Italian',
 'Sports Bars, American (New), American (Traditional), Nightlife, Bars, Restaurants',
 'American (Traditional), Bars, Nightlife, Sports Bars, Restaurants',
 'American (Traditional), Sports Bars, Restaurants, Bars, Nightlife, Steakhouses, Salad, Beer Bar',

In [26]:
# Find all unique categories and count occurances
unique_categories = {}

for i in range(len(restaurants_df)):
    types_list = restaurants_df["categories"][i].replace(" ","").split(",")
    for category in types_list:
        if category not in unique_categories:
            unique_categories[category]=1
        elif category in unique_categories:
            unique_categories[category]+=1

In [27]:
unique_categories

{'Restaurants': 33250,
 'Food': 10366,
 'BubbleTea': 197,
 'Coffee&Tea': 2814,
 'Bakeries': 1271,
 'Burgers': 4090,
 'FastFood': 5241,
 'Sandwiches': 5894,
 'IceCream&FrozenYogurt': 772,
 'Vietnamese': 410,
 'FoodTrucks': 742,
 'American(Traditional)': 5393,
 'Diners': 932,
 'Breakfast&Brunch': 4245,
 'SushiBars': 1071,
 'Japanese': 1086,
 'Korean': 258,
 'Steakhouses': 964,
 'AsianFusion': 897,
 'Italian': 2827,
 'Pizza': 4847,
 'ChickenWings': 2269,
 'Eatertainment': 16,
 'Arts&Entertainment': 705,
 'Brewpubs': 94,
 'Breweries': 272,
 'SpecialtyFood': 1290,
 'PastaShops': 152,
 'SportsBars': 1212,
 'American(New)': 3566,
 'Nightlife': 5543,
 'Bars': 5343,
 'Salad': 2213,
 'BeerBar': 485,
 'Lounges': 371,
 'Wraps': 236,
 'Automotive': 267,
 'Delis': 1690,
 'GasStations': 246,
 'ConvenienceStores': 480,
 'Pubs': 839,
 'EventPlanning&Services': 2219,
 'WineBars': 609,
 'Gastropubs': 308,
 'Venues&EventSpaces': 581,
 'JuiceBars&Smoothies': 704,
 'Fruits&Veggies': 85,
 'SportingGoods': 19

In [28]:
len(unique_categories)

680

In [29]:
# Cast all categories and counts into a dataframe
all_unique_categories_df = pd.DataFrame.from_dict(unique_categories, orient='index')
all_unique_categories_df.head()

Unnamed: 0,0
Restaurants,33250
Food,10366
BubbleTea,197
Coffee&Tea,2814
Bakeries,1271


In [30]:
len(all_unique_categories_df)

680

In [31]:
all_unique_categories_df.rename(columns={0:"Count"}, inplace=True)
all_unique_categories_df.head()

Unnamed: 0,Count
Restaurants,33250
Food,10366
BubbleTea,197
Coffee&Tea,2814
Bakeries,1271


In [32]:
# Sort
all_unique_categories_df.sort_values(by=["Count"], ascending=False, inplace=True)
all_unique_categories_df.head(20)

Unnamed: 0,Count
Restaurants,33250
Food,10366
Sandwiches,5894
Nightlife,5543
American(Traditional),5393
Bars,5343
FastFood,5241
Pizza,4847
Breakfast&Brunch,4245
Burgers,4090


In [33]:
# Create list of all unique categories
unique_categories_list = unique_categories.keys()
unique_categories_list

dict_keys(['Restaurants', 'Food', 'BubbleTea', 'Coffee&Tea', 'Bakeries', 'Burgers', 'FastFood', 'Sandwiches', 'IceCream&FrozenYogurt', 'Vietnamese', 'FoodTrucks', 'American(Traditional)', 'Diners', 'Breakfast&Brunch', 'SushiBars', 'Japanese', 'Korean', 'Steakhouses', 'AsianFusion', 'Italian', 'Pizza', 'ChickenWings', 'Eatertainment', 'Arts&Entertainment', 'Brewpubs', 'Breweries', 'SpecialtyFood', 'PastaShops', 'SportsBars', 'American(New)', 'Nightlife', 'Bars', 'Salad', 'BeerBar', 'Lounges', 'Wraps', 'Automotive', 'Delis', 'GasStations', 'ConvenienceStores', 'Pubs', 'EventPlanning&Services', 'WineBars', 'Gastropubs', 'Venues&EventSpaces', 'JuiceBars&Smoothies', 'Fruits&Veggies', 'SportingGoods', 'SportsWear', 'Fashion', 'Shopping', 'Seafood', 'Cajun/Creole', 'Mexican', 'French', 'Moroccan', 'Mediterranean', 'Chinese', 'Live/RawFood', 'Beer', 'Wine&Spirits', 'Barbeque', 'PerformingArts', 'Hotels&Travel', 'Beauty&Spas', 'Museums', 'Hotels', 'Cinema', 'Resorts', 'DaySpas', 'ChickenShop', 

In [34]:
# Create restaurant categories list by manually selecting food and alcohol related categories. 
# "Restaurant", "Food", "Groceries" not included

restaurant_categories_list = ['BubbleTea', 'Coffee&Tea', 'Bakeries', 'Burgers', 'FastFood', 'Sandwiches',\
                              'IceCream&FrozenYogurt', 'Vietnamese', 'FoodTrucks', 'American(Traditional)',\
                              'Diners', 'Breakfast&Brunch', 'SushiBars', 'Japanese', 'Korean', 'Steakhouses',\
                              'AsianFusion', 'Italian', 'Pizza', 'ChickenWings', 'Brewpubs', 'Breweries',\
                              'SportsBars', 'American(New)', 'Bars', 'Salad', 'BeerBar', 'Lounges', 'Wraps',\
                              'Delis', 'Pubs', 'WineBars', 'Gastropubs', 'JuiceBars&Smoothies',\
                              'Seafood', 'Cajun/Creole', 'Mexican', 'French', 'Moroccan', 'Mediterranean',\
                              'Chinese', 'Live/RawFood', 'Beer', 'Wine&Spirits', 'Barbeque',\
                              'Thai', 'Bagels', 'Southern', 'Irish', 'Vegan', 'CocktailBars', 'Tapas/SmallPlates',\
                              'IrishPub', 'CoffeeRoasteries', 'Cupcakes', 'Caribbean', 'Trinidadian', 'Cafes',\
                              'ComfortFood', 'Donuts', 'AcaiBowls', 'Vegetarian', 'Pakistani', 'Indian',\
                              'Soup', 'Halal', 'StreetVendors', 'Greek', 'FoodStands', 'HotDogs', 'Gluten-Free',\
                              'Empanadas', 'Desserts', 'WhiskeyBars', 'LatinAmerican', 'Honduran', 'Noodles',\
                              'Spanish', 'Cheesesteaks', 'African', 'Kebab', 'Turkish','MiddleEastern', 'Lebanese',\
                              'Creperies', 'Gelato', 'Poke', 'Falafel', 'Pretzels', 'Wineries', 'LocalFlavor',\
                              'Tex-Mex', 'DiveBars', 'Peruvian', 'Tacos', 'BeerGardens', 'SoulFood', 'Ramen',\
                              'Malaysian', 'Burmese', 'Hawaiian', 'EthnicFood','Do-It-YourselfFood', 'Sicilian',\
                              'Filipino', 'ThemedCafes','Fish&Chips', 'Sardinian', 'Laotian', 'Teppanyaki', 'Szechuan',\
                              'ShavedIce','Persian/Iranian', 'HongKongStyleCafe', 'Taiwanese', 'PanAsian', 'NewMexicanCuisine',\
                              'Oriental', 'Dominican', 'InternetCafes','Cuban', 'PuertoRican','Portuguese', 'DimSum',\
                              'TapasBars','Cantonese', 'Arabic', 'CandyStores', 'Buffets', 'Brasseries', 'Distilleries',\
                              'Ethiopian',  'Salvadoran', 'Karaoke', 'Mongolian', 'British', 'German', 'Syrian',\
                              'Armenian', 'Waffles','ModernEuropean', 'Colombian', 'Haitian', 'Czech', 'Pop-UpRestaurants',\
                              'TikiBars', 'Polish', 'Hainan', 'TeaRooms', 'Russian', 'Cafeteria', 'Afghan',\
                              'Somali', 'Argentine', 'Brazilian', 'PianoBars', 'Senegalese', 'Tuscan', 'Smokehouse',\
                              'Cambodian', 'Patisserie/CakeShop', 'Venezuelan', 'Shanghainese', 'Indonesian', 'GayBars',\
                              'Kombucha', 'Calabrian', 'Australian', 'Iberian', 'JapaneseCurry', 'Izakaya', 'Nicaraguan',\
                              'HotPot', 'Kosher', 'Pancakes','Egyptian', 'SriLankan', 'Uzbek', 'Scandinavian', 'Himalayan/Nepalese',\
                              'ChampagneBars', 'Delicatessen', 'Israeli','ShavedSnow', 'Macarons','Georgian', 'Belgian',\
                              'Fuzhou', 'Basque', 'Ukrainian','Fondue', 'Singaporean', 'SouthAfrican','Bangladeshi',\
                              'Hungarian', 'Bistros', 'Scottish','Guamanian','Tonkatsu', 'Donburi', 'Pita', 'Austrian',\
                              'EasternEuropean', 'Cucinacampana', 'ConveyorBeltSushi','Poutineries','Coffeeshops','SerboCroatian']

In [35]:
len(restaurant_categories_list)

208

In [36]:
restaurants_df.head()

Unnamed: 0,business_id,name,city,state,postal_code,stars,review_count,categories
0,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,Philadelphia,PA,19107,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
1,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,Ashland City,TN,37015,2.0,6,"Burgers, Fast Food, Sandwiches, Food, Ice Crea..."
2,bBDDEgkFA1Otx9Lfe7BZUQ,Sonic Drive-In,Nashville,TN,37207,1.5,10,"Ice Cream & Frozen Yogurt, Fast Food, Burgers,..."
3,eEOYSgkmpB90uNA7lDOMRA,Vietnamese Food Truck,Tampa Bay,FL,33602,4.0,10,"Vietnamese, Food, Restaurants, Food Trucks"
4,il_Ro8jwPlHresjw9EGmBg,Denny's,Indianapolis,IN,46227,2.5,28,"American (Traditional), Restaurants, Diners, B..."


In [60]:
# Make new columns for each categories list and star rating for category
# Make a new dataframe
expanded_restaurants_df = restaurants_df.copy()
expanded_restaurants_df[restaurant_categories_list]="0"

In [61]:
expanded_restaurants_df.head()

Unnamed: 0,business_id,name,city,state,postal_code,stars,review_count,categories,BubbleTea,Coffee&Tea,...,Tonkatsu,Donburi,Pita,Austrian,EasternEuropean,Cucinacampana,ConveyorBeltSushi,Poutineries,Coffeeshops,SerboCroatian
0,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,Philadelphia,PA,19107,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B...",0,0,...,0,0,0,0,0,0,0,0,0,0
1,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,Ashland City,TN,37015,2.0,6,"Burgers, Fast Food, Sandwiches, Food, Ice Crea...",0,0,...,0,0,0,0,0,0,0,0,0,0
2,bBDDEgkFA1Otx9Lfe7BZUQ,Sonic Drive-In,Nashville,TN,37207,1.5,10,"Ice Cream & Frozen Yogurt, Fast Food, Burgers,...",0,0,...,0,0,0,0,0,0,0,0,0,0
3,eEOYSgkmpB90uNA7lDOMRA,Vietnamese Food Truck,Tampa Bay,FL,33602,4.0,10,"Vietnamese, Food, Restaurants, Food Trucks",0,0,...,0,0,0,0,0,0,0,0,0,0
4,il_Ro8jwPlHresjw9EGmBg,Denny's,Indianapolis,IN,46227,2.5,28,"American (Traditional), Restaurants, Diners, B...",0,0,...,0,0,0,0,0,0,0,0,0,0


In [62]:
expanded_restaurants_df.columns

Index(['business_id', 'name', 'city', 'state', 'postal_code', 'stars',
       'review_count', 'categories', 'BubbleTea', 'Coffee&Tea',
       ...
       'Tonkatsu', 'Donburi', 'Pita', 'Austrian', 'EasternEuropean',
       'Cucinacampana', 'ConveyorBeltSushi', 'Poutineries', 'Coffeeshops',
       'SerboCroatian'],
      dtype='object', length=216)

In [63]:
# Add 1 to columns if a restaurant category is in each business's categories
for i in range(len(expanded_restaurants_df)):
    types_list = expanded_restaurants_df["categories"][i].replace(" ","").split(",")
    for category in restaurant_categories_list:
        if category in types_list:
            expanded_restaurants_df[category][i]=1
#         else:
#             restaurants_df[category][i]=0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  expanded_restaurants_df[category][i]=1


In [64]:
expanded_restaurants_df.head()

Unnamed: 0,business_id,name,city,state,postal_code,stars,review_count,categories,BubbleTea,Coffee&Tea,...,Tonkatsu,Donburi,Pita,Austrian,EasternEuropean,Cucinacampana,ConveyorBeltSushi,Poutineries,Coffeeshops,SerboCroatian
0,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,Philadelphia,PA,19107,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B...",1,1,...,0,0,0,0,0,0,0,0,0,0
1,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,Ashland City,TN,37015,2.0,6,"Burgers, Fast Food, Sandwiches, Food, Ice Crea...",0,0,...,0,0,0,0,0,0,0,0,0,0
2,bBDDEgkFA1Otx9Lfe7BZUQ,Sonic Drive-In,Nashville,TN,37207,1.5,10,"Ice Cream & Frozen Yogurt, Fast Food, Burgers,...",0,0,...,0,0,0,0,0,0,0,0,0,0
3,eEOYSgkmpB90uNA7lDOMRA,Vietnamese Food Truck,Tampa Bay,FL,33602,4.0,10,"Vietnamese, Food, Restaurants, Food Trucks",0,0,...,0,0,0,0,0,0,0,0,0,0
4,il_Ro8jwPlHresjw9EGmBg,Denny's,Indianapolis,IN,46227,2.5,28,"American (Traditional), Restaurants, Diners, B...",0,0,...,0,0,0,0,0,0,0,0,0,0


## Export Data

In [49]:
# Export to csv
expanded_restaurants_df.to_csv("../Processed_Data/expanded_restaurants.csv", index=False)

In [46]:
# Export to local SQL

# Dependencies
import psycopg2
import sqlalchemy as sqla
from sqlalchemy import create_engine
from sqlalchemy.orm import Session

# Export
from config import pSQL
# DB password in config.py file

db_path = f'postgresql://postgres:{pSQL}@127.0.0.1:5432/Yelp' # last item is name of db in the server group
# Create database engine
engine = create_engine(db_path)


# Change database address when loading to shared AWS database

In [47]:
# Load
expanded_restaurants_df.to_sql(name='Expanded_Restaurants',con=engine)
# Load confirmed in Postgres

In [48]:
# Close connection
connection = engine.connect()
connection.close()

## Add Columns for stars for each restaurant category

In [65]:
stars_expanded_restaurants_df = expanded_restaurants_df.copy()

In [94]:
# Add an additional column for each restaurant category
for category in restaurant_categories_list:
    stars_expanded_restaurants_df[f'{category}_stars']=0

In [95]:
len(stars_expanded_restaurants_df.columns)

424

In [96]:
stars_expanded_restaurants_df.columns

Index(['business_id', 'name', 'city', 'state', 'postal_code', 'stars',
       'review_count', 'categories', 'BubbleTea', 'Coffee&Tea',
       ...
       'Tonkatsu_stars', 'Donburi_stars', 'Pita_stars', 'Austrian_stars',
       'EasternEuropean_stars', 'Cucinacampana_stars',
       'ConveyorBeltSushi_stars', 'Poutineries_stars', 'Coffeeshops_stars',
       'SerboCroatian_stars'],
      dtype='object', length=424)

In [97]:
# Fill in stars for each restaurant type column if "1" in restaurant types
for i in range(len(stars_expanded_restaurants_df)):
    types_list = stars_expanded_restaurants_df["categories"][i].replace(" ","").split(",")
    for category in restaurant_categories_list:
        if category in types_list:
            stars_expanded_restaurants_df[f'{category}_stars'][i]=stars_expanded_restaurants_df['stars'][i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  stars_expanded_restaurants_df[f'{category}_stars'][i]=stars_expanded_restaurants_df['stars'][i]


In [98]:
stars_expanded_restaurants_df.head(10)

Unnamed: 0,business_id,name,city,state,postal_code,stars,review_count,categories,BubbleTea,Coffee&Tea,...,Tonkatsu_stars,Donburi_stars,Pita_stars,Austrian_stars,EasternEuropean_stars,Cucinacampana_stars,ConveyorBeltSushi_stars,Poutineries_stars,Coffeeshops_stars,SerboCroatian_stars
0,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,Philadelphia,PA,19107,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B...",1,1,...,0,0,0,0,0,0,0,0,0,0
1,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,Ashland City,TN,37015,2.0,6,"Burgers, Fast Food, Sandwiches, Food, Ice Crea...",0,0,...,0,0,0,0,0,0,0,0,0,0
2,bBDDEgkFA1Otx9Lfe7BZUQ,Sonic Drive-In,Nashville,TN,37207,1.5,10,"Ice Cream & Frozen Yogurt, Fast Food, Burgers,...",0,0,...,0,0,0,0,0,0,0,0,0,0
3,eEOYSgkmpB90uNA7lDOMRA,Vietnamese Food Truck,Tampa Bay,FL,33602,4.0,10,"Vietnamese, Food, Restaurants, Food Trucks",0,0,...,0,0,0,0,0,0,0,0,0,0
4,il_Ro8jwPlHresjw9EGmBg,Denny's,Indianapolis,IN,46227,2.5,28,"American (Traditional), Restaurants, Diners, B...",0,0,...,0,0,0,0,0,0,0,0,0,0
5,MUTTqe8uqyMdBl186RmNeA,Tuna Bar,Philadelphia,PA,19106,4.0,245,"Sushi Bars, Restaurants, Japanese",0,0,...,0,0,0,0,0,0,0,0,0,0
6,ROeacJQwBeh05Rqg7F6TCg,BAP,Philadelphia,PA,19147,4.5,205,"Korean, Restaurants",0,0,...,0,0,0,0,0,0,0,0,0,0
7,kfNv-JZpuN6TVNSO6hHdkw,Hibachi Express,Indianapolis,IN,46250,4.0,20,"Steakhouses, Asian Fusion, Restaurants",0,0,...,0,0,0,0,0,0,0,0,0,0
8,9OG5YkX1g2GReZM0AskizA,Romano's Macaroni Grill,Reno,NV,89502,2.5,339,"Restaurants, Italian",0,0,...,0,0,0,0,0,0,0,0,0,0
9,sqSqqLy0sN8n2IZrAbzidQ,Domino's Pizza,White House,TN,37188,3.5,8,"Pizza, Chicken Wings, Sandwiches, Restaurants",0,0,...,0,0,0,0,0,0,0,0,0,0


In [99]:
stars_expanded_restaurants_df.BubbleTea_stars[0]

4

In [100]:
stars_expanded_restaurants_df.BubbleTea_stars[1]

0

## Export Data (table with stars)

In [134]:
# Export to csv
stars_expanded_restaurants_df.to_csv("../Processed_Data/stars_expanded_restaurants.csv", index=False)

In [104]:
# Export to local SQL

# Dependencies
import psycopg2
import sqlalchemy as sqla
from sqlalchemy import create_engine
from sqlalchemy.orm import Session

# Export
from config import pSQL
# DB password in config.py file

db_path = f'postgresql://postgres:{pSQL}@127.0.0.1:5432/Yelp' # last item is name of db in the server group
# Create database engine
engine = create_engine(db_path)


# Change database address when loading to shared AWS database

In [102]:
# Load
stars_expanded_restaurants_df.to_sql(name='Stars_Expanded_Restaurants',con=engine)
# Load confirmed in Postgres

In [103]:
# Close connection
# connection = engine.connect()
# connection.close()

## Additional preprocessed in SQL:
- Restaurants grouped by zip code
- Total restaurants per zip code counted AND Average stars per zip calculated (new table created)
- Sum of each type/category of restaurant for each zip calculated (new table created)
- Sum of stars for each type of restaurant calculated per zip (new table created)

### Return to notebook/pandas for calculating average star ratings by restaurant category

In [105]:
# Read back tables from local SQL database:
# zip_restaurants = sum of restaurants by type
# zip_stars = sum of stars by restaurant type

connection = engine.connect()

zip_restaurants_df = pd.read_sql("select * from \"zip_restaurants\"", connection)

zip_stars_df = pd.read_sql("select * from \"zip_stars\"", connection)

In [106]:
zip_restaurants_df.head()

Unnamed: 0,postal_code,BubbleTea,Coffee&Tea,Bakeries,Burgers,FastFood,Sandwiches,IceCream&FrozenYogurt,Vietnamese,FoodTrucks,...,Tonkatsu,Donburi,Pita,Austrian,EasternEuropean,Cucinacampana,ConveyorBeltSushi,Poutineries,Coffeeshops,SerboCroatian
0,8001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,8002,1.0,9.0,3.0,14.0,18.0,16.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,8003,1.0,4.0,5.0,1.0,2.0,11.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,8004,0.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,8005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [107]:
zip_stars_df.head()

Unnamed: 0,postal_code,BubbleTea_stars,Coffee&Tea_stars,Bakeries_stars,Burgers_stars,FastFood_stars,Sandwiches_stars,IceCream&FrozenYogurt_stars,Vietnamese_stars,FoodTrucks_stars,...,Tonkatsu_stars,Donburi_stars,Pita_stars,Austrian_stars,EasternEuropean_stars,Cucinacampana_stars,ConveyorBeltSushi_stars,Poutineries_stars,Coffeeshops_stars,SerboCroatian_stars
0,8001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,8002,4.0,24.0,9.0,30.0,43.0,53.0,0.0,8.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,8003,4.0,13.0,15.0,1.0,4.0,37.0,0.0,12.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,8004,0.0,1.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,8005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [112]:
# Reset indices to postal code for both dataframes
zip_restaurants_df.set_index(["postal_code"], inplace=True)

In [113]:
zip_restaurants_df.head()

Unnamed: 0_level_0,BubbleTea,Coffee&Tea,Bakeries,Burgers,FastFood,Sandwiches,IceCream&FrozenYogurt,Vietnamese,FoodTrucks,American(Traditional),...,Tonkatsu,Donburi,Pita,Austrian,EasternEuropean,Cucinacampana,ConveyorBeltSushi,Poutineries,Coffeeshops,SerboCroatian
postal_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8002,1.0,9.0,3.0,14.0,18.0,16.0,0.0,2.0,0.0,20.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8003,1.0,4.0,5.0,1.0,2.0,11.0,0.0,3.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8004,0.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [114]:
zip_stars_df.set_index(["postal_code"], inplace=True)

In [115]:
zip_stars_df.head()

Unnamed: 0_level_0,BubbleTea_stars,Coffee&Tea_stars,Bakeries_stars,Burgers_stars,FastFood_stars,Sandwiches_stars,IceCream&FrozenYogurt_stars,Vietnamese_stars,FoodTrucks_stars,American(Traditional)_stars,...,Tonkatsu_stars,Donburi_stars,Pita_stars,Austrian_stars,EasternEuropean_stars,Cucinacampana_stars,ConveyorBeltSushi_stars,Poutineries_stars,Coffeeshops_stars,SerboCroatian_stars
postal_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8002,4.0,24.0,9.0,30.0,43.0,53.0,0.0,8.0,0.0,57.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8003,4.0,13.0,15.0,1.0,4.0,37.0,0.0,12.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8004,0.0,1.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [125]:
# Divide zip_stars dataframe by zip_restaurants dataframe
zip_avg_stars_df = zip_stars_df/zip_restaurants_df.values[:,:]

In [126]:
zip_avg_stars_df

Unnamed: 0_level_0,BubbleTea_stars,Coffee&Tea_stars,Bakeries_stars,Burgers_stars,FastFood_stars,Sandwiches_stars,IceCream&FrozenYogurt_stars,Vietnamese_stars,FoodTrucks_stars,American(Traditional)_stars,...,Tonkatsu_stars,Donburi_stars,Pita_stars,Austrian_stars,EasternEuropean_stars,Cucinacampana_stars,ConveyorBeltSushi_stars,Poutineries_stars,Coffeeshops_stars,SerboCroatian_stars
postal_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8001,,,,,,,,,,3.00,...,,,,,,,,,,
8002,4.0,2.666667,3.0,2.142857,2.388889,3.312500,,4.0,,2.85,...,,,,,,,,,,
8003,4.0,3.250000,3.0,1.000000,2.000000,3.363636,,4.0,,3.00,...,,,,,,,,,,
8004,,1.000000,,,,3.000000,,,,3.00,...,,,,,,,,,,
8005,,,,,,,,,,3.00,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93118,,,,,,3.000000,,,,3.00,...,,,,,,,,,,
93190,,,3.0,3.000000,,3.000000,,,,,...,,,,,,,,,,
93642,,,3.0,,,3.000000,,,,,...,,,,,,,,,,
95661,,,,,2.000000,2.000000,,,,,...,,,,,,,,,,


In [127]:
# Load zip_summary table from local database
zip_summary_df = pd.read_sql("select * from \"zip_summary\"", connection)

In [128]:
# Reset index for zip_summary
zip_summary_df.set_index(["postal_code"], inplace=True)

In [129]:
zip_summary_df

Unnamed: 0_level_0,total_restaurants,avg_stars,total_reviews
postal_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8001,2,3.750000,18.0
8002,96,3.276042,9827.0
8003,43,3.651163,3314.0
8004,8,3.125000,160.0
8005,1,3.500000,7.0
...,...,...,...
93118,2,3.500000,21.0
93190,3,3.500000,758.0
93642,1,3.500000,45.0
95661,1,2.500000,11.0


In [131]:
# Combine the 3 dataframes
from functools import reduce

frames = [zip_summary_df, zip_restaurants_df, zip_avg_stars_df]

zip_rest_cat_stars_df = reduce(lambda left,right: pd.merge(left,right,on='postal_code'), frames)

In [132]:
zip_rest_cat_stars_df.head()

Unnamed: 0_level_0,total_restaurants,avg_stars,total_reviews,BubbleTea,Coffee&Tea,Bakeries,Burgers,FastFood,Sandwiches,IceCream&FrozenYogurt,...,Tonkatsu_stars,Donburi_stars,Pita_stars,Austrian_stars,EasternEuropean_stars,Cucinacampana_stars,ConveyorBeltSushi_stars,Poutineries_stars,Coffeeshops_stars,SerboCroatian_stars
postal_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8001,2,3.75,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
8002,96,3.276042,9827.0,1.0,9.0,3.0,14.0,18.0,16.0,0.0,...,,,,,,,,,,
8003,43,3.651163,3314.0,1.0,4.0,5.0,1.0,2.0,11.0,0.0,...,,,,,,,,,,
8004,8,3.125,160.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,...,,,,,,,,,,
8005,1,3.5,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,


In [133]:
# Export to csv
zip_rest_cat_stars_df.to_csv("../Processed_Data/restaurant_categories_stars_by_zip.csv")

In [135]:
# Close connection
connection.close()

In [None]:
# Export to AWS database!