# Yelp API Pulled Data Initial Cleaning and Merge (ETL)

This notebook cleans the pulled yelp fusion api data and merges several pickle files into one dataframe for Yelp data transformation and feature engineering. 

#### Extract:
The restaurant data for the 6014 zip codes (from the merged zillow-redfin housing data) was extracted through an API pull from Yelp Fusion. 
**Code for API pull:**Yelp_Fusion_API.ipynb

**Data Source for Yelp API Clean and Merge:** pickle files from API pull

#### Transform: 

Each pickle file was converted into a dataframe and each dataframe was cleaned in the following steps

* Drop duplicate values
* Extract columns stored as a dictionary
    - Transaction Column is unraveled and using pd.dummies saved as: 
        * 'neighborhood'
        * 'delivery', 
        * 'pickup',   
        * 'restaurant_reservation'
    - Location Column is extracted to get State, City and Zip Code
    - Latitude and Longitude are unraveled
    - Price 
        * The price column was filled with 0 for null values
        * The dollar signs are given numerical values: 1, 2,3,4
        * Transaction Column is unraveled and using pd.dummies saved as 
            - price_value_1.0	
            - price_value_2.0	
            - price_value_3.0	
            - price_value_4.0
* ZIP CODE column : 
    - rename to postal_code to match zillow-redfin dataset 
    - strip white spaces
    - change datatype to float

* Merge with housing data
    - Read in housing data
    - Merge yelp and housing data
    - Check "state" for unusual values.
    - Drop unneeded rows of "state" that do not belong in United States
    - Drop empty rows of "state"
    - reorder the dataframe
    
#### Load

* Read merged, cleaned and tranformed yelp_housing dataframe as csv

Final extracted, transformed dataset has the following data fields:
* 'postal_code',: zipcodes in United States 
* 'City', : Cities in United States
* 'State',: States in United States
* 'CountyName',: County names in United States
* '2021', : Median House prices for 2021 per zipcode
* 'latitude', 
* 'longitude', 
* 'review_count',: Count of Yelp Reviews 
* 'rating', : Yelp Star rating
* 'categories', : Restaurant Categories (needs cleaning)
* 'price', : values converted to 1, 2, 3,4 per number of dollar signs
* 'delivery', : Transaction type
* 'pickup',   : Transaction type
* 'restaurant_reservation', : Transaction type
* 'price_value_1.0', : Yelp count of one dollar sign
* 'price_value_2.0', : Yelp count of two dollar sign
* 'price_value_3.0', : Yelp count of three dollar sign
* 'price_value_4.0'  : Yelp count of four dollar sign


# Extract


In [1]:
import requests
import pandas as pd
import pickle
pd.set_option('display.max_rows', None)

In [2]:
# list of all pulled Yelp Fusion Data except one - "data_pull_one.pickle"
list_data = ["data_pull_two.pickle", "data_pull_three.pickle", "data_pull_four.pickle", 
             "data_pull_five.pickle", "data_pull_six.pickle","data_bronx.pickle","data_brooklyn.pickle",
            "data_manhattan.pickle", "data_queens.pickle", "data_staten_island.pickle"]

In [3]:
#Open the first pulled Yelp Fusion Data- "data.pickle" as dataframe
with open ('data_pull_one.pickle','rb') as f:
    df = pickle.load(f)

print(len(df))
df.shape

143427


(143427, 17)

In [4]:
#Read all the pulled Yelp Fusion Data in list_data as dataframes and append to a dataframe list

# create empty list
dataframes_list = []
  
# append datasets into teh list
for i in range(len(list_data)):
    temp_df = pd.read_pickle(list_data[i])
    dataframes_list.append(temp_df)

In [5]:
#concatenate all the dataframes
df = df.append(dataframes_list)
print(df.shape)
df.head()

(650610, 17)


Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,transactions,price,location,phone,display_phone,distance,neighborhood
0,sJOPkTGLi53eB46ykvDeRg,farm-bar-lakeview-chicago,Farm Bar Lakeview,https://s3-media1.fl.yelpcdn.com/bphoto/5THhWC...,False,https://www.yelp.com/biz/farm-bar-lakeview-chi...,354,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,"{'latitude': 41.93643, 'longitude': -87.66141}","[pickup, delivery]",$$,"{'address1': '1300 W Wellington Ave', 'address...",17732812599,(773) 281-2599,729.039967,60657
1,iyilEJb1NwUeZcd5JWXTKw,figo-wine-bar-chicago,Figo Wine Bar,https://s3-media1.fl.yelpcdn.com/bphoto/8VnQ7f...,False,https://www.yelp.com/biz/figo-wine-bar-chicago...,212,"[{'alias': 'wine_bars', 'title': 'Wine Bars'},...",4.5,"{'latitude': 41.94015, 'longitude': -87.65386}",[delivery],$$,"{'address1': '3207 N Sheffield Ave', 'address2...",13128196111,(312) 819-6111,43.936284,60657
2,Mud-L2vMAtmvWpjo3tDvaA,zazas-pizzeria-chicago,Zazas Pizzeria,https://s3-media4.fl.yelpcdn.com/bphoto/TRI3fR...,False,https://www.yelp.com/biz/zazas-pizzeria-chicag...,60,"[{'alias': 'pizza', 'title': 'Pizza'}]",5.0,"{'latitude': 41.93742, 'longitude': -87.6483}",[delivery],,"{'address1': '3037 N Clark St', 'address2': No...",17736616389,(773) 661-6389,540.028408,60657
3,tCBKjclvCuiuSkiNz1fOjw,wood-chicago,Wood,https://s3-media2.fl.yelpcdn.com/bphoto/TVSzvQ...,False,https://www.yelp.com/biz/wood-chicago?adjust_c...,593,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,"{'latitude': 41.942829280388, 'longitude': -87...","[pickup, delivery]",$$,"{'address1': '3335 N Halsted St', 'address2': ...",17739359663,(773) 935-9663,513.692605,60657
4,GLF1TKJ9Qyr67prQIRfquA,la-biznaga-2-chicago,La Biznaga 2,https://s3-media3.fl.yelpcdn.com/bphoto/VuULup...,False,https://www.yelp.com/biz/la-biznaga-2-chicago?...,35,"[{'alias': 'mexican', 'title': 'Mexican'}]",5.0,"{'latitude': 41.94713257782274, 'longitude': -...","[pickup, delivery]",,"{'address1': '3555 N Broadway', 'address2': ''...",17738575110,(773) 857-5110,998.221727,60657


In [6]:
nyc_df = df.copy()

# Transform

####  Drop duplicate values:

In [7]:
# Drop duplicate values:
nyc_df = df.drop_duplicates(subset=['alias', 'id'], keep='first', inplace=False).copy()

nyc_df.reset_index(inplace=True, drop = True)
nyc_df.head()

Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,transactions,price,location,phone,display_phone,distance,neighborhood
0,sJOPkTGLi53eB46ykvDeRg,farm-bar-lakeview-chicago,Farm Bar Lakeview,https://s3-media1.fl.yelpcdn.com/bphoto/5THhWC...,False,https://www.yelp.com/biz/farm-bar-lakeview-chi...,354,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,"{'latitude': 41.93643, 'longitude': -87.66141}","[pickup, delivery]",$$,"{'address1': '1300 W Wellington Ave', 'address...",17732812599,(773) 281-2599,729.039967,60657
1,iyilEJb1NwUeZcd5JWXTKw,figo-wine-bar-chicago,Figo Wine Bar,https://s3-media1.fl.yelpcdn.com/bphoto/8VnQ7f...,False,https://www.yelp.com/biz/figo-wine-bar-chicago...,212,"[{'alias': 'wine_bars', 'title': 'Wine Bars'},...",4.5,"{'latitude': 41.94015, 'longitude': -87.65386}",[delivery],$$,"{'address1': '3207 N Sheffield Ave', 'address2...",13128196111,(312) 819-6111,43.936284,60657
2,Mud-L2vMAtmvWpjo3tDvaA,zazas-pizzeria-chicago,Zazas Pizzeria,https://s3-media4.fl.yelpcdn.com/bphoto/TRI3fR...,False,https://www.yelp.com/biz/zazas-pizzeria-chicag...,60,"[{'alias': 'pizza', 'title': 'Pizza'}]",5.0,"{'latitude': 41.93742, 'longitude': -87.6483}",[delivery],,"{'address1': '3037 N Clark St', 'address2': No...",17736616389,(773) 661-6389,540.028408,60657
3,tCBKjclvCuiuSkiNz1fOjw,wood-chicago,Wood,https://s3-media2.fl.yelpcdn.com/bphoto/TVSzvQ...,False,https://www.yelp.com/biz/wood-chicago?adjust_c...,593,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,"{'latitude': 41.942829280388, 'longitude': -87...","[pickup, delivery]",$$,"{'address1': '3335 N Halsted St', 'address2': ...",17739359663,(773) 935-9663,513.692605,60657
4,GLF1TKJ9Qyr67prQIRfquA,la-biznaga-2-chicago,La Biznaga 2,https://s3-media3.fl.yelpcdn.com/bphoto/VuULup...,False,https://www.yelp.com/biz/la-biznaga-2-chicago?...,35,"[{'alias': 'mexican', 'title': 'Mexican'}]",5.0,"{'latitude': 41.94713257782274, 'longitude': -...","[pickup, delivery]",,"{'address1': '3555 N Broadway', 'address2': ''...",17738575110,(773) 857-5110,998.221727,60657


In [8]:
nyc_df.shape

(251693, 17)

### Extract columns stored as a dictionary

#### Transactions

In [9]:
# Extract dictionary values for transactions

transactions_dummy = nyc_df['transactions'].str.join(sep=',').str.get_dummies(sep=',')

# Combine new columns with original dataframe:
nyc_df = pd.concat([nyc_df, transactions_dummy], axis=1)
nyc_df.head()

Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,transactions,price,location,phone,display_phone,distance,neighborhood,delivery,pickup,restaurant_reservation
0,sJOPkTGLi53eB46ykvDeRg,farm-bar-lakeview-chicago,Farm Bar Lakeview,https://s3-media1.fl.yelpcdn.com/bphoto/5THhWC...,False,https://www.yelp.com/biz/farm-bar-lakeview-chi...,354,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,"{'latitude': 41.93643, 'longitude': -87.66141}","[pickup, delivery]",$$,"{'address1': '1300 W Wellington Ave', 'address...",17732812599,(773) 281-2599,729.039967,60657,1,1,0
1,iyilEJb1NwUeZcd5JWXTKw,figo-wine-bar-chicago,Figo Wine Bar,https://s3-media1.fl.yelpcdn.com/bphoto/8VnQ7f...,False,https://www.yelp.com/biz/figo-wine-bar-chicago...,212,"[{'alias': 'wine_bars', 'title': 'Wine Bars'},...",4.5,"{'latitude': 41.94015, 'longitude': -87.65386}",[delivery],$$,"{'address1': '3207 N Sheffield Ave', 'address2...",13128196111,(312) 819-6111,43.936284,60657,1,0,0
2,Mud-L2vMAtmvWpjo3tDvaA,zazas-pizzeria-chicago,Zazas Pizzeria,https://s3-media4.fl.yelpcdn.com/bphoto/TRI3fR...,False,https://www.yelp.com/biz/zazas-pizzeria-chicag...,60,"[{'alias': 'pizza', 'title': 'Pizza'}]",5.0,"{'latitude': 41.93742, 'longitude': -87.6483}",[delivery],,"{'address1': '3037 N Clark St', 'address2': No...",17736616389,(773) 661-6389,540.028408,60657,1,0,0
3,tCBKjclvCuiuSkiNz1fOjw,wood-chicago,Wood,https://s3-media2.fl.yelpcdn.com/bphoto/TVSzvQ...,False,https://www.yelp.com/biz/wood-chicago?adjust_c...,593,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,"{'latitude': 41.942829280388, 'longitude': -87...","[pickup, delivery]",$$,"{'address1': '3335 N Halsted St', 'address2': ...",17739359663,(773) 935-9663,513.692605,60657,1,1,0
4,GLF1TKJ9Qyr67prQIRfquA,la-biznaga-2-chicago,La Biznaga 2,https://s3-media3.fl.yelpcdn.com/bphoto/VuULup...,False,https://www.yelp.com/biz/la-biznaga-2-chicago?...,35,"[{'alias': 'mexican', 'title': 'Mexican'}]",5.0,"{'latitude': 41.94713257782274, 'longitude': -...","[pickup, delivery]",,"{'address1': '3555 N Broadway', 'address2': ''...",17738575110,(773) 857-5110,998.221727,60657,1,1,0


#### Location

In [10]:
# Extract dictionary values for location
nyc_df['city'] = nyc_df['location'].apply(lambda x: x.get('city'))
nyc_df['zip_code'] = nyc_df['location'].apply(lambda x: x.get('zip_code'))
nyc_df['state'] = nyc_df['location'].apply(lambda x: x.get('state'))
nyc_df.head()

Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,...,phone,display_phone,distance,neighborhood,delivery,pickup,restaurant_reservation,city,zip_code,state
0,sJOPkTGLi53eB46ykvDeRg,farm-bar-lakeview-chicago,Farm Bar Lakeview,https://s3-media1.fl.yelpcdn.com/bphoto/5THhWC...,False,https://www.yelp.com/biz/farm-bar-lakeview-chi...,354,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,"{'latitude': 41.93643, 'longitude': -87.66141}",...,17732812599,(773) 281-2599,729.039967,60657,1,1,0,Chicago,60657,IL
1,iyilEJb1NwUeZcd5JWXTKw,figo-wine-bar-chicago,Figo Wine Bar,https://s3-media1.fl.yelpcdn.com/bphoto/8VnQ7f...,False,https://www.yelp.com/biz/figo-wine-bar-chicago...,212,"[{'alias': 'wine_bars', 'title': 'Wine Bars'},...",4.5,"{'latitude': 41.94015, 'longitude': -87.65386}",...,13128196111,(312) 819-6111,43.936284,60657,1,0,0,Chicago,60657,IL
2,Mud-L2vMAtmvWpjo3tDvaA,zazas-pizzeria-chicago,Zazas Pizzeria,https://s3-media4.fl.yelpcdn.com/bphoto/TRI3fR...,False,https://www.yelp.com/biz/zazas-pizzeria-chicag...,60,"[{'alias': 'pizza', 'title': 'Pizza'}]",5.0,"{'latitude': 41.93742, 'longitude': -87.6483}",...,17736616389,(773) 661-6389,540.028408,60657,1,0,0,Chicago,60657,IL
3,tCBKjclvCuiuSkiNz1fOjw,wood-chicago,Wood,https://s3-media2.fl.yelpcdn.com/bphoto/TVSzvQ...,False,https://www.yelp.com/biz/wood-chicago?adjust_c...,593,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,"{'latitude': 41.942829280388, 'longitude': -87...",...,17739359663,(773) 935-9663,513.692605,60657,1,1,0,Chicago,60657,IL
4,GLF1TKJ9Qyr67prQIRfquA,la-biznaga-2-chicago,La Biznaga 2,https://s3-media3.fl.yelpcdn.com/bphoto/VuULup...,False,https://www.yelp.com/biz/la-biznaga-2-chicago?...,35,"[{'alias': 'mexican', 'title': 'Mexican'}]",5.0,"{'latitude': 41.94713257782274, 'longitude': -...",...,17738575110,(773) 857-5110,998.221727,60657,1,1,0,Chicago,60657,IL


#### Latitude and Longitude

In [11]:
# Extract dictionary values for latitude and longitude
nyc_df['latitude'] = nyc_df['coordinates'].apply(lambda x: x.get('latitude'))
nyc_df['longitude'] = nyc_df['coordinates'].apply(lambda x: x.get('longitude'))
nyc_df.head()

Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,...,distance,neighborhood,delivery,pickup,restaurant_reservation,city,zip_code,state,latitude,longitude
0,sJOPkTGLi53eB46ykvDeRg,farm-bar-lakeview-chicago,Farm Bar Lakeview,https://s3-media1.fl.yelpcdn.com/bphoto/5THhWC...,False,https://www.yelp.com/biz/farm-bar-lakeview-chi...,354,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,"{'latitude': 41.93643, 'longitude': -87.66141}",...,729.039967,60657,1,1,0,Chicago,60657,IL,41.93643,-87.66141
1,iyilEJb1NwUeZcd5JWXTKw,figo-wine-bar-chicago,Figo Wine Bar,https://s3-media1.fl.yelpcdn.com/bphoto/8VnQ7f...,False,https://www.yelp.com/biz/figo-wine-bar-chicago...,212,"[{'alias': 'wine_bars', 'title': 'Wine Bars'},...",4.5,"{'latitude': 41.94015, 'longitude': -87.65386}",...,43.936284,60657,1,0,0,Chicago,60657,IL,41.94015,-87.65386
2,Mud-L2vMAtmvWpjo3tDvaA,zazas-pizzeria-chicago,Zazas Pizzeria,https://s3-media4.fl.yelpcdn.com/bphoto/TRI3fR...,False,https://www.yelp.com/biz/zazas-pizzeria-chicag...,60,"[{'alias': 'pizza', 'title': 'Pizza'}]",5.0,"{'latitude': 41.93742, 'longitude': -87.6483}",...,540.028408,60657,1,0,0,Chicago,60657,IL,41.93742,-87.6483
3,tCBKjclvCuiuSkiNz1fOjw,wood-chicago,Wood,https://s3-media2.fl.yelpcdn.com/bphoto/TVSzvQ...,False,https://www.yelp.com/biz/wood-chicago?adjust_c...,593,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,"{'latitude': 41.942829280388, 'longitude': -87...",...,513.692605,60657,1,1,0,Chicago,60657,IL,41.942829,-87.649185
4,GLF1TKJ9Qyr67prQIRfquA,la-biznaga-2-chicago,La Biznaga 2,https://s3-media3.fl.yelpcdn.com/bphoto/VuULup...,False,https://www.yelp.com/biz/la-biznaga-2-chicago?...,35,"[{'alias': 'mexican', 'title': 'Mexican'}]",5.0,"{'latitude': 41.94713257782274, 'longitude': -...",...,998.221727,60657,1,1,0,Chicago,60657,IL,41.947133,-87.646892


#### Price

In [12]:
#replace NaN with 0
nyc_df["price"] = nyc_df["price"].fillna(0)
nyc_df.head()

Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,...,distance,neighborhood,delivery,pickup,restaurant_reservation,city,zip_code,state,latitude,longitude
0,sJOPkTGLi53eB46ykvDeRg,farm-bar-lakeview-chicago,Farm Bar Lakeview,https://s3-media1.fl.yelpcdn.com/bphoto/5THhWC...,False,https://www.yelp.com/biz/farm-bar-lakeview-chi...,354,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,"{'latitude': 41.93643, 'longitude': -87.66141}",...,729.039967,60657,1,1,0,Chicago,60657,IL,41.93643,-87.66141
1,iyilEJb1NwUeZcd5JWXTKw,figo-wine-bar-chicago,Figo Wine Bar,https://s3-media1.fl.yelpcdn.com/bphoto/8VnQ7f...,False,https://www.yelp.com/biz/figo-wine-bar-chicago...,212,"[{'alias': 'wine_bars', 'title': 'Wine Bars'},...",4.5,"{'latitude': 41.94015, 'longitude': -87.65386}",...,43.936284,60657,1,0,0,Chicago,60657,IL,41.94015,-87.65386
2,Mud-L2vMAtmvWpjo3tDvaA,zazas-pizzeria-chicago,Zazas Pizzeria,https://s3-media4.fl.yelpcdn.com/bphoto/TRI3fR...,False,https://www.yelp.com/biz/zazas-pizzeria-chicag...,60,"[{'alias': 'pizza', 'title': 'Pizza'}]",5.0,"{'latitude': 41.93742, 'longitude': -87.6483}",...,540.028408,60657,1,0,0,Chicago,60657,IL,41.93742,-87.6483
3,tCBKjclvCuiuSkiNz1fOjw,wood-chicago,Wood,https://s3-media2.fl.yelpcdn.com/bphoto/TVSzvQ...,False,https://www.yelp.com/biz/wood-chicago?adjust_c...,593,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,"{'latitude': 41.942829280388, 'longitude': -87...",...,513.692605,60657,1,1,0,Chicago,60657,IL,41.942829,-87.649185
4,GLF1TKJ9Qyr67prQIRfquA,la-biznaga-2-chicago,La Biznaga 2,https://s3-media3.fl.yelpcdn.com/bphoto/VuULup...,False,https://www.yelp.com/biz/la-biznaga-2-chicago?...,35,"[{'alias': 'mexican', 'title': 'Mexican'}]",5.0,"{'latitude': 41.94713257782274, 'longitude': -...",...,998.221727,60657,1,1,0,Chicago,60657,IL,41.947133,-87.646892


In [13]:
# Update price to be numerical values:
price = {'$': 1, '$$': 2, '$$$':3, '$$$$': 4}
nyc_df['price_value'] = nyc_df['price'].map(price)
nyc_df['price'] = nyc_df['price'].map(price)

In [14]:
nyc_df = pd.get_dummies(nyc_df, columns=["price_value"])
nyc_df.head()

Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,...,restaurant_reservation,city,zip_code,state,latitude,longitude,price_value_1.0,price_value_2.0,price_value_3.0,price_value_4.0
0,sJOPkTGLi53eB46ykvDeRg,farm-bar-lakeview-chicago,Farm Bar Lakeview,https://s3-media1.fl.yelpcdn.com/bphoto/5THhWC...,False,https://www.yelp.com/biz/farm-bar-lakeview-chi...,354,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,"{'latitude': 41.93643, 'longitude': -87.66141}",...,0,Chicago,60657,IL,41.93643,-87.66141,0,1,0,0
1,iyilEJb1NwUeZcd5JWXTKw,figo-wine-bar-chicago,Figo Wine Bar,https://s3-media1.fl.yelpcdn.com/bphoto/8VnQ7f...,False,https://www.yelp.com/biz/figo-wine-bar-chicago...,212,"[{'alias': 'wine_bars', 'title': 'Wine Bars'},...",4.5,"{'latitude': 41.94015, 'longitude': -87.65386}",...,0,Chicago,60657,IL,41.94015,-87.65386,0,1,0,0
2,Mud-L2vMAtmvWpjo3tDvaA,zazas-pizzeria-chicago,Zazas Pizzeria,https://s3-media4.fl.yelpcdn.com/bphoto/TRI3fR...,False,https://www.yelp.com/biz/zazas-pizzeria-chicag...,60,"[{'alias': 'pizza', 'title': 'Pizza'}]",5.0,"{'latitude': 41.93742, 'longitude': -87.6483}",...,0,Chicago,60657,IL,41.93742,-87.6483,0,0,0,0
3,tCBKjclvCuiuSkiNz1fOjw,wood-chicago,Wood,https://s3-media2.fl.yelpcdn.com/bphoto/TVSzvQ...,False,https://www.yelp.com/biz/wood-chicago?adjust_c...,593,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,"{'latitude': 41.942829280388, 'longitude': -87...",...,0,Chicago,60657,IL,41.942829,-87.649185,0,1,0,0
4,GLF1TKJ9Qyr67prQIRfquA,la-biznaga-2-chicago,La Biznaga 2,https://s3-media3.fl.yelpcdn.com/bphoto/VuULup...,False,https://www.yelp.com/biz/la-biznaga-2-chicago?...,35,"[{'alias': 'mexican', 'title': 'Mexican'}]",5.0,"{'latitude': 41.94713257782274, 'longitude': -...",...,0,Chicago,60657,IL,41.947133,-87.646892,0,0,0,0


In [15]:
nyc_df["price"] = nyc_df["price"].fillna(0)

#### Remove unneeded columns

In [16]:
nyc_df.columns

Index(['id', 'alias', 'name', 'image_url', 'is_closed', 'url', 'review_count',
       'categories', 'rating', 'coordinates', 'transactions', 'price',
       'location', 'phone', 'display_phone', 'distance', 'neighborhood',
       'delivery', 'pickup', 'restaurant_reservation', 'city', 'zip_code',
       'state', 'latitude', 'longitude', 'price_value_1.0', 'price_value_2.0',
       'price_value_3.0', 'price_value_4.0'],
      dtype='object')

In [17]:
# Remove columns that we will not be working with:
nyc_df = nyc_df.drop(columns = ["id", "alias", "image_url", "url", "coordinates",
                               "phone", "display_phone", "distance", "is_closed", "name", "transactions",
                                "neighborhood"], axis=1)


In [18]:
nyc_df.head()

Unnamed: 0,review_count,categories,rating,price,location,delivery,pickup,restaurant_reservation,city,zip_code,state,latitude,longitude,price_value_1.0,price_value_2.0,price_value_3.0,price_value_4.0
0,354,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,2.0,"{'address1': '1300 W Wellington Ave', 'address...",1,1,0,Chicago,60657,IL,41.93643,-87.66141,0,1,0,0
1,212,"[{'alias': 'wine_bars', 'title': 'Wine Bars'},...",4.5,2.0,"{'address1': '3207 N Sheffield Ave', 'address2...",1,0,0,Chicago,60657,IL,41.94015,-87.65386,0,1,0,0
2,60,"[{'alias': 'pizza', 'title': 'Pizza'}]",5.0,0.0,"{'address1': '3037 N Clark St', 'address2': No...",1,0,0,Chicago,60657,IL,41.93742,-87.6483,0,0,0,0
3,593,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",4.0,2.0,"{'address1': '3335 N Halsted St', 'address2': ...",1,1,0,Chicago,60657,IL,41.942829,-87.649185,0,1,0,0
4,35,"[{'alias': 'mexican', 'title': 'Mexican'}]",5.0,0.0,"{'address1': '3555 N Broadway', 'address2': ''...",1,1,0,Chicago,60657,IL,41.947133,-87.646892,0,0,0,0


#### Reorder dataframe

In [19]:
#reorder dataframe and store as a new dataframe
yelp_merged_df = nyc_df[['zip_code','city', 'state', 'location', 'latitude',
       'longitude', 'review_count', 'rating', 'categories', 'price', 'delivery', 'pickup',
       'restaurant_reservation', 'price_value_1.0', 'price_value_2.0', 'price_value_3.0',
       'price_value_4.0']]
yelp_merged_df.head()

Unnamed: 0,zip_code,city,state,location,latitude,longitude,review_count,rating,categories,price,delivery,pickup,restaurant_reservation,price_value_1.0,price_value_2.0,price_value_3.0,price_value_4.0
0,60657,Chicago,IL,"{'address1': '1300 W Wellington Ave', 'address...",41.93643,-87.66141,354,4.0,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",2.0,1,1,0,0,1,0,0
1,60657,Chicago,IL,"{'address1': '3207 N Sheffield Ave', 'address2...",41.94015,-87.65386,212,4.5,"[{'alias': 'wine_bars', 'title': 'Wine Bars'},...",2.0,1,0,0,0,1,0,0
2,60657,Chicago,IL,"{'address1': '3037 N Clark St', 'address2': No...",41.93742,-87.6483,60,5.0,"[{'alias': 'pizza', 'title': 'Pizza'}]",0.0,1,0,0,0,0,0,0
3,60657,Chicago,IL,"{'address1': '3335 N Halsted St', 'address2': ...",41.942829,-87.649185,593,4.0,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",2.0,1,1,0,0,1,0,0
4,60657,Chicago,IL,"{'address1': '3555 N Broadway', 'address2': ''...",41.947133,-87.646892,35,5.0,"[{'alias': 'mexican', 'title': 'Mexican'}]",0.0,1,1,0,0,0,0,0


In [20]:
yelp_merged_df.shape

(251693, 17)

#### ZIP CODE column : 
* rename to postal_code
* strip white spaces
* change datatype to float

In [21]:
#Rename zip_code to postal_code to match the housing dataset
yelp_merged_df = yelp_merged_df.rename(columns={"zip_code": "postal_code"})
yelp_merged_df.head()

Unnamed: 0,postal_code,city,state,location,latitude,longitude,review_count,rating,categories,price,delivery,pickup,restaurant_reservation,price_value_1.0,price_value_2.0,price_value_3.0,price_value_4.0
0,60657,Chicago,IL,"{'address1': '1300 W Wellington Ave', 'address...",41.93643,-87.66141,354,4.0,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",2.0,1,1,0,0,1,0,0
1,60657,Chicago,IL,"{'address1': '3207 N Sheffield Ave', 'address2...",41.94015,-87.65386,212,4.5,"[{'alias': 'wine_bars', 'title': 'Wine Bars'},...",2.0,1,0,0,0,1,0,0
2,60657,Chicago,IL,"{'address1': '3037 N Clark St', 'address2': No...",41.93742,-87.6483,60,5.0,"[{'alias': 'pizza', 'title': 'Pizza'}]",0.0,1,0,0,0,0,0,0
3,60657,Chicago,IL,"{'address1': '3335 N Halsted St', 'address2': ...",41.942829,-87.649185,593,4.0,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",2.0,1,1,0,0,1,0,0
4,60657,Chicago,IL,"{'address1': '3555 N Broadway', 'address2': ''...",41.947133,-87.646892,35,5.0,"[{'alias': 'mexican', 'title': 'Mexican'}]",0.0,1,1,0,0,0,0,0


In [22]:
yelp_merged_df["postal_code"].dtypes

dtype('O')

In [23]:
#strip white spaces
yelp_merged_df["postal_code"] =yelp_merged_df["postal_code"].str.strip()

In [24]:
#change datatype of postal_code to float
yelp_merged_df["postal_code"] = pd.to_numeric(yelp_merged_df["postal_code"], errors = "coerce")

In [25]:
yelp_merged_df["postal_code"].nunique()

15521

#### Save yelp data as csv before merge for easy retrieval 

In [26]:
#Save as csv file
yelp_merged_df.to_csv("yelp_api_final.csv")

# Merge housing data and yelp data

In [27]:
#Read in housing data
housing_df = pd.read_csv("Redfin/zillow_redfin_merged.csv")

In [28]:
housing_df.head()

Unnamed: 0.1,Unnamed: 0,postal_code,State,City,CountyName,2021
0,0,10025,NY,New York,New York County,1111806.0
1,1,60657,IL,Chicago,Cook County,507204.0
2,2,10023,NY,New York,New York County,1446973.0
3,3,77494,TX,Katy,Harris County,400394.0
4,4,60614,IL,Chicago,Cook County,646795.0


In [29]:
# drop column "Unnamed: 0"
housing_df = housing_df.drop(columns=["Unnamed: 0"], axis=1)

#### Merge housing and yelp data

In [51]:
from functools import reduce
frames1 = [yelp_merged_df, housing_df]
yelp_housing_df = reduce(lambda left,right: pd.merge(left,right,on='postal_code'), frames1)
print(yelp_housing_df.shape)
yelp_housing_df.head()

(243467, 21)


Unnamed: 0,postal_code,city,state,location,latitude,longitude,review_count,rating,categories,price,...,pickup,restaurant_reservation,price_value_1.0,price_value_2.0,price_value_3.0,price_value_4.0,State,City,CountyName,2021
0,60657.0,Chicago,IL,"{'address1': '1300 W Wellington Ave', 'address...",41.93643,-87.66141,354,4.0,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",2.0,...,1,0,0,1,0,0,IL,Chicago,Cook County,507204.0
1,60657.0,Chicago,IL,"{'address1': '3207 N Sheffield Ave', 'address2...",41.94015,-87.65386,212,4.5,"[{'alias': 'wine_bars', 'title': 'Wine Bars'},...",2.0,...,0,0,0,1,0,0,IL,Chicago,Cook County,507204.0
2,60657.0,Chicago,IL,"{'address1': '3037 N Clark St', 'address2': No...",41.93742,-87.6483,60,5.0,"[{'alias': 'pizza', 'title': 'Pizza'}]",0.0,...,0,0,0,0,0,0,IL,Chicago,Cook County,507204.0
3,60657.0,Chicago,IL,"{'address1': '3335 N Halsted St', 'address2': ...",41.942829,-87.649185,593,4.0,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",2.0,...,1,0,0,1,0,0,IL,Chicago,Cook County,507204.0
4,60657.0,Chicago,IL,"{'address1': '3555 N Broadway', 'address2': ''...",41.947133,-87.646892,35,5.0,"[{'alias': 'mexican', 'title': 'Mexican'}]",0.0,...,1,0,0,0,0,0,IL,Chicago,Cook County,507204.0


In [52]:
#### Check for unusual values. 
yelp_housing_df.state.value_counts()

CA     23056
NY     22706
TX     22636
FL     11960
OH     11938
PA     10982
IL     10315
MI      7581
NC      7162
GA      7107
IN      7056
VA      6342
TN      6290
MO      5936
WA      5297
AZ      4955
CO      4678
AL      4462
WI      4366
KY      4314
MN      4196
SC      4172
LA      3606
OK      3538
OR      3487
MD      3393
IA      3342
AR      3130
KS      3035
NV      2477
NE      2314
MS      2300
WV      1874
UT      1531
NM      1257
DE      1138
DC       976
VIC      875
QLD      829
ND       677
NJ       676
HI       674
NSW      651
ID       620
MT       616
SD       613
AK       604
WY       511
SA       435
CT       234
ZH       144
TAS       61
84        51
5         26
CHH       25
BY        24
MBH       24
HH        24
MA        23
BCN       17
LU        14
85        12
LAG       11
VWV       10
TI        10
SZ        10
WGN       10
11        10
UR         8
           6
CAV        4
ACT        3
NW         3
AUK        2
DIF        1
ON         1
HE         1

#### Drop unneeded rows
As we can see from above the Yelp API has captured states that are not in the United States. We will drop those States from the dataset.

In [53]:
lists = ["11",	"5", "48",	"84",	"85",	"ACT",	"AUK",	"BCN",	"BW",	"BY",	
         "BZ",	"CAV",	"CHH",	"DIF",	"GA",	"GE",	"HE",	"HH",	"GA",	
         "GE",	"HE",	"HH",	"LAG",	"LU",	"MBH",	"NSN",	"NSW",	"NW",	
         "OAX",	"ON",	"PKN",	"QLD",	"SG",	"SH",	"SZ",	"TAS",	"TI",	
         "V",	"VIC",	"VWV",	"WGN",	"ZH", "US", "UR"]
yelp_housing_df = yelp_housing_df.drop(yelp_housing_df[yelp_housing_df.state.isin(lists)].index)

In [54]:
yelp_housing_df.shape

(233488, 21)

In [55]:
yelp_housing_df.state.value_counts()

CA    23056
NY    22706
TX    22636
FL    11960
OH    11938
PA    10982
IL    10315
MI     7581
NC     7162
IN     7056
VA     6342
TN     6290
MO     5936
WA     5297
AZ     4955
CO     4678
AL     4462
WI     4366
KY     4314
MN     4196
SC     4172
LA     3606
OK     3538
OR     3487
MD     3393
IA     3342
AR     3130
KS     3035
NV     2477
NE     2314
MS     2300
WV     1874
UT     1531
NM     1257
DE     1138
DC      976
ND      677
NJ      676
HI      674
ID      620
MT      616
SD      613
AK      604
WY      511
SA      435
CT      234
MA       23
          6
NH        1
Name: state, dtype: int64

#### Check the 6 empty value rows in "state"

In [56]:
nan_in_col  = yelp_housing_df[yelp_housing_df['state']==""]
nan_in_col

Unnamed: 0,postal_code,city,state,location,latitude,longitude,review_count,rating,categories,price,...,pickup,restaurant_reservation,price_value_1.0,price_value_2.0,price_value_3.0,price_value_4.0,State,City,CountyName,2021
183487,1069.0,København,,"{'address1': 'Bremerholm 6', 'address2': '', '...",55.67891,12.58272,3,5.0,"[{'alias': 'restaurants', 'title': 'Restaurant...",0.0,...,0,0,0,0,0,0,MA,Palmer,Hampden County,254289.0
183500,1050.0,København,,"{'address1': 'Kongens Nytorv 13', 'address2': ...",55.679145,12.583886,4,3.5,"[{'alias': 'cafes', 'title': 'Cafes'}]",0.0,...,0,0,0,0,0,0,MA,Huntington,Hampshire County,272608.0
183505,1101.0,København,,"{'address1': 'Ny Østergade 14', 'address2': ''...",55.68165,12.582228,3,3.5,"[{'alias': 'tapasmallplates', 'title': 'Tapas/...",0.0,...,0,0,0,0,0,0,MA,Springfield,Hampden County,187680.0
183508,1101.0,København,,"{'address1': 'Østergade 52', 'address2': '', '...",55.67926,12.58069,3,2.5,"[{'alias': 'seafood', 'title': 'Seafood'}]",2.0,...,0,0,0,1,0,0,MA,Springfield,Hampden County,187680.0
183518,1118.0,København,,"{'address1': 'Sværtegade 2', 'address2': '', '...",55.681114,12.580298,2,4.0,"[{'alias': 'salad', 'title': 'Salad'}, {'alias...",0.0,...,0,0,0,0,0,0,MA,Springfield,Hampden County,242045.0
222986,4760.0,Vordingborg,,"{'address1': 'Slotsruin 1', 'address2': '', 'a...",55.00852,11.91248,1,5.0,"[{'alias': 'localflavor', 'title': 'Local Flav...",2.0,...,0,0,0,1,0,0,ME,Monticello,Aroostook County,106628.0


In [45]:
#drop the 6 empty rows
yelp_housing_df = yelp_housing_df[yelp_housing_df["state"]!=""]

In [46]:
yelp_housing_df.shape

(233482, 21)

In [58]:
yelp_housing_df.state.value_counts()

CA    23056
NY    22706
TX    22636
FL    11960
OH    11938
PA    10982
IL    10315
MI     7581
NC     7162
IN     7056
VA     6342
TN     6290
MO     5936
WA     5297
AZ     4955
CO     4678
AL     4462
WI     4366
KY     4314
MN     4196
SC     4172
LA     3606
OK     3538
OR     3487
MD     3393
IA     3342
AR     3130
KS     3035
NV     2477
NE     2314
MS     2300
WV     1874
UT     1531
NM     1257
DE     1138
DC      976
ND      677
NJ      676
HI      674
ID      620
MT      616
SD      613
AK      604
WY      511
SA      435
CT      234
MA       23
          6
NH        1
Name: state, dtype: int64

In [60]:
#reorder dataframe
yelp_final_df = yelp_housing_df[['postal_code','City', 'State', "CountyName", "2021", 'latitude',
       'longitude', 'review_count', 'rating', 'categories', 'price', 'delivery', 'pickup',
       'restaurant_reservation', 'price_value_1.0', 'price_value_2.0', 'price_value_3.0',
       'price_value_4.0']]

In [61]:
yelp_final_df.head()

Unnamed: 0,postal_code,City,State,CountyName,2021,latitude,longitude,review_count,rating,categories,price,delivery,pickup,restaurant_reservation,price_value_1.0,price_value_2.0,price_value_3.0,price_value_4.0
0,60657.0,Chicago,IL,Cook County,507204.0,41.93643,-87.66141,354,4.0,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",2.0,1,1,0,0,1,0,0
1,60657.0,Chicago,IL,Cook County,507204.0,41.94015,-87.65386,212,4.5,"[{'alias': 'wine_bars', 'title': 'Wine Bars'},...",2.0,1,0,0,0,1,0,0
2,60657.0,Chicago,IL,Cook County,507204.0,41.93742,-87.6483,60,5.0,"[{'alias': 'pizza', 'title': 'Pizza'}]",0.0,1,0,0,0,0,0,0
3,60657.0,Chicago,IL,Cook County,507204.0,41.942829,-87.649185,593,4.0,"[{'alias': 'bars', 'title': 'Bars'}, {'alias':...",2.0,1,1,0,0,1,0,0
4,60657.0,Chicago,IL,Cook County,507204.0,41.947133,-87.646892,35,5.0,"[{'alias': 'mexican', 'title': 'Mexican'}]",0.0,1,1,0,0,0,0,0


In [63]:
yelp_final_df.shape

(233488, 18)

In [64]:
yelp_final_df.postal_code.nunique()

13617

# Load as csv

In [65]:
yelp_final_df.to_csv("yelp_housing_merge.csv")