# Download data

For my captone project, I downloaded the dataset from Yelp (https://www.yelp.com/dataset) in August 2019. The datasets were prepared by Yelp in JSON format to let students use their data for their projects (https://www.yelp.com/dataset/challenge). The dataset is stored in 5 files of JSON format. Each file was composed of a single object type - one-JSON-object per line.

# About files

I loaded and transformed Yelp data challenge datasets into Pandas DataFrames:
- **Users** include user information from registered users on Yelp. It contains reviews per user, number of useful reviews they posted, type of reviews, etc.
- **Review** include reviews submitted by customers about the business
- **Checkin** include check-in times of customers at given business store
- **Tip** include tips per business id
- **Business** contain attributes of business listed on Yelp such as categories (e.g. good for lunch), city info, hours, location and Yelp stars. These columns are consistent with Yelp.com. 

# Import packages

In [2]:
import json
import warnings
warnings.filterwarnings('ignore')
from copy import deepcopy
import pickle
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer

# Turn business.json to a Pandas Dataframe

In [3]:
file = '/Users/mai/Desktop/yelp_dataset/yelp_business.json'

In [4]:
# Decode JSON
with open(file) as f:
    data = [json.loads(line) for line in f]
df = pd.DataFrame(data)

In [4]:
# Check how the dataframe looks like
df.head()

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,postal_code,review_count,stars,state
0,30 Eglinton Avenue W,"{'RestaurantsReservations': 'True', 'GoodForMe...",QXAEGFB4oINsVuTFxEYKFQ,"Specialty Food, Restaurants, Dim Sum, Imported...",Mississauga,"{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...",1,43.605499,-79.652289,Emerald Chinese Restaurant,L5R 3E7,128,2.5,ON
1,"10110 Johnston Rd, Ste 15","{'GoodForKids': 'True', 'NoiseLevel': 'u'avera...",gnKjwL_1w79qoiV3IC_xQQ,"Sushi Bars, Restaurants, Japanese",Charlotte,"{'Monday': '17:30-21:30', 'Wednesday': '17:30-...",1,35.092564,-80.859132,Musashi Japanese Restaurant,28210,170,4.0,NC
2,"15655 W Roosevelt St, Ste 237",,xvX2CttrVhyG2z1dFg_0xw,"Insurance, Financial Services",Goodyear,"{'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ...",1,33.455613,-112.395596,Farmers Insurance - Paul Lorenz,85338,3,5.0,AZ
3,"4209 Stuart Andrew Blvd, Ste F","{'BusinessAcceptsBitcoin': 'False', 'ByAppoint...",HhyxOkGAM07SRYtlQ4wMFQ,"Plumbing, Shopping, Local Services, Home Servi...",Charlotte,"{'Monday': '7:0-23:0', 'Tuesday': '7:0-23:0', ...",1,35.190012,-80.887223,Queen City Plumbing,28217,4,4.0,NC
4,"Credit Valley Town Plaza, F2 - 6045 Creditview Rd","{'BusinessParking': '{'garage': False, 'street...",68dUKd8_8liJ7in4aWOSEA,"Shipping Centers, Couriers & Delivery Services...",Mississauga,"{'Monday': '9:0-19:0', 'Tuesday': '9:0-20:0', ...",1,43.599475,-79.711584,The UPS Store,L5V 0B1,3,2.5,ON


In [5]:
# Drop unnecessary columns for this project
df.drop(columns=['address', 'hours', 'postal_code'], inplace=True)

In [6]:
# Check how the nested dictionary in 'attributes' column looks like
# Check the first item
df['attributes'][0]

{'RestaurantsReservations': 'True',
 'GoodForMeal': "{'dessert': False, 'latenight': False, 'lunch': True, 'dinner': True, 'brunch': False, 'breakfast': False}",
 'BusinessParking': "{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}",
 'Caters': 'True',
 'NoiseLevel': "u'loud'",
 'RestaurantsTableService': 'True',
 'RestaurantsTakeOut': 'True',
 'RestaurantsPriceRange2': '2',
 'OutdoorSeating': 'False',
 'BikeParking': 'False',
 'Ambience': "{'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'divey': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': True}",
 'HasTV': 'False',
 'WiFi': "u'no'",
 'GoodForKids': 'True',
 'Alcohol': "u'full_bar'",
 'RestaurantsAttire': "u'casual'",
 'RestaurantsGoodForGroups': 'True',
 'RestaurantsDelivery': 'False'}

In [7]:
# Check the inside of nested dictionary - GoodForMeal
df['attributes'][0]['GoodForMeal']

"{'dessert': False, 'latenight': False, 'lunch': True, 'dinner': True, 'brunch': False, 'breakfast': False}"

In [8]:
# Check the inside of nested dictionary - BusinessParking
df['attributes'][0]['BusinessParking']

"{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}"

In [9]:
# Compare the df['attributes'][1] with others
# Each business_id contains different information in the attributes column
print('Example #1', df['attributes'][4])
print('----------')
print('Example #2', df['attributes'][1000])

Example #1 {'BusinessParking': "{'garage': False, 'street': False, 'validated': False, 'lot': False, 'valet': False}", 'RestaurantsPriceRange2': '2'}
----------
Example #2 {'RestaurantsPriceRange2': '2', 'RestaurantsAttire': "u'casual'", 'GoodForMeal': "{'dessert': False, 'latenight': True, 'lunch': True, 'dinner': True, 'brunch': False, 'breakfast': False}", 'RestaurantsGoodForGroups': 'True', 'GoodForKids': 'True', 'RestaurantsReservations': 'False', 'Alcohol': "'beer_and_wine'", 'BusinessParking': "{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}", 'Caters': 'False', 'NoiseLevel': "u'loud'", 'HasTV': 'True', 'BikeParking': 'True', 'RestaurantsDelivery': 'False', 'WiFi': "u'no'", 'Ambience': "{'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'divey': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': True}", 'OutdoorSeating': 'False', 'RestaurantsTakeOut': 'True'}


#  Expand the attributes and category columns by using DictVectorizer()

Since each business contains different amount of attribute information in the 'attribute' column, I decided to expand the columns by using  DictVectorizer(). However, the following columns needs to be manually expanded due to the dictionary type for these keys: Ambience, Business Parking, Good For Meal. I set DictVectorizer() for each keys to avoid memory errors and overwriting information. I also made the data to a list since .fit() and .transform() which only accepts a lit of dictionaries.

In [7]:
# Remove empty cells in 'attributes' column in df
my_dict = df[['attributes']].dropna()

In [8]:
# Remove empty cells in my_dict
my_dict.dropna(inplace=True)

In [9]:
# Check shape
print(my_dict.shape)

# Check how the attribute column looks like
my_dict.head()

(163772, 1)


Unnamed: 0,attributes
0,"{'RestaurantsReservations': 'True', 'GoodForMe..."
1,"{'GoodForKids': 'True', 'NoiseLevel': 'u'avera..."
3,"{'BusinessAcceptsBitcoin': 'False', 'ByAppoint..."
4,"{'BusinessParking': '{'garage': False, 'street..."
5,"{'RestaurantsPriceRange2': '2', 'BusinessParki..."


In [10]:
# Create extra columns indicating the presence or absence of a category with a value of 1 or 0
dvec_attributes = DictVectorizer()
dmat_attributes = dvec_attributes.fit_transform(my_dict['attributes'])
df_attributes = pd.DataFrame(dmat_attributes.toarray(), columns=dvec_attributes.get_feature_names(), index=my_dict['attributes'].index)
df_attributes.shape

(163772, 915)

In [14]:
# Drop unnecessary columns
# 'BusinessParking', 'GoodForMeal', 'Ambience': I will apply DictVectorizer() separately thus this dataframe doesn't need to keep the columns here
# DietaryRestrictions / BestNights: I found this does not contain insightful data
# HairSpecializesIn: I focus on restaurants thus this has to be removed
# =u: Original Business.JSON has duplicated keys in the nested dictionary. These contain =u

remove_keys = ['BusinessParking', 'GoodForMeal', 'Ambience', 'Music',\
               'DietaryRestrictions', 'BestNights', 'HairSpecializesIn', '=u']
remove_col_list = []
i = 0
for i in range(len(remove_keys)):
    for col in df_attributes.columns:
        if remove_keys[i] in col:
            remove_col_list.append(col)
            i + 1
print(remove_col_list[1])
print(len(remove_col_list))

BusinessParking={'garage': False, 'street': False, 'lot': False, 'valet': False}
816


In [15]:
# Remove all the columns listed in remove_col_list from the DataFrame
df_attributes.drop(columns=remove_col_list, inplace=True)
df_attributes.shape

(163772, 99)

In [16]:
# Create columns per ambience feature
# Fill NaN in the empty cells
my_dict['ambience'] = my_dict['attributes'].map(lambda x: (x['Ambience']) if 'Ambience' in x else np.nan)

In [17]:
# Delete keys from 'attributes' column to avoid duplicates
selected_key = 'Ambience'
def dropkey(x, key=selected_key):
    y = deepcopy(x)
    if key in y:
         del y[key]
    return y

In [18]:
# Apply a function to all the items in the list
my_dict['attributes'] = my_dict['attributes'].map(dropkey)

# Replace 'None' with NaN in the column
my_dict['ambience'][my_dict['ambience'] == 'None'] = np.nan

# Apply a function to all the items in the list
my_dict['ambience'] = my_dict['ambience'].map(lambda x: eval(x) if type(x)==str else x)

In [19]:
# Remove all NaN values before using DictVectorizer()
my_dict.dropna(inplace=True)

In [20]:
# Check the shape
my_dict.shape

(47547, 2)

In [21]:
# Apply DictVectorizer to extract features and put them into a Pandas DataFrame
dvec_ambience = DictVectorizer()
dmat_ambience = dvec_ambience.fit_transform(my_dict['ambience'])

# Create a DataFrame
df_ambience = pd.DataFrame(dmat_ambience.toarray(), columns=dvec_ambience.get_feature_names(), index=my_dict['ambience'].index)

In [22]:
# Check how the ambience features were expanded to columns
df_ambience.columns

Index(['casual', 'classy', 'divey', 'hipster', 'intimate', 'romantic',
       'touristy', 'trendy', 'upscale'],
      dtype='object')

In [23]:
# Change column names for future use
df_ambience.columns = [
                        'ambience_Casual',
                        'ambience_Classy',
                        'ambience_Divey',
                        'ambience_Hipster',
                        'ambience_Intimate',
                        'ambience_Romantic',
                        'ambience_Touristy',
                        'ambience_Trendy',
                        'ambience_Upscale',
                      ]

In [24]:
# Check how the DataFrame looks like
df_ambience.head()

Unnamed: 0,ambience_Casual,ambience_Classy,ambience_Divey,ambience_Hipster,ambience_Intimate,ambience_Romantic,ambience_Touristy,ambience_Trendy,ambience_Upscale
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [25]:
# Check the shape
df_ambience.shape

(47547, 9)

## Create columns per Business Parking features

In [26]:
# Create columns per Business Parking features
# Fill NaN in the empty cells
my_dict['businessparking'] = my_dict['attributes'].map(lambda x: (x['BusinessParking']) if 'BusinessParking' in x else np.nan)
my_dict['businessparking'].head()

0     {'garage': False, 'street': False, 'validated'...
1     {'garage': False, 'street': False, 'validated'...
10    {'garage': False, 'street': False, 'validated'...
11    {'garage': False, 'street': False, 'validated'...
16    {'garage': False, 'street': False, 'validated'...
Name: businessparking, dtype: object

In [27]:
# Follow the same step as Ambience features
my_dict['attributes'] = my_dict['attributes'].map(dropkey)
my_dict['businessparking'][my_dict['businessparking'] == 'None'] = np.nan
my_dict['businessparking'] = my_dict['businessparking'].map(lambda x: eval(x) if type(x)==str else x)

In [28]:
# Apply DictVectorizer to extract features and put them into a Pandas dataframe
dvec_parking = DictVectorizer()
dmat_parking = dvec_parking.fit_transform(my_dict['businessparking'].fillna(method='ffill'))
df_parking = pd.DataFrame(dmat_parking.toarray(), columns=dvec_parking.get_feature_names(), index=my_dict['businessparking'].index)

In [29]:
# Change column names for modelling and analysis
df_parking.columns = [
                        'parking_Garage',
                        'parking_Lot',
                        'parking_Street',
                        'parking_Valet',
                        'parking_Validated'
                      ]

In [30]:
df_parking.head()

Unnamed: 0,parking_Garage,parking_Lot,parking_Street,parking_Valet,parking_Validated
0,0.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0
11,0.0,0.0,0.0,0.0,0.0
16,0.0,1.0,0.0,0.0,0.0


## Create columns per GoodForMeal features

In [31]:
# Fill NaN in the empty cells
my_dict['goodformeal'] = my_dict['attributes'].map(lambda x: (x['GoodForMeal']) if 'GoodForMeal' in x else np.nan)

In [32]:
# Follow the same step as Ambience features
my_dict['attributes'] = my_dict['attributes'].map(dropkey)
my_dict['goodformeal'][my_dict['goodformeal'] == 'None'] = np.nan
my_dict['goodformeal'] = my_dict['goodformeal'].map(lambda x: eval(x) if type(x)==str else x)
#my_dict['GoodForMeal'].dropna(inplace=True)

In [33]:
# Apply DictVectorizer to extract features and put them into a Pandas dataframe
dvec_meal = DictVectorizer()
dmat_meal = dmat_meal = dvec_meal.fit_transform(my_dict['goodformeal'].fillna(method='ffill'))
df_meal = pd.DataFrame(dmat_meal.toarray(), columns=dvec_meal.get_feature_names(), index=my_dict['goodformeal'].index)

In [34]:
# Change column names for modelling and analysis
df_meal.columns = [
                        'meal_Breakfast',
                        'meal_Brunch',
                        'meal_Dessert',
                        'meal_Dinner',
                        'meal_Latenight',
                        'meal_Lunch'
                      ]

In [35]:
df_meal.head()

Unnamed: 0,meal_Breakfast,meal_Brunch,meal_Dessert,meal_Dinner,meal_Latenight,meal_Lunch
0,0.0,0.0,0.0,1.0,0.0,1.0
1,0.0,0.0,0.0,1.0,0.0,1.0
10,0.0,0.0,0.0,1.0,0.0,1.0
11,0.0,0.0,0.0,1.0,0.0,1.0
16,0.0,0.0,0.0,1.0,0.0,1.0


In [36]:
# Check the DataFrame shape
df_meal.shape

(47547, 6)

## Merge all dataframes created from business.json

In [37]:
businessdf = pd.concat([df, df_attributes, df_ambience, df_meal, df_parking], axis=1)

# Drop the original column
businessdf.drop(columns='attributes', inplace=True) 
businessdf.head(2)

Unnamed: 0,business_id,categories,city,is_open,latitude,longitude,name,review_count,stars,state,...,meal_Brunch,meal_Dessert,meal_Dinner,meal_Latenight,meal_Lunch,parking_Garage,parking_Lot,parking_Street,parking_Valet,parking_Validated
0,QXAEGFB4oINsVuTFxEYKFQ,"Specialty Food, Restaurants, Dim Sum, Imported...",Mississauga,1,43.605499,-79.652289,Emerald Chinese Restaurant,128,2.5,ON,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1,gnKjwL_1w79qoiV3IC_xQQ,"Sushi Bars, Restaurants, Japanese",Charlotte,1,35.092564,-80.859132,Musashi Japanese Restaurant,170,4.0,NC,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0


In [38]:
# Replace NaN values in the dataframe
businessdf = businessdf.fillna(0)

## Clean up the column names before I save the dataframe

The business.json file did not have clear dictionary keys, thus I decided to rename the column keys for future use.

In [39]:
# Rename columns since these include {}, =, ' in the name (e.g. NoiseLevel='average')
businessdf.columns = businessdf.columns.str.replace("{", "")
businessdf.columns = businessdf.columns.str.replace("}", "")
businessdf.columns = businessdf.columns.str.replace("=", "_")
businessdf.columns = businessdf.columns.str.replace("'", "_")

# Replace none to Unknown
businessdf.columns = businessdf.columns.str.replace('none', 'unknown')
businessdf.columns = businessdf.columns.str.replace('byob', 'bring_your_own_bottle')

# Lowercase column names
businessdf.columns = businessdf.columns.map(lambda x: x.lower())

# Remove 'restaurants' from column names
businessdf.columns = businessdf.columns.map(lambda x: x.replace('restaurants', ''))

In [40]:
 # Rename column name clearly
businessdf.rename(columns={
     'is_open': 'business_status','name': 'business_name',
     'review_count': 'review_count_per_business', 'stars': 'average_stars_per_business', 'state': 'US_state',
     'alcohol__beer_and_wine_': 'alcohol_beer_and_wine', 'alcohol__full_bar_': 'alcohol_full_bar', 'alcohol__unknown_': 'alcohol_not_available',
     'bring_your_own_bottlecorkage__no_': 'bring_your_own_bottle_corkage_no',
     'bring_your_own_bottlecorkage__yes_corkage_': 'bring_your_own_bottle_corkage_yes_corkage',
     'bring_your_own_bottlecorkage__yes_free_': 'bring_your_own_bottle_corkage_yes_free',
     'bring_your_own_bottlecorkage_unknown': 'bring_your_own_bottle_corkage_unknown',
     'hastv_false': 'hasTV_false', 'hastv_unknown': 'hasTV_unknown', 'hastv_true': 'hasTV_true',
     'noiselevel__average_': 'noiselevel_average', 'noiselevel__loud_': 'noiselevel_loud',
     'noiselevel__quiet_': 'noiselevel_quiet', 'noiselevel__very_loud_': 'noiselevel_very_loud',    
     'attire__casual_': 'dresscode_casual', 'attire__dressy_': 'dresscode_dressy',
     'attire__formal_': 'dresscode_formal', 'attire_unknown': 'dresscode_unknown',
     'pricerange2_1': 'spricerange_1',
     'pricerange2_2': 'pricerange_2', 'pricerange2_3': 'pricerange_3',
     'pricerange2_4': 'pricerange_4', 'pricerange2_unknown': 'pricerange_unknown',
     'smoking__no_': 'smoking_no', 'smoking__outdoor_': 'smoking_outdoor', 'smoking__yes_': 'smoking_yes',
     'wifi__free_': 'wifi_free', 'Wifi__no_': 'wifi_no',
     'wifi__paid_': 'wifi_paid'}, inplace=True)

In [41]:
# Check how the DataFrame looks like
businessdf.head()

Unnamed: 0,business_id,categories,city,business_status,latitude,longitude,business_name,review_count_per_business,average_stars_per_business,US_state,...,meal_brunch,meal_dessert,meal_dinner,meal_latenight,meal_lunch,parking_garage,parking_lot,parking_street,parking_valet,parking_validated
0,QXAEGFB4oINsVuTFxEYKFQ,"Specialty Food, Restaurants, Dim Sum, Imported...",Mississauga,1,43.605499,-79.652289,Emerald Chinese Restaurant,128,2.5,ON,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1,gnKjwL_1w79qoiV3IC_xQQ,"Sushi Bars, Restaurants, Japanese",Charlotte,1,35.092564,-80.859132,Musashi Japanese Restaurant,170,4.0,NC,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
2,xvX2CttrVhyG2z1dFg_0xw,"Insurance, Financial Services",Goodyear,1,33.455613,-112.395596,Farmers Insurance - Paul Lorenz,3,5.0,AZ,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,HhyxOkGAM07SRYtlQ4wMFQ,"Plumbing, Shopping, Local Services, Home Servi...",Charlotte,1,35.190012,-80.887223,Queen City Plumbing,4,4.0,NC,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,68dUKd8_8liJ7in4aWOSEA,"Shipping Centers, Couriers & Delivery Services...",Mississauga,1,43.599475,-79.711584,The UPS Store,3,2.5,ON,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Drop one column per attribute group 

 To maximize the performance of models, I decided to drop a column. 
 For example, There were 'BikeParking = True/False/None'. I removed 'BikeParking=False' because when 'BikeParking=True' has 1, 'Bikeparking=True'=0 means 'BikeParking=False'

In [42]:
# List up columns to drop
drop_col_list = []

for col in businessdf.columns:
    if 'false' in col:
        drop_col_list.append(col)
print(drop_col_list)

['acceptsinsurance_false', 'byob_false', 'bikeparking_false', 'businessacceptsbitcoin_false', 'businessacceptscreditcards_false', 'byappointmentonly_false', 'caters_false', 'coatcheck_false', 'corkage_false', 'dogsallowed_false', 'drivethru_false', 'goodfordancing_false', 'goodforkids_false', 'happyhour_false', 'hasTV_false', 'open24hours_false', 'outdoorseating_false', 'counterservice_false', 'delivery_false', 'goodforgroups_false', 'reservations_false', 'tableservice_false', 'takeout_false', 'wheelchairaccessible_false']


In [43]:
# Drop the listed columns from the DataFrame
businessdf.drop(columns=drop_col_list, inplace=True)

In [44]:
# Pickle results 
businessdf.to_pickle('businessdf.pickle')

In [45]:
# Check if I successfully pickled the data
pickled_businessdf = pd.read_pickle('/Users/mai/Desktop/yelp_dataset/to_submit/businessdf.pickle')
pickled_businessdf.head()

Unnamed: 0,business_id,categories,city,business_status,latitude,longitude,business_name,review_count_per_business,average_stars_per_business,US_state,...,meal_brunch,meal_dessert,meal_dinner,meal_latenight,meal_lunch,parking_garage,parking_lot,parking_street,parking_valet,parking_validated
0,QXAEGFB4oINsVuTFxEYKFQ,"Specialty Food, Restaurants, Dim Sum, Imported...",Mississauga,1,43.605499,-79.652289,Emerald Chinese Restaurant,128,2.5,ON,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1,gnKjwL_1w79qoiV3IC_xQQ,"Sushi Bars, Restaurants, Japanese",Charlotte,1,35.092564,-80.859132,Musashi Japanese Restaurant,170,4.0,NC,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
2,xvX2CttrVhyG2z1dFg_0xw,"Insurance, Financial Services",Goodyear,1,33.455613,-112.395596,Farmers Insurance - Paul Lorenz,3,5.0,AZ,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,HhyxOkGAM07SRYtlQ4wMFQ,"Plumbing, Shopping, Local Services, Home Servi...",Charlotte,1,35.190012,-80.887223,Queen City Plumbing,4,4.0,NC,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,68dUKd8_8liJ7in4aWOSEA,"Shipping Centers, Couriers & Delivery Services...",Mississauga,1,43.599475,-79.711584,The UPS Store,3,2.5,ON,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Check the review file looks like

In [150]:
reviewjson = '/Users/mai/Desktop/yelp_dataset/review.json'

In [None]:
# Decode JSON
with open(reviewjson) as f:
    reviewdata = [json.loads(line) for line in f]
reviewdf = pd.DataFrame(reviewdata)

In [156]:
# Pickle results 
reviewdf.to_pickle('reviewdf.pickle')

In [157]:
# Check if I successfully pickled the data
pickled_reviewdf = pd.read_pickle('/Users/mai/Desktop/yelp_dataset/to_submit/reviewdf.pickle')
pickled_reviewdf.head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,ujmEBvifdJM6h6RLv4wQIg,0,2013-05-07 04:34:36,1.0,Q1sbwvVQXV2734tPgoKj4Q,1.0,Total bill for this horrible service? Over $8G...,6.0,hG7b0MtEbXx5QzbzE6C_VA
1,NZnhc2sEQy3RmzKTZnqtwQ,0,2017-01-14 21:30:33,0.0,GJXCdrto3ASJOqKeVWPi6Q,5.0,I *adore* Travis at the Hard Rock's new Kelly ...,0.0,yXQM5uF2jS6es16SJzNHfg
2,WTqjgwHlXbSFevF32_DJVw,0,2016-11-09 20:09:03,0.0,2TzJjDVDEuAW6MR5Vuc1ug,5.0,I have to say that this office really has it t...,3.0,n6-Gk65cPZL6Uz8qRm3NYw
3,ikCg8xy5JIg_NGPx-MSIDA,0,2018-01-09 20:56:38,0.0,yi0R0Ugj_xUx_Nek0-_Qig,5.0,Went in for a lunch. Steak sandwich was delici...,0.0,dacAIZ6fTM6mqwW5uxkskg
4,b1b1eb3uo-w561D0ZfCEiQ,0,2018-01-30 23:07:38,0.0,11a8sVPMUFtaC7_ABRkmtw,1.0,Today was my second out of three sessions I ha...,7.0,ssoyf2_x0EQMed6fgHeMyQ


# Check the user file looks like

In [65]:
userjson = '/Users/mai/Desktop/yelp_dataset/user.json'

In [66]:
# Decode JSON
with open(userjson) as f:
    userdata = [json.loads(line) for line in f]
userdf = pd.DataFrame(userdata)

In [160]:
# Pickle results 
userdf.to_pickle('userdf.pickle')

In [162]:
# Check if I successfully pickled the data
pickled_userdf = pd.read_pickle('/Users/mai/Desktop/yelp_dataset/to_submit/userdf.pickle')
pickled_userdf.head()

Unnamed: 0,average_stars,compliment_cool,compliment_cute,compliment_funny,compliment_hot,compliment_list,compliment_more,compliment_note,compliment_photos,compliment_plain,...,cool,elite,fans,friends,funny,name,review_count,useful,user_id,yelping_since
0,4.03,1,0,1,2,0,0,1,0,1,...,25,201520162017.0,5,"c78V-rj8NQcQjOI8KP3UEA, alRMgPcngYSCJ5naFRBz5g...",17,Rashmi,95,84,l6BmjZMeQD3rDxWUbiAiow,2013-10-08 23:11:33
1,3.63,1,0,1,1,0,0,0,0,0,...,16,,4,"kEBTgDvFX754S68FllfCaA, aB2DynOxNOJK9st2ZeGTPg...",22,Jenna,33,48,4XChL029mKr5hydo79Ljxg,2013-02-21 22:29:06
2,3.71,0,0,0,0,0,0,1,0,0,...,10,,0,"4N-HU_T32hLENLntsNKNBg, pSY2vwWLgWfGVAAiKQzMng...",8,David,16,28,bc8C_eETBWL0olvFSJJd0w,2013-10-04 00:16:10
3,4.85,0,0,0,1,0,0,0,0,2,...,14,,5,"RZ6wS38wnlXyj-OOdTzBxA, l5jxZh1KsgI8rMunm-GN6A...",4,Angela,17,30,dD0gZpBctWGdWo9WlGuhlA,2014-05-22 15:57:30
4,4.08,80,0,80,28,1,1,16,5,57,...,665,2015201620172018.0,39,"mbwrZ-RS76V1HoJ0bF_Geg, g64lOV39xSLRZO0aQQ6DeQ...",279,Nancy,361,1114,MM4RJAeH6yuaN8oZDSt0RA,2013-10-23 07:02:50


# Check the checkin file looks like

In [69]:
checkinjson = '/Users/mai/Desktop/yelp_dataset/checkin.json'

In [70]:
# Decode JSON
with open(checkinjson) as f:
    checkindata = [json.loads(line) for line in f]
checkindf = pd.DataFrame(checkindata)

In [164]:
# Pickle results 
checkindf.to_pickle('checkindf.pickle')

In [166]:
# Check if I successfully pickled the data
pickled_checkindf = pd.read_pickle('/Users/mai/Desktop/yelp_dataset/to_submit/checkindf.pickle')
pickled_checkindf.head()

Unnamed: 0,business_id,date
0,--1UhMGODdWsrMastO9DZw,"2016-04-26 19:49:16, 2016-08-30 18:36:57, 2016..."
1,--6MefnULPED_I942VcFNA,"2011-06-04 18:22:23, 2011-07-23 23:51:33, 2012..."
2,--7zmmkVg-IMGaXbuVd0SQ,"2014-12-29 19:25:50, 2015-01-17 01:49:14, 2015..."
3,--8LPVSo5i0Oo61X01sV9A,2016-07-08 16:43:30
4,--9QQLMTbFzLJ_oT-ON3Xw,"2010-06-26 17:39:07, 2010-08-01 20:06:21, 2010..."


# Check the tip file looks like

In [73]:
tipjson = '/Users/mai/Desktop/yelp_dataset/tip.json'

In [74]:
# Decode JSON
with open(tipjson) as f:
    tipdata = [json.loads(line) for line in f]
tipdf = pd.DataFrame(tipdata)

In [168]:
# Pickle results 
tipdf.to_pickle('tipdf.pickle')

In [170]:
# Check if I successfully pickled the data
pickled_tipdf = pd.read_pickle('/Users/mai/Desktop/yelp_dataset/to_submit/tipdf.pickle')
pickled_tipdf.head()

Unnamed: 0,business_id,compliment_count,date,text,user_id
0,VaKXUpmWTTWDKbpJ3aQdMw,0,2014-03-27 03:51:24,"Great for watching games, ufc, and whatever el...",UPw5DWs_b-e2JRBS-t37Ag
1,OPiPeoJiv92rENwbq76orA,0,2013-05-25 06:00:56,Happy Hour 2-4 daily with 1/2 price drinks and...,Ocha4kZBHb4JK0lOWvE0sg
2,5KheTjYPu1HcQzQFtm4_vw,0,2011-12-26 01:46:17,Good chips and salsa. Loud at times. Good serv...,jRyO2V1pA4CdVVqCIOPc1Q
3,TkoyGi8J7YFjA6SbaRzrxg,0,2014-03-23 21:32:49,The setting and decoration here is amazing. Co...,FuTJWFYm4UKqewaosss1KA
4,AkL6Ous6A1atZejfZXn1Bg,0,2012-10-06 00:19:27,Molly is definately taking a picture with Sant...,LUlKtaM3nXd-E4N4uOk_fQ
