## DATASET COLLECTION

**Adriana de Vicente**

**Irma Sánchez**

# Download data

We downloaded the dataset from Yelp (https://www.yelp.com/dataset) 

# About files

We loaded and transformed Yelp data challenge datasets into Pandas DataFrames:
- **Users** include user information from registered users on Yelp. It contains reviews per user, number of useful reviews they posted, type of reviews, etc.
- **Review** include reviews submitted by customers about the business
- **Checkin** include check-in times of customers at given business store
- **Tip** include tips per business id
- **Business** contain attributes of business listed on Yelp such as categories (e.g. good for lunch), city info, hours, location and Yelp stars. These columns are consistent with Yelp.com. 

# Import packages

In [1]:
import json
import warnings
warnings.filterwarnings('ignore')
from copy import deepcopy
import pickle
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer

# Turn business.json to a Pandas Dataframe

This code reads a file in JSON format and loads it into a Pandas dataframe.

In [2]:
business_df = pd.read_json('../data/raw/yelp_academic_dataset_business.json',lines=True)

In [3]:
business_df.to_csv('../data/processed/business_df.csv', index=False)


In [4]:
business_df = pd.read_csv('../data/processed/business_df.csv')

In [5]:
business_df = pd.read_pickle('../data/processed/businessdf.pickle')


The head() method is used to return the first few rows of a dataframe. 

In [138]:
# Check how the dataframe looks like
business_df.head()

Unnamed: 0,business_id,business_name,address,city,US_state,postal_code,latitude,longitude,average_stars_per_business,review_count_per_business,...,meal_brunch,meal_dessert,meal_dinner,meal_latenight,meal_lunch,parking_garage,parking_lot,parking_street,parking_valet,parking_validated
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
# Drop unnecessary columns for this project
business_df.drop(columns=['address', 'hours', 'postal_code'], inplace=True)

In [7]:
# Check how the nested dictionary in 'attributes' column looks like
# Check the first item
business_df.attributes

0                             {'ByAppointmentOnly': 'True'}
1                    {'BusinessAcceptsCreditCards': 'True'}
2         {'BikeParking': 'True', 'BusinessAcceptsCredit...
3         {'RestaurantsDelivery': 'False', 'OutdoorSeati...
4         {'BusinessAcceptsCreditCards': 'True', 'Wheelc...
                                ...                        
150341    {'ByAppointmentOnly': 'False', 'RestaurantsPri...
150342    {'BusinessAcceptsCreditCards': 'True', 'Restau...
150343    {'RestaurantsPriceRange2': '1', 'BusinessAccep...
150344    {'BusinessParking': "{'garage': False, 'street...
150345    {'WheelchairAccessible': 'True', 'BusinessAcce...
Name: attributes, Length: 150346, dtype: object

This code sets the index of the attributes dataframe to the index of the business_df dataframe, and then adds a new column called 'business_id' that contains the index values.

In [107]:
attributes

Unnamed: 0,ByAppointmentOnly,BusinessAcceptsCreditCards,BikeParking,RestaurantsPriceRange2,CoatCheck,RestaurantsTakeOut,RestaurantsDelivery,Caters,WiFi,BusinessParking,...,BestNights,BYOB,Corkage,BYOBCorkage,HairSpecializesIn,Open24Hours,RestaurantsCounterService,AgesAllowed,DietaryRestrictions,business_id
0,True,,,,,,,,,,...,,,,,,,,,,0
1,,True,,,,,,,,,...,,,,,,,,,,1
2,False,True,True,2,False,False,False,False,u'no',"{'garage': False, 'street': False, 'validated'...",...,,,,,,,,,,2
3,False,False,True,1,,True,False,True,u'free',"{'garage': False, 'street': True, 'validated':...",...,,,,,,,,,,3
4,,True,True,,,True,,False,,"{'garage': None, 'street': None, 'validated': ...",...,,,,,,,,,,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150341,False,,,3,,,,,,,...,,,,,,,,,,150341
150342,,True,True,2,,,,,u'no',"{'garage': False, 'street': False, 'validated'...",...,,,,,,,,,,150342
150343,,True,,1,,,,,,,...,,,,,,,,,,150343
150344,,True,True,4,,,,,,"{'garage': False, 'street': False, 'validated'...",...,,,,,,,,,,150344


In [116]:
business_df.attributes.head()

0                        {'ByAppointmentOnly': 'True'}
1               {'BusinessAcceptsCreditCards': 'True'}
2    {'BikeParking': 'True', 'BusinessAcceptsCredit...
3    {'RestaurantsDelivery': 'False', 'OutdoorSeati...
4    {'BusinessAcceptsCreditCards': 'True', 'Wheelc...
Name: attributes, dtype: object

In [113]:
# Compare the df['attributes'][1] with others
# Each business_id contains different information in the attributes column
print('Example #1', business_df['attributes'][4])
print('----------')
print('Example #2', business_df['attributes'][1000])

Example #1 {'BusinessAcceptsCreditCards': 'True', 'WheelchairAccessible': 'True', 'RestaurantsTakeOut': 'True', 'BusinessParking': "{'garage': None, 'street': None, 'validated': None, 'lot': True, 'valet': False}", 'BikeParking': 'True', 'GoodForKids': 'True', 'Caters': 'False'}
----------
Example #2 {'RestaurantsTableService': 'True', 'NoiseLevel': "u'average'", 'BusinessAcceptsCreditCards': 'True', 'RestaurantsAttire': "u'casual'", 'RestaurantsReservations': 'False', 'BusinessParking': "{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}", 'DogsAllowed': 'False', 'RestaurantsDelivery': 'False', 'GoodForMeal': "{'dessert': True, 'latenight': None, 'lunch': True, 'dinner': None, 'brunch': None, 'breakfast': None}", 'RestaurantsPriceRange2': '2', 'WiFi': "u'free'", 'HasTV': 'True', 'Caters': 'True', 'BikeParking': 'True', 'GoodForKids': 'True', 'RestaurantsTakeOut': 'True', 'OutdoorSeating': 'True', 'RestaurantsGoodForGroups': 'True', 'WheelchairAccessibl

#  Expand the attributes and category columns by using DictVectorizer()

Since each business contains different amount of attribute information in the 'attribute' column, We decided to expand the columns by using  DictVectorizer(). 
However, the following columns needs to be manually expanded due to the dictionary type for these keys: Ambience, Business Parking, Good For Meal. 

In [30]:
# Remove empty cells in 'attributes' column in df
my_dict = business_df[['attributes']].dropna()

In [31]:
# Remove empty cells in my_dict
my_dict.dropna(inplace=True)

In [32]:
# Check shape
print(my_dict.shape)

# Check how the attribute column looks like
my_dict.head()

(136602, 1)


Unnamed: 0,attributes
0,{'ByAppointmentOnly': 'True'}
1,{'BusinessAcceptsCreditCards': 'True'}
2,"{'BikeParking': 'True', 'BusinessAcceptsCredit..."
3,"{'RestaurantsDelivery': 'False', 'OutdoorSeati..."
4,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc..."


In [33]:
# Create extra columns indicating the presence or absence of a category with a value of 1 or 0
dvec_attributes = DictVectorizer()
dmat_attributes = dvec_attributes.fit_transform(my_dict['attributes'])
df_attributes = pd.DataFrame(dmat_attributes.toarray(), columns=dvec_attributes.get_feature_names(), index=my_dict['attributes'].index)
df_attributes.shape

(136602, 3668)

In [34]:
# Drop unnecessary columns
# 'BusinessParking', 'GoodForMeal', 'Ambience': I will apply DictVectorizer() separately thus this dataframe doesn't need to keep the columns here
# DietaryRestrictions / BestNights: I found this does not contain insightful data
# HairSpecializesIn: I focus on restaurants thus this has to be removed
# =u: Original Business.JSON has duplicated keys in the nested dictionary. These contain =u

remove_keys = ['BusinessParking', 'GoodForMeal', 'Ambience', 'Music',\
               'DietaryRestrictions', 'BestNights', 'HairSpecializesIn', '=u']
remove_col_list = []
i = 0
for i in range(len(remove_keys)):
    for col in df_attributes.columns:
        if remove_keys[i] in col:
            remove_col_list.append(col)
            i + 1
print(remove_col_list[1])
print(len(remove_col_list))

BusinessParking={'garage': False, 'street': False, 'validated': False, 'lot': False, 'valet': False}
3570


In [35]:
# Remove all the columns listed in remove_col_list from the DataFrame
df_attributes.drop(columns=remove_col_list, inplace=True)
df_attributes.shape

(136602, 98)

In [36]:
# Create columns per ambience feature
# Fill NaN in the empty cells
my_dict['ambience'] = my_dict['attributes'].map(lambda x: (x['Ambience']) if 'Ambience' in x else np.nan)

In [37]:
# Delete keys from 'attributes' column to avoid duplicates
selected_key = 'Ambience'
def dropkey(x, key=selected_key):
    y = deepcopy(x)
    if key in y:
         del y[key]
    return y

In [38]:
# Apply a function to all the items in the list
my_dict['attributes'] = my_dict['attributes'].map(dropkey)

# Replace 'None' with NaN in the column
my_dict['ambience'][my_dict['ambience'] == 'None'] = np.nan

# Apply a function to all the items in the list
my_dict['ambience'] = my_dict['ambience'].map(lambda x: eval(x) if type(x)==str else x)

In [39]:
# Remove all NaN values before using DictVectorizer()
my_dict.dropna(inplace=True)

In [40]:
# Check the shape
my_dict.shape

(43728, 2)

In [41]:
# Apply DictVectorizer to extract features and put them into a Pandas DataFrame
dvec_ambience = DictVectorizer()
dmat_ambience = dvec_ambience.fit_transform(my_dict['ambience'])

# Create a DataFrame
df_ambience = pd.DataFrame(dmat_ambience.toarray(), columns=dvec_ambience.get_feature_names(), index=my_dict['ambience'].index)

In [42]:
# Check how the ambience features were expanded to columns
df_ambience.columns

Index(['casual', 'classy', 'divey', 'hipster', 'intimate', 'romantic',
       'touristy', 'trendy', 'upscale'],
      dtype='object')

In [43]:
# Change column names for future use
df_ambience.columns = [
                        'ambience_Casual',
                        'ambience_Classy',
                        'ambience_Divey',
                        'ambience_Hipster',
                        'ambience_Intimate',
                        'ambience_Romantic',
                        'ambience_Touristy',
                        'ambience_Trendy',
                        'ambience_Upscale',
                      ]

In [44]:
# Check how the DataFrame looks like
df_ambience.head()

Unnamed: 0,ambience_Casual,ambience_Classy,ambience_Divey,ambience_Hipster,ambience_Intimate,ambience_Romantic,ambience_Touristy,ambience_Trendy,ambience_Upscale
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12,1.0,,,,,0.0,,,
14,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
# Check the shape
df_ambience.shape

(43728, 9)

## Create columns per Business Parking features

In [46]:
# Create columns per Business Parking features
# Fill NaN in the empty cells
my_dict['businessparking'] = my_dict['attributes'].map(lambda x: (x['BusinessParking']) if 'BusinessParking' in x else np.nan)
my_dict['businessparking'].head()

8     {'garage': False, 'street': False, 'validated'...
11    {'garage': False, 'street': False, 'validated'...
12    {'garage': None, 'street': False, 'validated':...
14    {'garage': False, 'street': False, 'validated'...
15    {u'valet': False, u'garage': None, u'street': ...
Name: businessparking, dtype: object

In [47]:
# Follow the same step as Ambience features
my_dict['attributes'] = my_dict['attributes'].map(dropkey)
my_dict['businessparking'][my_dict['businessparking'] == 'None'] = np.nan
my_dict['businessparking'] = my_dict['businessparking'].map(lambda x: eval(x) if type(x)==str else x)

In [48]:
# Apply DictVectorizer to extract features and put them into a Pandas dataframe
dvec_parking = DictVectorizer()
dmat_parking = dvec_parking.fit_transform(my_dict['businessparking'].fillna(method='ffill'))
df_parking = pd.DataFrame(dmat_parking.toarray(), columns=dvec_parking.get_feature_names(), index=my_dict['businessparking'].index)

In [49]:
# Change column names for modelling and analysis
df_parking.columns = [
                        'parking_Garage',
                        'parking_Lot',
                        'parking_Street',
                        'parking_Valet',
                        'parking_Validated'
                      ]

In [50]:
df_parking.head()

Unnamed: 0,parking_Garage,parking_Lot,parking_Street,parking_Valet,parking_Validated
8,0.0,1.0,0.0,0.0,0.0
11,0.0,0.0,0.0,0.0,0.0
12,,1.0,0.0,0.0,
14,0.0,1.0,0.0,0.0,0.0
15,,0.0,1.0,0.0,


## Create columns per GoodForMeal features

In [68]:
# Fill NaN in the empty cells
my_dict['goodformeal'] = my_dict['attributes'].map(lambda x: (x['GoodForMeal']) if 'GoodForMeal' in x else np.nan)

In [69]:
# Follow the same step as Ambience features
my_dict['attributes'] = my_dict['attributes'].map(dropkey)
my_dict['goodformeal'][my_dict['goodformeal'] == 'None'] = np.nan
my_dict['goodformeal'] = my_dict['goodformeal'].map(lambda x: eval(x) if type(x)==str else x)
my_dict['goodformeal'].dropna(inplace=True)

In [72]:
# Apply DictVectorizer to extract features and put them into a Pandas dataframe
dvec_meal = DictVectorizer()
dmat_meal = dmat_meal = dvec_meal.fit_transform(my_dict['goodformeal'].fillna(method='bfill'))
df_meal = pd.DataFrame(dmat_meal.toarray(), columns=dvec_meal.get_feature_names(), index=my_dict['goodformeal'].index)



In [73]:
# Change column names for modelling and analysis
df_meal.columns = [
                        'meal_Breakfast',
                        'meal_Brunch',
                        'meal_Dessert',
                        'meal_Dinner',
                        'meal_Latenight',
                        'meal_Lunch'
                      ]

In [74]:
df_meal.head()

Unnamed: 0,meal_Breakfast,meal_Brunch,meal_Dessert,meal_Dinner,meal_Latenight,meal_Lunch
8,0.0,0.0,0.0,0.0,0.0,0.0
11,0.0,0.0,0.0,0.0,0.0,0.0
12,0.0,0.0,0.0,0.0,0.0,0.0
14,0.0,0.0,0.0,0.0,0.0,1.0
15,,,1.0,1.0,,


In [75]:
# Check the DataFrame shape
df_meal.shape

(43728, 6)

## Merge all dataframes created from business.json

In [77]:
businessdf = pd.concat([business_df, df_attributes, df_ambience, df_meal, df_parking], axis=1)

# Drop the original column
businessdf.drop(columns='attributes', inplace=True) 
businessdf.head(2)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,...,meal_Brunch,meal_Dessert,meal_Dinner,meal_Latenight,meal_Lunch,parking_Garage,parking_Lot,parking_Street,parking_Valet,parking_Validated
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,...,,,,,,,,,,
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,...,,,,,,,,,,


In [78]:
# Replace NaN values in the dataframe
businessdf = businessdf.fillna(0)

## Clean up the column names before I save the dataframe

The business.json file did not have clear dictionary keys, thus We decided to rename the column keys for future use.

In [79]:
# Rename columns since these include {}, =, ' in the name (e.g. NoiseLevel='average')
businessdf.columns = businessdf.columns.str.replace("{", "")
businessdf.columns = businessdf.columns.str.replace("}", "")
businessdf.columns = businessdf.columns.str.replace("=", "_")
businessdf.columns = businessdf.columns.str.replace("'", "_")

# Replace none to Unknown
businessdf.columns = businessdf.columns.str.replace('none', 'unknown')
businessdf.columns = businessdf.columns.str.replace('byob', 'bring_your_own_bottle')

# Lowercase column names
businessdf.columns = businessdf.columns.map(lambda x: x.lower())

# Remove 'restaurants' from column names
businessdf.columns = businessdf.columns.map(lambda x: x.replace('restaurants', ''))

In [80]:
 # Rename column name clearly
businessdf.rename(columns={
     'is_open': 'business_status','name': 'business_name',
     'review_count': 'review_count_per_business', 'stars': 'average_stars_per_business', 'state': 'US_state',
     'alcohol__beer_and_wine_': 'alcohol_beer_and_wine', 'alcohol__full_bar_': 'alcohol_full_bar', 'alcohol__unknown_': 'alcohol_not_available',
     'bring_your_own_bottlecorkage__no_': 'bring_your_own_bottle_corkage_no',
     'bring_your_own_bottlecorkage__yes_corkage_': 'bring_your_own_bottle_corkage_yes_corkage',
     'bring_your_own_bottlecorkage__yes_free_': 'bring_your_own_bottle_corkage_yes_free',
     'bring_your_own_bottlecorkage_unknown': 'bring_your_own_bottle_corkage_unknown',
     'hastv_false': 'hasTV_false', 'hastv_unknown': 'hasTV_unknown', 'hastv_true': 'hasTV_true',
     'noiselevel__average_': 'noiselevel_average', 'noiselevel__loud_': 'noiselevel_loud',
     'noiselevel__quiet_': 'noiselevel_quiet', 'noiselevel__very_loud_': 'noiselevel_very_loud',    
     'attire__casual_': 'dresscode_casual', 'attire__dressy_': 'dresscode_dressy',
     'attire__formal_': 'dresscode_formal', 'attire_unknown': 'dresscode_unknown',
     'pricerange2_1': 'spricerange_1',
     'pricerange2_2': 'pricerange_2', 'pricerange2_3': 'pricerange_3',
     'pricerange2_4': 'pricerange_4', 'pricerange2_unknown': 'pricerange_unknown',
     'smoking__no_': 'smoking_no', 'smoking__outdoor_': 'smoking_outdoor', 'smoking__yes_': 'smoking_yes',
     'wifi__free_': 'wifi_free', 'Wifi__no_': 'wifi_no',
     'wifi__paid_': 'wifi_paid'}, inplace=True)

In [81]:
# Check how the DataFrame looks like
businessdf.head()

Unnamed: 0,business_id,business_name,address,city,US_state,postal_code,latitude,longitude,average_stars_per_business,review_count_per_business,...,meal_brunch,meal_dessert,meal_dinner,meal_latenight,meal_lunch,parking_garage,parking_lot,parking_street,parking_valet,parking_validated
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Drop one column per attribute group 

In [82]:
# List up columns to drop
drop_col_list = []

for col in businessdf.columns:
    if 'false' in col:
        drop_col_list.append(col)
print(drop_col_list)

['acceptsinsurance_false', 'byob_false', 'bikeparking_false', 'businessacceptsbitcoin_false', 'businessacceptscreditcards_false', 'byappointmentonly_false', 'caters_false', 'coatcheck_false', 'corkage_false', 'dogsallowed_false', 'drivethru_false', 'goodfordancing_false', 'goodforkids_false', 'happyhour_false', 'hasTV_false', 'open24hours_false', 'outdoorseating_false', 'counterservice_false', 'delivery_false', 'goodforgroups_false', 'reservations_false', 'tableservice_false', 'takeout_false', 'wheelchairaccessible_false']


In [83]:
# Drop the listed columns from the DataFrame
businessdf.drop(columns=drop_col_list, inplace=True)

In [87]:
# Pickle results 
businessdf.to_pickle('../data/processed/businessdf.pickle')

In [88]:
# Check if I successfully pickled the data
pickled_businessdf = pd.read_pickle('../data/processed/businessdf.pickle')
pickled_businessdf.head()

Unnamed: 0,business_id,business_name,address,city,US_state,postal_code,latitude,longitude,average_stars_per_business,review_count_per_business,...,meal_brunch,meal_dessert,meal_dinner,meal_latenight,meal_lunch,parking_garage,parking_lot,parking_street,parking_valet,parking_validated
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Check the review file looks like

In [92]:
data_file = open('../data/raw/yelp_academic_dataset_review.json',encoding='utf8')
data=[]
for line in data_file:
    data.append(json.loads(line))
review_df = pd.DataFrame(data)
data_file.close()

In [93]:
review_df.to_csv('../data/processed/review_df.csv', index=False)

In [95]:
# Pickle results 
review_df.to_pickle('../data/processed/reviewdf.pickle')

In [97]:
# Check if I successfully pickled the data
pickled_reviewdf = pd.read_pickle('../data/processed/reviewdf.pickle')
pickled_reviewdf.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3.0,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5.0,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3.0,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5.0,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4.0,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15


# Check the user file looks like

In [150]:
data_file = open('../data/raw/yelp_academic_dataset_user.json',encoding='utf8')
data=[]
for line in data_file:
    data.append(json.loads(line))
user_df = pd.DataFrame(data)
data_file.close()

In [151]:
# Pickle results 
user_df.to_pickle('../data/processed/userdf.pickle')

In [162]:
# Check if I successfully pickled the data
pickled_userdf = pd.read_pickle('../data/processed/userdf.pickle')
pickled_userdf.head()

Unnamed: 0,average_stars,compliment_cool,compliment_cute,compliment_funny,compliment_hot,compliment_list,compliment_more,compliment_note,compliment_photos,compliment_plain,...,cool,elite,fans,friends,funny,name,review_count,useful,user_id,yelping_since
0,4.03,1,0,1,2,0,0,1,0,1,...,25,201520162017.0,5,"c78V-rj8NQcQjOI8KP3UEA, alRMgPcngYSCJ5naFRBz5g...",17,Rashmi,95,84,l6BmjZMeQD3rDxWUbiAiow,2013-10-08 23:11:33
1,3.63,1,0,1,1,0,0,0,0,0,...,16,,4,"kEBTgDvFX754S68FllfCaA, aB2DynOxNOJK9st2ZeGTPg...",22,Jenna,33,48,4XChL029mKr5hydo79Ljxg,2013-02-21 22:29:06
2,3.71,0,0,0,0,0,0,1,0,0,...,10,,0,"4N-HU_T32hLENLntsNKNBg, pSY2vwWLgWfGVAAiKQzMng...",8,David,16,28,bc8C_eETBWL0olvFSJJd0w,2013-10-04 00:16:10
3,4.85,0,0,0,1,0,0,0,0,2,...,14,,5,"RZ6wS38wnlXyj-OOdTzBxA, l5jxZh1KsgI8rMunm-GN6A...",4,Angela,17,30,dD0gZpBctWGdWo9WlGuhlA,2014-05-22 15:57:30
4,4.08,80,0,80,28,1,1,16,5,57,...,665,2015201620172018.0,39,"mbwrZ-RS76V1HoJ0bF_Geg, g64lOV39xSLRZO0aQQ6DeQ...",279,Nancy,361,1114,MM4RJAeH6yuaN8oZDSt0RA,2013-10-23 07:02:50


This code appears to be reading in a JSON file called 'yelp_academic_dataset_user.json' and storing it as a list of dictionaries called 'data'. It then creates a Pandas DataFrame called 'user_df' using the list of dictionaries. The code then pickles the 'user_df' DataFrame, saving it as 'userdf.pickle'. Finally, it reads in the pickled 'userdf' DataFrame and displays the first few rows using the 'head()' method.


# Check the checkin file looks like

In [141]:
checkinjson = '../data/raw/yelp_academic_dataset_checkin.json'

In [142]:

checkinjson = pd.read_json('../data/raw/yelp_academic_dataset_checkin.json',lines=True)

checkinjson.to_csv('../data/processed/checkinjson.csv', index=False)




In [144]:
checkinjson.to_pickle('../data/processed/checkinjson.pickle')

In [145]:
# Check if I successfully pickled the data
pickled_checkindf = pd.read_pickle('../data/processed/checkinjson.pickle')
pickled_checkindf.head()

Unnamed: 0,business_id,date
0,---kPU91CF4Lq2-WlRu9Lw,"2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020..."
1,--0iUa4sNDFiZFrAdIWhZQ,"2010-09-13 21:43:09, 2011-05-04 23:08:15, 2011..."
2,--30_8IhuyMHbSOcNWd6DQ,"2013-06-14 23:29:17, 2014-08-13 23:20:22"
3,--7PUidqRWpRSpXebiyxTg,"2011-02-15 17:12:00, 2011-07-28 02:46:10, 2012..."
4,--7jw19RH9JKXgFohspgQw,"2014-04-21 20:42:11, 2014-04-28 21:04:46, 2014..."


This code appears to be reading in a JSON file called 'yelp_academic_dataset_checkin.json' and storing it as a Pandas DataFrame called 'checkinjson'. It then exports 'checkinjson' to a CSV file called 'checkinjson.csv' and pickles the 'checkinjson' DataFrame, saving it as 'checkinjson.pickle'. Finally, it reads in the pickled 'checkinjson' DataFrame and displays the first few rows using the 'head()' method.


# Check the tip file looks like

In [147]:
tipdf = pd.read_json('../data/raw/yelp_academic_dataset_tip.json',lines=True)

tipdf.to_csv('../data/processed/tipdf.csv', index=False)



In [148]:
# Pickle results 
tipdf.to_pickle('../data/processed/tipdf.pickle')

In [149]:
# Check if I successfully pickled the data
pickled_tipdf = pd.read_pickle('../data/processed/tipdf.pickle')
pickled_tipdf.head()

Unnamed: 0,user_id,business_id,text,date,compliment_count
0,AGNUgVwnZUey3gcPCJ76iw,3uLgwr0qeCNMjKenHJwPGQ,Avengers time with the ladies.,2012-05-18 02:17:21,0
1,NBN4MgHP9D3cw--SnauTkA,QoezRbYQncpRqyrLH6Iqjg,They have lots of good deserts and tasty cuban...,2013-02-05 18:35:10,0
2,-copOvldyKh1qr-vzkDEvw,MYoRNLb5chwjQe3c_k37Gg,It's open even when you think it isn't,2013-08-18 00:56:08,0
3,FjMQVZjSqY8syIO-53KFKw,hV-bABTK-glh5wj31ps_Jw,Very decent fried chicken,2017-06-27 23:05:38,0
4,ld0AperBXk1h6UbqmM80zw,_uN0OudeJ3Zl_tf6nxg5ww,Appetizers.. platter special for lunch,2012-10-06 19:43:09,0


This code appears to be reading in a JSON file called 'yelp_academic_dataset_tip.json' and storing it as a Pandas DataFrame called 'tipdf'. It then exports 'tipdf' to a CSV file called 'tipdf.csv' and pickles the 'tipdf' DataFrame, saving it as 'tipdf.pickle'. Finally, it reads in the pickled 'tipdf' DataFrame and displays the first few rows using the 'head()' method.