## Data Exploration and Processing
We downloaded Yelp dataset from https://www.yelp.com/dataset, the data is a zip file so we had to unzip the file and got our dataset in JSON format. /yelp_dataset is the directory of all the downloaded Yelp data

In [None]:
import pandas as pd

In [None]:
review_file = 'yelp_dataset/yelp_academic_dataset_review.json'

Due to the size of the dataset and the dataset format, we decided to convert the file from JSON to csv. Below lines of code is to perform the conversion

In [None]:
%%time
# converting reviews dataset
chunks = pd.read_json(review_file, lines=True, chunksize = 10000)
reviews = pd.DataFrame()
for chunk in chunks:
    reviews = reviews.append(chunk)

In [None]:
reviews.head()

In [None]:
reviews.shape

In [None]:
# it's faster to write the results in a csv using pyarrow
import pyarrow as pa
from pyarrow import csv

out = pa.Table.from_pandas(reviews)
file_name = 'yelp_reviews.csv'
# uncomment below lines to write output to a csv
# csv.write_csv(out, file_name) 

In [None]:
# reading from the newly saved file
reviews_df = pd.read_csv('yelp_reviews.csv')
reviews_df.head()

In [None]:
reviews_df.shape

In [None]:
business_file = 'yelp_dataset/yelp_academic_dataset_business.json'
chunks = pd.read_json(business_file, lines=True, chunksize = 10000)
business = pd.DataFrame()
for chunk in chunks:
    business = business.append(chunk)

In [None]:
business.head()

We are not only interested in the reviews but also want to learn about the business which people review about. Therefore, we need to merge review dataset and business dataset, joining on business id of a merchant

In [None]:
df = reviews_df.merge(business, on='business_id')
df.head()

In [None]:
# merging reviews and business datasets
# uncomment below lines to write to csv file
# df_out = pa.Table.from_pandas(df)
# file_name = 'yelp_data.csv'
# csv.write_csv(out, file_name) 

In [None]:
# exploring unique categories
category_list = [i.split(', ') for i in df_pa['categories']]
# first = category_list[0]
uniques = set()
for i in range(len(category_list)):
    uniques.update(category_list[i])

In [None]:
# dataframe that contains reviews in California
df_ca = df[df['state']=='CA']
df_ca.head()

In [None]:
state_list = ['PA', 'AZ', 'LA', 'CA', 'FL', 'IN', 'MO', 'TN', 'NV', 'AB', 'NJ',
       'IL', 'ID', 'DE', 'HI', 'NC', 'CO', 'WA', 'UT', 'TX', 'MT', 'MI',
       'SD', 'XMS', 'MA', 'VI', 'VT']
for state in state_list:
    print('data records in {} is '.format(state), df[df['state']==state].shape[0])

For our training purpose, we decided to go with restaurant reviews for California given the smaller data size and diversity of this state. Therefore, first we need to get all the unique labels for business in California and then later we need to manually include food related lable to filter out non-restaurants in the dataset. 

In [None]:
df_ca.dropna(subset=['categories'], inplace=True)
category_list_ca = [i.split(', ') for i in df_ca['categories']]
# first = category_list[0]
uniques_ca = set()
for i in range(len(category_list_ca)):
    uniques_ca.update(category_list_ca[i])

In [None]:
out_ca_df = pd.DataFrame(uniques_ca, columns=['labels'])
out_ca_df['labels'] = uniques_ca
# uncomment below line to write to a csv file
# out_ca_df.to_csv('ca_labels.csv')

From our manual labeling process, we found out below list of food related 

In [None]:
food_labels =  ['Acai Bowls',
 'American (New)',
 'American (Traditional)',
 'Arabic',
 'Argentine',
 'Asian Fusion',
 'Australian',
 'Bagels',
 'Bakeries',
 'Barbeque',
 'Bars',
 'Basque',
 'Bed & Breakfast',
 'Beer',
 'Beer Bar',
 'Beer Gardens',
 'Belgian',
 'Beverage Store',
 'Brazilian',
 'Breakfast & Brunch',
 'Brewpubs',
 'British',
 'Bubble Tea',
 'Buffets',
 'Burgers',
 'Cafes',
 'Cafeteria',
 'Cajun/Creole',
 'Candy Stores',
 'Cantonese',
 'Caribbean',
 'Caterers',
 'Champagne Bars',
 'Cheese Shops',
 'Cheesesteaks',
 'Chicken Shop',
 'Chicken Wings',
 'Chinese',
 'Chocolatiers & Shops',
 'Cocktail Bars',
 'Coffee & Tea',
 'Coffee Roasteries',
 'Coffeeshops',
 'Comfort Food',
 'Creperies',
 'Cuban',
 'Cupcakes',
 'Custom Cakes',
 'Delicatessen',
 'Delis',
 'Desserts',
 'Dim Sum',
 'Diners',
 'Dive Bars',
 'Do-It-Yourself Food',
 'Donuts',
 'Empanadas',
 'Ethiopian',
 'Ethnic Food',
 'Falafel',
 'Fast Food',
 'Fish & Chips',
 'Fondue',
 'Food',
 'Food Court',
 'Food Delivery Services',
 'Food Stands',
 'Food Tours',
 'Food Trucks',
 'French',
 'Fruits & Veggies',
 'Gay Bars',
 'Gelato',
 'German',
 'Gluten-Free',
 'Greek',
 'Halal',
 'Hawaiian',
 'Herbs & Spices',
 'Himalayan/Nepalese',
 'Honey',
 'Hong Kong Style Cafe',
 'Hookah Bars',
 'Hot Dogs',
 'Hot Pot',
 'Ice Cream & Frozen Yogurt',
 'Imported Food',
 'Indian',
 'Indonesian',
 'Internet Cafes',
 'Irish',
 'Irish Pub',
 'Italian',
 'Japanese',
 'Juice Bars & Smoothies',
 'Kebab',
 'Kombucha',
 'Korean',
 'Latin American',
 'Lebanese',
 'Live/Raw Food',
 'Lounges',
 'Macarons',
 'Mediterranean',
 'Mexican',
 'Middle Eastern',
 'Modern European',
 'Mongolian',
 'Moroccan',
 'Muay Thai',
 'New Mexican Cuisine',
 'Noodles',
 'Pakistani',
 'Pan Asian',
 'Pasta Shops',
 'Patisserie/Cake Shop',
 'Persian/Iranian',
 'Peruvian',
 'Pizza',
 'Poke',
 'Pop-Up Restaurants',
 'Pretzels',
 'Pubs',
 'Ramen',
 'Restaurants',
 'Salad',
 'Salvadoran',
 'Sandwiches',
 'Scandinavian',
 'Seafood',
 'Shaved Ice',
 'Soul Food',
 'Soup',
 'Southern',
 'Spanish',
 'Speakeasies',
 'Specialty Food',
 'Steakhouses',
 'Sushi Bars',
 'Szechuan',
 'Tacos',
 'Taiwanese',
 'Tapas Bars',
 'Tapas/Small Plates',
 'Tasting Classes',
 'Tea Rooms',
 'Tex-Mex',
 'Thai',
 'Themed Cafes',
 'Turkish',
 'Tuscan',
 'Vegan',
 'Vegetarian',
 'Vietnamese']

In [None]:
# method to identify a restaurant
def is_restaurant(list1, list2):
    for item in list1:
        if item in list2:
            return True
    return False

Looking at each of the category in categories column of each row, if any of the categories matches our food list, then we will mark this business is a restaurant. Y for is a restaurant, N for a non-restaurant business

In [None]:
is_res = []
for categories in df_ca['categories']:
    cate_list = categories.split(', ')
    if is_restaurant(cate_list, food_labels):
        is_res.append('Y')
    else:
        is_res.append('N')

df_ca['is_restaurant'] = is_res
df_ca.head()

In [None]:
ca_res_df = df_ca[df_ca['is_restaurant'] =='Y']
#  creating California restaurant dataset, uncomment below line to write a csv
# ca_res_df.to_csv('ca_restaurants.csv')