## Dataset Introduction

[Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge)

The Challenge Dataset:

    5.2M reviews and 11M tips by 1.3M users for 174K businesses
    1.2M business attributes, e.g., hours, parking availability, ambience
    Aggregated check-ins over time for each of the 174K businesses
    200,000 pictures from the included businesses
    11 metropolitan areas

Cities:

    U.K.: Edinburgh
    Germany: Karlsruhe
    Canada: Montreal and Waterloo
    U.S.: Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison, Cleveland
    
Files:

    yelp_academic_dataset_business.json
    yelp_academic_dataset_checkin.json
    yelp_academic_dataset_review.json
    yelp_academic_dataset_tip.json
    yelp_academic_dataset_user.json

## Read data from file and load to Pandas DataFrame

**Warning**: Loading all the data into Pandas at a time takes long

In [15]:
import json
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [6]:
file_business, file_checkin, file_review, file_tip, file_user = [
    'business.json',
    'checkin.json',
    'review.json',
    'tip.json',
    'user.json',
]

#### Review

In [7]:
with open(file_review) as f:
    df_review = pd.DataFrame(json.loads(line) for line in f)

In [8]:
df_review.head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,0W4lkclzZThpx3V65bVgig,0,2016-05-28,0,v0i_UHJMo_hPBq9bxWvW4w,5,"Love the staff, love the meat, love the place....",0,bv2nCi5Qv5vroFiqKGopiw
1,AEx2SYEUJmTxVVB18LlCwA,0,2016-05-28,0,vkVSCC7xljjrAI4UGfnKEQ,5,Super simple place but amazing nonetheless. It...,0,bv2nCi5Qv5vroFiqKGopiw
2,VR6GpWIda3SfvPC-lg9H3w,0,2016-05-28,0,n6QzIUObkYshz4dz2QRJTw,5,Small unassuming place that changes their menu...,0,bv2nCi5Qv5vroFiqKGopiw
3,CKC0-MOWMqoeWf6s-szl8g,0,2016-05-28,0,MV3CcKScW05u5LVfF6ok0g,5,Lester's is located in a beautiful neighborhoo...,0,bv2nCi5Qv5vroFiqKGopiw
4,ACFtxLv8pGrrxMm6EgjreA,0,2016-05-28,0,IXvOzsEMYtiJI0CARmj77Q,4,Love coming here. Yes the place always needs t...,0,bv2nCi5Qv5vroFiqKGopiw


#### Business

In [9]:
with open(file_business) as f:
    df_business = pd.DataFrame(json.loads(line) for line in f)

In [10]:
df_business.head()

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state
0,"4855 E Warner Rd, Ste B9","{'AcceptsInsurance': True, 'ByAppointmentOnly'...",FYWN1wneV18bWNgQjJ2GNg,"[Dentists, General Dentistry, Health & Medical...",Ahwatukee,"{'Friday': '7:30-17:00', 'Tuesday': '7:30-17:0...",1,33.33069,-111.978599,Dental by Design,,85044,22,4.0,AZ
1,3101 Washington Rd,"{'BusinessParking': {'garage': False, 'street'...",He-G7vWjzVUysIKrfNbPUQ,"[Hair Stylists, Hair Salons, Men's Hair Salons...",McMurray,"{'Monday': '9:00-20:00', 'Tuesday': '9:00-20:0...",1,40.291685,-80.1049,Stephen Szabo Salon,,15317,11,3.0,PA
2,"6025 N 27th Ave, Ste 1",{},KQPW8lFf1y5BT2MxiSZ3QA,"[Departments of Motor Vehicles, Public Service...",Phoenix,{},1,33.524903,-112.11531,Western Motor Vehicle,,85017,18,1.5,AZ
3,"5000 Arizona Mills Cr, Ste 435","{'BusinessAcceptsCreditCards': True, 'Restaura...",8DShNS-LuFqpEWIp0HxijA,"[Sporting Goods, Shopping]",Tempe,"{'Monday': '10:00-21:00', 'Tuesday': '10:00-21...",0,33.383147,-111.964725,Sports Authority,,85282,9,3.0,AZ
4,581 Howe Ave,"{'Alcohol': 'full_bar', 'HasTV': True, 'NoiseL...",PfOCPjBrlQAnz__NXj9h_w,"[American (New), Nightlife, Bars, Sandwiches, ...",Cuyahoga Falls,"{'Monday': '11:00-1:00', 'Tuesday': '11:00-1:0...",1,41.119535,-81.47569,Brick House Tavern + Tap,,44221,116,3.5,OH


#### Filter a subset

We filter for restaurants in Las Vegas only because large dataset will crash using a single machine power

In [29]:
cond_city = df_business['city'] == 'Las Vegas'
cond_category_not_null = ~df_business['categories'].isnull()
cond_category_restaurant = df_business['categories'].apply(str).str.contains('Restaurants')
cond_review_count = df_business['review_count'] > 1

In [42]:
df_vegas = df_business[cond_city & cond_category_not_null & cond_category_restaurant & cond_review_count]

In [79]:
## Create category feature vector
category_list = []
for row in df_vegas['categories']:
    for i in row:
        category_list.append(i)
        
category_list = list(set(category_list))

def feature_vector(category):
    lst = [1 if category in row else 0 for row in df_vegas['categories']]
    return lst

vec_list = [feature_vector(category) for category in category_list]
feature_df = pd.DataFrame(vec_list).transpose()
feature_df.columns = category_list
feature_df.head()

Unnamed: 0,Trainers,Family Practice,Car Dealers,Makeup Artists,Filipino,Acupuncture,Dim Sum,Breweries,Fireplace Services,Swimming Pools,...,Kebab,Buffets,Persian/Iranian,Street Vendors,Special Education,Festivals,Shanghainese,Scandinavian,Towing,Insurance
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [155]:
feature_df['American'] = feature_df['American (Traditional)'] + feature_df['American (New)']
others = feature_df['American'] + feature_df['Fast Food'] + feature_df['Mexican'] + feature_df['Chinese'] + feature_df['Italian'] + feature_df['Japanese']
a = [1 if x == 0 else 0 for x in others]
feature_df['Others'] = a

In [174]:
selected_feature = [u'American',u'Fast Food',u'Mexican',u'Chinese',u'Italian', u'Japanese', u'Others']
df_feature = feature_df[selected_feature]
df_feature.columns = ['american','fast_food','mexican','chinese','italian','japanese','others']

In [175]:
df_feature.head()

Unnamed: 0,american,fast_food,mexican,chinese,italian,japanese,others
0,1,0,0,0,0,0,0
1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0
3,0,0,0,0,1,0,0
4,0,1,0,0,0,0,0


In [169]:
## Create attribute feature vector
romantic = []
classy = []
upscale = []
casual = []
open_24 = []
wifi = []
noise = []


for row in df_vegas.attributes:
    romantic.append(row.get('Ambience',{}).get('romantic'))

for row in df_vegas.attributes:
    classy.append(row.get('Ambience',{}).get('classy'))

for row in df_vegas.attributes:
    upscale.append(row.get('Ambience',{}).get('upscale'))
    
for row in df_vegas.attributes:
    casual.append(row.get('Ambience',{}).get('casual'))
    
for row in df_vegas.attributes:
    open_24.append(row.get('Open24Hours'))
    
for row in df_vegas.attributes:
    wifi.append(row.get('WiFi'))

for row in df_vegas.attributes:
    noise.append(row.get('NoiseLevel'))

    
romantic = [1 if x == True else 0 for x in romantic ]
casual = [1 if x == True else 0 for x in casual]
upscale = [1 if x == True else 0 for x in upscale]
classy = [1 if x == True else 0 for x in classy]
open_24 = [1 if x == True else 0 for x in open_24]
wifi = [1 if x == True else 0 for x in wifi]
noise_level = []

for x in noise:
    if x == 'quiet':
        noise_level.append(1)
    elif x == 'average':
        noise_level.append(2)
    elif x == 'loud':
        noise_level.append(3)
    else:
        noise_level.append(4)

Unnamed: 0,casual,classy,noise_level,open_24,romantic,upscale
0,1,0,2,0,0,0
1,0,0,4,0,0,0
2,0,0,4,0,0,0
3,0,0,4,0,0,0
4,0,0,4,0,0,0


In [218]:
pd.options.mode.chained_assignment = None 
df_feature.loc[:,'romantic'] = romantic
df_feature.loc[:,'casual'] = casual
df_feature.loc[:,'classy'] = classy
df_feature.loc[:,'upscale'] = upscale
df_feature.loc[:,'open_24'] = open_24
df_feature.loc[:,'noise_level'] = noise_level
df_feature.loc[:,'business_id'] = df_vegas['business_id'].values

In [219]:
df_feature.head()

Unnamed: 0,american,fast_food,mexican,chinese,italian,japanese,others,romantic,casual,classy,upscale,open_24,noise_level,business_id
0,1,0,0,0,0,0,0,0,1,0,0,0,2,Pd52CjgyEU3Rb8co6QfTPw
1,0,1,0,0,0,0,0,0,0,0,0,0,4,4srfPk1s8nlm1YusyDUbjg
2,1,0,0,0,0,0,0,0,0,0,0,0,4,n7V4cD-KqqE3OXk0irJTyA
3,0,0,0,0,1,0,0,0,0,0,0,0,4,F0fEKpTk7gAmuSFI0KW1eQ
4,0,1,0,0,0,0,0,0,0,0,0,0,4,Wpt0sFHcPtV5MO9He7yMKQ


In [220]:
df_vegas.head()

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state
52,6730 S Las Vegas Blvd,"{'Alcohol': 'full_bar', 'HasTV': True, 'NoiseL...",Pd52CjgyEU3Rb8co6QfTPw,"[Nightlife, Bars, Barbeque, Sports Bars, Ameri...",Las Vegas,"{'Monday': '8:30-22:30', 'Tuesday': '8:30-22:3...",1,36.066914,-115.170848,Flight Deck Bar & Grill,Southeast,89119,13,4.0,NV
53,"6889 S Eastern Ave, Ste 101","{'GoodForMeal': {'dessert': False, 'latenight'...",4srfPk1s8nlm1YusyDUbjg,"[Fast Food, Restaurants, Sandwiches]",Las Vegas,{},1,36.064652,-115.118954,Subway,Southeast,89119,6,2.5,NV
54,"6587 Las Vegas Blvd S, Ste 171","{'RestaurantsTableService': True, 'GoodForMeal...",n7V4cD-KqqE3OXk0irJTyA,"[Arcades, Arts & Entertainment, Gastropubs, Re...",Las Vegas,"{'Monday': '11:00-0:00', 'Tuesday': '11:00-0:0...",1,36.068259,-115.178877,GameWorks,Southeast,89119,349,3.0,NV
91,"4250 S Rainbow Blvd, Ste 1007","{'GoodForMeal': {'dessert': False, 'latenight'...",F0fEKpTk7gAmuSFI0KW1eQ,"[Italian, Restaurants]",Las Vegas,{},0,36.111057,-115.241688,Cafe Mastrioni,Spring Valley,89103,3,1.5,NV
122,3020 E Desert Inn Rd,"{'RestaurantsTableService': False, 'GoodForMea...",Wpt0sFHcPtV5MO9He7yMKQ,"[Restaurants, Fast Food, Burgers]",Las Vegas,"{'Monday': '0:00-0:00', 'Tuesday': '0:00-0:00'...",1,36.130013,-115.10931,McDonald's,Eastside,89121,20,2.0,NV


In [221]:
### Write df_feature to csv
df_feature.to_csv('df_feature.csv', index=False, encoding='utf-8')