# **Yelp Dataset EDA**
---
By Kostas Hatalis (<kostas@gaudium.ai>)

Data source: https://www.kaggle.com/yelp-dataset/yelp-dataset

## Description of Data

Attributes of business table are as following:

* **business_id**: ID of the business
* **name**: name of the business
* **neighborhood**: 
* **address**: address of the business
* **city**: city of the business
* **state**: state of the business
* **postal_code**: postal code of the business
* **latitude**: latitude of the business
* **longitude**: longitude of the business
* **stars**: average rating of the business
* **review_count**: number of reviews received
* **is_open**: 1 if the business is open, 0 therwise
* **categories**: multiple categories of the business

Attribues of review table are as following:

* **review_id**: ID of the review
* **user_id**: ID of the user
* **business_id**: ID of the business
* **stars**: ratings of the business
* **date**: review date
* **text**: review from the user
* **useful**: number of users who vote a review as usefull
* **funny**: number of users who vote a review as funny
* **cool**: number of users who vote a review as cool

Attribues of user table are as following:

* **average_stars**: mean star rating
* **compliment_cool**: 
* **compliment_cute**: 
* **compliment_funny**: 
* **compliment_hot**: 
* **compliment_list**: 
* **compliment_more**: 
* **compliment_note**: 
* **compliment_photos**: 
* **compliment_plain**: 
* **compliment_profile**: 
* **compliment_writer**: 
* **cool**: 
* **elite**: 
* **fans**: 
* **friends**: 
* **funny**: 
* **name**: users name
* **review_count**: number of reviews the user wrote
* **useful**: 
* **user_id**: ID of the user
* **yelping_since**: date user started yelping



## Load in Datasets

In [23]:
import json
import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)

In [16]:
def load_rows(filepath, nrows = None):
    """
    Function to load in json data.

    Args:
        filepath (str): full path of dataset.
        nrows (int): max number of rows to load (optional).

    Returns:
        Pandas DataFrame with table.
    """
    with open(filepath, encoding="utf8") as json_file:
        count = 0
        objs = []
        line = json_file.readline()
        while (nrows is None or count < nrows) and line:
            count += 1
            obj = json.loads(line)
            objs.append(obj)
            line = json_file.readline()
        return pd.DataFrame(objs)

In [29]:
%%time

data_path = "C:\\Data\\Yelp"

businesses = load_rows(data_path+'\\yelp_academic_dataset_business.json')
print('Business objects loaded. Count = {}'.format(businesses.shape[0]))

reviews = load_rows(data_path+'/yelp_academic_dataset_review.json')
print('Review objects loaded. Count = {}'.format(reviews.shape[0]))

users = load_rows(data_path+'/yelp_academic_dataset_user.json')
print('User objects loaded. Count = {}'.format(users.shape[0]))

Business objects loaded. Count = 209393
Review objects loaded. Count = 8021122
User objects loaded. Count = 8021122
Wall time: 2min 42s


## Example rows of each table
### Business table

In [33]:
businesses.head(2)

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,postal_code,review_count,stars,state
0,10913 Bailey Rd,"{'BusinessAcceptsCreditCards': 'True', 'BikeParking': 'True', 'GoodForKids': 'False', 'BusinessParking': '{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}', 'ByAppointmentOnly': 'False', 'RestaurantsPriceRange2': '3'}",f9NumwFMBDn751xgFiRbNA,"Active Life, Gun/Rifle Ranges, Guns & Ammo, Shopping",Cornelius,"{'Monday': '10:0-18:0', 'Tuesday': '11:0-20:0', 'Wednesday': '10:0-18:0', 'Thursday': '11:0-20:0', 'Friday': '11:0-20:0', 'Saturday': '11:0-20:0', 'Sunday': '13:0-18:0'}",1,35.462724,-80.852612,The Range At Lake Norman,28031,36,3.5,NC
1,"8880 E Via Linda, Ste 107","{'GoodForKids': 'True', 'ByAppointmentOnly': 'True'}",Yzvjg0SayhoZgCljUJRF9Q,"Health & Medical, Fitness & Instruction, Yoga, Active Life, Pilates",Scottsdale,,1,33.569404,-111.890264,"Carlos Santo, NMD",85258,4,5.0,AZ


### Reviews table

In [25]:
reviews.head(2)

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,-MhfebM0QIsKt87iDN-FNw,0,2015-04-15 05:21:16,0,xQY8N_XvtGbearJ5X4QryQ,2.0,"As someone who has worked with many museums, I was eager to visit this gallery on my most recent trip to Las Vegas. When I saw they would be showing infamous eggs of the House of Faberge from the Virginia Museum of Fine Arts (VMFA), I knew I had to go!\n\nTucked away near the gelateria and the garden, the Gallery is pretty much hidden from view. It's what real estate agents would call ""cozy"" or ""charming"" - basically any euphemism for small.\n\nThat being said, you can still see wonderful art at a gallery of any size, so why the two *s you ask? Let me tell you:\n\n* pricing for this, while relatively inexpensive for a Las Vegas attraction, is completely over the top. For the space and the amount of art you can fit in there, it is a bit much.\n* it's not kid friendly at all. Seriously, don't bring them.\n* the security is not trained properly for the show. When the curating and design teams collaborate for exhibitions, there is a definite flow. That means visitors should view the art in a certain sequence, whether it be by historical period or cultural significance (this is how audio guides are usually developed). When I arrived in the gallery I could not tell where to start, and security was certainly not helpful. I was told to ""just look around"" and ""do whatever."" \n\nAt such a *fine* institution, I find the lack of knowledge and respect for the art appalling.",5,OwjRMXRC0KyPrIlcjaXeFQ
1,lbrU8StCq3yDfr-QMnGrmQ,0,2013-12-07 03:16:52,1,UmFMZ8PyXZTY2QcwzsfQYA,1.0,"I am actually horrified this place is still in business. My 3 year old son needed a haircut this past summer and the lure of the $7 kids cut signs got me in the door. We had to wait a few minutes as both stylists were working on people. The decor in this place is total garbage. It is so tacky. The sofa they had at the time was a pleather sofa with giant holes in it. And my son noticed ants crawling all over the floor and the furniture. It was disgusting and I should have walked out then. Actually, I should have turned around and walked out upon entering but I didn't. So the older black male stylist finishes the haircut he was doing and it's our turn. I tell him I want a #2 clipper around the back and sides and then hand cut the top into a standard boys cut. Really freaking simple, right? WRONG! Rather than use the clippers and go up to actually cut the hair, he went down. Using it moving downward doesn't cut hair, it just rubs against it. How does this man who has an alleged cosmetology license not know how to use a set of freaking clippers??? I realized almost immediately that he had no idea what he was doing. No idea at all. After about 10 minutes of watching this guy stumble through it, I said ""you know what? That's fine."", paid and left. All I wanted to do was get out of that scummy joint and take my son to a real haircut place.\n\nBottom line: DO NOT GO HERE. RUN THE OTHER WAY!!!!!",1,nIJD_7ZXHq-FX8byPMOkMQ


### User table

In [32]:
users.head(1)

Unnamed: 0,average_stars,compliment_cool,compliment_cute,compliment_funny,compliment_hot,compliment_list,compliment_more,compliment_note,compliment_photos,compliment_plain,compliment_profile,compliment_writer,cool,elite,fans,friends,funny,name,review_count,useful,user_id,yelping_since
0,3.57,22,0,22,3,1,2,11,0,15,1,10,227,,14,"oeMvJh94PiGQnx_6GlndPQ, wm1z1PaJKvHgSDRKfwhfDg, IkRib6Xs91PPW7pon7VVig, A8Aq8f0-XvLBcyMk2GJdJQ, eEZM1kogR7eL4GOBZyPvBA, e1o1LN7ez5ckCpQeAab4iw, _HrJVzFaRFUhPva8cwBjpQ, pZeGZGzX-ROT_D5lam5uNg, 0S6EI51ej5J7dgYz3-O0lA, woDt8raW-AorxQM_tIE2eA, hWUnSE5gKXNe7bDc8uAG9A, c_3LDSO2RHwZ94_Q6j_O7w, -uv1wDiaplY6eXXS0VwQiA, QFjqxXn3acDC7hckFGUKMg, ErOqapICmHPTN8YobZIcfQ, mJLRvqLOKhqEdkgt9iEaCQ, VKX7jlScJSA-ja5hYRw12Q, ijIC9w5PRcj3dWVlanjZeg, CIZGlEw-Bp0rmkP8M6yQ9Q, OC6fT5WZ8EU7tEVJ3bzPBQ, UZSDGTDpycDzrlfUlyw2dQ, deL6e_z9xqZTIODKqnvRXQ, 5mG2ENw2PylIWElqHSMGqg, Uh5Kug2fvDd51RYmsNZkGg, 4dI4uoShugD9z84fYupelQ, EQpFHqGT9Tk6YSwORTtwpg, o4EGL2-ICGmRJzJ3GxB-vw, s8gK7sdVzJcYKcPv2dkZXw, vOYVZgb_GVe-kdtjQwSUHw, wBbjgHsrKr7BsPBrQwJf2w, p59u2EC_qcmCmLeX1jCi5Q, VSAZI1eHDrOPRWMK4Q2DIQ, efMfeI_dkhpeGykaRJqxfQ, x6qYcQ8_i0mMDzSLsFCbZg, K_zSmtNGw1fu-vmxyTVfCQ, 5IM6YPQCK-NABkXmHhlRGQ, U_w8ZMD26vnkeeS1sD7s4Q, AbfS_oXF8H6HJb5jFqhrLw, hbcjX4_D4KIfonNnwrH-cg, UKf66_MPz0zHCP70mF6p1g, hK2gYbxZRTqcqlSiQQcrtQ, 2Q45w_Twx_T9dXqlE16xtQ, BwRn8qcKSeA77HLaOTbfiQ, jouOn4VS_DtFPtMR2w8VDA, ESteyJabbfvqas6CEDs3pQ",225,Rafael,553,628,ntlvfPzc8eglqvk92iDIAw,2007-07-06 03:27:11
