<div id="container" style="position:relative;">
<div style="float:centre"><h1> Yelp Review Analysis </h1></div>
<div style="position:relative; float:right"><img style="height:65px" src ="https://drive.google.com/uc?export=view&id=1EnB0x-fdqMp6I5iMoEBBEuxB_s7AmE2k" />
</div>
<div style="position:relative; float:left"><h4> by Lavanya Kwatra (August, 2020) </h4></div>
</div>

## Introduction:

Founded in October 2004, Yelp is an online platform that publishes crowd-sourced reviews for businesses, along with the recent addition of Yelp Reservations, which is an online booking service. Headquartered in San Francisco, California, Yelp develops, hosts and markets their website (www.yelp.com) as well as the Yelp mobile app. Yelp prides itself on their motto of “connecting people with great local businesses,” thereby, creating a marketplace within their user community to study the social perception of the businesses they advertise. (Yelp, Inc., n.d.)

As a result of this marketplace model, in a way, businesses advertise to new customers through the user-generated reviews on Yelp and rely heavily on consumer engagement, such as “useful votes” on the review another user wrote about their establishment. Yelp generates insights from their online community of users, who write reviews & rate businesses on a scale of 1 to 5 while interacting with other users & the posted content in a social networking format. 

A traditional consumer centric approach would focus on tracking user activity levels, such as:
1.	the total number of reviews written by the user,
2.	whether the user leaves "tips" or suggestions for the restaurant online,
3.	whether the user checks into restaurants when they dine there,
4.	the amount of time the user spends on the platform, 
5.	the user’s social interaction with other users through “votes” on reviews
6.	whether the user actively finds friends in the user community,
7.	whether the user follows top Yelper's or similar users,


The most popular reason people use Yelp is when they are searching for a new restaurant, based on how other users found it. This is captured in the user “votes” on reviews posted by other users in the Yelp community. Based on a [study](https://kanuparthy.files.wordpress.com/2016/05/cscwf630.pdf) by Saeideh Bakhshi, Partha Kanuparthy and David A. Shamma at Yahoo Labs, a deeper understanding of the social perception of these "votes" as user feedback can help design better recommendation and social networks. Using the same review data from 2013, they found that active and older members tend to engage more on the platform, by writing more reviews and reacting to other reviews. Reviews with a higher word count are perceived as more "useful" overall by the community, while those who give higher ratings are perceived as "cool" and review with a negative tone are more likely to be voted "funny" by other users. Additionally, they found that new businesses can generate a "review momentum" by receiving more "useful" votes on initial reviews by users, which helps establish the business. (Bakhshi, Kanuparthy, & Shamma, 2013)

------

# Data Selection

Yelp’s website has a well-documented dataset of 6 .json files, namely: business, user, review, tips, check-ins and photos. 

Based on which data is relevant to study the social perception of restaurants in the Yelp user community, `review.json`, `business.json` and `user.json` were downloaded to the local drive. The three files are almost 11GB in size, containing over 8 million reviews for 200,000+ businesses across North America. 

Note: these cells below take a while to run and will require downloading the .json files that are hosted in an S3 bucket, `yelp` on AWS. 



In [1]:
# reading in the datasets 
# downloaded from Yelp as .json files
# importing pandas to read the files into a dataframe
import pandas as pd

In [2]:
# reading in business attributes data
business = pd.read_json(r'business.json', lines=True)

In [3]:
# reading in user attributes data
user = pd.read_json(r'user.json', lines=True)

In [4]:
# reading in review attributes data
review = pd.read_json(r'review.json', lines=True)

In [5]:
# viewing the head of the user dataframe
user.head()

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,ntlvfPzc8eglqvk92iDIAw,Rafael,553,2007-07-06 03:27:11,628,225,227,,"oeMvJh94PiGQnx_6GlndPQ, wm1z1PaJKvHgSDRKfwhfDg...",14,...,2,1,0,1,11,15,22,22,10,0
1,FOBRPlBHa3WPHFB5qYDlVg,Michelle,564,2008-04-28 01:29:25,790,316,400,200820092010201120122013,"ly7EnE8leJmyqyePVYFlug, pRlR63iDytsnnniPb3AOug...",27,...,4,5,2,1,33,37,63,63,21,5
2,zZUnPeh2hEp0WydbAZEOOg,Martin,60,2008-08-28 23:40:05,151,125,103,2010,"Uwlk0txjQBPw_JhHsQnyeg, Ybxr1tSCkv3lYA0I1qmnPQ...",5,...,6,0,1,0,3,7,17,17,4,1
3,QaELAmRcDc5TfJEylaaP8g,John,206,2008-09-20 00:08:14,233,160,84,2009,"iog3Nyg1i4jeumiTVG_BSA, M92xWY2Vr9w0xoH8bPplfQ...",6,...,1,0,0,0,7,14,7,7,2,0
4,xvu8G900tezTzbbfqmTKvA,Anne,485,2008-08-09 00:30:27,1265,400,512,200920102011201220142015201620172018,"3W3ZMSthojCUirKEqAwGNw, eTIbuu23j9tOgmIa9POyLQ...",78,...,9,2,1,1,22,28,31,31,19,31


In [6]:
# viewing the head of the business dataframe
business.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,f9NumwFMBDn751xgFiRbNA,The Range At Lake Norman,10913 Bailey Rd,Cornelius,NC,28031,35.462724,-80.852612,3.5,36,1,"{'BusinessAcceptsCreditCards': 'True', 'BikePa...","Active Life, Gun/Rifle Ranges, Guns & Ammo, Sh...","{'Monday': '10:0-18:0', 'Tuesday': '11:0-20:0'..."
1,Yzvjg0SayhoZgCljUJRF9Q,"Carlos Santo, NMD","8880 E Via Linda, Ste 107",Scottsdale,AZ,85258,33.569404,-111.890264,5.0,4,1,"{'GoodForKids': 'True', 'ByAppointmentOnly': '...","Health & Medical, Fitness & Instruction, Yoga,...",
2,XNoUzKckATkOD1hP6vghZg,Felinus,3554 Rue Notre-Dame O,Montreal,QC,H4C 1P4,45.479984,-73.58007,5.0,5,1,,"Pets, Pet Services, Pet Groomers",
3,6OAZjbxqM5ol29BuHsil3w,Nevada House of Hose,1015 Sharp Cir,North Las Vegas,NV,89030,36.219728,-115.127725,2.5,3,0,"{'BusinessAcceptsCreditCards': 'True', 'ByAppo...","Hardware Stores, Home Services, Building Suppl...","{'Monday': '7:0-16:0', 'Tuesday': '7:0-16:0', ..."
4,51M2Kk903DFYI6gnB5I6SQ,USE MY GUY SERVICES LLC,4827 E Downing Cir,Mesa,AZ,85205,33.428065,-111.726648,4.5,26,1,"{'BusinessAcceptsCreditCards': 'True', 'ByAppo...","Home Services, Plumbing, Electricians, Handyma...","{'Monday': '0:0-0:0', 'Tuesday': '9:0-16:0', '..."


In [7]:
# viewing the head of the review dataframe
review.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,xQY8N_XvtGbearJ5X4QryQ,OwjRMXRC0KyPrIlcjaXeFQ,-MhfebM0QIsKt87iDN-FNw,2,5,0,0,"As someone who has worked with many museums, I...",2015-04-15 05:21:16
1,UmFMZ8PyXZTY2QcwzsfQYA,nIJD_7ZXHq-FX8byPMOkMQ,lbrU8StCq3yDfr-QMnGrmQ,1,1,1,0,I am actually horrified this place is still in...,2013-12-07 03:16:52
2,LG2ZaYiOgpr2DK_90pYjNw,V34qejxNsCbcgD8C0HVk-Q,HQl28KMwrEKHqhFrrDqVNQ,5,1,0,0,I love Deagan's. I do. I really do. The atmosp...,2015-12-05 03:18:11
3,i6g_oA9Yf9Y31qt0wibXpw,ofKDkJKXSKZXu5xJNGiiBQ,5JxlZaqCnk1MnbgRirs40Q,1,0,0,0,"Dismal, lukewarm, defrosted-tasting ""TexMex"" g...",2011-05-27 05:30:52
4,6TdNDKywdbjoTkizeMce8A,UgMW8bLE0QMJDCkQ1Ax5Mg,IS4cv902ykd8wj1TR0N3-A,4,0,0,0,"Oh happy day, finally have a Canes near my cas...",2017-01-14 21:56:57


In [None]:
# Lets view the shapes of each of these dataframes:

In [8]:
business.shape

(209393, 14)

In [9]:
user.shape

(1968703, 22)

In [10]:
review.shape

(8021122, 9)

From the above, the following conclusions are drawn:

- We have 8,021,122 reviews with 9 corresponding review features, 209,393 businesses with 14 corresponding business attributes and a user base of 1,968,703 users with 22 corresponding user dimensions.

- For each review, there is a primary key called `review_id`, along with the corresponding `business_id`, relating it to the business and `user_id`, relating it to the user who wrote the review about that business.

- In order to create a dataset that would be ideal for the model to classify the sentiments based on a cluster analysis of users, it is important to merge these three files to concatenate all the attributes pertaining to that review in one row of the dataframe.

In [11]:
# merging reviews with business on business_id 
# Note: set index as the business_id to avoid adding the column again
review = review.merge(business.set_index('business_id'), on='business_id')

In [12]:
# merging the new review with business attributes with user on user_id
# Note: set index as the user_id to avoid adding the column again
review = review.merge(user.set_index('user_id'), on='user_id')

In [13]:
# viewing the shape of the new complete review dataframe 
# with the corresponding business & user attributes
# Note that the number of reviews is the same, therefore, same # of rows
# All the columns from the two dataframes have been added to review
review.shape

(8021122, 43)

In [14]:
# to view all column names in the head
pd.set_option('display.max_columns', None)

In [15]:
# viewing head of the dataframe for sanity check
review.head()

Unnamed: 0,review_id,user_id,business_id,stars_x,useful_x,funny_x,cool_x,text,date,name_x,address,city,state,postal_code,latitude,longitude,stars_y,review_count_x,is_open,attributes,categories,hours,name_y,review_count_y,yelping_since,useful_y,funny_y,cool_y,elite,friends,fans,average_stars,compliment_hot,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,xQY8N_XvtGbearJ5X4QryQ,OwjRMXRC0KyPrIlcjaXeFQ,-MhfebM0QIsKt87iDN-FNw,2,5,0,0,"As someone who has worked with many museums, I...",2015-04-15 05:21:16,Bellagio Gallery of Fine Art,3600 S Las Vegas Blvd,Las Vegas,NV,89109,36.112896,-115.177637,3.5,180,1,"{'BusinessAcceptsCreditCards': 'True', 'Restau...","Shopping, Arts & Entertainment, Art Galleries,...","{'Monday': '10:0-20:0', 'Tuesday': '10:0-20:0'...",Jamie,58,2008-12-18 09:41:38,85,21,28,,"m2R7pydiZ7vezHD-oZt6-g, EpAkqFJLjztmlAscrTfh_Q...",4,3.36,3,1,0,0,0,2,5,6,6,0,0
1,SjfnCrMCgOiWafnQuCKlhw,OwjRMXRC0KyPrIlcjaXeFQ,9SU7ZZhaFUJJ6m2k5HKHeg,1,4,0,0,SLS just opened in August and they have so man...,2015-03-19 06:17:28,Sahara,2535 Las Vegas Blvd S,Las Vegas,NV,89109,36.142375,-115.156723,3.0,2259,1,"{'BikeParking': 'False', 'Ambience': '{'romant...","Hotels & Travel, Nightlife, Hotels, Event Plan...","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W...",Jamie,58,2008-12-18 09:41:38,85,21,28,,"m2R7pydiZ7vezHD-oZt6-g, EpAkqFJLjztmlAscrTfh_Q...",4,3.36,3,1,0,0,0,2,5,6,6,0,0
2,t7xOZF5UKXjSpVcXLOSAgw,owbC7FP8SNAlwv6f9S5Stw,-MhfebM0QIsKt87iDN-FNw,2,2,2,0,I have been there. I believe more than once. \...,2014-03-14 08:24:25,Bellagio Gallery of Fine Art,3600 S Las Vegas Blvd,Las Vegas,NV,89109,36.112896,-115.177637,3.5,180,1,"{'BusinessAcceptsCreditCards': 'True', 'Restau...","Shopping, Arts & Entertainment, Art Galleries,...","{'Monday': '10:0-20:0', 'Tuesday': '10:0-20:0'...",Girl,250,2012-09-28 19:36:16,507,265,185,,"3tLPT5PzOemSed-ERxwSQg, TuivYYtdK6y2uOgbE2axXA...",8,3.42,0,1,0,0,0,7,6,5,5,1,0
3,WE2gu5ww_FOR8IYWb10GFQ,owbC7FP8SNAlwv6f9S5Stw,4ZbRwCB9oGibxK21MUZKHA,2,1,3,0,This Smiths sucks ass.\nThey invented Yelp for...,2015-06-23 06:30:12,Smith's,9750 S Maryland Pkwy,Las Vegas,NV,89183,36.012129,-115.134257,2.5,96,1,"{'RestaurantsTakeOut': 'True', 'BikeParking': ...","Beauty & Spas, Delis, Grocery, Cosmetics & Bea...","{'Monday': '6:0-1:0', 'Tuesday': '6:0-1:0', 'W...",Girl,250,2012-09-28 19:36:16,507,265,185,,"3tLPT5PzOemSed-ERxwSQg, TuivYYtdK6y2uOgbE2axXA...",8,3.42,0,1,0,0,0,7,6,5,5,1,0
4,Djql0bAS55_sl7im29wD0g,owbC7FP8SNAlwv6f9S5Stw,4ZbRwCB9oGibxK21MUZKHA,1,1,0,0,I have an update ! \nDo not EVER pay your util...,2018-02-10 21:51:52,Smith's,9750 S Maryland Pkwy,Las Vegas,NV,89183,36.012129,-115.134257,2.5,96,1,"{'RestaurantsTakeOut': 'True', 'BikeParking': ...","Beauty & Spas, Delis, Grocery, Cosmetics & Bea...","{'Monday': '6:0-1:0', 'Tuesday': '6:0-1:0', 'W...",Girl,250,2012-09-28 19:36:16,507,265,185,,"3tLPT5PzOemSed-ERxwSQg, TuivYYtdK6y2uOgbE2axXA...",8,3.42,0,1,0,0,0,7,6,5,5,1,0


In [16]:
# taking a sample of this dataframe
# random_state set to get the same points & take a representative sample
sample = review.sample(frac=0.0625, random_state=11)

In [7]:
# exporting this sample as a .csv file to local drive
sample.to_csv('sample.csv')

Now that the reviews have been matched to their corresponding business & user attributes, we can begin to explore the data to understand the social perception of users in the Yelp community with respect to restaurants. The next notebook will deal with further cleaning of this data.

------------------------------------------------