# Yelp Restaurant Review Data Analysis

This notebook is for data cleaning and merging different datasets into one.

## Business.json Data

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Columns that are used for the initial dataframe
cols = ['business_id', 'city', 'state', 'stars', 'review_count', 'categories']

business_df = pd.read_csv('..\Yelp Dataset\yelp_business.csv', index_col='business_id', usecols=cols)

In [21]:
business_df.head()

Unnamed: 0_level_0,city,state,stars,review_count,categories
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
FYWN1wneV18bWNgQjJ2GNg,Ahwatukee,AZ,4.0,22,Dentists;General Dentistry;Health & Medical;Or...
He-G7vWjzVUysIKrfNbPUQ,McMurray,PA,3.0,11,Hair Stylists;Hair Salons;Men's Hair Salons;Bl...
KQPW8lFf1y5BT2MxiSZ3QA,Phoenix,AZ,1.5,18,Departments of Motor Vehicles;Public Services ...
8DShNS-LuFqpEWIp0HxijA,Tempe,AZ,3.0,9,Sporting Goods;Shopping
PfOCPjBrlQAnz__NXj9h_w,Cuyahoga Falls,OH,3.5,116,American (New);Nightlife;Bars;Sandwiches;Ameri...


In [22]:
business_df.shape

(174567, 5)

In [23]:
business_df.describe(include='all')

Unnamed: 0,city,state,stars,review_count,categories
count,174566,174566,174567.0,174567.0,174567
unique,1093,67,,,76419
top,Las Vegas,AZ,,,Restaurants;Pizza
freq,26775,52214,,,990
mean,,,3.632196,30.137059,
std,,,1.003739,98.208174,
min,,,1.0,3.0,
25%,,,3.0,4.0,
50%,,,3.5,8.0,
75%,,,4.5,23.0,


Because we are visualizing only restaurant review data, we want to extract observations that have Restaurants value in their category attribute.

In [24]:
business_df.categories.value_counts()

Restaurants;Pizza                                                                                                                                                                                       990
Pizza;Restaurants                                                                                                                                                                                       987
Food;Coffee & Tea                                                                                                                                                                                       978
Nail Salons;Beauty & Spas                                                                                                                                                                               936
Coffee & Tea;Food                                                                                                                                                                       

In [25]:
# Matching 'Restaurants' string with categories attribute and extracting a new dataframe
restaurants_df = business_df[business_df['categories'].str.contains('Restaurants')]

In [26]:
restaurants_df.shape

(54618, 5)

In [27]:
restaurants_df.head()

Unnamed: 0_level_0,city,state,stars,review_count,categories
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PfOCPjBrlQAnz__NXj9h_w,Cuyahoga Falls,OH,3.5,116,American (New);Nightlife;Bars;Sandwiches;Ameri...
o9eMRCWt5PkpLDE0gOPtcQ,Stuttgart,BW,4.0,5,Italian;Restaurants
XOSRcvtaKc_Q5H1SAzN20A,Houston,PA,4.5,3,Breakfast & Brunch;Gluten-Free;Coffee & Tea;Fo...
fNMVV_ZX7CJSDWQGdOM8Nw,Charlotte,NC,3.5,7,Restaurants;American (Traditional)
l09JfMeQ6ynYs5MCJtrcmQ,Toronto,ON,3.0,12,Italian;French;Restaurants


Inspecting the DF for missing values

In [28]:
restaurants_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 54618 entries, PfOCPjBrlQAnz__NXj9h_w to UdEmYOnk2iJDY9lpEPAlJQ
Data columns (total 5 columns):
city            54618 non-null object
state           54618 non-null object
stars           54618 non-null float64
review_count    54618 non-null int64
categories      54618 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 2.5+ MB


We want to replace the Restaurants string in categories attribute, because it doesn't provide any additional information

In [29]:
restaurants_df['categories'] = restaurants_df['categories'].replace({';Restaurants' : '', 'Restaurants;' : '', 'Restaurants' : ''}, regex=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [30]:
restaurants_df.head()

Unnamed: 0_level_0,city,state,stars,review_count,categories
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PfOCPjBrlQAnz__NXj9h_w,Cuyahoga Falls,OH,3.5,116,American (New);Nightlife;Bars;Sandwiches;Ameri...
o9eMRCWt5PkpLDE0gOPtcQ,Stuttgart,BW,4.0,5,Italian
XOSRcvtaKc_Q5H1SAzN20A,Houston,PA,4.5,3,Breakfast & Brunch;Gluten-Free;Coffee & Tea;Fo...
fNMVV_ZX7CJSDWQGdOM8Nw,Charlotte,NC,3.5,7,American (Traditional)
l09JfMeQ6ynYs5MCJtrcmQ,Toronto,ON,3.0,12,Italian;French


Now we want to split the categories column by ";" character and add the first given category (the most important one) to the restaurants_df.

In [31]:
restaurants_df['category'] = restaurants_df.categories.str.split(';').str[0]

# Drop the categories column from the dataframe
r_df_categorized = restaurants_df.drop(columns='categories')
r_df_categorized.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,city,state,stars,review_count,category
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PfOCPjBrlQAnz__NXj9h_w,Cuyahoga Falls,OH,3.5,116,American (New)
o9eMRCWt5PkpLDE0gOPtcQ,Stuttgart,BW,4.0,5,Italian
XOSRcvtaKc_Q5H1SAzN20A,Houston,PA,4.5,3,Breakfast & Brunch
fNMVV_ZX7CJSDWQGdOM8Nw,Charlotte,NC,3.5,7,American (Traditional)
l09JfMeQ6ynYs5MCJtrcmQ,Toronto,ON,3.0,12,Italian


In [32]:
# Count frequencies of different cuisines
r_df_categorized['category'].value_counts()

Pizza                        3388
Fast Food                    2851
Mexican                      2559
Chinese                      2539
Food                         2530
Italian                      2326
Sandwiches                   2304
American (Traditional)       2047
Burgers                      1826
Nightlife                    1570
Breakfast & Brunch           1498
Bars                         1488
American (New)               1391
Japanese                     1159
Cafes                        1024
Sushi Bars                    908
Thai                          833
Seafood                       801
Indian                        800
Chicken Wings                 787
Barbeque                      672
Vietnamese                    647
Mediterranean                 623
Asian Fusion                  601
Delis                         585
Coffee & Tea                  582
Steakhouses                   565
Canadian (New)                553
Greek                         521
Salad         

In [33]:
# Let's pick 50 most frequent out of these and read them into a list

# Read all food categories to an list
food_categories_list = r_df_categorized['category'].value_counts().index.tolist()

# Select 50 most frequent cuisines and remove one empty (category with only Restaurant) and 'Event Planning and Services'
food_categories_50 = food_categories_list[0:50]
print(food_categories_50)

['Pizza', 'Fast Food', 'Mexican', 'Chinese', 'Food', 'Italian', 'Sandwiches', 'American (Traditional)', 'Burgers', 'Nightlife', 'Breakfast & Brunch', 'Bars', 'American (New)', 'Japanese', 'Cafes', 'Sushi Bars', 'Thai', 'Seafood', 'Indian', 'Chicken Wings', 'Barbeque', 'Vietnamese', 'Mediterranean', 'Asian Fusion', 'Delis', 'Coffee & Tea', 'Steakhouses', 'Canadian (New)', 'Greek', 'Salad', 'French', 'Diners', 'Middle Eastern', 'Korean', 'Event Planning & Services', '', 'Buffets', 'Caribbean', 'Caterers', 'Bakeries', 'Sports Bars', 'Hot Dogs', 'Desserts', 'Vegetarian', 'Specialty Food', 'Pubs', 'Tex-Mex', 'German', 'Latin American', 'Soup']


In [34]:
# Let's remove categories that are not usefull in our case (for rxample nightlife)

indexes = (4, 9, 11, 25, 34, 35, 36, 38, 40, 44, 45)
food_categories = np.delete(food_categories_50, indexes)
print(food_categories)

['Pizza' 'Fast Food' 'Mexican' 'Chinese' 'Italian' 'Sandwiches'
 'American (Traditional)' 'Burgers' 'Breakfast & Brunch' 'American (New)'
 'Japanese' 'Cafes' 'Sushi Bars' 'Thai' 'Seafood' 'Indian' 'Chicken Wings'
 'Barbeque' 'Vietnamese' 'Mediterranean' 'Asian Fusion' 'Delis'
 'Steakhouses' 'Canadian (New)' 'Greek' 'Salad' 'French' 'Diners'
 'Middle Eastern' 'Korean' 'Caribbean' 'Bakeries' 'Hot Dogs' 'Desserts'
 'Vegetarian' 'Tex-Mex' 'German' 'Latin American' 'Soup']


In [35]:
# Filter rows of containing only these most frequently used cuisines

r_df_clean = r_df_categorized[r_df_categorized['category'].isin(food_categories)]
r_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38454 entries, PfOCPjBrlQAnz__NXj9h_w to UdEmYOnk2iJDY9lpEPAlJQ
Data columns (total 5 columns):
city            38454 non-null object
state           38454 non-null object
stars           38454 non-null float64
review_count    38454 non-null int64
category        38454 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 1.8+ MB


In [36]:
r_df_clean.describe(include='all')

Unnamed: 0,city,state,stars,review_count,category
count,38454,38454,38454.0,38454.0,38454
unique,691,37,,,39
top,Toronto,ON,,,Pizza
freq,4890,9545,,,3388
mean,,,3.39113,55.770427,
std,,,0.806129,142.506078,
min,,,1.0,3.0,
25%,,,3.0,7.0,
50%,,,3.5,17.0,
75%,,,4.0,51.0,


# Review.json Data

In [37]:
# Columns that are used for the initial dataframe
cols = ['review_id', 'user_id', 'business_id', 'stars', 'date']

review_df = pd.read_csv('..\Yelp Dataset\yelp_review.csv', index_col='review_id', usecols=cols)

In [38]:
review_df.head()

Unnamed: 0_level_0,user_id,business_id,stars,date
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
vkVSCC7xljjrAI4UGfnKEQ,bv2nCi5Qv5vroFiqKGopiw,AEx2SYEUJmTxVVB18LlCwA,5,2016-05-28
n6QzIUObkYshz4dz2QRJTw,bv2nCi5Qv5vroFiqKGopiw,VR6GpWIda3SfvPC-lg9H3w,5,2016-05-28
MV3CcKScW05u5LVfF6ok0g,bv2nCi5Qv5vroFiqKGopiw,CKC0-MOWMqoeWf6s-szl8g,5,2016-05-28
IXvOzsEMYtiJI0CARmj77Q,bv2nCi5Qv5vroFiqKGopiw,ACFtxLv8pGrrxMm6EgjreA,4,2016-05-28
L_9BTb55X0GDtThi6GlZ6w,bv2nCi5Qv5vroFiqKGopiw,s2I_Ni76bjJNK9yG60iD-Q,4,2016-05-28


In [39]:
# Whole dataset has: 5261668 entries
review_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5261668 entries, vkVSCC7xljjrAI4UGfnKEQ to ldsIs3sGXPJ7WM7VyAm4lQ
Data columns (total 4 columns):
user_id        object
business_id    object
stars          int64
date           object
dtypes: int64(1), object(3)
memory usage: 200.7+ MB


In [40]:
review_df.describe()

Unnamed: 0,stars
count,5261668.0
mean,3.727739
std,1.433593
min,1.0
25%,3.0
50%,4.0
75%,5.0
max,5.0


In [41]:
review_df[review_df.duplicated()]

Unnamed: 0_level_0,user_id,business_id,stars,date
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


In [42]:
review_df.isna().sum()

user_id        0
business_id    0
stars          0
date           0
dtype: int64

Dataset has no missing or duplicate values

## User.json Data

In [45]:
# Columns that are used for the initial dataframe
cols = ['user_id', 'review_count', 'average_stars']

user_df = pd.read_csv('..\Yelp Dataset\yelp_user.csv', index_col='user_id', usecols=cols)

In [46]:
user_df.head()

Unnamed: 0_level_0,review_count,average_stars
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
JJ-aSuM4pCFPdkfoZ34q0Q,10,3.7
uUzsFQn_6cXDh6rPNGbIFA,1,2.0
mBneaEEH5EMyxaVyqS-72A,6,4.67
W5mJGs-dcDWRGEhAzUYtoA,3,4.67
4E8--zUZO1Rr1IBK4_83fg,11,3.45


In [47]:
user_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1326100 entries, JJ-aSuM4pCFPdkfoZ34q0Q to q-1Tz4SvaTpGEMhI_xwm0Q
Data columns (total 2 columns):
review_count     1326100 non-null int64
average_stars    1326100 non-null float64
dtypes: float64(1), int64(1)
memory usage: 30.4+ MB


In [48]:
user_df.describe()

Unnamed: 0,review_count,average_stars
count,1326100.0,1326100.0
mean,23.11717,3.710841
std,79.09808,1.120721
min,0.0,1.0
25%,2.0,3.09
50%,5.0,3.9
75%,15.0,4.61
max,11954.0,5.0


## Merging DataFrames

Let's start merging by inspecting all dataframes and renaming columns that have conflicting names in different dataframes to avoid running into trouble later on

In [49]:
r_df_clean.rename(columns={'stars': 'restaurant_stars', 'review_count': 'restaurant_review_count'}, inplace=True)
r_df_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Unnamed: 0_level_0,city,state,restaurant_stars,restaurant_review_count,category
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PfOCPjBrlQAnz__NXj9h_w,Cuyahoga Falls,OH,3.5,116,American (New)
o9eMRCWt5PkpLDE0gOPtcQ,Stuttgart,BW,4.0,5,Italian
XOSRcvtaKc_Q5H1SAzN20A,Houston,PA,4.5,3,Breakfast & Brunch
fNMVV_ZX7CJSDWQGdOM8Nw,Charlotte,NC,3.5,7,American (Traditional)
l09JfMeQ6ynYs5MCJtrcmQ,Toronto,ON,3.0,12,Italian


In [50]:
r_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38454 entries, PfOCPjBrlQAnz__NXj9h_w to UdEmYOnk2iJDY9lpEPAlJQ
Data columns (total 5 columns):
city                       38454 non-null object
state                      38454 non-null object
restaurant_stars           38454 non-null float64
restaurant_review_count    38454 non-null int64
category                   38454 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 1.8+ MB


In [54]:
review_df.rename(columns={'stars': 'review_stars', 'date': 'review_date'}, inplace=True)
review_df.head()

Unnamed: 0_level_0,user_id,business_id,review_stars,review_date
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
vkVSCC7xljjrAI4UGfnKEQ,bv2nCi5Qv5vroFiqKGopiw,AEx2SYEUJmTxVVB18LlCwA,5,2016-05-28
n6QzIUObkYshz4dz2QRJTw,bv2nCi5Qv5vroFiqKGopiw,VR6GpWIda3SfvPC-lg9H3w,5,2016-05-28
MV3CcKScW05u5LVfF6ok0g,bv2nCi5Qv5vroFiqKGopiw,CKC0-MOWMqoeWf6s-szl8g,5,2016-05-28
IXvOzsEMYtiJI0CARmj77Q,bv2nCi5Qv5vroFiqKGopiw,ACFtxLv8pGrrxMm6EgjreA,4,2016-05-28
L_9BTb55X0GDtThi6GlZ6w,bv2nCi5Qv5vroFiqKGopiw,s2I_Ni76bjJNK9yG60iD-Q,4,2016-05-28


In [55]:
review_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5261668 entries, vkVSCC7xljjrAI4UGfnKEQ to ldsIs3sGXPJ7WM7VyAm4lQ
Data columns (total 4 columns):
user_id         object
business_id     object
review_stars    int64
review_date     object
dtypes: int64(1), object(3)
memory usage: 200.7+ MB


In [57]:
user_df.rename(columns={'review_count': 'user_review_count', 'average_stars': 'user_average_stars'}, inplace=True)
user_df.head()

Unnamed: 0_level_0,user_review_count,user_average_stars
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
JJ-aSuM4pCFPdkfoZ34q0Q,10,3.7
uUzsFQn_6cXDh6rPNGbIFA,1,2.0
mBneaEEH5EMyxaVyqS-72A,6,4.67
W5mJGs-dcDWRGEhAzUYtoA,3,4.67
4E8--zUZO1Rr1IBK4_83fg,11,3.45


In [58]:
user_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1326100 entries, JJ-aSuM4pCFPdkfoZ34q0Q to q-1Tz4SvaTpGEMhI_xwm0Q
Data columns (total 2 columns):
user_review_count     1326100 non-null int64
user_average_stars    1326100 non-null float64
dtypes: float64(1), int64(1)
memory usage: 30.4+ MB


As there are no missing values we can start merging by joining review_df and user_df on user_id. Using inner join we keep only users that have reviewed something and reviews on restaurants on the second join.

In [59]:
user_review_df = pd.merge(review_df, user_df, how='inner', left_on="user_id" , right_index= True)
user_review_df.head()

Unnamed: 0_level_0,user_id,business_id,review_stars,review_date,user_review_count,user_average_stars
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
vkVSCC7xljjrAI4UGfnKEQ,bv2nCi5Qv5vroFiqKGopiw,AEx2SYEUJmTxVVB18LlCwA,5,2016-05-28,6,4.67
n6QzIUObkYshz4dz2QRJTw,bv2nCi5Qv5vroFiqKGopiw,VR6GpWIda3SfvPC-lg9H3w,5,2016-05-28,6,4.67
MV3CcKScW05u5LVfF6ok0g,bv2nCi5Qv5vroFiqKGopiw,CKC0-MOWMqoeWf6s-szl8g,5,2016-05-28,6,4.67
IXvOzsEMYtiJI0CARmj77Q,bv2nCi5Qv5vroFiqKGopiw,ACFtxLv8pGrrxMm6EgjreA,4,2016-05-28,6,4.67
L_9BTb55X0GDtThi6GlZ6w,bv2nCi5Qv5vroFiqKGopiw,s2I_Ni76bjJNK9yG60iD-Q,4,2016-05-28,6,4.67


In [61]:
user_review_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5261662 entries, vkVSCC7xljjrAI4UGfnKEQ to ldsIs3sGXPJ7WM7VyAm4lQ
Data columns (total 6 columns):
user_id               object
business_id           object
review_stars          int64
review_date           object
user_review_count     int64
user_average_stars    float64
dtypes: float64(1), int64(2), object(3)
memory usage: 281.0+ MB


In [62]:
user_review_df.describe()

Unnamed: 0,review_stars,user_review_count,user_average_stars
count,5261662.0,5261662.0,5261662.0
mean,3.727738,121.65,3.737311
std,1.433593,341.3802,0.7826569
min,1.0,0.0,1.0
25%,3.0,7.0,3.4
50%,4.0,24.0,3.8
75%,5.0,97.0,4.2
max,5.0,11954.0,5.0


In [63]:
user_review_df.isna().sum()

user_id               0
business_id           0
review_stars          0
review_date           0
user_review_count     0
user_average_stars    0
dtype: int64

Now we can join user_reviews_df with r_df_clean (containing restaurant information) on business_id to form the final data frame

In [64]:
final_df = pd.merge(user_review_df, r_df_clean, how='inner', left_on="business_id" , right_index= True)
final_df.head()

Unnamed: 0_level_0,user_id,business_id,review_stars,review_date,user_review_count,user_average_stars,city,state,restaurant_stars,restaurant_review_count,category
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
vkVSCC7xljjrAI4UGfnKEQ,bv2nCi5Qv5vroFiqKGopiw,AEx2SYEUJmTxVVB18LlCwA,5,2016-05-28,6,4.67,Montréal,QC,4.0,84,Diners
vm1b1keOzwHjtGZEPPuYXA,xYciRtVZ1PW4IxSX4oJ1aw,AEx2SYEUJmTxVVB18LlCwA,5,2016-02-22,177,3.41,Montréal,QC,4.0,84,Diners
jUzausdZ_ujqe_n8BlBj-g,DVOOF0Z627DyrZ4XKQbTgA,AEx2SYEUJmTxVVB18LlCwA,5,2017-08-08,40,3.98,Montréal,QC,4.0,84,Diners
SXwA9KZ-Nc_hMARk_3cJ7g,5Ymfsf9fAYz-Ds_p0xawVQ,AEx2SYEUJmTxVVB18LlCwA,5,2013-03-29,79,4.52,Montréal,QC,4.0,84,Diners
oCRDwF3tszAkeszSfxwthg,5JoKz3mU42Cp906KRXDwJw,AEx2SYEUJmTxVVB18LlCwA,4,2009-01-17,3,4.0,Montréal,QC,4.0,84,Diners


In [65]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2144848 entries, vkVSCC7xljjrAI4UGfnKEQ to 99hWXPtbpzXXbsL43RYuKA
Data columns (total 11 columns):
user_id                    object
business_id                object
review_stars               int64
review_date                object
user_review_count          int64
user_average_stars         float64
city                       object
state                      object
restaurant_stars           float64
restaurant_review_count    int64
category                   object
dtypes: float64(2), int64(3), object(6)
memory usage: 196.4+ MB


In [66]:
final_df.isna().sum()

user_id                    0
business_id                0
review_stars               0
review_date                0
user_review_count          0
user_average_stars         0
city                       0
state                      0
restaurant_stars           0
restaurant_review_count    0
category                   0
dtype: int64

In [67]:
final_df[final_df.duplicated()].count()

user_id                    0
business_id                0
review_stars               0
review_date                0
user_review_count          0
user_average_stars         0
city                       0
state                      0
restaurant_stars           0
restaurant_review_count    0
category                   0
dtype: int64

In [68]:
final_df.shape

(2144848, 11)

## Data output

In [71]:
final_df.to_csv('..\Yelp Dataset\merged_data.csv')