## Import Library & Dataset Processing
* import necessary libraries 
* preprocess datasets - merging 

In [1]:
import pandas as pd

In [2]:
users = pd.read_csv('data/users.csv')
reviews = pd.read_csv('data/reviews.csv')
restaurants = pd.read_csv('data/restaurants.csv')

There are 4 classes of flagged reviews by Yelp recommendation system. 
* N - Reviews that are confirmed as not fake 
* Y - Reviews that are confirmed as fake (least frequent label)
* NR - Reviews that are not confirmed as fake but flagged for investigation (most frequent laabel)
* YR - Reviews that are suspected to be fake and require more review before confirmation 

In [3]:
reviews['flagged'].value_counts()

flagged
NR    402774
YR    318678
N      58716
Y       8303
Name: count, dtype: int64

In [4]:
## inner join 
merged = pd.merge(reviews, users, on='reviewerID')

In [5]:
merged.shape

(708268, 22)

In [6]:
# drop rows with NR and YR in flagged column since NR and YR are not confirmed reviews
merged_filtered = merged[~merged['flagged'].isin(['YR', 'NR'])]

In [7]:
merged_filtered['flagged'].value_counts()

flagged
N    20752
Y     6206
Name: count, dtype: int64

In [9]:
merged_filtered.rename(columns={'usefulCount_x': 'usefulCount_review', 
                   'coolCount_x': 'coolCount_review',
                   'funnyCount_x': 'funnyCount_review',
                   'usefulCount_y': 'usefulCount_user',
                   'coolCount_y': 'coolCount_user',
                   'funnyCount_y': 'funnyCount_user'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_filtered.rename(columns={'usefulCount_x': 'usefulCount_review',


In [12]:
restaurant_merged = pd.merge(merged_filtered, restaurants, on='restaurantID')

In [16]:
restaurant_merged.rename(columns={'name_x': 'name_user',
                                  'location_x': 'location_user',
                                  'reviewCount_x': 'reviewCount_user',
                                  'name_y': 'name_restaurant',
                                  'location_y': 'location_restaurant',
                                  'reviewCount_y': 'reviewCount_restaurant',
                                  'rating_x': 'rating_review',
                                  'rating_y': 'rating_restaurant'}, inplace=True)

In [17]:
restaurant_merged.columns

Index(['date', 'reviewID', 'reviewerID', 'reviewContent', 'rating_review',
       'usefulCount_review', 'coolCount_review', 'funnyCount_review',
       'flagged', 'restaurantID', 'name_user', 'location_user', 'yelpJoinDate',
       'friendCount', 'reviewCount_user', 'firstCount', 'usefulCount_user',
       'coolCount_user', 'funnyCount_user', 'complimentCount', 'tipCount',
       'fanCount', 'name_restaurant', 'location_restaurant',
       'reviewCount_restaurant', 'rating_restaurant', 'categories', 'address',
       'Hours', 'GoodforKids', 'AcceptsCreditCards', 'Parking', 'Attire',
       'GoodforGroups', 'PriceRange', 'TakesReservations', 'Delivery',
       'Takeout', 'WaiterService', 'OutdoorSeating', 'WiFi', 'GoodFor',
       'Alcohol', 'NoiseLevel', 'Ambience', 'HasTV', 'Caters',
       'WheelchairAccessible', 'webSite', 'phoneNumber', 'filReviewCount'],
      dtype='object')