In [1]:
import pandas as pd

# Data

https://www.yelp.com/dataset/documentation/main

# V1

Will start with reviews only, so that we can build an user-item matrix.

## Data Wrangling

- select only open restaurants
- merge reviews with open business (city and state)
- select a state/city with a relevant number of reviews

In [26]:
# with this info, we can sample the data
df_business = pd.read_json('../data/external/yelp_dataset/yelp_academic_dataset_business.json',lines=True)
# select only open business
cols = ["business_id", "city", "state"]
df_clean_business = df_business[df_business['is_open']==1][cols]

In [19]:
# load reviews
review = pd.read_json('../data/external/yelp_dataset/yelp_academic_dataset_review.json', lines=True, chunksize=1000000)

# There are multiple chunks to be read
chunk_list = []
for chunk_review in review:
    # Drop columns that aren't needed
    chunk_review = chunk_review.drop(['review_id','useful','funny','cool', 'text'], axis=1)
    # Renaming column name to avoid conflict with business overall star rating
    chunk_review = chunk_review.rename(columns={'stars': 'review_stars'})
    # Inner merge with edited business file so only reviews related to the business remain
    chunk_merged = pd.merge(df_clean_business, chunk_review, on='business_id', how='inner')
    chunk_list.append(chunk_merged)
    break
# After trimming down the review file, concatenate all relevant data back to one dataframe
df = pd.concat(chunk_list, ignore_index=True, join='outer', axis=0)

# if multiple reviews from the same user to a given business, select only the most recent
df = df.sort_values(by="date").drop_duplicates(subset=['business_id', 'user_id'], keep='last').drop(columns='date')

In [23]:
df.state.value_counts()

PA    178139
FL    125503
LA     99019
TN     70519
MO     60949
IN     56071
NV     46098
AZ     45967
CA     38037
NJ     29689
ID     15586
AB     12507
DE      6594
IL      5846
HI        17
WA         6
Name: state, dtype: int64

In [24]:
# sampling NJ to have some reviews but not too many
df_nj = df[df.state=='NJ'].drop(columns=['state', 'city'])

In [25]:
df_nj.sample(3)

Unnamed: 0,business_id,user_id,review_stars
293835,8yXzNnRUzZ0r-947fpZL7A,PJYDtMQBKNGVbZ2m5L3ITQ,5
605353,NebCGmQMd058CRT-xs13tQ,hv0gqAwHKFNBrqqImuk1Iw,3
797422,G2_1Q5TnESlgdQDufHtqIw,3v-uRcbXNmO6IC99olzOrg,5


## Recsys Methods

- popularity: top restaurants with highest score
- TODO

Check:
- https://towardsdatascience.com/how-i-implemented-explainable-movie-recommendations-using-python-7aa42a0af023 
- https://towardsdatascience.com/yelp-restaurant-recommendation-system-capstone-project-264fe7a7dea1