# Yelp Data Filter Script

This script processes the full Yelp dataset JSON files and produces filtered CSVs 
containing only data relevant to restaurant reviews. Specifically, it:

1. Loads `business.json` and filters businesses to only those in restaurant-related categories.
2. Filters `review.json`, `checkin.json`, and `tip.json` to include only entries 
   associated with the filtered restaurants.
3. Filters `user.json` to include only users who wrote reviews in the filtered review dataset.
4. Saves all filtered datasets to CSV for easier analysis.

The script is designed to handle large datasets efficiently using chunked processing 
to avoid excessive memory usage.

In [1]:
import json
import pandas as pd
import numpy as np

In [2]:
# Load all of the businesses 
file = "full-dataset/yelp_academic_dataset_business.json"
df = pd.read_json(file, lines=True) # lines = true b/c each line in the file is its own JSON object

# Filter out only the restaurants by keyword
keywords = ["Restaurants", "Food", "Cafes", "Coffee & Tea", "Bakeries", "Bars", "Fast Food", "Pizza", "Sandwiches", "Breakfast & Brunch"]
pattern = "|".join(keywords)

# Saving this dataframe so I can use business_id to cross-reference with other dataframes
# Note: could use similar logic to filter by city as well
df_restaurants = df[df['categories'].str.contains(pattern, case=False, na=False)]
restaurant_ids = df_restaurants["business_id"].unique()

In [3]:
# This function will load a Yelp JSON dataset in chunks (rather than all at once) and filter by restaurant business_ids 
# (which we've already identified)
def filter_datasets_chunked(filename, chunksize=100000):
    filtered_chunks = []
    for chunk in pd.read_json(filename, lines=True, chunksize=chunksize):
        filtered = chunk[chunk["business_id"].isin(restaurant_ids)]
        filtered_chunks.append(filtered)
    
    return pd.concat(filtered_chunks, ignore_index=True)

In [4]:
# Filtering out all non-restaurant data from other datasets
filtered_checkin = filter_datasets_chunked("full-dataset/yelp_academic_dataset_checkin.json")
filtered_review = filter_datasets_chunked("full-dataset/yelp_academic_dataset_review.json")
filtered_tip = filter_datasets_chunked("full-dataset/yelp_academic_dataset_tip.json")

In [5]:
# Now I am going to do the same thing for the users dataset (getting data only from users who reviewed one 
# of the restaurants we have reviews for)
user_ids = filtered_review["user_id"].unique()
filename = "full-dataset/yelp_academic_dataset_user.json"
filtered_chunks = []
for chunk in pd.read_json(filename, lines=True, chunksize=100000):
    filtered = chunk[chunk["user_id"].isin(user_ids)]
    filtered_chunks.append(filtered)
    
filtered_user = pd.concat(filtered_chunks, ignore_index=True)

In [14]:
# Saving all of these datasets in csv format
df_restaurants.to_csv("data-restaurants-only/business.csv", index=False)
filtered_checkin.to_csv("data-restaurants-only/checkin.csv", index=False)
filtered_review.to_csv("data-restaurants-only/review.csv", index=False)
filtered_tip.to_csv("data-restaurants-only/tip.csv", index=False)
filtered_user.to_csv("data-restaurants-only/user.csv", index=False)