## Yelp dataset readin code
For: CAPP 30254 - Machine Learning for Public Policy<br>
Spring 2023<br>
By: Matt Jackson<br>

The original Yelp dataset is over 5GB in size, is 6.9+ million rows long, and contains several redundant columns that we don't need. That is both unwieldy and too big to play nicely with GitHub or pandas.

This code takes the original Yelp dataset into pandas, removes the unnecessary columns, and creates two kinds of sample file for exploratory analysis and model training:

* `yelp_true_sample`: A truly random sample of 100,000 of the Yelp reviews
* `yelp_oversample`: An "oversample" in which there are 20,000 reviews of each star rating (1 through 5) (for a total of 100,000 reviews)

It does _not_ do any text pre-processing.

In [1]:
import pandas as pd
import numpy as np

_There is no need to run this once the sample datasets are available. Preserving the code for posterity and reproducibility, though._

In [2]:
#Read in the original Yelp JSON as "chunks", which are smaller and can be concatenated
#to recreate whole dataset without crashing pandas 
#https://stackoverflow.com/questions/46790390/how-to-read-a-large-json-in-pandas

#"yelp_dataset/yelp_academic_dataset_review.json" is the original 5GB file
#This takes about 5 minutes.
chunks = pd.read_json("yelp_dataset/yelp_academic_dataset_review.json", lines=True, chunksize = 10000)
#there are 700 chunks
reviews = pd.DataFrame()
for i, chunk in enumerate(chunks):
  print(f"Now concatenating chunk {i}...")
  reviews = pd.concat([reviews, chunk])
print("Full dataset load complete")

Now concatenating chunk 0...
Now concatenating chunk 1...
Now concatenating chunk 2...
Now concatenating chunk 3...
Now concatenating chunk 4...
Now concatenating chunk 5...
Now concatenating chunk 6...
Now concatenating chunk 7...
Now concatenating chunk 8...
Now concatenating chunk 9...
Now concatenating chunk 10...
Now concatenating chunk 11...
Now concatenating chunk 12...
Now concatenating chunk 13...
Now concatenating chunk 14...
Now concatenating chunk 15...
Now concatenating chunk 16...
Now concatenating chunk 17...
Now concatenating chunk 18...
Now concatenating chunk 19...
Now concatenating chunk 20...
Now concatenating chunk 21...
Now concatenating chunk 22...
Now concatenating chunk 23...
Now concatenating chunk 24...
Now concatenating chunk 25...
Now concatenating chunk 26...
Now concatenating chunk 27...
Now concatenating chunk 28...
Now concatenating chunk 29...
Now concatenating chunk 30...
Now concatenating chunk 31...
Now concatenating chunk 32...
Now concatenating ch

In [3]:
#print(reviews.shape) #(6990280, 9)

#preserving date column in case time-based analysis is relevant later
reviews = reviews.loc[:, ['stars', 'text', 'date']]
reviews

Unnamed: 0,stars,text,date
0,3,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,5,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,3,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,5,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,4,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15
...,...,...,...
6990275,5,Latest addition to services from ICCU is Apple...,2014-12-17 21:45:20
6990276,5,"This spot offers a great, affordable east week...",2021-03-31 16:55:10
6990277,4,This Home Depot won me over when I needed to g...,2019-12-30 03:56:30
6990278,5,For when I'm feeling like ignoring my calorie-...,2022-01-19 18:59:27


In [None]:
#Takes about 47 seconds to write back out full dataset if you choose to do that
#reviews.to_csv("yelp_review_dataset_3col.csv")

### Generate true sample of 100,000 reviews

In [12]:
yelp_true_sample = reviews.sample(n=100000, random_state=2023, replace=False)
yelp_true_sample.reset_index(drop=True, inplace=True)

#This file is about 59.8MB and takes ~1 second to write to file
yelp_true_sample.to_csv("yelp_true_sample_100k.csv")


### Generate oversample with 20,000 reviews of each star rating

In [22]:
yelp_oversample = reviews.loc[(reviews.loc[:,'stars'] == 1), :].sample(n=20000, 
                                                                       random_state=2023, 
                                                                       replace=False)

for rating in range(2,6):
    yelp_oversample = pd.concat((yelp_oversample,
                                reviews.loc[(reviews.loc[:,'stars'] == rating), :].sample(n=20000, 
                                                                       random_state=2023, 
                                                                       replace=False)))
yelp_oversample.reset_index(drop=True, inplace=True)
yelp_oversample.to_csv("yelp_oversample_20k_per_rating.csv")

In [20]:
#testing whether oversample worked right
# for i in range(1,6):
#     print(yelp_oversample.loc[(yelp_oversample.loc[:,'stars'] == 5)].shape)

(20000, 3)
(20000, 3)
(20000, 3)
(20000, 3)
(20000, 3)
