# Converting Yelp Review Dataset to CSV Format

Cause the raw data is line by line json, we need to convert it into csv format.
Also because the dataset is large, we will make a sample of it at the end.

### Importing libraries

In [13]:
import csv
import json
import os

import pandas as pd
from tqdm import tqdm

## Constants

In [14]:
raw_file_path = "data/yelp_academic_dataset_review.json"
csv_file_path = "data/reviews.csv"
sample_csv_file_path = "data/reviews_sample.csv"

## File Information and Structure

In [15]:
file_size = os.path.getsize(raw_file_path)
file_size_gb = file_size / (1024**3)
print(f"File Size: {file_size_gb:.2f} GB")

print("File head:")
with open(raw_file_path, "r") as file:
    for _ in range(5):  # Read the first 5 lines
        print(file.readline().strip())

File Size: 4.98 GB
File head:
{"review_id":"KU_O5udG6zpxOg-VcAEodg","user_id":"mh_-eMZ6K5RLWhZyISBhwA","business_id":"XQfwVwDr-v0ZS3_CbbE5Xw","stars":3.0,"useful":0,"funny":0,"cool":0,"text":"If you decide to eat here, just be aware it is going to take about 2 hours from beginning to end. We have tried it multiple times, because I want to like it! I have been to it's other locations in NJ and never had a bad experience. \n\nThe food is good, but it takes a very long time to come out. The waitstaff is very young, but usually pleasant. We have just had too many experiences where we spent way too long waiting. We usually opt for another diner or restaurant on the weekends, in order to be done quicker.","date":"2018-07-07 22:09:11"}
{"review_id":"BiTunyQ73aT9WBnpR9DZGw","user_id":"OyoGAe7OKpv6SyGZT5g77Q","business_id":"7ATYjTIgM3jUlt4UM3IypQ","stars":5.0,"useful":1,"funny":0,"cool":1,"text":"I've taken a lot of spin classes over the years, and nothing compares to the classes at Body Cycle.

## Reading the raw data

In [16]:
%%time
with open(raw_file_path, "r") as file:
    data = []
    for line in tqdm(file, desc="Reading lines"):
        data.append(json.loads(line))

Reading lines: 6990280it [00:24, 286092.96it/s]


CPU times: user 19 s, sys: 3.22 s, total: 22.2 s
Wall time: 24.5 s


## Saving to CSV

In [17]:
with open(csv_file_path, "w", newline="", encoding="utf-8") as csvfile:
    fieldnames = data[0].keys()
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    for row in tqdm(data, desc="Writing rows"):
        writer.writerow(row)

Writing rows: 100%|██████████| 6990280/6990280 [02:12<00:00, 52669.34it/s]


## Loading into DataFrame

In [18]:
df = pd.read_csv(csv_file_path, low_memory=True)
print(f"DataFrame shape: {df.shape}")

DataFrame shape: (6990280, 9)


## Sampling the dataset to reduce size

In [22]:
sample_df = df.sample(frac=0.1, random_state=42)  # Taking a 10% sample
print(f"Sample DataFrame shape: {sample_df.shape}")

Sample DataFrame shape: (69903, 9)


## Saving the sample DataFrame to CSV

In [24]:
sample_df.to_csv(sample_csv_file_path, index=False)