## Title
Data Raw Extraction 

### By:
Juan Gómez

### Date:
2025-05-16

### Description:

This notebook loads and joins data from the Yelp dataset to build the training structure.
It extracts data for user, reviews and business metadata to create a clean dataset for message classification.

# Data load

## 1. Imports and configuration


In [1]:
from pathlib import Path

BASE_DIR = Path.cwd().resolve().parents[1]

In [2]:
import pandas as pd

In [3]:
pd.set_option("display.max_columns", None)

## 2. Load dataset

### Business

In [4]:
df_business = pd.read_json(BASE_DIR / "data/01_raw/yelp_academic_dataset_business.json", lines=True)

### User

In [5]:
df_user = pd.concat(
    [
        chunk[chunk["review_count"] > 0]
        for chunk in pd.read_json(
            BASE_DIR / "data/01_raw/yelp_academic_dataset_user.json",
            lines=True,
            chunksize=100000
        )
    ],
    ignore_index=True
)

### Review

In [6]:
df_review = pd.concat(
    [
        chunk[(chunk["text"].str.strip() != "") & (chunk["useful"] > 0)]
        for chunk in pd.read_json(
            BASE_DIR / "data/01_raw/yelp_academic_dataset_review.json",
            lines=True,
            chunksize=100000
        )
    ],
    ignore_index=True
)

In [7]:
df_review2 = df_review.sort_values("date", ascending=False).head(1500000)

## 3. Enrich reviews with user data

In [21]:
df_review3 = df_review2.merge(
    df_user,
    on=["user_id"],
    how="left",
    suffixes=("", "_user")
)

## 4. Enrich reviews with business data

In [31]:
df_review4 = df_review3.merge(
    df_business,
    on=["business_id"],
    how="left",
    suffixes=("", "_user")
)

## 5. Save reviews data raw

In [42]:
df_review4.to_parquet(
    BASE_DIR / "data/01_raw/data_message_classifier_raw.parquet",
    index=False
)