## Title
Data Raw Extraction 

### By:
Juan Gómez

### Date:
2025-05-16

### Description:

This notebook loads and joins data from the Yelp dataset to build the training structure.
It extracts data for user, reviews and business metadata to create a clean dataset for message classification.

# Data load

## Imports and configuration

In [1]:
from pathlib import Path

BASE_DIR = Path.cwd().resolve().parents[1]

In [2]:
import pandas as pd

In [3]:
pd.set_option("display.max_columns", None)

## Load dataset

### Business

In [4]:
# df_business = pd.read_json(BASE_DIR / "data/01_raw/yelp_academic_dataset_business.json", lines=True)

### User

In [5]:
# df_user = pd.concat(
#     [
#         chunk[chunk["review_count"] > 0]
#         for chunk in pd.read_json(
#             BASE_DIR / "data/01_raw/yelp_academic_dataset_user.json",
#             lines=True,
#             chunksize=100000,
#         )
#     ],
#     ignore_index=True,
# )

### Review

In [6]:
# df_review = pd.concat(
#     [
#         chunk[(chunk["text"].str.strip() != "") & (chunk["useful"] > 0)]
#         for chunk in pd.read_json(
#             BASE_DIR / "data/01_raw/yelp_academic_dataset_review.json",
#             lines=True,
#             chunksize=100000,
#         )
#     ],
#     ignore_index=True,
# )

In [7]:
# df_review2 = df_review.sort_values("date", ascending=False).head(1000100)

## Enrich reviews with user data

In [8]:
# df_review3 = df_review2.merge(df_user, on=["user_id"], how="left", suffixes=("", "_user"))

## Enrich reviews with business data

In [9]:
# df_review4 = df_review3.merge(df_business, on=["business_id"], how="left", suffixes=("", "_user"))

## Save reviews data raw

In [10]:
# df_review4.to_parquet(BASE_DIR / "data/02_intermediate/data_message_classifier_interm.parquet", index=False)

# Test Data Extract

In [11]:
import os

os.chdir("/Users/agomezj/Desktop/Juan-G/ml-message-classifier/")
print(os.getcwd())

/Users/agomezj/Desktop/Juan-G/ml-message-classifier


In [12]:
from sklearn import set_config

from src.pipelines.feature_pipeline.feature_pipeline import feature_pipeline

In [13]:
extract = feature_pipeline.named_steps["extract"].transform(None)

[32m2025-05-19 22:42:17.521[0m | [1mINFO    [0m | [36msrc.data.extract[0m:[36m_load_user[0m:[36m83[0m - [1mLoading user data from /Users/agomezj/Desktop/Juan-G/ml-message-classifier/data/01_raw/yelp_academic_dataset_user.json[0m
[32m2025-05-19 22:42:36.773[0m | [1mINFO    [0m | [36msrc.data.extract[0m:[36m_load_review[0m:[36m102[0m - [1mLoading review data from /Users/agomezj/Desktop/Juan-G/ml-message-classifier/data/01_raw/yelp_academic_dataset_review.json[0m
[32m2025-05-19 22:43:18.028[0m | [1mINFO    [0m | [36msrc.data.extract[0m:[36m_load_business[0m:[36m121[0m - [1mLoading business data from /Users/agomezj/Desktop/Juan-G/ml-message-classifier/data/01_raw/yelp_academic_dataset_business.json[0m
[32m2025-05-19 22:43:22.323[0m | [1mINFO    [0m | [36msrc.data.extract[0m:[36m_merge_all[0m:[36m142[0m - [1mMerging review, user, and business data[0m
[32m2025-05-19 22:43:39.449[0m | [1mINFO    [0m | [36msrc.data.extract[0m:[36m_save_if_

In [14]:
set_config(display="diagram")
feature_pipeline