## Title
Data Raw Exploration

### By:
Juan G√≥mez

### Date:
2024-05-16

### Description:

This notebook explores the raw Yelp Open Dataset. It shows basic statistics, checks missing values, and looks at trends in popularity, genres, and ratings. The goal is to understand the data before building a recommendation system.

## Import  libraries

In [1]:
from pathlib import Path

import pandas as pd
import pyarrow as pa

## Load data

In [2]:
pd.set_option("display.max_columns", None)
BASE_DIR = Path.cwd().resolve().parents[1]

In [3]:
df = pd.read_parquet(BASE_DIR / "data/01_raw/review_user_business_extract.parquet")

## Exploration

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000100 entries, 0 to 1000099
Data columns (total 43 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   review_id              1000100 non-null  object        
 1   user_id                1000100 non-null  object        
 2   business_id            1000100 non-null  object        
 3   stars                  1000100 non-null  int64         
 4   useful                 1000100 non-null  int64         
 5   funny                  1000100 non-null  int64         
 6   cool                   1000100 non-null  int64         
 7   text                   1000100 non-null  object        
 8   date                   1000100 non-null  datetime64[ns]
 9   name                   1000092 non-null  object        
 10  review_count           1000092 non-null  float64       
 11  yelping_since          1000092 non-null  object        
 12  useful_user            10000

In [5]:
df.sample(5)

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,name,review_count,yelping_since,useful_user,funny_user,cool_user,elite,friends,fans,average_stars,compliment_hot,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos,name_business,address,city,state,postal_code,latitude,longitude,stars_business,review_count_business,is_open,attributes,categories,hours
193295,dlZld5VjuF3UlVf1bneHBQ,aJ-RI7oOjhZIYfEBNswUrQ,2F-TFWmAc-rkLBmK_ZoTOw,1,1,1,1,a few years back i was training here for a pos...,2021-02-28 19:37:09,Gracie,5.0,2021-01-18 02:39:17,2.0,1.0,1.0,,,0.0,4.2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,Naples At The Warehouse,2 S Main St,Mullica Hill,NJ,08062,39.735601,-75.22498,3.5,95,1,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Italian, Sandwiches, Restaurants, Pizza","{'Friday': '10:0-23:0', 'Monday': None, 'Satur..."
216212,VEGDz_FY9doYT1R7N48j1Q,qMWKHCNOzB6dzWiBRpr2og,c7WZXqCRHWSJWEkD6Fbswg,5,1,0,0,Great resource for those of us who love readin...,2021-01-22 01:28:29,Bob,1.0,2015-11-05 18:31:25,1.0,0.0,0.0,,,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Wee Book Inn,10332 Jasper Avenue,Edmonton,AB,T5J 1Y7,53.541075,-113.498411,4.5,8,1,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Bookstores, Shopping, Books, Mags, Music & Video","{'Friday': '9:0-0:0', 'Monday': '9:0-22:0', 'S..."
937153,REfd8tAyY_AA812N8s0Nmw,3q2cQC60mNmSHH3LYC_j_g,tBMy2DhoMF3SbZqB-4yEtA,2,5,0,0,"To qualify my review, I'll say that I shop at ...",2018-08-14 19:29:08,Aaron,237.0,2012-06-03 19:46:05,577.0,84.0,280.0,2015201620172018201920202021,"zdiMak6UuWRV5AlvsB8_7A, _8HnZjh_XZwQLmuCM1rNVQ...",23.0,3.91,22.0,6.0,0.0,0.0,0.0,88.0,22.0,27.0,27.0,17.0,5.0,All My Relations,7218 Rockville Rd,Indianapolis,IN,46214,39.764877,-86.287506,3.5,20,1,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Health & Medical, Doctors, Jewelry, Naturopath...","{'Friday': '11:0-19:0', 'Monday': '11:0-19:0',..."
336314,Wl7DSbCeIkRQyQ5XNwRBkg,e1-FrjkLIrrIueMBJwb8og,6TNz9PRdx14NgL24f880dQ,5,1,1,1,Still amazing. Props to them and the way they'...,2020-07-07 00:03:19,Carly,85.0,2019-05-01 18:33:43,99.0,35.0,80.0,201920202021,"vsYzJgUK6QV9m0bexiV2aQ, NiHBWzjE16B8hJOHyyou0g...",12.0,4.61,1.0,5.0,1.0,0.0,0.0,6.0,4.0,7.0,7.0,2.0,3.0,El Mariachi Mexican Restaurant,614 Thompson Ln,Nashville,TN,37204,36.111524,-86.753715,3.5,78,1,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Food, Desserts, Breakfast & Brunch, Mexican, R...","{'Friday': '11:0-22:30', 'Monday': '0:0-0:0', ..."
681053,qV7-Om9qU-Yx58sUaidWCA,xalgcjscRLNPuyaAeKNThA,a-pcOIKQTLdYsPyrDbUMhw,4,2,0,0,This restaurant is located near the Capitol. I...,2019-05-05 13:04:17,Tank,1647.0,2016-10-09 18:10:18,4223.0,228.0,2725.0,2018201920202021,"cUTAO7xfjpRuOcXefbF2Bw, Cu6dcKerS5U4Pv47Kn6UyQ...",45.0,4.05,18.0,0.0,2.0,0.0,0.0,45.0,20.0,50.0,50.0,36.0,20.0,Lazeez,115 W Market St,Indianapolis,IN,46204,39.768473,-86.160644,3.5,121,0,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Salad, Restaurants, Hookah Bars, Bars, Nightli...","{'Friday': '10:30-1:0', 'Monday': '10:30-21:0'..."


### Null Values

In [6]:
print("\nNull values in Movie Data Set:")
null_counts = df.isnull().sum()
display(null_counts[null_counts > 0].sort_values(ascending=False))


Null values in Movie Data Set:


hours                 49876
attributes            37401
categories              136
review_count              8
compliment_photos         8
compliment_writer         8
compliment_funny          8
compliment_cool           8
compliment_plain          8
compliment_note           8
compliment_list           8
compliment_cute           8
name                      8
compliment_more           8
compliment_hot            8
average_stars             8
fans                      8
friends                   8
elite                     8
cool_user                 8
funny_user                8
useful_user               8
yelping_since             8
compliment_profile        8
dtype: int64

In [7]:
df2 = df.dropna(subset=["useful_user", "funny_user", "cool_user", "fans"])

In [8]:
print("\nColumns with more than 30% missing values:")
null_threshold = 30
null_percent = df.isnull().mean() * 100  # calculate & of null values
display(null_percent[null_percent > null_threshold].sort_values(ascending=False))


Columns with more than 30% missing values:


Series([], dtype: float64)

### Remove columns

In [9]:
df3 = df2.drop(
    columns=[
        "user_id",
        "business_id",
        "name",
        "yelping_since",
        "latitude",
        "friends",
        "postal_code",
        "longitude",
        "compliment_hot",
        "compliment_more",
        "compliment_profile",
        "compliment_cute",
        "compliment_list",
        "compliment_note",
        "compliment_plain",
        "compliment_cool",
        "compliment_funny",
        "compliment_writer",
        "compliment_photos",
        "attributes",
        "hours",
    ]
)

In [10]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000092 entries, 0 to 1000099
Data columns (total 22 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   review_id              1000092 non-null  object        
 1   stars                  1000092 non-null  int64         
 2   useful                 1000092 non-null  int64         
 3   funny                  1000092 non-null  int64         
 4   cool                   1000092 non-null  int64         
 5   text                   1000092 non-null  object        
 6   date                   1000092 non-null  datetime64[ns]
 7   review_count           1000092 non-null  float64       
 8   useful_user            1000092 non-null  float64       
 9   funny_user             1000092 non-null  float64       
 10  cool_user              1000092 non-null  float64       
 11  elite                  1000092 non-null  object        
 12  fans                   1000092 no

In [11]:
df3.shape

(1000092, 22)

### Categorical Variables

In [12]:
cols_categoric = [
    "stars",
    "elite",
    "city",
    "state",
]

In [13]:
df3[cols_categoric] = df3[cols_categoric].astype("category")

- Ordinal: stars

- Nominal: elite, city, state

### Numerical Variables

In [14]:
cols_numeric = [
    "useful",
    "funny",
    "cool",
    "review_count",
    "useful_user",
    "funny_user",
    "cool_user",
    "fans",
    "average_stars",
]

- Float

In [15]:
cols_numeric_float = ["average_stars"]

In [16]:
df3[cols_numeric_float] = df3[cols_numeric_float].astype("float")

- Int

In [17]:
cols_numeric_int = [
    "useful",
    "funny",
    "cool",
    "review_count",
    "useful_user",
    "funny_user",
    "cool_user",
    "fans",
]

In [18]:
df3[cols_numeric_int] = df3[cols_numeric_int].astype("int32")

### Boolean Variables

In [19]:
cols_boolean = ["is_open"]

In [20]:
df3[cols_boolean] = df3[cols_boolean].astype("bool")

### String Variables

In [21]:
cols_string = ["review_id", "text", "address", "categories"]

In [22]:
df3[cols_string] = df3[cols_string].astype("object")

### Date Variables

In [23]:
col_date = ["date"]

In [24]:
df3[col_date] = df3[col_date].astype("datetime64[ns]")

### Schema

In [25]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000092 entries, 0 to 1000099
Data columns (total 22 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   review_id              1000092 non-null  object        
 1   stars                  1000092 non-null  category      
 2   useful                 1000092 non-null  int32         
 3   funny                  1000092 non-null  int32         
 4   cool                   1000092 non-null  int32         
 5   text                   1000092 non-null  object        
 6   date                   1000092 non-null  datetime64[ns]
 7   review_count           1000092 non-null  int32         
 8   useful_user            1000092 non-null  int32         
 9   funny_user             1000092 non-null  int32         
 10  cool_user              1000092 non-null  int32         
 11  elite                  1000092 non-null  category      
 12  fans                   1000092 no

In [26]:
schema = pa.Schema.from_pandas(df3, preserve_index=False)

In [27]:
schema

review_id: string
stars: dictionary<values=int64, indices=int8, ordered=0>
useful: int32
funny: int32
cool: int32
text: string
date: timestamp[ns]
review_count: int32
useful_user: int32
funny_user: int32
cool_user: int32
elite: dictionary<values=string, indices=int16, ordered=0>
fans: int32
average_stars: double
name_business: string
address: string
city: dictionary<values=string, indices=int16, ordered=0>
state: dictionary<values=string, indices=int8, ordered=0>
stars_business: double
review_count_business: int64
is_open: bool
categories: string
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 2774

## Basic Statistics

In [28]:
df3.describe(include=["object", "category"])

Unnamed: 0,review_id,stars,text,elite,name_business,address,city,state,categories
count,1000092,1000092,1000092,1000092.0,1000092,1000092.0,1000092,1000092,999956
unique,1000092,5,999102,700.0,79536,91369.0,1228,24,62454
top,LgfSWgq5DzgoFzNW6YwfSg,5,So they have policy changes and NEVER Let the ...,,McDonald's,,Philadelphia,PA,"Restaurants, Mexican"
freq,1,463862,7,610308.0,3022,14713.0,113054,200366,7277


In [29]:
MAX_UNIQUE_DISPLAY = 20

for column in cols_numeric:
    print(f"\nüîç Column analysis: {column}")
    print("-" * 50)

    # Step 1: Summary statistics
    print("üìä Summary statistics:")
    print(df3[column].describe())

    # Step 2: Unique values (limit if too many)
    unique_vals = df3[column].unique()
    print(f"\nüî¢ Unique values ({len(unique_vals)}):")
    print(
        unique_vals
        if len(unique_vals) <= MAX_UNIQUE_DISPLAY
        else unique_vals[:MAX_UNIQUE_DISPLAY]
    )

    # Step 3: Value counts (only if few unique values)
    if df3[column].nunique() <= MAX_UNIQUE_DISPLAY:
        print("\nüìà Value counts:")
        print(df3[column].value_counts().sort_index())


üîç Column analysis: useful
--------------------------------------------------
üìä Summary statistics:
count    1.000092e+06
mean     2.436598e+00
std      4.411195e+00
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      2.000000e+00
max      5.390000e+02
Name: useful, dtype: float64

üî¢ Unique values (193):
[ 1  3  2  6  8  5  4 10  7 25 23 52 11 13 12 18  9 67 14 15]

üîç Column analysis: funny
--------------------------------------------------
üìä Summary statistics:
count    1.000092e+06
mean     5.257036e-01
std      2.165200e+00
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      1.800000e+02
Name: funny, dtype: float64

üî¢ Unique values (119):
[ 0  1  2  4  3  5  6  7 27  9 18  8 10 19 21 23 26 16 13 11]

üîç Column analysis: cool
--------------------------------------------------
üìä Summary statistics:
count    1.000092e+06
mean     1.156940e+00
std      3.798284e+00
min      0.000000e+00
25%      0

# Test Data Validate

In [30]:
import os

os.chdir("/Users/agomezj/Desktop/Juan-G/ml-message-classifier/")
print(os.getcwd())

/Users/agomezj/Desktop/Juan-G/ml-message-classifier


In [31]:
extract = pd.read_parquet(BASE_DIR / "data/01_raw/review_user_business_extract.parquet")

In [32]:
from src.pipelines.feature_pipeline.feature_pipeline import feature_pipeline

In [33]:
validate = feature_pipeline.named_steps["validate"].fit_transform(extract)

[32m2025-05-20 11:21:27.149[0m | [1mINFO    [0m | [36msrc.data.validate[0m:[36mtransform[0m:[36m137[0m - [1mStarting data validation process[0m
[32m2025-05-20 11:21:27.150[0m | [1mINFO    [0m | [36msrc.data.validate[0m:[36mtransform[0m:[36m140[0m - [1mDropping unnecessary columns[0m
[32m2025-05-20 11:21:27.445[0m | [1mINFO    [0m | [36msrc.data.validate[0m:[36mtransform[0m:[36m145[0m - [1mConverting columns to appropriate data types[0m
[32m2025-05-20 11:21:27.446[0m | [34m[1mDEBUG   [0m | [36msrc.data.validate[0m:[36m_safe_cast[0m:[36m84[0m - [34m[1mCasting to category: ['stars', 'elite', 'city', 'state'][0m
[32m2025-05-20 11:21:27.539[0m | [34m[1mDEBUG   [0m | [36msrc.data.validate[0m:[36m_safe_cast[0m:[36m84[0m - [34m[1mCasting to float: ['average_stars'][0m
[32m2025-05-20 11:21:27.541[0m | [34m[1mDEBUG   [0m | [36msrc.data.validate[0m:[36m_safe_cast[0m:[36m84[0m - [34m[1mCasting to int32: ['useful', 'funny'