## Title
Data Raw Exploration

### By:
Juan G√≥mez

### Date:
2024-05-16

### Description:

This notebook explores the raw Yelp Open Dataset. It shows basic statistics, checks missing values, and looks at trends in popularity, genres, and ratings. The goal is to understand the data before building a recommendation system.

## Import  libraries

In [1]:
from pathlib import Path

import pandas as pd
import pyarrow as pa

## Load data

In [2]:
pd.set_option("display.max_columns", None)
BASE_DIR = Path.cwd().resolve().parents[1]

In [3]:
df = pd.read_parquet(BASE_DIR / "data/02_intermediate/data_message_classifier_interm.parquet")

## Exploration

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000100 entries, 0 to 1000099
Data columns (total 43 columns):
 #   Column              Non-Null Count    Dtype         
---  ------              --------------    -----         
 0   review_id           1000100 non-null  object        
 1   user_id             1000100 non-null  object        
 2   business_id         1000100 non-null  object        
 3   stars               1000100 non-null  int64         
 4   useful              1000100 non-null  int64         
 5   funny               1000100 non-null  int64         
 6   cool                1000100 non-null  int64         
 7   text                1000100 non-null  object        
 8   date                1000100 non-null  datetime64[ns]
 9   name                1000092 non-null  object        
 10  review_count        1000092 non-null  float64       
 11  yelping_since       1000092 non-null  object        
 12  useful_user         1000092 non-null  float64       
 13  funny_user  

In [5]:
df.sample(5)

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,name,review_count,yelping_since,useful_user,funny_user,cool_user,elite,friends,fans,average_stars,compliment_hot,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos,name_user,address,city,state,postal_code,latitude,longitude,stars_user,review_count_user,is_open,attributes,categories,hours
295744,rNBmnFOUvnnTiCzhIxeI6w,-YAlSBsxzwE-Vg4hm1MDPg,9xW7LsJpyhVZFRb6z9xorg,5,6,0,2,Penne Arrabiata. It is a simple dish of pasta ...,2020-09-10 00:12:54,Jacob,162.0,2008-06-03 01:46:02,191.0,63.0,117.0,201920202021.0,"xZZJ-d4WVdA8ht2xLL1V4Q, mx4vcqwgSHMjJg0-JbCkpg...",21.0,4.28,6.0,6.0,1.0,0.0,0.0,7.0,12.0,11.0,11.0,6.0,0.0,Ca' Dario Goleta,"250 Storke Rd, Unit B",Goleta,CA,93117,34.42954,-119.869271,4.0,172,1,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Pizza, Nightlife, Italian, Bars, Restaurants, ...","{'Friday': '16:0-22:0', 'Monday': '16:0-21:0',..."
870728,CKnQQEwLrihL5OIaDOCL5Q,Vg0-Hj1sk92UhPBcoFqalw,GWqPmrWu0kXB_-gB1H-j6A,4,2,0,1,This review is long overdue! \nI finally had t...,2018-10-25 18:45:09,Nikky,159.0,2018-05-21 15:13:35,357.0,54.0,192.0,2018201920202021.0,"p45X9TF6ZlCjsvAG_ag6Xw, aYa5DvAzmOfdUOHJLiWAYg...",39.0,4.02,6.0,0.0,0.0,0.0,0.0,15.0,11.0,10.0,10.0,6.0,9.0,Love & Honey Fried Chicken,1100 N Front St,Philadelphia,PA,19123,39.967481,-75.136957,4.5,409,1,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Restaurants, Chicken Shop, Chicken Wings, Sand...","{'Friday': '11:0-16:0', 'Monday': '0:0-0:0', '..."
493788,VY4MH8fsEuBaFQ9D2KmGKQ,gMpdDTOmjnRAfVqFmE11dQ,eYoOM8C9mEpdZx_wAsqcug,4,1,0,1,This place is great! We tried the Carnitas Bu...,2019-11-12 12:09:27,Kaleb,53.0,2012-11-06 22:33:35,42.0,5.0,24.0,2020.0,"AHRrG3T1gJpHvtpZ-K0G_g, z8VkURoX-QGqs3pZNqNHkQ...",2.0,3.93,4.0,1.0,0.0,0.0,0.0,4.0,2.0,3.0,3.0,2.0,0.0,First Watch,3309 East 86th St,Indianapolis,IN,46240,39.91084,-86.1091,4.5,262,1,"{'AcceptsInsurance': None, 'AgesAllowed': None...","American (Traditional), Food, Restaurants, Caf...","{'Friday': '7:0-14:30', 'Monday': '0:0-0:0', '..."
877586,H7OJj5kdq0RpNJkBGlPziA,MfRZLLjL6gwiG2USIjIJtg,UQOR4jwKlNzxvgGjZTXIYA,5,3,2,4,It's rare that you find a hole in a wall and e...,2018-10-18 04:31:33,Tony,48.0,2018-05-26 15:51:27,110.0,71.0,43.0,2018.0,"cBx_dQqxv3-2qL5wUhoD5A, blr4eOJ2JS99rS46s_sRPQ...",8.0,3.42,5.0,2.0,0.0,0.0,0.0,13.0,7.0,6.0,6.0,6.0,4.0,Zwick's Pretzels,12415 - 107 Avenue NW,Edmonton,AB,T5M 1Z2,53.550662,-113.536536,4.5,97,1,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Desserts, Bakeries, Sandwiches, Food, Restaurants","{'Friday': '10:0-18:0', 'Monday': None, 'Satur..."
950624,nnr0XiPOPfFR98RA1yoeDQ,NhLnQfghAGOanvdOVROVDw,RSnJE62UUBttxJCHQY6FBA,1,1,0,0,Please do not EVER use this company and get st...,2018-08-02 00:24:36,Chloe,1.0,2018-08-02 00:24:34,1.0,0.0,0.0,,"hUXHgSfTFFz3cwVgNDzqxQ, IY7kifWf26GkV7Be3TH0aQ...",0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Go Go Philly Movers,,Philadelphia,PA,19130,39.967819,-75.175462,4.0,35,1,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Home Services, Movers","{'Friday': '9:0-21:0', 'Monday': '9:0-21:0', '..."


### Null Values

In [6]:
print("\nNull values in Movie Data Set:")
null_counts = df.isnull().sum()
display(null_counts[null_counts > 0].sort_values(ascending=False))


Null values in Movie Data Set:


hours                 49876
attributes            37401
categories              136
review_count              8
compliment_photos         8
compliment_writer         8
compliment_funny          8
compliment_cool           8
compliment_plain          8
compliment_note           8
compliment_list           8
compliment_cute           8
name                      8
compliment_more           8
compliment_hot            8
average_stars             8
fans                      8
friends                   8
elite                     8
cool_user                 8
funny_user                8
useful_user               8
yelping_since             8
compliment_profile        8
dtype: int64

In [7]:
df2 = df.dropna(subset=["useful_user", "funny_user", "cool_user", "fans"])

In [8]:
print("\nColumns with more than 30% missing values:")
null_threshold = 30
null_percent = df.isnull().mean() * 100  # calculate & of null values
display(null_percent[null_percent > null_threshold].sort_values(ascending=False))


Columns with more than 30% missing values:


Series([], dtype: float64)

### Remove columns

In [9]:
df3 = df2.drop(
    columns=[
        "user_id",
        "business_id",
        "name",
        "yelping_since",
        "name_user",
        "latitude",
        "friends",
        "postal_code",
        "longitude",
        "compliment_hot",
        "compliment_more",
        "compliment_profile",
        "compliment_cute",
        "compliment_list",
        "compliment_note",
        "compliment_plain",
        "compliment_cool",
        "compliment_funny",
        "compliment_writer",
        "compliment_photos",
        "attributes",
        "hours",
        "review_count_user",
    ]
)

In [10]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000092 entries, 0 to 1000099
Data columns (total 20 columns):
 #   Column         Non-Null Count    Dtype         
---  ------         --------------    -----         
 0   review_id      1000092 non-null  object        
 1   stars          1000092 non-null  int64         
 2   useful         1000092 non-null  int64         
 3   funny          1000092 non-null  int64         
 4   cool           1000092 non-null  int64         
 5   text           1000092 non-null  object        
 6   date           1000092 non-null  datetime64[ns]
 7   review_count   1000092 non-null  float64       
 8   useful_user    1000092 non-null  float64       
 9   funny_user     1000092 non-null  float64       
 10  cool_user      1000092 non-null  float64       
 11  elite          1000092 non-null  object        
 12  fans           1000092 non-null  float64       
 13  average_stars  1000092 non-null  float64       
 14  address        1000092 non-null  object

In [11]:
df3.shape

(1000092, 20)

### Categorical Variables

In [12]:
cols_categoric = [
    "stars",
    "elite",
    "city",
    "state",
]

In [13]:
df3[cols_categoric] = df3[cols_categoric].astype("category")

- Ordinal: stars

- Nominal: elite, city, state

### Numerical Variables

In [14]:
cols_numeric = [
    "useful",
    "funny",
    "cool",
    "review_count",
    "useful_user",
    "funny_user",
    "cool_user",
    "fans",
    "average_stars",
    "stars_user",
]

- Float

In [15]:
cols_numeric_float = ["stars_user", "average_stars"]

In [16]:
df3[cols_numeric_float] = df3[cols_numeric_float].astype("float")

- Int

In [17]:
cols_numeric_int = [
    "useful",
    "funny",
    "cool",
    "review_count",
    "useful_user",
    "funny_user",
    "cool_user",
    "fans",
]

In [18]:
df3[cols_numeric_int] = df3[cols_numeric_int].astype("int32")

### Boolean Variables

In [19]:
cols_boolean = ["is_open"]

In [20]:
df3[cols_boolean] = df3[cols_boolean].astype("bool")

### String Variables

In [21]:
cols_string = ["review_id", "text", "address", "categories"]

In [22]:
df3[cols_string] = df3[cols_string].astype("object")

### Date Variables

In [23]:
col_date = ["date"]

In [24]:
df3[col_date] = df3[col_date].astype("datetime64[ns]")

### Schema

In [25]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000092 entries, 0 to 1000099
Data columns (total 20 columns):
 #   Column         Non-Null Count    Dtype         
---  ------         --------------    -----         
 0   review_id      1000092 non-null  object        
 1   stars          1000092 non-null  category      
 2   useful         1000092 non-null  int32         
 3   funny          1000092 non-null  int32         
 4   cool           1000092 non-null  int32         
 5   text           1000092 non-null  object        
 6   date           1000092 non-null  datetime64[ns]
 7   review_count   1000092 non-null  int32         
 8   useful_user    1000092 non-null  int32         
 9   funny_user     1000092 non-null  int32         
 10  cool_user      1000092 non-null  int32         
 11  elite          1000092 non-null  category      
 12  fans           1000092 non-null  int32         
 13  average_stars  1000092 non-null  float64       
 14  address        1000092 non-null  object

In [26]:
schema = pa.Schema.from_pandas(df3, preserve_index=False)

In [27]:
schema

review_id: string
stars: dictionary<values=int64, indices=int8, ordered=0>
useful: int32
funny: int32
cool: int32
text: string
date: timestamp[ns]
review_count: int32
useful_user: int32
funny_user: int32
cool_user: int32
elite: dictionary<values=string, indices=int16, ordered=0>
fans: int32
average_stars: double
address: string
city: dictionary<values=string, indices=int16, ordered=0>
state: dictionary<values=string, indices=int8, ordered=0>
stars_user: double
is_open: bool
categories: string
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 2501

## Basic Statistics

In [28]:
df3.describe(include=["object", "category"])

Unnamed: 0,review_id,stars,text,elite,address,city,state,categories
count,1000092,1000092,1000092,1000092.0,1000092.0,1000092,1000092,999956
unique,1000092,5,999102,700.0,91369.0,1228,24,62454
top,LgfSWgq5DzgoFzNW6YwfSg,5,So they have policy changes and NEVER Let the ...,,,Philadelphia,PA,"Restaurants, Mexican"
freq,1,463862,7,610308.0,14713.0,113054,200366,7277


In [29]:
MAX_UNIQUE_DISPLAY = 20

for column in cols_numeric:
    print(f"\nüîç Column analysis: {column}")
    print("-" * 50)

    # Step 1: Summary statistics
    print("üìä Summary statistics:")
    print(df3[column].describe())

    # Step 2: Unique values (limit if too many)
    unique_vals = df3[column].unique()
    print(f"\nüî¢ Unique values ({len(unique_vals)}):")
    print(
        unique_vals if len(unique_vals) <= MAX_UNIQUE_DISPLAY else unique_vals[:MAX_UNIQUE_DISPLAY]
    )

    # Step 3: Value counts (only if few unique values)
    if df3[column].nunique() <= MAX_UNIQUE_DISPLAY:
        print("\nüìà Value counts:")
        print(df3[column].value_counts().sort_index())


üîç Column analysis: useful
--------------------------------------------------
üìä Summary statistics:
count    1.000092e+06
mean     2.436598e+00
std      4.411195e+00
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      2.000000e+00
max      5.390000e+02
Name: useful, dtype: float64

üî¢ Unique values (193):
[ 1  3  2  6  8  5  4 10  7 25 23 52 11 13 12 18  9 67 14 15]

üîç Column analysis: funny
--------------------------------------------------
üìä Summary statistics:
count    1.000092e+06
mean     5.257036e-01
std      2.165200e+00
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      1.800000e+02
Name: funny, dtype: float64

üî¢ Unique values (119):
[ 0  1  2  4  3  5  6  7 27  9 18  8 10 19 21 23 26 16 13 11]

üîç Column analysis: cool
--------------------------------------------------
üìä Summary statistics:
count    1.000092e+06
mean     1.156940e+00
std      3.798284e+00
min      0.000000e+00
25%      0

## Save reviews data intermediate

In [30]:
df3.to_parquet(
    BASE_DIR / "data/03_primary/data_message_classifier_primary.parquet",
    index=False,
    schema=schema,
)