# EDA Overview

This notebook performs exploratory data analysis (EDA) on the user behavior dataset.

Objectives:
- Understand data structure and scale
- Inspect data quality and types
- Explore basic user behavior distributions
- Prepare for downstream analysis (SQL / BI)

In [1]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)

In [2]:
oct_path = "../data/raw/2019-Oct.csv"
nov_path = "../data/raw/2019-Nov.csv"

df_oct = pd.read_csv(oct_path)
df_nov = pd.read_csv(nov_path)

In [3]:
df_oct.shape,df_nov.shape

((42448764, 9), (67501979, 9))

In [4]:
df = pd.concat([df_oct, df_nov], ignore_index=True)

In [9]:
output_path = "../data/processed/events_2019_10_11.csv"

df.to_csv(
    output_path,
    index=False
)

In [5]:
df.shape

(109950743, 9)

In [6]:
df.head()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
0,2019-10-01 00:00:00 UTC,view,44600062,2103807459595387724,,shiseido,35.79,541312140,72d76fde-8bb3-4e00-8c23-a032dfed738c
1,2019-10-01 00:00:00 UTC,view,3900821,2053013552326770905,appliances.environment.water_heater,aqua,33.2,554748717,9333dfbd-b87a-4708-9857-6336556b0fcc
2,2019-10-01 00:00:01 UTC,view,17200506,2053013559792632471,furniture.living_room.sofa,,543.1,519107250,566511c2-e2e3-422b-b695-cf8e6e792ca8
3,2019-10-01 00:00:01 UTC,view,1307067,2053013558920217191,computers.notebook,lenovo,251.74,550050854,7c90fc70-0e80-4590-96f3-13c02c18c713
4,2019-10-01 00:00:04 UTC,view,1004237,2053013555631882655,electronics.smartphone,apple,1081.98,535871217,c6bd7419-2748-4c56-95b4-8cec9ff8b80d


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109950743 entries, 0 to 109950742
Data columns (total 9 columns):
 #   Column         Dtype  
---  ------         -----  
 0   event_time     object 
 1   event_type     object 
 2   product_id     int64  
 3   category_id    int64  
 4   category_code  object 
 5   brand          object 
 6   price          float64
 7   user_id        int64  
 8   user_session   object 
dtypes: float64(1), int64(3), object(5)
memory usage: 7.4+ GB


In [8]:
df.describe()

Unnamed: 0,product_id,category_id,price,user_id
count,109950700.0,109950700.0,109950700.0,109950700.0
mean,11755770.0,2.057707e+18,291.6348,536669800.0
std,15435640.0,1.949326e+16,356.68,21451730.0
min,1000365.0,2.053014e+18,0.0,10300220.0
25%,1005256.0,2.053014e+18,67.96,516262900.0
50%,5100396.0,2.053014e+18,164.93,532641500.0
75%,17200510.0,2.053014e+18,360.11,556331200.0
max,100028600.0,2.187708e+18,2574.07,579969900.0


In [None]:
df.nunique()

In [None]:
df['event_type'].value_counts()

Most user actions are views, while purchases account for a relatively small proportion,
which is consistent with typical e-commerce user behavior funnels.