# üìì Day 2 ‚Äì Exploratory Data Analysis

# Day 2 ‚Äî Exploratory Data Analysis (EDA)
In today's notebook, we explore the structure of the OTTO event dataset.
We aim to understand user behavior, event distribution, and item popularity.
Insights in this step will guide feature engineering in later stages.

## üß≠Main Steps
1. Load data efficiently  
2. Inspect dataset shape and schema  
3. Event type distribution  
4. Session length analysis  
5. Item popularity statistics  
6. Visualize key findings  

---

## 1.load data

In [None]:
import pandas as pd

path = "/kaggle/input/otto-recsys-dataset/train.parquet"


df = pd.read_parquet(path).head(500000)

df.info()
df.head()

## 2.Inspect structure

In [None]:
print(df.shape)
df.info()
df.iloc[0]

## 3.Event type distribution

In [None]:
import matplotlib.pyplot as plt
print("Columns:", df.columns.tolist())



events = df.explode("ts")
events_normalized = pd.json_normalize(events["ts"])

# ÂêàÂπ∂ÂõûÂéüÂßã session
events_full = pd.concat([events.drop(columns="ts"), events_normalized], axis=1)

events_full.head()



# Statics
event_counts = events_full["type"].value_counts()

plt.figure(figsize=(6,4))
event_counts.plot(kind='bar')
plt.title("Event Type Distribution")
plt.xlabel("Event Type")
plt.ylabel("Count")
plt.show()

event_counts

## 4.Session Length Analysis

In [None]:
session_len = events_full.groupby("session")["aid"].count()

plt.figure(figsize=(6,4))
session_len.hist(bins=50)
plt.title("Session Length Distribution")
plt.xlabel("Number of Events")
plt.ylabel("Frequency")
plt.show()

session_len.describe()

## 5.Item Popularity

In [None]:
item_popularity = events_full["aid"].value_counts()

plt.figure(figsize=(6,4))
plt.plot(item_popularity.values[:2000])
plt.title("Item Popularity (Top 2000)")
plt.xlabel("Item Rank")
plt.ylabel("Count")
plt.show()

item_popularity.head()

# üìù Key Insights from Day 2

### 1. Event Type Distribution
- **Clicks** dominate the dataset
- **Carts** and **Orders** are less frequent but more valuable  
These insights tell us we need weighting strategies later.

---

### 2. Session Length
- Most sessions contain **5‚Äì30 events**
- A small number of extremely long sessions exist  
Later steps will handle trimming or normalization.

---

### 3. Item Popularity
- Popularity follows **power-law distribution**  
We might use:
- item embeddings
- frequency-based priors
- popularity decay

---

Tomorrow (Day 3) we will:
‚úî Build session-level features  
‚úî Construct co-visitation matrices