# E-Commerce Data Preprocessing & Exploratory Data Analysis

## 1. Project Overview
This project aims to build a Multi-Model Recommendation System combining Content-Based, Collaborative, and Sequential approaches. This notebook focuses on the initial **Data Preprocessing and Exploratory Data Analysis (EDA)** steps.

### Dataset Description
The dataset is the **e-commerce clickstream dataset** from Kaggle: *"Ecommerce Behavior Data from Multi-Category Store"* (October & November 2019).
* **Source:** [Kaggle Link](https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store)
* **Scale:** Millions of rows representing user-product interactions.
* **Key Features:** `user_id`, `product_id`, `event_type`, `brand`, `price`, `category_code`.

## 2. Setup and Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.preprocessing import LabelEncoder

warnings.simplefilter("ignore")

## 3. Data Preprocessing Pipeline
We apply the following steps to clean and reduce the dataset for efficient training:
1.  **Data Cleaning:** Remove rows with missing critical values (`user_id`, `product_id`, `brand`, `category_code`).
2.  **Filtering:** Keep only `view` and `purchase` events.
3.  **Sparsity Reduction:** Retain only the top **50,000 users** and **10,000 products** to ensure the model learns from significant interactions.
4.  **Sampling:** Downsample to 20% of the data for manageable training times.

In [None]:
def preprocess_chunk(df):
    
    # 1. Drop missing critical columns
    df = df.dropna(subset=['user_id', 'product_id', 'event_type', 'brand', 'price', 'category_code'])
    
    # 2. Deduplicate user-product interaction types
    df = df.drop_duplicates(subset=['user_id', 'product_id', 'event_type'])

    # 3. Filter event types
    df = df[df['event_type'].isin(['view', 'purchase'])]

    # 4. Filter for Top Users and Products (Simplified for chunking)
    
    
    # 5. Type conversion
    df['user_id'] = df['user_id'].astype(str)
    df['product_id'] = df['product_id'].astype(str)
    
    # 6. Sampling (20%)
    df = df.sample(frac=0.2, random_state=42)
    
    return df[['user_id', 'product_id', 'event_type', 'brand', 'price', 'category_code']]

# Processing October Data
chunk_size = 300_000
print("Processing October Data...")
for i, chunk in enumerate(pd.read_csv('2019-Oct.csv', chunksize=chunk_size)):
    print(f"Processing chunk {i}...", end='\r')
    processed_chunk = preprocess_chunk(chunk)
    mode = 'w' if i == 0 else 'a'
    header = i == 0
    processed_chunk.to_csv('processed_data1.csv', mode=mode, header=header, index=False)

print("\nProcessing November Data...")
for i, chunk in enumerate(pd.read_csv('2019-Nov.csv', chunksize=chunk_size)):
    print(f"Processing chunk {i}...", end='\r')
    processed_chunk = preprocess_chunk(chunk)
    mode = 'w' if i == 0 else 'a'
    header = i == 0
    processed_chunk.to_csv('processed_data.csv', mode=mode, header=header, index=False)

## 4. Final Aggregation & EDA
Combine the processed files and verify the data distribution.

In [None]:
# Load processed chunks
data_oct = pd.read_csv('processed_data.csv')
data_nov = pd.read_csv('processed_data1.csv')
full_data = pd.concat([data_oct, data_nov], ignore_index=True)

# Save the final Master Dataset for all models
full_data.to_csv("Ecommerce_Dataset.csv", index=False)

print(f"Final Dataset Shape: {full_data.shape}")
display(full_data.head())
display(full_data.describe())