# Explaining the Data Preprocessing Pipeline

This notebook explains the 3-step data preprocessing pipeline used to clean, merge, and balance the product category dataset. The goal is to transform the raw data into a format suitable for training a deep learning model for image-based product classification.

## The Problem: Many Imbalanced and Redundant Categories

The original dataset has a large number of categories (248), many of which are semantically similar (e.g., "Men's T-Shirts" and "Men's Tees"). Furthermore, the number of products in each category is highly imbalanced. This poses a challenge for training a robust classification model.

## The Solution: A 3-Step Pipeline

To address this, we use a three-script pipeline:

1.  `create_category_mappings.py`: Merges semantically similar categories using NLP and clustering.
2.  `apply_category_mapping.py`: Applies this new mapping to the main products dataset.
3.  `filter_by_product_count.py`: Filters out categories with too few products and re-indexes the final category IDs.

### Step 1: Semantic Category Merging (`create_category_mappings.py`)

**Goal:** To group similar categories based on their names.

**How it works:**

1.  **Load Categories:** It starts by loading the original `categories.csv` file.
2.  **Generate Embeddings:** It uses the `sentence-transformers` library with the `all-MiniLM-L6-v2` model to convert each category name into a numerical vector (embedding). These embeddings capture the semantic meaning of the names.
3.  **Cluster Embeddings:** It then uses `KMeans` clustering to group these embeddings into a smaller number of clusters (`NUM_CLUSTERS = 50`). Categories with similar meanings will have embeddings that are close to each other in the vector space and will be grouped into the same cluster.
4.  **Create Mapping File:** The script creates a new file, `category_mapping.csv`, which contains the original `category_id` and a new `merged_category_id` (the cluster ID).

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import numpy as np

# The original file with all 248 categories
CATEGORIES_FILE_PATH = "data/raw/categories.csv"
# The desired number of final, merged categories
NUM_CLUSTERS = 50
# The output file that will store the mapping from old to new categories
OUTPUT_MAPPING_FILE = "data/processed/category_mapping.csv"

# Load the original categories data
categories_df = pd.read_csv(CATEGORIES_FILE_PATH)
categories_df.rename(columns={'id': 'category_id'}, inplace=True)

# Generate sentence embeddings for category names
model = SentenceTransformer('all-MiniLM-L6-v2')
category_embeddings = model.encode(categories_df['category_name'].tolist())

# Cluster the embeddings
kmeans = KMeans(n_clusters=NUM_CLUSTERS, random_state=42, n_init='auto')
cluster_labels = kmeans.fit_predict(category_embeddings)
categories_df['merged_category_id'] = cluster_labels

# Save the mapping
output_df = categories_df[['category_id', 'category_name', 'merged_category_id']]
output_df.to_csv(OUTPUT_MAPPING_FILE, index=False)

### Step 2: Applying the Mapping (`apply_category_mapping.py`)

**Goal:** To apply the newly created category mapping to the entire product dataset.

**How it works:**

1.  **Load Datasets:** It loads both the large, original `products.csv` and the `category_mapping.csv` created in the previous step.
2.  **Map IDs:** It efficiently maps the `category_id` in the products DataFrame to the `merged_category_id` from the mapping file.
3.  **Save Cleaned File:** It saves a new file, `products_cleaned.csv`, containing the product information with the new `merged_category_id`. 

In [None]:
import pandas as pd

ORIGINAL_PRODUCTS_FILE = "data/raw/products.csv"
CATEGORY_MAPPING_FILE = "data/processed/category_mapping.csv"
OUTPUT_PRODUCTS_FILE = "data/processed/products_cleaned.csv"

# Load the datasets
products_df = pd.read_csv(ORIGINAL_PRODUCTS_FILE)
mapping_df = pd.read_csv(CATEGORY_MAPPING_FILE)

# Create a mapping dictionary for efficiency
id_to_merged_id = pd.Series(mapping_df.merged_category_id.values, index=mapping_df.category_id).to_dict()

# Apply the mapping
products_df['merged_category_id'] = products_df['category_id'].map(id_to_merged_id)

# Save the result
final_df = products_df[['asin', 'title', 'imgUrl', 'merged_category_id']]
final_df.to_csv(OUTPUT_PRODUCTS_FILE, index=False)

### Step 3: Filtering by Product Count (`filter_by_product_count.py`)

**Goal:** To ensure each category has enough products for effective model training and to create final, sequential category IDs.

**How it works:**

1.  **Count Products:** It loads the `products_cleaned.csv` and counts the number of products in each `merged_category_id`. 
2.  **Filter Categories:** It identifies and keeps only the categories that have more than a `MIN_PRODUCT_COUNT` (e.g., 5000) of products.
3.  **Re-index Category IDs:** This is a crucial step. After filtering, the remaining `merged_category_id`s might have gaps (e.g., 0, 2, 5, 8). Most machine learning frameworks, including PyTorch, expect class labels to be sequential (0, 1, 2, 3...). This script re-indexes the filtered category IDs to be continuous.
4.  **Overwrite Files:** Finally, it overwrites both `products_cleaned.csv` and `category_mapping.csv` with the filtered and re-indexed data. This results in the final, analysis-ready dataset.

In [None]:
import pandas as pd

MIN_PRODUCT_COUNT = 5000
PRODUCTS_FILE = "data/processed/products_cleaned.csv"
MAPPING_FILE = "data/processed/category_mapping.csv"

# Load data
products_df = pd.read_csv(PRODUCTS_FILE)
mapping_df = pd.read_csv(MAPPING_FILE)

# Calculate product counts and identify valid categories
category_counts = products_df['merged_category_id'].value_counts()
valid_category_ids = category_counts[category_counts >= MIN_PRODUCT_COUNT].index.tolist()

# Filter dataframes
products_df = products_df[products_df['merged_category_id'].isin(valid_category_ids)]
mapping_df = mapping_df[mapping_df['merged_category_id'].isin(valid_category_ids)]

# Re-index the final category IDs to be sequential
unique_ids = products_df['merged_category_id'].unique()
old_to_new_id_map = {old_id: new_id for new_id, old_id in enumerate(unique_ids)}
products_df['final_category_id'] = products_df['merged_category_id'].map(old_to_new_id_map)

# Clean up and rename columns
products_df = products_df.drop(columns=['merged_category_id'])
products_df.rename(columns={'final_category_id': 'category_id'}, inplace=True)

# Overwrite the files with the final data
products_df.to_csv(PRODUCTS_FILE, index=False)
mapping_df.to_csv(MAPPING_FILE, index=False)

## Conclusion

This pipeline effectively transforms a large, messy, and imbalanced dataset into a clean, well-structured, and balanced one. The final `products_cleaned.csv` and `category_mapping.csv` files are ready to be used for training a deep learning model.