<a href="https://colab.research.google.com/github/lav162329/product-category-classifier/blob/main/notebooks/02_feature_engineering_and_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook is dedicated to the application of advanced **Feature Engineering** techniques and the comparison of multiple Machine Learning algorithms. The primary goal is to create an optimal, robust model for automated product classification.

This notebook builds directly upon the data cleansing and exploratory insights gained from `01_data_exploration.ipynb`.

---

## ðŸ“¥ Initialization and Data Loading

In [1]:
# Initialization and Data Loading

import pandas as pd
import numpy as np
import re
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import ComplementNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from scipy.sparse import hstack
import warnings
warnings.filterwarnings('ignore') # Suppress warnings for cleaner output

# Load the data and perform minimal essential cleaning (reproduction of initial steps)
url = "https://raw.githubusercontent.com/lav162329/product-category-classifier/main/data/products.csv"
df = pd.read_csv(url)

# Essential cleaning steps
df.columns = df.columns.str.strip()
df.dropna(subset=['Product Title', 'Category Label'], inplace=True)
df['Product Title'] = df['Product Title'].astype(str).str.lower()
df.reset_index(drop=True, inplace=True)

print(f"Data for modeling loaded and cleaned. Shape: {df.shape}")
print("\nReady to proceed with Feature Engineering.")

Data for modeling loaded and cleaned. Shape: (35096, 8)

Ready to proceed with Feature Engineering.


# Phase 1: Feature Engineering

## 2: Detailed Feature Engineering

This step involves creating specific numerical features (length, presence of units, etc.). Regular expression patterns are adapted using **non-capturing groups `(?:...)`** to suppress `UserWarning` and ensure compatibility across different Pandas versions. These features provide essential context to the model, significantly improving classification performance.

---

In [2]:
# Detailed Feature Engineering.

# Define the feature creation function
def create_engineered_features(df):

    titles = df['Product Title']

    # Structural Features
    df['title_length'] = titles.apply(len)
    df['word_count'] = titles.apply(lambda x: len(x.split()))

    # Binary Specification Markers (using non-capturing groups to prevent warnings)

    # 1. has_storage_unit: Capacity/Volume (GB, TB, L, KG)
    storage_pattern = r'(?:\d+)\s*(?:gb|tb|mb|l|kg|litre)\b'
    df['has_storage_unit'] = titles.str.contains(storage_pattern, regex=True).astype(int)

    # 2. has_dimension: Size (cm, inch, ")
    dimension_pattern = r'(?:\d+(?:\.\d+)?)\s*(?:cm|inch|")\b'
    df['has_dimension'] = titles.str.contains(dimension_pattern, regex=True).astype(int)

    # 3. has_digit: Presence of any number
    df['has_digit'] = titles.str.contains(r'\d+', regex=True).astype(int)

    # 4. is_tech_product: Common tech acronyms
    tech_pattern = r'\b(?:ssd|led|usb|hdmi|wifi|ghz|mp|core)\b'
    df['is_tech_product'] = titles.str.contains(tech_pattern, regex=True).astype(int)

    return df

# Apply and define feature columns
df = create_engineered_features(df)
FEATURE_COLS = ['title_length', 'word_count', 'has_storage_unit', 'has_dimension', 'has_digit', 'is_tech_product']

print("Engineered features created successfully.")
print(df[FEATURE_COLS].head())

Engineered features created successfully.
   title_length  word_count  has_storage_unit  has_dimension  has_digit  \
0            31           6                 1              0          1   
1            35           7                 1              0          1   
2            70          13                 1              0          1   
3            35           7                 1              0          1   
4            54          11                 1              0          1   

   is_tech_product  
0                0  
1                0  
2                0  
3                0  
4                0  
