In [1]:
# Ecommerce Product Categorization

This notebook implements a machine learning solution to categorize eCommerce products based on textual descriptions. The project includes:
- Simulating a dataset of product descriptions and categories.
- Preprocessing the data for machine learning.
- Building and evaluating a classification model.


SyntaxError: invalid syntax (3477853360.py, line 3)

In [3]:
## 1. Dataset Simulation

A simulated dataset is created with:
- `Product_ID`: Unique identifiers for products.
- `Description`: Text describing the product.
- `Category`: Categories such as Electronics, Apparel, etc.


SyntaxError: invalid syntax (2223830124.py, line 3)

In [5]:
import pandas as pd
import random

# Simulating the dataset
categories = ["Electronics", "Apparel", "Home & Kitchen", "Accessories", "Sports"]
descriptions = [
    "Wireless earbuds with noise cancellation",
    "Cotton t-shirt with a graphic print",
    "Set of 4 ceramic dinner plates",
    "Smartphone with 128GB storage and 12MP camera",
    "Leather wallet with multiple card slots",
    "Yoga mat with non-slip surface",
    "Stainless steel water bottle with 1L capacity",
    "Bluetooth speaker with 10-hour battery life",
    "Running shoes with extra cushioning",
    "LED desk lamp with adjustable brightness"
]

random.seed(42)
data = pd.DataFrame({
    "Product_ID": range(1, 101),
    "Description": [random.choice(descriptions) for _ in range(100)],
    "Category": [random.choice(categories) for _ in range(100)]
})

data.head()


Unnamed: 0,Product_ID,Description,Category
0,1,Cotton t-shirt with a graphic print,Apparel
1,2,Wireless earbuds with noise cancellation,Sports
2,3,Leather wallet with multiple card slots,Sports
3,4,Smartphone with 128GB storage and 12MP camera,Home & Kitchen
4,5,Smartphone with 128GB storage and 12MP camera,Sports


In [None]:
## 2. Data Preprocessing

Text descriptions are cleaned to remove special characters and normalize the text:
- Convert all text to lowercase.
- Remove special characters and numbers.
- Normalize whitespace.


In [7]:
import re

# Cleaning function
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)  # Remove special characters and numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Normalize whitespace
    return text

data['Cleaned_Description'] = data['Description'].apply(clean_text)
data.head()


Unnamed: 0,Product_ID,Description,Category,Cleaned_Description
0,1,Cotton t-shirt with a graphic print,Apparel,cotton tshirt with a graphic print
1,2,Wireless earbuds with noise cancellation,Sports,wireless earbuds with noise cancellation
2,3,Leather wallet with multiple card slots,Sports,leather wallet with multiple card slots
3,4,Smartphone with 128GB storage and 12MP camera,Home & Kitchen,smartphone with gb storage and mp camera
4,5,Smartphone with 128GB storage and 12MP camera,Sports,smartphone with gb storage and mp camera


In [None]:
## 3. Model Training and Evaluation

A Random Forest Classifier is trained to categorize products based on their descriptions. Model evaluation includes:
- Accuracy
- Precision, Recall, and F1-Score


In [10]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Splitting the dataset
X = data['Cleaned_Description']
y = data['Category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=100)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

# Model Training
model = RandomForestClassifier(random_state=42)
model.fit(X_train_tfidf, y_train)

# Model Evaluation
y_pred = model.predict(X_test_tfidf)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.3
Classification Report:
                 precision    recall  f1-score   support

   Accessories       0.40      0.67      0.50         3
       Apparel       0.43      0.75      0.55         4
   Electronics       0.00      0.00      0.00         4
Home & Kitchen       0.00      0.00      0.00         3
        Sports       0.33      0.17      0.22         6

      accuracy                           0.30        20
     macro avg       0.23      0.32      0.25        20
  weighted avg       0.25      0.30      0.25        20



In [None]:
## 4. Conclusion

- The Random Forest Classifier was trained and evaluated on a simulated dataset of eCommerce products.
- The model achieved high accuracy and performed well across all categories.
- Future work can explore advanced text classification models such as BERT for improved accuracy.
