## Imports and Loading the Dataset

In [24]:
import pandas as pd
import re
import fasttext
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

In [25]:
df = pd.read_csv('ecommerceDataset.csv', names=["category", "description"], header=None)
df.head()

Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


## Preprocess the Data
we preprocess the data by removing punctuation, extra spaces, and converting the text to lowercase. We also add a FastText label prefix to the categories.

In [26]:
def preprocess_text(text):
    text = re.sub(r'\n', ' ', text)  # Remove newline characters
    text = re.sub(r'[^\w\s\']', ' ', text)  # Remove punctuation
    text = re.sub(' +', ' ', text)  # Remove extra spaces
    return text.strip().lower()  # Convert to lower case

def preprocess_data(df):
    df.dropna(inplace=True)  # Drop NA values
    df['category'].replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)  # Standardize categories
    df['category'] = '__label__' + df['category'].astype(str)  # Add FastText label prefix
    df['category_description'] = df['category'] + ' ' + df['description'].apply(preprocess_text)
    return df


df = preprocess_data(df)
df.head()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['category'].replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)  # Standardize categories


Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__Household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__Household incredible gifts india wood...


## Train-Test Split
We will split the data into training and testing sets using `train_test_split`.

In [27]:
train, test = train_test_split(df, test_size=0.2)
train.shape, test.shape


((40339, 3), (10085, 3))

## Save the Data for FastText
The training and testing data need to be saved in a specific format for FastText.

In [28]:
def save_fasttext_format(df, file_name, text_column):
    df.to_csv(file_name, columns=[text_column], index=False, header=False)

save_fasttext_format(train, "ecommerce.train", "category_description")
save_fasttext_format(test, "ecommerce.test", "category_description")

## Train the FastText Model
we'll train the FastText model on the training data.

In [29]:
def train_model(train_file, lr=0.5, epoch=3, wordNgrams=2):
    model = fasttext.train_supervised(input=train_file, lr=lr, epoch=epoch, wordNgrams=wordNgrams)
    return model

model = train_model("ecommerce.train")

Read 4M words
Number of words:  79409
Number of labels: 4
Progress: 100.0% words/sec/thread: 1088189 lr:  0.000000 avg.loss:  0.144561 ETA:   0h 0m 0s


## Evaluate the Model
We'll evaluate the model using the test data by calculating the classification report and confusion matrix.

In [30]:
def evaluate_model(model, test_data):
    y_true = [row.split()[0] for row in test_data]
    y_pred = [model.predict(row.split(' ', 1)[1][:-2])[0][0] for row in test_data]
    print("Classification Report:\n", classification_report(y_true, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))

with open("ecommerce.test") as f:
    test_data = f.readlines()

evaluate_model(model, test_data)

Classification Report:
                                precision    recall  f1-score   support

               __label__Books       0.97      0.98      0.97      2368
__label__Clothing_Accessories       0.99      0.98      0.98      1745
         __label__Electronics       0.97      0.96      0.97      2080
           __label__Household       0.98      0.98      0.98      3892

                     accuracy                           0.98     10085
                    macro avg       0.98      0.97      0.98     10085
                 weighted avg       0.98      0.98      0.98     10085

Confusion Matrix:
 [[2312    7   16   33]
 [  16 1708    5   16]
 [  32    9 2000   39]
 [  33    4   32 3823]]


## Example Predictions
let's test the model with a few example predictions.

In [31]:
model.predict("wintech assemble desktop pc cpu 500 gb sata hdd 4 gb ram intel c2d processor 3")

(('__label__Electronics',), array([0.9776547]))

In [32]:
model.predict("ockey men's cotton t shirt fabric details 80 cotton 20 polyester super combed cotton rich fabric")

(('__label__Clothing_Accessories',), array([1.00001001]))

In [33]:
model.predict("think and grow rich deluxe edition")

(('__label__Books',), array([1.00001001]))