This is the classification based E-commerce text dataset for 4 categories - "Electronics", "Household", "Books" and "Clothing & Accessories", which almost cover 80% of any E-commerce website.

https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification


## Installation of Kaggle, Wandb and Transformers

In [2]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.32.0-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m31.9 MB/s[0m eta [36m0:00:0

In [4]:
import os
import gdown
import pickle
import zipfile
import warnings
import numpy as np
import pandas as pd

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'

warnings.filterwarnings('ignore')

import tensorflow as tf
from sklearn.preprocessing import LabelEncoder

## Downloading the preprocessed data from google drive

In [3]:
!gdown https://drive.google.com/uc?id=1-4yfh2Q2dONZyGVaJbn23jQvcKqr_n8Q
!gdown https://drive.google.com/uc?id=10FwxJr-2aQ5Jr8nypw3y8d_q0H3ZcPhW

Downloading...
From: https://drive.google.com/uc?id=1-4yfh2Q2dONZyGVaJbn23jQvcKqr_n8Q
To: /content/ecommerce_product_classification.csv
100% 65.1M/65.1M [00:01<00:00, 54.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=10FwxJr-2aQ5Jr8nypw3y8d_q0H3ZcPhW
To: /content/Ecommerce_Product_Classification.zip
100% 247M/247M [00:02<00:00, 101MB/s]


## Unzipping the label encoder and the pretrained distilbert model

In [5]:
archive = zipfile.ZipFile('/content/Ecommerce_Product_Classification.zip')
archive.extractall("/content")

## Loading a pandas dataframe

In [6]:
df = pd.read_csv('ecommerce_product_classification.csv')
print(df.shape)
df.head()

(50424, 4)


Unnamed: 0,product_type,description,clean_description,encoded_product_type
0,Household,Paper Plane Design Framed Wall Hanging Motivat...,paper plane design framed wall hanging motivat...,3
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...","saf ' floral ' framed painting ( wood , inch x...",3
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...,saf ' uv textured modern art print framed ' pa...,3
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1...","saf flower print framed painting ( synthetic ,...",3
4,Household,Incredible Gifts India Wooden Happy Birthday U...,incredible gifts india wooden happy birthday u...,3


In [7]:
df['clean_description'][0]

'paper plane design framed wall hanging motivational office decor art prints ( 8.7 x 8.7 inch ) - set painting synthetic frame uv textured print gives multi effects attracts . special series paintings makes wall beautiful gives royal touch . painting ready hang , proud possess unique painting niche apart . use modern efficient printing technology prints , inks precision epson , roland hp printers . innovative hd printing technique results durable spectacular looking prints highest lifetime . print solely - notch % inks , achieve brilliant true colours . high level uv resistance , prints retain beautiful colours years . add colour style living space digitally printed painting . pleasure eternal bliss.so bring home elegant print lushed rich colors makes sheer elegance friends family.it treasured forever lucky recipient . liven place intriguing paintings high definition hd graphic digital prints home , office room .'

## Checking Null Values


In [8]:
df.isnull().sum()

product_type            0
description             0
clean_description       5
encoded_product_type    0
dtype: int64

## Dropping rows having null values

In [9]:
df = df.dropna(subset=['clean_description'])

In [10]:
df.isnull().sum()

product_type            0
description             0
clean_description       0
encoded_product_type    0
dtype: int64

## Loading the model and the tokenizer For Inference


In [11]:
from transformers import DistilBertTokenizerFast, TFDistilBertForSequenceClassification

In [12]:
save_directory = "/content/saved_models"

In [13]:
loaded_tokenizer = DistilBertTokenizerFast.from_pretrained(save_directory)
loaded_model = TFDistilBertForSequenceClassification.from_pretrained(save_directory)

All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at /content/saved_models.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


## Load the label encoder pickle file

In [14]:
# Later, when you want to inverse transform
encoded_pickle_file = "/content/label_encoder/ecommerce_product_classification_label_encoder.pkl"
with open(encoded_pickle_file, 'rb') as encoder_file:
    loaded_label_encoder = pickle.load(encoder_file)


## Inference

In [16]:
test_text = df['clean_description'][1234]
test_text

'deckup bei - door shoe rack wooden legs ( dark wenge , matte finish ) color : brown engineered wood . comes doors - footwear invisible closed . ventilated clean hygienic .'

In [17]:
predict_input = loaded_tokenizer.encode(test_text,
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")
print(predict_input)

tf.Tensor(
[[  101  5877  6279 21388  1011  2341 10818 14513  4799  3456  1006  2601
  19181  3351  1010  4717  2063  3926  1007  3609  1024  2829 13685  3536
   1012  3310  4303  1011  3329 16689  8841  2701  1012 18834 11733  3064
   4550  1044  2100 11239  8713  1012   102]], shape=(1, 43), dtype=int32)


In [20]:
output = loaded_model(predict_input)[0]
print(output)

tf.Tensor([[-3.1534643 -2.4930713 -3.198036   6.2556047]], shape=(1, 4), dtype=float32)


In [21]:
prediction_value = tf.argmax(output, axis=1).numpy()[0]
prediction_value

3

## Prediction

In [22]:
loaded_label_encoder.inverse_transform([prediction_value])[0]

'Household'

## Original labels

In [23]:
df['product_type'][1234]


'Household'

## Inference pipeline

In [27]:
test_text = df['clean_description'][34567]
predict_input = loaded_tokenizer.encode(test_text,
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")
output = loaded_model(predict_input)[0]
prediction_value = tf.argmax(output, axis=1).numpy()[0]
final_prediction_value = loaded_label_encoder.inverse_transform([prediction_value])[0]
print(final_prediction_value)

Clothing & Accessories


In [28]:
df['product_type'][34567]

'Clothing & Accessories'

In [29]:
df['clean_description'][34567]

'rupa thermocot men cotton thermal'