#Text Classification Using FastText

Dataset Credits: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification

- We have a dataset of ecommerce item description. Total 4 categories: Household, Electronics, Clothing and Accessories, and Books.

- The task at hand is to classify a product into one of the above 4 categories based on the product description.

In [1]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Using cached pybind11-2.13.6-py3-none-any.whl (243 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.3-cp311-cp311-linux_x86_64.whl size=4313475 sha256=494583d0989074b13871a64688d63e157ecb0cd2d5eceec0f77e7147c403e8c0
  Stored in directory: /root/.cache/pip/wheels/65/4f/35/5057db0249224e9ab55a51

In [2]:
import pandas as pd

df = pd.read_csv("/content/ecommerce_dataset.csv", names=["category", "description"], header=None)

df.head()

Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


In [3]:
df.shape

(50425, 2)

In [4]:
df['category'].value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
Household,19313
Books,11820
Electronics,10621
Clothing & Accessories,8671


In [5]:
#looking for nan samples
df.isna().sum()

Unnamed: 0,0
category,0
description,1


We have only one sample which is empty so we will drop those samples.

In [6]:
df.dropna(inplace=True)
df.shape

(50424, 2)

We will change the category which have spaces instead of space we are going to use underscore because for text classification in fasttext expect file in this format "_label_books and the book description or whatever" one single line both category and text will be present. Let's first add prefix label to category then we will create new column and add category and description together in one column.

In [7]:
df['category'].replace("Clothing & Accessories", "Clothing_and_Accessories", inplace=True)
df['category'].unique()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['category'].replace("Clothing & Accessories", "Clothing_and_Accessories", inplace=True)


array(['Household', 'Books', 'Clothing_and_Accessories', 'Electronics'],
      dtype=object)

In [8]:
df['category'] = '__label__' + df['category'].astype(str)
df.head()

Unnamed: 0,category,description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...


In [9]:
df['category_description'] = df['category'] + ' ' + df['description']
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__Household SAF Flower Print Framed Pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__Household Incredible Gifts India Wood...


#Preprocessing

In [10]:
import re

def preprocessing(text):
  #removing punctuation
  text = re.sub('[^\w\s]', ' ', text)

  #removing white spaces or extra spaces
  text = re.sub(' +', ' ', text)

  #removing leading and trailing spaces and making sentence into lower case
  return text.strip().lower()

In [11]:
df['category_description'] = df['category_description'].map(preprocessing)
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__household saf floral framed painting ...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__household saf uv textured modern art ...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__household incredible gifts india wood...


In [12]:
df['category_description'][0]

'__label__household paper plane design framed wall hanging motivational office decor art prints 8 7 x 8 7 inch set of 4 painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it this is an special series of paintings which makes your wall very beautiful and gives a royal touch this painting is ready to hang you would be proud to possess this unique painting that is a niche apart we use only the most modern and efficient printing technology on our prints with only the and inks and precision epson roland and hp printers this innovative hd printing technique results in durable and spectacular looking prints of the highest that last a lifetime we print solely with top notch 100 inks to achieve brilliant and true colours due to their high level of uv resistance our prints retain their beautiful colours for many years add colour and style to your living space with this digitally printed painting some are for pleasure and some for eternal bli

#Train Test Split

In [13]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

In [14]:
print("Train Dataset: ", train.shape)
print("Test Dataset: ", test.shape)

Train Dataset:  (40339, 3)
Test Dataset:  (10085, 3)


In [15]:
train.to_csv("ecommerce.train", columns=['category_description'], header=False, index=False)
test.to_csv("ecommerce.test", columns=['category_description'], header=False, index=False)

In fasttext we have two methods unsupervised method (is used to generate word embeddings) and supervised method (is used to do classification) so it is using fasttext to create the word embeddings and then it will use those word embeddings to do the classification.

In [16]:
import fasttext

model = fasttext.train_supervised(input="ecommerce.train")
model.test("ecommerce.test")

(10084, 0.9674732249107497, 0.9674732249107497)

in the above cell the output first value (10084) represents the number of samples in test dataset. The second value represents the precision of the model. The third value represents the recall of the model.

In [17]:
model.predict("Looking for a comfortable jacket for the winter season.")

(('__label__clothing_and_accessories',), array([0.99200171]))

In [18]:
model.predict("The new smartphone has an impressive camera and long battery life.")

(('__label__electronics',), array([0.99579531]))

In [19]:
model.predict("office decor art prints 8 7 x 8 7 inch set of 4 painting")

(('__label__household',), array([0.97096372]))

In [20]:
model.predict("think and grow rich deluxe edition")

(('__label__books',), array([1.00000942]))

In [28]:
model.get_nearest_neighbors("sony")

[(0.9993633031845093, 'fo'),
 (0.9993615746498108, 'godox'),
 (0.9993606209754944, 'tewtross'),
 (0.9993597269058228, 'reptiles'),
 (0.9993539452552795, 'hmd'),
 (0.9993538856506348, 'licensee'),
 (0.9993538856506348, 'appsshazam'),
 (0.9993538856506348, 'phonewindows'),
 (0.9993538856506348, 'onethe'),
 (0.9993538856506348, 'lifeuse')]