### NLP Tutorial: Text Classification Using FastText
Dataset Credits: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification
We have a dataset of ecommerce item description. Total 4 categories,

Household
Electronics
Clothing and Accessories
Books
The task at hand is to classify a product into one of the above 4 categories based on the product description

In [2]:
import pandas as pd

df= pd.read_csv("Sample CSV/ecommerce_dataset.csv", names=["category", "description"], header=None)
print(df.shape)
df.head()

(50425, 2)


Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


In [3]:
df.dropna(inplace=True)
df.shape

(50424, 2)

In [4]:
df.category.unique()

array(['Household', 'Books', 'Clothing & Accessories', 'Electronics'],
      dtype=object)

In [5]:
df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)

In [6]:
df.category.unique()

array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

In [7]:
df['category'] = '__label__' + df['category'].astype(str)
df.head(5)

Unnamed: 0,category,description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...


In [8]:
df['category_description'] = df['category'] + ' ' + df['description']
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__Household SAF Flower Print Framed Pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__Household Incredible Gifts India Wood...


***Pre-procesing***

1. Remove punctuation
2. Remove extra space
3. Make the entire sentence lower case

In [9]:
import re

text = "  VIKI's | Bookcase/Bookshelf (3-Shelf/Shelve, White) | ? . hi"
text = re.sub(r'[^\w\s\']',' ', text)
text = re.sub(' +', ' ', text)
text.strip().lower()

"viki's bookcase bookshelf 3 shelf shelve white hi"

In [10]:
def preprocess(text):
    text = re.sub(r'[^\w\s\']',' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip().lower()

In [11]:
df['category_description'] = df['category_description'].map(preprocess)
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__household incredible gifts india wood...


### Train Test Split

In [12]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

In [13]:
train.shape, test.shape

((40339, 3), (10085, 3))

In [14]:
train.to_csv("Sample CSV/ecommerce.train", columns=["category_description"], index=False, header=False)
test.to_csv("Sample CSV/ecommerce.test", columns=["category_description"], index=False, header=False)

In [15]:
import fasttext

model = fasttext.train_supervised(input="Sample CSV/ecommerce.train")
model.test("Sample CSV/ecommerce.test")

(10082, 0.9662765324340409, 0.9662765324340409)

First parameter (10084) is test size. Second and third parameters are precision and recall respectively. You can see we are getting around 96% precision which is pretty good

*Now let's do prediction for few product descriptions*

In [16]:
model.predict("wintech assemble desktop pc cpu 500 gb sata hdd 4 gb ram intel c2d processor 3")

(('__label__electronics',), array([0.99685782]))

In [17]:
model.predict("ockey men's cotton t shirt fabric details 80 cotton 20 polyester super combed cotton rich fabric")

(('__label__clothing_accessories',), array([1.00001001]))

In [18]:
model.predict("think and grow rich deluxe edition")

(('__label__books',), array([1.00000989]))

In [19]:
model.get_nearest_neighbors("painting")

[(0.9979185461997986, 'door'),
 (0.997356116771698, 'vacuum'),
 (0.9970409870147705, 'lid'),
 (0.9967893362045288, 'lamp'),
 (0.9966614842414856, 'guard'),
 (0.995917558670044, 'furniture'),
 (0.9958286881446838, 'bulb'),
 (0.995821475982666, 'artificial'),
 (0.9954538941383362, 'rack'),
 (0.9951730966567993, 'cook')]

In [20]:
model.get_nearest_neighbors("sony")

[(0.9982239007949829, 'binoculars'),
 (0.9976645112037659, 'external'),
 (0.9974766373634338, 'antenna'),
 (0.9971629977226257, 'dvd'),
 (0.9971431493759155, 'binocular'),
 (0.9968652725219727, 'point'),
 (0.9967527985572815, 'output'),
 (0.9963108897209167, 'stereo'),
 (0.9959954023361206, 'magnification'),
 (0.9958712458610535, 'glossy')]

In [21]:
model.get_nearest_neighbors("banglore")

[(0.0, 'to'),
 (0.0, 'and'),
 (0.0, 'a'),
 (0.0, 'with'),
 (0.0, 'for'),
 (0.0, 'is'),
 (0.0, '</s>'),
 (0.0, 'p192gr1'),
 (0.0, 'jacguard'),
 (0.0, '81de0048in')]