### Text Classification using Fasttext

##### Dataset Credits: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification

This is a dataset of ecommerce item description. Total 4 categories,
1. Household
2. Electronics
3. Clothing and Accessories
4. Books

The task at hand is to classify a product into one of the above 4 categories based on the product description

In [1]:
#importing libraries 
#Loading the dataset
import pandas as pd
df=pd.read_csv('ecommerce_dataset.csv',names=['category','description'],header=None)
df.head()

Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


In [2]:
df.category.value_counts()

category
Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8671
Name: count, dtype: int64

In [3]:
df.isnull().sum()

category       0
description    1
dtype: int64

In [4]:
df.dropna(inplace=True)
df.isnull().sum()

category       0
description    0
dtype: int64

In [5]:
df.shape

(50424, 2)

In [7]:
#fasttext expects labels in the following format, so replacing & with _
df.category.replace('Clothing & Accessories','Clothing_Accessories',inplace=True)
df.category.unique()

array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

In [8]:
df['category']="_label_"+df['category'].astype(str)
df.head(3)

Unnamed: 0,category,description
0,_label_Household,Paper Plane Design Framed Wall Hanging Motivat...
1,_label_Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,_label_Household,SAF 'UV Textured Modern Art Print Framed' Pain...


In [15]:
df['category_description']=df['category']+" "+df['description']
df.head(3)

Unnamed: 0,category,description,category_description
0,_label_Household,Paper Plane Design Framed Wall Hanging Motivat...,_label_Household Paper Plane Design Framed Wal...
1,_label_Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",_label_Household SAF 'Floral' Framed Painting ...
2,_label_Household,SAF 'UV Textured Modern Art Print Framed' Pain...,_label_Household SAF 'UV Textured Modern Art P...


In [10]:
df['category_description'][0]

'_label_HouseholdPaper Plane Design Framed Wall Hanging Motivational Office Decor Art Prints (8.7 X 8.7 inch) - Set of 4 Painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it. This is an special series of paintings which makes your wall very beautiful and gives a royal touch. This painting is ready to hang, you would be proud to possess this unique painting that is a niche apart. We use only the most modern and efficient printing technology on our prints, with only the and inks and precision epson, roland and hp printers. This innovative hd printing technique results in durable and spectacular looking prints of the highest that last a lifetime. We print solely with top-notch 100% inks, to achieve brilliant and true colours. Due to their high level of uv resistance, our prints retain their beautiful colours for many years. Add colour and style to your living space with this digitally printed painting. Some are for pleasure and some 

**Pre-procesing**
1. Remove punctuation
2. Remove extra space
3. Make the entire sentence lower case

In [11]:
import re

In [12]:
def preprocess(text):
    text=re.sub(r'[^\w\s\']',' ',text)
    text=re.sub(r' +',' ',text)
    return text.strip().lower()

In [16]:
df['category_description']=df['category_description'].map(preprocess)

In [17]:
df['category_description']

0        _label_household paper plane design framed wal...
1        _label_household saf 'floral' framed painting ...
2        _label_household saf 'uv textured modern art p...
3        _label_household saf flower print framed paint...
4        _label_household incredible gifts india wooden...
                               ...                        
50420    _label_electronics strontium microsd class 10 ...
50421    _label_electronics crossbeats wave waterproof ...
50422    _label_electronics karbonn titanium wind w4 wh...
50423    _label_electronics samsung guru fm plus sm b11...
50424    _label_electronics micromax canvas win w121 white
Name: category_description, Length: 50424, dtype: object

### Train-Test Split

In [19]:
from sklearn.model_selection import train_test_split
train,test=train_test_split(df,test_size=0.2)

In [20]:
train.shape, test.shape

((40339, 3), (10085, 3))

In [21]:
test.head(3)

Unnamed: 0,category,description,category_description
10249,_label_Household,"Milton Aqua 1000 Stainless Steel Water Bottle,...",_label_household milton aqua 1000 stainless st...
40927,_label_Electronics,WireScorts Interface USB3.0 and 3 Ports USB Hu...,_label_electronics wirescorts interface usb3 0...
2862,_label_Household,Crompton Greaves Ozone 55 Ltrs Desert Air Cool...,_label_household crompton greaves ozone 55 ltr...


In [30]:
train.to_csv("ecommerce.train", columns=["category_description"], index=False, header=False)
test.to_csv("ecommerce.test", columns=["category_description"], index=False, header=False)

In [35]:
import fasttext

model=fasttext.train_supervised(input='ecommerce.train')
model.test('ecommerce.test')

(10084, 0.9693573978579929, 0.9693573978579929)

First parameter (10084) is test size. Second and third parameters are precision and recall respectively. Both precision and Recall are about 96%  which is pretty good

### Prediction

In [36]:
model.predict("wintech assemble desktop pc cpu 500 gb sata hdd 4 gb ram intel c2d processor 3")

(('__label__electronics',), array([0.99864846]))

In [37]:
model.predict("I love to wear cotton dresses in summer")

(('__label__clothing_accessories',), array([0.9997713]))

In [39]:
model.predict("My phone is not working i need to upgrade to iphone15")

(('__label__electronics',), array([0.63933337]))

In [42]:
model.get_nearest_neighbors("cloth")

[(0.9880392551422119, 'off'),
 (0.9866911768913269, 'cut'),
 (0.9838992357254028, 'coin'),
 (0.9835329651832581, 'warranty'),
 (0.9835214018821716, 'wide'),
 (0.9822599291801453, 'its'),
 (0.9766194820404053, 'dark'),
 (0.9764715433120728, 'your'),
 (0.9754511117935181, 'tube'),
 (0.9738571047782898, 'starword')]

In [43]:
model.get_nearest_neighbors("electronics")

[(0.994606614112854, 'writing'),
 (0.9932381510734558, 'my'),
 (0.9910879731178284, '2pcs'),
 (0.9892962574958801, 'computer'),
 (0.9892043471336365, 'volume'),
 (0.9886366724967957, 'improve'),
 (0.988511323928833, 'class'),
 (0.9872010350227356, 'offers'),
 (0.9863372445106506, 'network'),
 (0.9860197901725769, 'fm')]