## Text Classification using Fasttext

* Dataset Credits: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification

* We have a dataset of ecommerce item description. Total 4 categories,
    1. Household
    2. Electronics
    3. Clothing and Accessories
    4. Books

* The task at hand is to classify a product into one of the above 4 categories based on the product description

In [3]:
# So let's import the dataset:
import pandas as pd

df= pd.read_csv("ecommerce_dataset.csv", names=["category", "description"], header=None)
df.sample(5)

Unnamed: 0,category,description
9957,Household,Gtc 3 In 1 Large Sink Set Dish Rack Drainer Wi...
5278,Household,PARADIGM PICTURES Wood and Metal Colourful Win...
31956,Clothing & Accessories,elegante' Combo of UV protected Black Clubmast...
1396,Household,GSI Multicolour 10 x 10 cm Durable 0 to 9 Numb...
21351,Books,Barefoot to Boots: The Many Lives of Indian Fo...


In [2]:
# Dataset shape is:
print(df.shape)

(50425, 2)


In [4]:
# To explore category column:
df.category.value_counts()

Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8671
Name: category, dtype: int64

* The dataset is not looking very imbalance so we don't need to have concern.

In [5]:
# Now let's remove if we have .n.a values:
df.dropna(inplace=True)
df.shape

(50424, 2)

* We had only one .n.a value.

In [6]:
# Next we replace the 'Clothing & Accessories' column name into 'Clothing_Accessories'.
df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)

In [7]:
# Now to check:
df.category.unique()

array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

In [8]:
# When you train a fasttext model, it expects labels to be specified with label prefix. We will just create a third column
# in the dataframe that has label as well as the product description.
df['category'] = '__label__' + df['category'].astype(str)
df.head(5)

Unnamed: 0,category,description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...


In [9]:
# So now we need to generate a file which have both category and description in a single line. So for that we create a new 
# column which has label, category and description. Then we export that particular column into a text file.
# So we merge category and description columns and create a 3rd column.
df['category_description'] = df['category'] + ' ' + df['description']
df.head(3)

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...


In [10]:
# To see the first sample:
df.category_description[0]

'__label__Household Paper Plane Design Framed Wall Hanging Motivational Office Decor Art Prints (8.7 X 8.7 inch) - Set of 4 Painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it. This is an special series of paintings which makes your wall very beautiful and gives a royal touch. This painting is ready to hang, you would be proud to possess this unique painting that is a niche apart. We use only the most modern and efficient printing technology on our prints, with only the and inks and precision epson, roland and hp printers. This innovative hd printing technique results in durable and spectacular looking prints of the highest that last a lifetime. We print solely with top-notch 100% inks, to achieve brilliant and true colours. Due to their high level of uv resistance, our prints retain their beautiful colours for many years. Add colour and style to your living space with this digitally printed painting. Some are for pleasure and so

### Pre-procesing

1. Remove punctuation
2. Remove extra space
3. Make the entire sentence lower case

In [13]:
# So to remove all the punctuations, extra spaces or changing upper case into lower case, we use regex using user defined
# function:
import re

def preprocess(text):
    text = re.sub(r'[^\w\s\']',' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip().lower()

In [14]:
# Next we supply our text 'category_description' column into the defined function:
df['category_description'] = df['category_description'].map(preprocess)
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__household incredible gifts india wood...


In [15]:
# if we check the first sample, the text will be clean:
df.category_description[0]

'__label__household paper plane design framed wall hanging motivational office decor art prints 8 7 x 8 7 inch set of 4 painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it this is an special series of paintings which makes your wall very beautiful and gives a royal touch this painting is ready to hang you would be proud to possess this unique painting that is a niche apart we use only the most modern and efficient printing technology on our prints with only the and inks and precision epson roland and hp printers this innovative hd printing technique results in durable and spectacular looking prints of the highest that last a lifetime we print solely with top notch 100 inks to achieve brilliant and true colours due to their high level of uv resistance our prints retain their beautiful colours for many years add colour and style to your living space with this digitally printed painting some are for pleasure and some for eternal bli

In [16]:
# Next we call train_test_split method to split the dataset into train  and test samples:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

In [17]:
# To check train and test shapes:
train.shape, test.shape

((40339, 3), (10085, 3))

In [18]:
# Next we generate train and test '.txt' files, because 'train_supervised()' expect txt files.
train.to_csv("ecommerce.train", columns=["category_description"], index=False, header=False)
test.to_csv("ecommerce.test", columns=["category_description"], index=False, header=False)

In [19]:
# Next we train the model using train file and test the performance using test file.
# in previous notebook we used unsupervised method to generate word embedding but 'train_supervised' method is used to 
# classification. So it's used fasttext to generate word embedding and then it will use those word embeddings to do the 
# classification.
import fasttext

model = fasttext.train_supervised(input="ecommerce.train")
model.test("ecommerce.test")

(10084, 0.9685640618802063, 0.9685640618802063)

* '10084' is the size of the test samples, '0.9685640...' is precision and '0.9685640...' is recall.

In [20]:
# Now let's do some prediction on some items,
model.predict("wintech assemble desktop pc cpu 500 gb sata hdd 4 gb ram intel c2d processor 3")

(('__label__electronics',), array([0.99115664]))

In [21]:
model.predict("ockey men's cotton t shirt fabric details 80 cotton 20 polyester super combed cotton rich fabric")

(('__label__clothing_accessories',), array([1.00001001]))

In [22]:
model.predict("think and grow rich deluxe edition")

(('__label__books',), array([1.00000989]))

In [32]:
# So when we train the model, at the same time it has word embeddings and we can use those word embedding:
model.get_nearest_neighbors("sony")

[(0.9990784525871277, 'feisty'),
 (0.9990700483322144, 'ralink'),
 (0.9990700483322144, '0oc'),
 (0.9990700483322144, '7601'),
 (0.9990519285202026, 'rivals'),
 (0.9990459084510803, 'gamingphoto'),
 (0.999028742313385, '867'),
 (0.999028742313385, '1167'),
 (0.999011218547821, 'visualize'),
 (0.9990058541297913, 'hommie')]

In [31]:
model.get_nearest_neighbors("sony")

[(0.9990784525871277, 'feisty'),
 (0.9990700483322144, 'ralink'),
 (0.9990700483322144, '0oc'),
 (0.9990700483322144, '7601'),
 (0.9990519285202026, 'rivals'),
 (0.9990459084510803, 'gamingphoto'),
 (0.999028742313385, '867'),
 (0.999028742313385, '1167'),
 (0.999011218547821, 'visualize'),
 (0.9990058541297913, 'hommie')]