In [1]:
pip install fasttext

Note: you may need to restart the kernel to use updated packages.


 ## Text Classification Using FastText

In [3]:
import fasttext
import pandas as pd

In [22]:
df=pd.read_csv("ecommerce_dataset2.csv",names=["category","description"],header=None)

In [23]:
df.head(5)

Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


In [24]:
df.columns

Index(['category', 'description'], dtype='object')

In [27]:
df.category.value_counts()

category
Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8671
Name: count, dtype: int64

In [29]:
df.shape

(50425, 2)

In [32]:
df.dropna(inplace=True)
df.shape

(50424, 2)

In [35]:
#
df.category.replace("Clothing & Accessories","Clothing_Accessories",inplace=True)
df.category.unique()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.category.replace("Clothing & Accessories","Clothing_Accessories",inplace=True)


array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

In [36]:
df.category.value_counts()

category
Household               19313
Books                   11820
Electronics             10621
Clothing_Accessories     8670
Name: count, dtype: int64

In [37]:
df['category']="__label__"+df['category'].astype(str)
df.head(3)

Unnamed: 0,category,description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...


In [49]:
df['category_description']=df['category']+ " " +df['description']
df.head(4)

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__Household SAF Flower Print Framed Pai...


In [50]:
df['category_description'][0]

'__label__Household Paper Plane Design Framed Wall Hanging Motivational Office Decor Art Prints (8.7 X 8.7 inch) - Set of 4 Painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it. This is an special series of paintings which makes your wall very beautiful and gives a royal touch. This painting is ready to hang, you would be proud to possess this unique painting that is a niche apart. We use only the most modern and efficient printing technology on our prints, with only the and inks and precision epson, roland and hp printers. This innovative hd printing technique results in durable and spectacular looking prints of the highest that last a lifetime. We print solely with top-notch 100% inks, to achieve brilliant and true colours. Due to their high level of uv resistance, our prints retain their beautiful colours for many years. Add colour and style to your living space with this digitally printed painting. Some are for pleasure and so

In [54]:
import re

def preprocess(text):
    text = re.sub(r'[^\w\s\']', ' ', text)  # Remove special characters (except letters, numbers, spaces, apostrophes)
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    return text.strip().lower()  # Trim spaces & convert to lowercase


In [55]:
df['category_description']=df['category_description'].map(preprocess)

In [56]:
df.head(4)

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__household saf flower print framed pai...


In [57]:
from sklearn.model_selection import train_test_split
train,test=train_test_split(df,test_size=0.2)

In [58]:
train.shape,test.shape

((40339, 3), (10085, 3))

In [59]:
train.head(3)

Unnamed: 0,category,description,category_description
20911,__label__Books,Exploring C Came 1972 and Dennis Ritchie gave ...,__label__books exploring c came 1972 and denni...
13536,__label__Household,Grizzly Stainless Steel Pasta Roller with Tagl...,__label__household grizzly stainless steel pas...
20404,__label__Books,UGC/NET/JRF/SET English Literature (Paper-II A...,__label__books ugc net jrf set english literat...


In [76]:
import fasttext


In [69]:
train.to_csv("ecommerse.train", columns=["category_description"], index=False, header=False)


In [78]:
model = fasttext.train_supervised(input="ecommerse.train")


Read 4M words
Number of words:  77226
Number of labels: 4
Progress: 100.0% words/sec/thread: 2668888 lr:  0.000000 avg.loss:  0.187566 ETA:   0h 0m 0s


In [84]:
test["category_description"].to_csv("ecommerce.test", index=False, header=False, encoding="utf-8")


In [87]:
model.test("ecommerce.test")


(10085, 0.9688646504709966, 0.9688646504709966)

In [88]:
model.predict("kalyan")

(('__label__books',), array([0.9999994]))

In [99]:
model.predict("pc cpu 500 gb")

(('__label__electronics',), array([0.66082937]))

In [100]:
model.predict("red shirt")

(('__label__clothing_accessories',), array([0.99909687]))

In [114]:
model.get_nearest_neighbors("love")

[(0.9963963627815247, 'neurosensory'),
 (0.9963963627815247, 'neuropop'),
 (0.9963963627815247, 'warick'),
 (0.9962460398674011, 'mips'),
 (0.9959198236465454, 'dietetics'),
 (0.9958067536354065, 'projections'),
 (0.9956098794937134, 'blufury'),
 (0.9955282807350159, 'ri'),
 (0.9954931139945984, 'mush'),
 (0.9953666925430298, 'craziest')]