# Text Classifying Products with the fastText Library

In [1]:
import random
import numpy
import fasttext
import csv
import re
import pandas as pd
from sklearn.model_selection import train_test_split
import re

## Import and Explore Data

'train_en.csv' is a file given to me by a colleague. This is a file of 500000 rows of raw data of product names with their product category. I read it into a dataframe called data_test. *I probably should've changed the name because this is just the training data I'm going to be using for this model, it's the entire data set.*

In [2]:
data_test = pd.read_csv('train_en.csv')
print("any null values:", data_test.isnull().values.any())

any null values: False


In [7]:
len(data_test)

500000

In [3]:
data_test.head(5)

Unnamed: 0,product_title,category
0,Recollections Color Splash Clear Stamps & Stencil,Hobbies & Stationery
1,"soap,lotion scrub set 400",Health & Personal Care
2,Spigen Galaxy S10e Case Tough Armor Gunmetal,Mobile Accessories
3,Acrylic Lanalon Bright Red,Hobbies & Stationery
4,303 FLAT SHEET/Blanket 100% cotton,Home & Living


In [9]:
data_test.iloc[233523,0]

'COD Saudi Gold 18K Bracelet 7.5\\",Women Accessories\n1587702033,Xiaomi Redmi 5plus Beatle Series Case,Mobile Accessories\n2016255207,COD! ORIGINAL FOSSIL ES-9075 WATCH FOR WOMEN-BOUGHT IN US!,Women Accessories\n366936664,DM TALLGEESE 1 MG 1/100,Toys'

Some rows of the data (such as the one above) consist of multiple products with their categories in the product_title column... and some consist of emojis and non ascii characters.

## Cleaning Data and Feature Engineering

In [10]:
#function to get rid of emojis and strip trailing white spaces
def deEmojify(inputString):
    return inputString.encode('ascii', 'ignore').decode('ascii').strip()

In [11]:
#apply the function to each row and column
for i in range(len(data_test)) :
    #print(data_test.loc[i, "product_title"], data_test.loc[i, "category"])
    data_test.loc[i, "product_title"] = deEmojify(data_test.loc[i, "product_title"])
    data_test.loc[i, "category"] = deEmojify(data_test.loc[i, "category"])

In [35]:
# test
# print(data_test.iloc[188, :])
# print(data_test.iloc[1763,:])
# print(data_test.iloc[1754,:])
# print(data_test.iloc[213128,:])
# print(data_test.loc[data_test['product_title'] == 'Blue water XY02 cool mist humidifier 2 in 1',:])
# print(data_test.iloc[211078,:])
#print(data_test.iloc[233523,:])

product_title    COD Saudi Gold 18K Bracelet 7.5\",Women Access...
category                                     Games & Collectibles"
Name: 233523, dtype: object


In [7]:
data_test.tail()
data_test.info()

Unnamed: 0,product_title,category
499995,rocker arm roller racing mio,Motors
499996,Secosana (preloved bag),Women's Bags
499997,jag bag,Women's Bags
499998,Baby wipes 15 sheets (Alcohol and Paraben Free...,Babies & Kids
499999,PRE-LOVED ORIGINAL GREEN FINO BAG,Women's Bags


Now the data only contains ascii characters.

### Create New Column: Give Each Category an Numerical Index

For fastText model to be able to recognize and read labels (categories in this case) and distinguish from words that make up the product (will explain more below), it's much easier to keep the category name shorter, so I'm giving them each an index.

In [11]:
#map categories into numbers by getting unique categories
unique_categories = data_test.category.unique()

In [12]:
#make dictionary of the category mappings
dict_map = {}

for x in range(len(unique_categories)):
    dict_map[unique_categories[x]] = x

In [13]:
for x in dict_map:
    print(x, ":", dict_map[x])

Hobbies & Stationery : 0
Health & Personal Care : 1
Mobile Accessories : 2
Home & Living : 3
Women's Apparel : 4
Women Shoes : 5
Babies & Kids : 6
Women Accessories : 7
Toys, Games & Collectibles : 8
Groceries : 9
Motors : 10
Makeup & Fragrances : 11
Women's Bags : 12
Men's Apparel : 13
Pet Care : 14
Men's Bags & Accessories : 15
Sports & Travel : 16
Men Shoes : 17
Gaming : 18
Laptops & Computers : 19
Home Entertainment : 20
Mobiles & Gadgets : 21
Cameras : 22
Home Appliances : 23
Consumer Electronics : 24
Games & Collectibles" : 25
Digital Goods & Vouchers : 26


In [14]:
#new dataframe with all the data with an numerical category column
df_index = data_test.copy()
df_index["category_index"] = data_test.category.map(dict_map)
df_index.head(10)

Unnamed: 0,product_title,category,category_index
0,Recollections Color Splash Clear Stamps & Stencil,Hobbies & Stationery,0
1,"soap,lotion scrub set 400",Health & Personal Care,1
2,Spigen Galaxy S10e Case Tough Armor Gunmetal,Mobile Accessories,2
3,Acrylic Lanalon Bright Red,Hobbies & Stationery,0
4,303 FLAT SHEET/Blanket 100% cotton,Home & Living,3
5,Korean Set,Women's Apparel,4
6,High-grade keychain,Home & Living,3
7,CODChanel Black/White Sneaker Shoes For Women,Women Shoes,5
8,Cat eyeglasses,Women's Apparel,4
9,Baby shoes by Stride Rite (BRAND NEW) (3-6 mon...,Babies & Kids,6


Now that we have the category as an index, we can get rid of the category column with the longer category names.

In [15]:
#new dataframe with only product_title and category_index
df = df_index[['product_title', 'category_index']]
df.head()

Unnamed: 0,product_title,category_index
0,Recollections Color Splash Clear Stamps & Stencil,0
1,"soap,lotion scrub set 400",1
2,Spigen Galaxy S10e Case Tough Armor Gunmetal,2
3,Acrylic Lanalon Bright Red,0
4,303 FLAT SHEET/Blanket 100% cotton,3


### Formatting the Labels
The format of the text that goes into a fastText is a series, with each element as a string of text including its respective labels. All the labels/categories in fastText start by the "__label__" prefix, which is how fastText recognize what is a label or what is a word. The model is then trained to predict the labels given the word in the document. So now I will add \_\_label\_\_ in front of the category for fastText to read it as a label and then combine the labels and words into a single string.

In [16]:
#add __label__ in front of the labels for fastText to read
#df.iloc[:,1] #select category column
df_labeled = df.copy()
df_labeled['category_index'] = '__label__' + df_labeled['category_index'].astype(str)
df_labeled.head()

Unnamed: 0,product_title,category_index
0,Recollections Color Splash Clear Stamps & Stencil,__label__0
1,"soap,lotion scrub set 400",__label__1
2,Spigen Galaxy S10e Case Tough Armor Gunmetal,__label__2
3,Acrylic Lanalon Bright Red,__label__0
4,303 FLAT SHEET/Blanket 100% cotton,__label__3


In [17]:
#put category and product_title together
#format I want: __label__ product
category_prod = df_labeled['category_index'] + " " + df_labeled['product_title']
print(category_prod)

0         __label__0 Recollections Color Splash Clear St...
1                      __label__1 soap,lotion scrub set 400
2         __label__2 Spigen Galaxy S10e Case Tough Armor...
3                     __label__0 Acrylic Lanalon Bright Red
4             __label__3 303 FLAT SHEET/Blanket 100% cotton
                                ...                        
499995             __label__10 rocker arm roller racing mio
499996                  __label__12 Secosana (preloved bag)
499997                                  __label__12 jag bag
499998    __label__6 Baby wipes 15 sheets (Alcohol and P...
499999        __label__12 PRE-LOVED ORIGINAL GREEN FINO BAG
Length: 500000, dtype: object


In [18]:
type(category_prod)

pandas.core.series.Series

As seen in the section where I explored the data, the data isn't clean because some lines of products include multiple products and categories jumbled together in one string. By looking at the data, I realized they all contain a substring with 10 digits. So I'm going to impute them since we have a pretty large set of data to work with.

In [131]:
## GET RID OF LINES THAT DON'T START WITH __LABEL__
print(len(category_prod))
count = 0
for line in range(len(category_prod)-1, -1, -1):
    match = re.search(r'\d\d\d\d\d\d\d\d\d\d', category_prod[line])
    try:
        if not match.group() == "":
            category_prod = category_prod.drop(line)
            count+=1
    except:
        count += 0
print(count)
print(len(category_prod))

500000
454
499546


After cleaning out the data, we have 499546 lines to work with.

## Train Test Split

In [132]:
#split into train and test
train, test = train_test_split(category_prod, test_size=0.3, train_size=0.7, random_state=42)

In [134]:
#write test and train into files
f_train = open("train.txt", "a")
for i in range(len(numpy.array(train))):
    f_train.write(numpy.array(train)[i] + "\n")
f_train.close()

f_test = open("test.txt", "a")
for i in range(len(numpy.array(test))):
    f_test.write(numpy.array(test)[i] + "\n")
f_test.close()


I used the train_supervised function instead of unsupervised because the products already are classified, which means this will be a supervised classification model.

In [12]:
#train model
model = fasttext.train_supervised(input="train.txt")

In [13]:
model.predict("Sterling Silver Ladies Dangling Earrings DE24TB")

(('__label__7',), array([1.0000087]))

In [14]:
#test using the model
model.test("test.txt") #(n, precision, recall)

(149864, 0.8053435114503816, 0.8053435114503816)

In [15]:
#try to test random words to see if can predict into a label/category that makes sense
model.predict("track and field spikes")

(('__label__4',), array([0.77857149]))

label 4 is Women Accessories... not sure if this completely fits into that category.

## Tuning the Model

I'm going to try improving performance of the model by adjusting the wordNgrams parameter, which by default is 1 (a unigram). I tried using word bigrams (wordNgrams = 2), instead of just unigrams. Instead of inputting single words into the model, putting 2 consecutive tokens or words in can be important for classification problems where word order is important, such as sentiment analysis.

In [185]:
#wordNgrams=2
model2 = fasttext.train_supervised(input="train.txt", wordNgrams=2)

In [182]:
model2.predict("Sterling Silver Ladies Dangling Earrings DE24TB")

(('__label__7',), array([1.00000918]))

In [186]:
#test
model2.test("test.txt") #(n, precision, recall)

(149864, 0.8101345219665831, 0.8101345219665831)

The precision increase a little bit to 81.01% after adjusting the wordNgrams parameter. I also tried adjusting other parameters such as epoch and learning rate and scaling things up with hierarchical softmax, but they didn't help much So I kept the model at this. I think a precision of 81.01% is pretty good for my first time working with a classification model, though it can definitely be improved.