# Product Catergorisation

The correct categorisation of products is a very important step for online 
shops. This step might look straightforward, but it can easily be a nightmare. 
This could be due to any number of difficulties, including: 1) the number of 
available categories can be huge, 2) the number of available categories can 
change constantly, and 3) new products may be added daily.

Manual categorisation of products can be a tedious and labour-intensive task.
Therefore, it makes sense to automate this process. One method is to use 
machine learning.

In this task, I will implement a product categoriser based on the following 
explanation.


## Approach Description

As shown in the table below, there are three levels of categories. Level 1 is 
the highest and most generic and level 3 is the most specific. The idea is to 
create multiple models over a number of iterations. 
In the first pass we create one model using the features and the Level 1 
column as the class variable. 
In the next pass (to predict level 2), we create one separate model for each 
unique category in Level 1.
Similarly, for Level 3, the number of models we create is the same as the 
number of distinct groups of combinations of the categories in Level 1 and 
Level 

<img src="Product_categorisation.png">

The total number of models we need to create depends on the number of 
categories. For example, if the number of unique categories was 12, 22 and 
31 in Levels 1, 2 and 3 respectively, then the total number of models created 
will be one model in the first pass, 12 models in the second pass and 264 (12 
\* 22) models in the third pass. This makes the total number of models 277 (in 
a mock run of a model solution to this task, the number of models was 
between 50 and 60).

### Import Libraries

In [1]:
import numpy as np
import pandas as pd

### Load and explore the data (4 marks)

In [2]:
#import the data
df=pd.read_csv("./product-cat-dataset.csv")
#printing first 5 row of our dataset
df.head()

Unnamed: 0,Description,Level_1,Level_2,Level_3
0,gerb cap help keep littl on head cov warm day ...,09BF5150,C7E19,D06E
1,newborn inf toddl boy hoody jacket oshkosh b g...,2CEC27F1,ADAD6,98CF
2,tut ballet anym leap foxy fash ruffl tul toddl...,09BF5150,C7E19,D06E
3,newborn inf toddl boy hoody jacket oshkosh b g...,2CEC27F1,ADAD6,98CF
4,easy keep feel warm cozy inf toddl girl hoody ...,2CEC27F1,ADAD6,98CF


In [3]:
#checking the amout of rows and columns
df.shape

(10649, 4)

In [4]:
#checking if there are any null entries
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10649 entries, 0 to 10648
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Description  10637 non-null  object
 1   Level_1      10649 non-null  object
 2   Level_2      10649 non-null  object
 3   Level_3      10649 non-null  object
dtypes: object(4)
memory usage: 332.9+ KB


As we can see the amount of non-null for column Description is not equal to the amount of other columns

In [5]:
#describing data and checking unique values
df.describe()

Unnamed: 0,Description,Level_1,Level_2,Level_3
count,10637,10649,10649,10649
unique,9677,15,39,43
top,glory gorg col fing complet outfit express moo...,B092BA29,2D5A3,28A7
freq,24,900,797,797


Level_1 has 15 unique values, Level_2 39 and Level_3 43

### Deal with Missing Data (4 marks)

In [6]:
# Checking null entries count in Description column
missing_values_count=df.isnull().sum()
missing_values_count

Description    12
Level_1         0
Level_2         0
Level_3         0
dtype: int64

Description has a total of 12 missing values

In [7]:
#Checking percentage of missing values
total_cells=np.product(df.shape)
total_missing=missing_values_count.sum()

percent_missing=(total_missing/total_cells)*100
print(f"Percentage of values missing is in Description column is: {round(percent_missing,2)}%")


Percentage of values missing is in Description column is: 0.03%


Most probably this entries are missing because they were not recorded rather than because they don't exist. The amout of data missing is negligible we can drop those missing values.

In [8]:
# Dropping missing rows
df_new=df.dropna()
print(f"Rows in original dataset: {df.shape[0]}\n")
print(f"Rows with na's dropped: {df_new.shape[0]} \n")



Rows in original dataset: 10649

Rows with na's dropped: 10637 



### Drop Classes where the number of instances is < 10 (4 marks)

In [9]:
# Apply to Level_1 
n_instances_l1=(df_new["Level_1"].value_counts(ascending=True)<10).sum()
print(f"the number of classes with instances <10 is : {n_instances_l1}")


the number of classes with instances <10 is : 0


In [10]:
# Apply to Level_2
n_instances_l2=(df_new["Level_2"].value_counts(ascending=True)<10).sum()
print(f"the number of classes with instances <10 is : {n_instances_l2}")

the number of classes with instances <10 is : 3


In [11]:
print(f"shape before {df_new.shape}")
df_new=df_new[df_new.groupby('Level_2')["Level_2"].transform('count')>=10]
print(f"shape after drop {df_new.shape}")


shape before (10637, 4)
shape after drop (10629, 4)


In [12]:
# Apply to Level_3
n_instances_l3=(df_new["Level_3"].value_counts(ascending=True)<10).sum()
print(f"the number of classes with instances <10 is : {n_instances_l3}")



the number of classes with instances <10 is : 2


In [13]:
print(f"shape before {df_new.shape}")
df_new=df_new[df_new.groupby('Level_3')["Level_3"].transform('count')>=10]
print(f"shape after drop {df_new.shape}")

shape before (10629, 4)
shape after drop (10627, 4)


### Now let's write a Function to Prepare Text (4 marks)
We will apply it to our DataFrame later on

* This function receives a text string and performs the following:
* Convert text to lower case
* Remove punctuation marks
* Apply stemming using the popular Snowball or Porter Stemmer (optional)
* Apply NGram Tokenisation
* Return the tokenised text as a list of strings

In [14]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to C:\Users\Massimiliano
[nltk_data]     Gargano\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [15]:
#importing libraries
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import PorterStemmer
from nltk import word_tokenize, ngrams
import string

def process_text(text, n = 1):
    """
    Takes in a string of text, then performs the following:
    1. Convert text to lower case and remove all punctuation
    2. Optionally apply stemming
    3. Apply Ngram Tokenisation
    4. Returns the tokenised text as a list
    """
    #1
    text=text.lower()
    text=text.translate(str.maketrans('', '', string.punctuation))

    #2
    en = SnowballStemmer('english')
    text=en.stem(text)

    #3
    text = list(ngrams(text.split(), n))
    tokenised=[" ".join(i) for i in text]

            
    return tokenised

In [16]:
# Here is an example function call
process_text("Here we're testing the process_text function, results are as follows:", n = 3)

['here were testing',
 'were testing the',
 'testing the processtext',
 'the processtext function',
 'processtext function results',
 'function results are',
 'results are as',
 'are as follow']

### Now let's apply TF-IDF to extract features from plain text (10 marks)

In [17]:
# Here you apply the process_text function to the Description column of the data
# Then you pass the results to the bag of words tranformer

#Let's create the corpus 

#corpus=df_new.Description.tolist()

corpus=[process_text(i) for i in df_new.Description]
corpus

corpus=[" ".join(subset) for subset in corpus]

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
#initiate CountVectorizer()
vectorizer=CountVectorizer()
vectorizer


CountVectorizer()

In [19]:
#let's generate the word counts matrix
X=vectorizer.fit_transform(corpus)
X.shape

(10627, 16635)

Now we can use .transform on our Bag-of-Words (bow) transformed object and transform the entire DataFrame of text file contents. Let's go ahead and check out how the bag-of-words counts for the entire corpus in a large, sparse matrix:

In [20]:
# After that you pass the result of the previous step to sklearn's TfidfTransformer
# which will convert them into a feature matrix
# See here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
from sklearn.feature_extraction.text import TfidfTransformer
#initiate TfidfTransformer()
transformer = TfidfTransformer()
transformer


TfidfTransformer()

In [21]:
tfidf = transformer.fit_transform(X)
tfidf


<10627x16635 sparse matrix of type '<class 'numpy.float64'>'
	with 298015 stored elements in Compressed Sparse Row format>

In [22]:
# The resulting matrix is in sparse format, we can transform it into dense
# Code prepared for you so you can see what results look like
text_tfidf = pd.DataFrame(tfidf.toarray())
text_tfidf.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16625,16626,16627,16628,16629,16630,16631,16632,16633,16634
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Now the Data is Ready for Classifier Usage

### Split Data into Train and Test sets (4 marks)

In [23]:
# Train/Test split
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

#target
classes=["Level_1","Level_2","Level_3"]
y=df_new[classes]

#features
X=text_tfidf

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=1, )

tree=DecisionTreeClassifier(random_state=1)

In [24]:
# You might need to reset index in each dataframe (depends on you how you do things)
# done for you to make it clearer
train_X.reset_index(inplace=True, drop=True)
test_X.reset_index(inplace=True, drop=True)
train_y.reset_index(inplace=True, drop=True)
test_y.reset_index(inplace=True, drop=True)

In [25]:
# You might need to take classes as separate columns (depends on you how you do things)
class1 = train_y['Level_1'].astype(str)
class2 = train_y['Level_2'].astype(str)
class3 = train_y['Level_3'].astype(str)

## Model training for the three levels (8 marks)

In [26]:
# Create and save model for level 1
import pickle
#Fit the model
tree.fit(train_X, class1)

#save model for level 1
with open('level1.pk', 'wb') as cls:
    pickle.dump(tree, cls)

In [27]:
#list of unique values for level 1
level1_uniques=list(df_new.Level_1.unique())


In [28]:
## Create and save models for level 2
#Create and save model for level 2 looping trough all possibles signle values in level 1
n_models_2=[]

for i in level1_uniques:
    x=list(class1[class1==i].index)
    train_X.loc[train_X.index[x]]
    #fit the model
    tree.fit(train_X.loc[train_X.index[x]],class2[x])


    #save level 2 classifier
    with open(f"level2_{i}.pk", "wb") as cls:
        pickle.dump(tree,cls)
        n_models_2.append(i)
    



In [29]:
## Create and save models for level 3
#list of unique values for level 2
level2_uniques=list(df_new.Level_2.unique())

#Create and save model for level 3 looping trough all possibles signle values in level 2

n_models_3=[]

for j in level2_uniques:
    x=list(class2[class2==j].index)
    train_X.loc[train_X.index[x]]
    #fit the model
    tree.fit(train_X.loc[train_X.index[x]],class3[x])

    #save level 3 classifier
    with open(f"level3_{j}.pk", "wb") as cls:
        pickle.dump(tree,cls)
        n_models_3.append(i)
        
        


## Predict the test set (8 marks)

In [30]:
# Creating an empty Dataframe with column names only (depends on you how you do things)
results = pd.DataFrame(columns=['Level1_Pred', 'Level2_Pred', 'Level3_Pred'])

## Here we reload the saved models and use them to predict the levels
# load model for level 1 (done for you)
with open('level1.pk', 'rb') as nb:
    model = pickle.load(nb)


## loop through the test data, predict level 1, then based on that predict level 2
## and based on level 2 predict level 3 (you need to load saved models accordingly)

#Level 1 prediction
l1_pred=model.predict(test_X)


    

In [31]:
#Leve 2 predictions
l2_pred=[]
for i in range(len(test_X)):

    with open(f'level2_{l1_pred[i]}.pk', 'rb') as nb:
        model = pickle.load(nb)

    Level2_Pred=model.predict(test_X)
    l2_pred.append(Level2_Pred[i])

    

In [32]:
#Level 3 predictions
l3_pred=[]

for i in range(len(test_X)):

    with open(f'level3_{l2_pred[i]}.pk', 'rb') as nb:
        model = pickle.load(nb)

    Level3_Pred=model.predict(test_X)
    l3_pred.append(Level3_Pred[i])

In [33]:
results["Level1_Pred"]=l1_pred
results["Level2_Pred"]=l2_pred
results["Level3_Pred"]=l3_pred

#checking predictionds dataset
results

Unnamed: 0,Level1_Pred,Level2_Pred,Level3_Pred
0,2CEC27F1,BAE8A,2ABA
1,90A8B052,C719A,A0E2
2,4513C920,E69F5,DDD5
3,AAC8EE56,9B69F,80C4
4,B092BA29,375FE,1F61
...,...,...,...
2652,EFEF723B,02FA0,078B
2653,96F95EEC,36080,C563
2654,AAC8EE56,914A1,D97D
2655,014303D1,7AED7,6539


## Compute Accuracy on each level (4 marks)
Now you have the predictions for each level (in the test data), and you also have the actual levels, you can compute the accurcay

In [34]:
from sklearn.metrics import accuracy_score, classification_report

In [35]:
# Level 1 accuracy
print(accuracy_score(test_y[['Level_1']], results["Level1_Pred"]))


0.8182160331200602


In [36]:
# Level 2 accuracy
print(accuracy_score(test_y[['Level_2']], results["Level2_Pred"]))


0.7395558901016184


In [37]:
# Level 3 accuracy
print(accuracy_score(test_y[['Level_3']], results["Level3_Pred"]))


0.7278885961610839


## Well done!