# MACHINE LEARNING IN PRODUCT CATEGORIZATION:
## APPLICATION OF A SUPERVISED LEARNING MODEL IN A REAL E-COMMERCE DATASET

Using a machine learning approach, this project will show how to implement a
simple and fast solution to the product categorization problem.

## 1 DATA EXPLORATION
First let's take a look on our original data, a Walmart dataset with some millions of products.

### 1.1 SAMPLE

In [5]:
# coding=utf-8

import pandas as pd

# 1.2 GB
INPUT_FILE = 'dataset/zxpd_201712121136_12011_38441474.csv'

# read input file
data = pd.read_csv(INPUT_FILE, sep=';', encoding='utf-8',names = ['programId',
                                                                   'zupid',
                                                                   'name',
                                                                   'desc',
                                                                   'price',
                                                                   'priceOld',
                                                                   'brand',
                                                                   'date',
                                                                   'medium',
                                                                   'large',
                                                                   'link',
                                                                   'path',
                                                                   'main',
                                                                   'sub',
                                                                   'third',
                                                                   'ean',
                                                                   'small',
                                                                   'available',
                                                                   'img1',
                                                                   'gender'])

# print a product
print(data.loc[1919944])

programId                                                12011
zupid                         6267bf6a0ebb5fb08b708bbc4954c45c
name         Smart TV Sony KDL55W955B LED 55' 3D Full HD Sm...
desc                                                       NaN
price                                                     4999
priceOld                                                   NaN
brand                                                     Sony
date                                       01/12/2017 11:03:00
medium       https://static.wmobjects.com.br/imgres/arquivo...
large                                                      NaN
link         http://ad.zanox.com/ppc/?38441474C295166823&UL...
path                              Eletrônicos / TVs / Smart TV
main                                                       NaN
sub                                                        NaN
third                                                      NaN
ean                                                4.90

### 1.2 SIZE

In [6]:
import os

# get metadata
statinfo = os.stat(INPUT_FILE)

# print file size, in GB
print(format(statinfo.st_size / pow(1024,3),'.2') + ' GB')

# print number or products (rows)
print(format(len(data),',') + ' products')

# free memory
del data

1.2 GB
2,625,975 products


## 2 DATA PREPROCESSING
Now let's tranform the original data. We will normalize that, create new columns and remove all the useless information to our model, including ~1.4 millions of unavailable products.

In [7]:
from dsLib import *
from time import time

OUTPUT_FILE = 'dataset/preprocessed.csv'

# start time
start = time()

# open output file
output = open(OUTPUT_FILE, 'w')

# read input file
with open(INPUT_FILE, 'r') as file:
    
    # write header
    output.write(header())
    
    # for each line...
    for line in file:
        
        # transform in list
        lst = transform_line(line)
        
        # valid
        if (not valid(lst)):
            continue
            
        # apply changes
        line = format_line(lst)
        
        # save
        output.write(line)

# close file
output.close()

# print preprocessing time
print('{:.2f} seconds'.format(time() - start))


223.81 seconds


### 2.1 SAMPLE

In [8]:
# read output file
data = pd.read_csv(OUTPUT_FILE, sep=';', encoding='utf-8',names = ['name',
                                                                   'brand',
                                                                   'gender',
                                                                   'room',
                                                                   'vehicle',
                                                                   'console',
                                                                   'device',
                                                                   'pet',
                                                                   'mattress',
                                                                   'cup',
                                                                   'category'])

# print a product
print(data.loc[1211643])

name                                     perfume 212 sexy
brand                                    carolina herrera
gender                                                  1
room                                                     
vehicle                                                  
console                                                  
device                                                   
pet                                                      
mattress                                                 
cup                                                      
category    Perfumaria e Cosméticos / Perfumes / Feminino
Name: 1211643, dtype: object


### 2.2 SIZE

In [9]:
# get metadata
statinfo = os.stat(OUTPUT_FILE)

# print file size, in GB
print(format(statinfo.st_size / pow(1024,3),'.2') + ' GB')

# print number or products (rows)
print(format(len(data),',') + ' products')

# free memory
del data

0.12 GB
1,235,888 products


## 3 IMPLEMENTATION
With our dataset ready, let's read, encode and train the data. And, to finish, let's predict the product categories with our Decision Tree classifier.

In [10]:
from mlLib import *
import pandas as pd
import numpy as np
from time import time
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

INPUT_FILE = OUTPUT_FILE

NUMBER_OF_FEATURES = 10

# start time
start = time()

# read data
data = pd.read_csv(INPUT_FILE, sep=';', header=0, converters={'price': float}, encoding='utf-8')

# price scaling
# data['price'] = np.log(data['price'])

# feature data (all rows, all columns except the last)
X_all = data.iloc[:, :data.shape[1] - 1]

# target data (all rows, only the last column)
y_all = data.iloc[:, data.shape[1] - 1]

# save memory
del data

# encode features
X_encoder = []
for col in range(0, NUMBER_OF_FEATURES):
    X_encoder.append(preprocessing.LabelEncoder())
    X_encoder[col].fit(X_all.iloc[:, col].astype('str'))
    X_all.iloc[:, col] = X_encoder[col].transform(X_all.iloc[:, col].astype('str'))

# encode labels
y_encoder = preprocessing.LabelEncoder()
y_encoder.fit(y_all)
y_all = y_encoder.transform(y_all)

# shuffle and split the dataset into training and testing points
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.2, random_state=55)

# initialize the classifier
clf = DecisionTreeClassifier(random_state=86)

# grid Search
# clf = best_estimator(clf, X_train, y_train)

# fit
clf.fit(X_train, y_train)

# predict (save memory!)
# predict_loop(clf, X_test, y_test, X_encoder, y_encoder)

# predict (no output)
print('accuracy: {:.2f}%'.format(100. * accuracy_score(y_test, clf.predict(X_test))))

# print processing time
print('{:.2f} seconds'.format(time() - start))

accuracy: 90.13%
141.62 seconds


## 4 RESULTS
So our model was able to train e predict ~1.2 million of products in ~2.5 minutes with ~90% of accuracy. Not bad.

PS: it's possible to visualize the wrong predictions, uncommenting the *predict_loop()* method. However, as the data will need to be decoded, the whole process will take ~10 minutes.