# Exercise 1 - Sistemas Inteligentes

The objective of this project is create a model pipeline for a **categorization model**.

**Authors**
- Luís Vendramin
- Matheus Garcia
---

In [1]:
import os
import pandas as pd

from joblib import dump, load
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

# 1. Data and Exploratory Analysis

In [2]:
data = pd.read_csv(os.environ['DATASET_PATH'])
data.head()

Unnamed: 0,product_id,seller_id,query,search_page,position,title,concatenated_tags,creation_date,price,weight,express_delivery,minimum_quantity,view_counts,order_counts,category
0,11394449,8324141,espirito santo,2,6,Mandala Espírito Santo,mandala mdf,2015-11-14 19:42:12,171.89,1200.0,1,4,244,,Decoração
1,15534262,6939286,cartao de visita,2,0,Cartão de Visita,cartao visita panfletos tag adesivos copos lon...,2018-04-04 20:55:07,77.67,8.0,1,5,124,,Papel e Cia
2,16153119,9835835,expositor de esmaltes,1,38,Organizador expositor p/ 70 esmaltes,expositor,2018-10-13 20:57:07,73.920006,2709.0,1,1,59,,Outros
3,15877252,8071206,medidas lencol para berco americano,1,6,Jogo de Lençol Berço Estampado,t jogo lencol menino lencol berco,2017-02-27 13:26:03,118.770004,0.0,1,1,180,1.0,Bebê
4,15917108,7200773,adesivo box banheiro,3,38,ADESIVO BOX DE BANHEIRO,adesivo box banheiro,2017-05-09 13:18:38,191.81,507.0,1,6,34,,Decoração


## 1.1. Check null values

Just `weight`, `order_counts` and `concatenated_tags` has missing values. So, as long as we don't use the variables `weight` and `order_counts`, they won't be a problem for us. But we use the variable `concatenated_tags` in our model, so we going to fill the null values with a blank character.

In [3]:
data.isnull().mean().round(5).sort_values()

product_id           0.00000
seller_id            0.00000
query                0.00000
search_page          0.00000
position             0.00000
title                0.00000
creation_date        0.00000
price                0.00000
express_delivery     0.00000
minimum_quantity     0.00000
view_counts          0.00000
category             0.00000
concatenated_tags    0.00005
weight               0.00153
order_counts         0.52908
dtype: float64

# 1.2. Check the target balance

Our target variable `category` is unbalanced. We have many observations with `category` (+/- 46%) equals to "Lembancinhas" and just a few equal to "Outros" or "Bijuterias e Jóias" (+/- 3% and ~2%). Therefore, is necessary to evaluate our model using metrics that deal better with unbalanced data, like **Recall**, **F1** or **AUC**. 

In [4]:
data.category.value_counts(normalize=True).round(2)

Lembrancinhas         0.46
Decoração             0.23
Bebê                  0.18
Papel e Cia           0.07
Outros                0.03
Bijuterias e Jóias    0.02
Name: category, dtype: float64

# 2. Model

We build a very simple model just using the non-numerical fields `title` and `concatenated_tags`. 

In the first part of our pipeline model we:
* Concatenate `title` and `concatenated_tags`
* Applied the upper function
* Applied the fill na function in `concatenated_tags`

In [5]:
(data['title'].str.upper()+' '+data['concatenated_tags'].str.upper()).head()

0                   MANDALA ESPÍRITO SANTO MANDALA MDF
1    CARTÃO DE VISITA CARTAO VISITA PANFLETOS TAG A...
2       ORGANIZADOR EXPOSITOR P/ 70 ESMALTES EXPOSITOR
3    JOGO DE LENÇOL BERÇO ESTAMPADO T JOGO LENCOL M...
4         ADESIVO BOX DE BANHEIRO ADESIVO BOX BANHEIRO
dtype: object

In [6]:
X = data['title'].str.upper()+' '+data['concatenated_tags'].str.upper().fillna('') #Preprocessing the data
y = data.category

# Scikit learn force us to encode non-numerical variables in numerical variables
le = LabelEncoder()
le.fit(y)
y = le.transform(y)

In the secod part, we applied the scikit learn functions `CountVectorizer()` and `TfidfTransformer()` in our non-numerical explainer variables to encode the data in a numerical matrix. For more details, see this [link][1]

Finally, we build a Random Forest Classfier with a Grid Search in 5 stratified folds. During the model training, we were careful with the unbalance problems of the target variable. Therefore, we:
* Used the ROC AUC score in the Grid Search
* Setted "balanced" in the class_weight parameter of the Random Forest Classifier 
* Implemented a stratified cross validation.

[1]: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [7]:
text_clf = Pipeline([
     ('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf', RandomForestClassifier(random_state=42))
])

param_grid = { 
    'clf__n_estimators': [100, 200, 500],
    'clf__max_depth' : [2,3,4,5,6,7],
    'clf__class_weight': ["balanced"] #Garantee that the RF will be robust to unbalanced data
}

In [None]:
%%time
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)

CV_rfc = GridSearchCV(estimator=text_clf, 
                      param_grid=param_grid, 
                      cv=StratifiedKFold(random_state=42, shuffle=True, n_splits=5), 
                      scoring='roc_auc_ovr', 
                      n_jobs=-1,
                      verbose=10,
                      refit=True)

CV_rfc.fit(X_train, y_train);

Fitting 5 folds for each of 18 candidates, totalling 90 fits


In [None]:
CV_rfc.best_estimator_

# 3. Metrics and Results

As we see, the weighted average recall of our model is 75% and the weighted average precision is 81%. The macro average precision is not so good because the target "Outros", probably, this target is too generic and create a classifier for it it's harder than for the others.

In [None]:
confusion_matrix(y_test, CV_rfc.predict(X_test))

In [None]:
print(classification_report(y_test, CV_rfc.predict(X_test), target_names=list(le.classes_)))

In [None]:
print(classification_report(y_train, CV_rfc.predict(X_train), target_names=list(le.classes_)))

# 4. Exporting model

In [None]:
dump(CV_rfc, os.environ['MODEL_PATH'])