# Ensembled Models I

In this notebook, we perform the exploration of different combinations of NLP and image models previously trained


In [3]:
import os
os.chdir('/home/app/src')
import time
import joblib

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import top_k_accuracy_score
from sklearn.base import BaseEstimator, TransformerMixin

from utils import evaluation
from utils.build_df import build_df
from utils import tree_utils
from utils.text_normalizer import normalize_corpus
from utils.decoder import decode_id_path, decode_id

from utils import utils_img
from utils import efficientnet

%load_ext autoreload
%autoreload 2

2022-12-27 18:19:50.282783: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-27 18:19:50.282817: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[nltk_data] Downloading package stopwords to /home/app/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## 1. Labels 



`build_df()` function returns a new dataset with custom leaf (label) according to the threshold of minimun amount of products selected per category.

Call `build_df()` to extract the labels 

In [4]:
cat = build_df(json_path='data/products.json', 
             threshold=100, 
             preprocessed_csv='data/normalized_data.csv'
            ) 

In [5]:
y = cat['leaf']

Recreating the hierarchical structure of our categories applying our `make_tree()` function.

We extracted the nodes from the same dataframe generated by `build_df()`. We are going to use it later to get the distance between predicted and true categories when apply `get_performance()` function  

In [6]:
tree_dict = tree_utils.make_tree(cat, cat['category'], 'Categories', display_tree= True)

Categories - All categories
├── pcmcat312300050015 - Connected Home & Housewares
│   ├── pcmcat248700050021 - Housewares
│   │   ├── pcmcat303600050001 - Household Batteries
│   │   └── pcmcat179100050006 - Outdoor Living
│   │       ├── pcmcat179200050003 - Grills
│   │       ├── pcmcat179200050008 - Patio Furniture & Decor
│   │       │   └── pcmcat748300322875 - Outdoor Seating
│   │       └── pcmcat179200050013 - Outdoor Heating
│   ├── abcat0802000 - Telephones & Communication
│   │   ├── abcat0811011 - Telephone Accessories
│   │   └── abcat0802001 - Cordless Telephones
│   │       └── pcmcat159300050002 - Systems
│   ├── abcat0805000 - Office Electronics
│   │   └── abcat0511001 - Printers, Ink & Toner
│   │       └── pcmcat266500050030 - All Printers
│   ├── pcmcat275600050000 - Office & School Supplies
│   │   └── abcat0807000 - Printer Ink & Toner
│   │       ├── abcat0807001 - Printer Ink
│   │       ├── pcmcat335400050008 - 3D Printer Filament
│   │       └── abcat0807009 -

## 3. Features

We took already normalized dataset (see 'prepare_dataset' tutorial notebook)

In [7]:
df = pd.read_csv('data/normalized_data.csv')
df.head()

Unnamed: 0,name,description,nm_and_desc,category,image,name_and_description
0,duracel aaa batteri 4pack,compat select electron devic aaa size duralock...,Duracell - AAA Batteries (4-Pack) Compatible w...,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",http://www.bestbuy.com/site/duracell-aaa-batte...,duracel aaa batteri 4pack compat select electr...
1,duracel aa 15v coppertop batteri 4pack,longlast energi duralock power preserv technol...,Duracell - AA 1.5V CopperTop Batteries (4-Pack...,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",http://www.bestbuy.com/site/duracell-aa-1-5v-c...,duracel aa 15v coppertop batteri 4pack longlas...
2,duracel aa batteri 8pack,compat select electron devic aa size duralock ...,Duracell - AA Batteries (8-Pack) Compatible wi...,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",http://www.bestbuy.com/site/duracell-aa-batter...,duracel aa batteri 8pack compat select electro...
3,energ max batteri aa 4pack,4pack aa alkalin batteri batteri tester includ,Energizer - MAX Batteries AA (4-Pack) 4-pack A...,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",http://www.bestbuy.com/site/energizer-max-batt...,energ max batteri aa 4pack 4pack aa alkalin ba...
4,duracel c batteri 4pack,compat select electron devic c size duralock p...,Duracell - C Batteries (4-Pack) Compatible wit...,"[{'id': 'pcmcat312300050015', 'name': 'Connect...",http://www.bestbuy.com/site/duracell-c-batteri...,duracel c batteri 4pack compat select electron...


As our NLP model can be trained on different features (names, description and concatenated name and description see model trainning notebooks) we can select from the following options:

In [8]:
name = df['name'].apply(str)
description = df['description'].apply(str)
name_and_description = df['name_and_description'].apply(str)
image = df['image']

## 2. First NLP Model: trained with names of products (BL0)



### 2.1. Feature selection

In [9]:
X_a = name

### 2.2. Train/Test split

In [10]:
X_a_train, X_a_test, y_train, y_test = train_test_split(
    X_a, y,
    test_size=0.20, 
    random_state=42,
    stratify = y
)

In [11]:
X_a_train.head()

7029                conair suprem 2in1 hot air brush white
26164    hp slimlin desktop intel pentium 4gb memori 50...
46217    mb quart discu 1200w class sq ab bridgeabl 2ch...
13187                  sabr window glass alarm 2pack white
41483                   elit beat agent preown nintendo ds
Name: name, dtype: object

### 2.3. Feature engineering

In [19]:
tfid_vectorizer_BL0 = TfidfVectorizer(max_features=5000,
                                      ngram_range=(1, 3),
                                      use_idf=False,
                                     ) 
tfid_vectorizer_BL0.fit(X_a_train)
#joblib.dump(tfid_vectorizer_BL0, '/home/app/src/model/vect_BL0')

In [20]:
X_a_train = tfid_vectorizer_BL0.transform(X_a_train)
X_a_test = tfid_vectorizer_BL0.transform(X_a_test)

In [21]:
X_a_train[0]

<1x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

### 2.4. Modelling

In [22]:
logreg_BL0 = LogisticRegression(max_iter=7000, 
                            n_jobs=-1, 
                            multi_class='multinomial', 
                            solver='newton-cg',
                            random_state=42)

In [23]:
logreg_BL0.fit(X_a_train, y_train)

In [24]:
# save the model 
# filename = '/home/app/src/model/model_BL0'
# joblib.dump(logreg_BL0, filename)

In [25]:
# getting predictions
y_pred_a = logreg_BL0.predict(X_a_test)
y_pred_a.shape

(10330,)

In [26]:
y_pred_a_prob = logreg_BL0.predict_proba(X_a_test)

In [27]:
y_pred_a_prob.shape

(10330, 213)

In [28]:
len(y_pred_a_prob[0])

213

### 2.5. General evaluation

In [29]:
evaluation.get_performance(model=logreg_BL0,
                           pred_labels=y_pred_a, 
                           true_labels=y_test,
                           vectorizer=tfid_vectorizer_BL0,
                           probs=y_pred_a_prob,
                           average='micro',
                           tree= tree_dict)

Model Performance metrics:
------------------------------
Accuracy: 0.8191674733785091
Precision: 0.8191674733785091
Recall: 0.8191674733785091
F1 Score: 0.8191674733785091
Average distance between nodes categories: 0.40609874152952563
Top 5 Score: 0.9613746369796708

Model Classification report:
------------------------------
                                           precision    recall  f1-score   support

                      3D Printer Filament       0.89      1.00      0.94        47
                  A/V Cables & Connectors       0.68      0.81      0.74        90
                  Action Camcorder Mounts       0.59      0.61      0.60        28
           Activity Trackers & Pedometers       0.94      0.85      0.89        39
              Adapters, Cables & Chargers       0.66      0.72      0.69        71
                         Air Conditioners       1.00      0.96      0.98        28
             Air Purifier Filters & Parts       0.94      0.81      0.87        21
      

## 3. Second NLP Model: trained with concatenation of name and description of products (BL1)

### 3.1. Feature selection 

In [30]:
X_b = name_and_description
X_b.head()

0    duracel aaa batteri 4pack compat select electr...
1    duracel aa 15v coppertop batteri 4pack longlas...
2    duracel aa batteri 8pack compat select electro...
3    energ max batteri aa 4pack 4pack aa alkalin ba...
4    duracel c batteri 4pack compat select electron...
Name: name_and_description, dtype: object

### 3.2. Train/Test split

In [31]:
X_b_train, X_b_test, y_train, y_test = train_test_split(
    X_b, y,
    test_size=0.20, 
    random_state=42,
    stratify = y
)

In [32]:
X_b_train.head()

7029     conair suprem 2in1 hot air brush white 150 wat...
26164    hp slimlin desktop intel pentium 4gb memori 50...
46217    mb quart discu 1200w class sq ab bridgeabl 2ch...
13187    sabr window glass alarm 2pack white sabr windo...
41483    elit beat agent preown nintendo ds prepar rock...
Name: name_and_description, dtype: object

### 3.3. Feature engineering

In [33]:
tfid_vectorizer_BL1 = TfidfVectorizer(max_features=3000,
                                  ngram_range=(1, 2),
                                  use_idf=False,
                                  min_df=1,
                                  norm='l2',
                                  smooth_idf=True
                                 )

In [34]:
tfid_vectorizer_BL1.fit(X_b_train)
#joblib.dump(tfid_vectorizer_BL1, '/home/app/src/model/vect_BL1')

In [35]:
X_b_train = tfid_vectorizer_BL1.transform(X_b_train)
X_b_test = tfid_vectorizer_BL1.transform(X_b_test)

In [36]:
X_b_train[0]

<1x3000 sparse matrix of type '<class 'numpy.float64'>'
	with 22 stored elements in Compressed Sparse Row format>

### 3.4. Modelling

In [37]:
logreg_BL1 = LogisticRegression(max_iter=2500, 
                            n_jobs=-1, 
                            multi_class='multinomial', 
                            solver='newton-cg',
                            random_state=42)

In [38]:
logreg_BL1.fit(X_b_train, y_train)

In [39]:
# save the model to disk
# filename = '/home/app/src/model/model_BL1'
# joblib.dump(logreg_BL1, filename)

In [40]:
y_pred_b = logreg_BL1.predict(X_b_test)

In [41]:
y_pred_b_prob = logreg_BL1.predict_proba(X_b_test)

### 3.5. General evaluation

In [42]:
evaluation.get_performance(model=logreg_BL1,
                           pred_labels=y_pred_b, 
                           true_labels=y_test,
                           vectorizer=tfid_vectorizer_BL1,
                           probs=y_pred_b_prob,
                           average='micro',
                           tree= tree_dict)

Model Performance metrics:
------------------------------
Accuracy: 0.8207163601161666
Precision: 0.8207163601161666
Recall: 0.8207163601161666
F1 Score: 0.8207163601161666
Average distance between nodes categories: 0.3940948693126815
Top 5 Score: 0.965053242981607

Model Classification report:
------------------------------
                                           precision    recall  f1-score   support

                      3D Printer Filament       0.87      1.00      0.93        47
                  A/V Cables & Connectors       0.68      0.84      0.76        90
                  Action Camcorder Mounts       0.64      0.75      0.69        28
           Activity Trackers & Pedometers       0.95      0.95      0.95        39
              Adapters, Cables & Chargers       0.67      0.73      0.70        71
                         Air Conditioners       0.96      0.96      0.96        28
             Air Purifier Filters & Parts       1.00      0.86      0.92        21
        

## 4. Computer Vision Model: model trained with image of the products

**TO-REVISE REFERENCE** see notebook images 

In [None]:
CONFIG_YML = "/home/app/src/model/exp4.yml"

TEST_FOLDER = "/home/app/src/uploads/"

WEIGHTS = "/home/app/src/model/model.06-2.0593.h5"

config = utils_img.load_config(CONFIG_YML)

MODEL_CLASSES = logreg_BL0.classes_

cnn_model = efficientnet.create_model(weights=WEIGHTS)


predictions, labels, probs = utils_img.predict_from_folder(
    folder=TEST_FOLDER, 
    model=cnn_model, 
    input_size=config["data"]["image_size"], 
    class_names=MODEL_CLASSES,
)

In [None]:
## csv 

----------------------------------------------------------------------

## 5. Ensembled Model A: average of predicted probabilities given by previous models

First way to combine models in order to get predictions is by making an average of the predicted probability for each label. 

In this case same weight to the probabillity of each model, that is:

$$ w_{n} + w_{d} + w_{i} = 1 $$
with $$ w_{n} = w_{d} = w_{i} $$

were $w_{n}$, $w_{d}$ and $w_{i}$ are the weights for the probability per sample given by the name model (BL0), the name and description model (BL1) and the image model respectively

$$ P(x) = p_{n}(x)  w_{n}  + p_{d}(x)  w_{d} + p_{i}(x)  w_{i}$$ 

In [43]:
# average model 

y_pred_cm1 = np.array([((p1 + p2 + p3) * (1/3))for p1, p2, p3 in zip(y_pred_a_prob, y_pred_b_prob, y_pred_prob_C)])

**TO-REVISE**

As we mentioned (image model notebook), NLP models (BL0 and BL1) were trained on 41316 samples and tested on 10330 whereas the image model was trained on 40024 and tested on 10006 samples, respectively. The difference is caused by broken URLs and GIFs (see notebooks on preparation dataset for image model)


Due to this difference on the sizes, it is not possible to make a direct comparison between the predicted probabilities by the different kind of models - NLP and images. However, and given that our main focus is on NLP models we can do such comparison between both NLP models:

In [45]:
tk_c = top_k_accuracy_score(y_test, y_pred_prob_C, k=5)
tk_a = top_k_accuracy_score(y_test,y_pred_a_prob, k=5)
tk_b = top_k_accuracy_score(y_test,y_pred_b_prob, k=5)
print(f""" 
Top K=5 accuracy score:
----------------------
Model A: {tk_a}
Model B: {tk_b}
Model C(avg): {tk_c}
""")


 
Top K=5 accuracy score:
----------------------
Model A: 0.9613746369796708
Model B: 0.965053242981607
Model C(avg): 0.9706679574056147



In [44]:
y_pred_prob_C.shape

(10330, 213)

### 5.1. Getting predictions from our first combined model

In [46]:
# all models were trained on the same label, so if indifferent the classes of which model we take 
labels = logreg_BL1.classes_


**TO-REVISE** Extracting categories for the API

In [60]:
def get_feat_max(cat_prob, prod_idx, max_k_feat, classes):
    """Given a array of predicted probability of classes for one product returns a dictionary with the names of the k classes with the highest probability"""
    #obtain indexes class of the five most probable predicted categories
    most_prob_cat_idx = np.argsort(-cat_prob[prod_idx])[:max_k_feat]
    name_cat_max= []
    
    for idx in most_prob_cat_idx:
      nm_cat = classes[idx]
      name_cat_max.append(nm_cat)

    dict_max_feat = {}
    for items in range(len(name_cat_max)):
        dict_max_feat[str(items+1)] = utils.decode_id_path(name_cat_max[items])

    return dict_max_feat 

In [None]:
 # Not normalizing gives the number of "correctly" classified samples
# top_k_accuracy_score(y_test, y_pred_test1_p, k=5, normalize=False)

In [61]:
#Model A
prob_cat_max_a = np.sort(-y_pred_a_prob[0])[:5]
most_prob_cat_idx_a = np.argsort(-y_pred_a_prob[0])[:5]
print(most_prob_cat_idx_a)
print(prob_cat_max_a)

[101  92  16  39  60]
[-0.75610682 -0.0212806  -0.00728905 -0.00705707 -0.00622776]


In [64]:
#Model B
prob_cat_max_b = np.sort(-y_pred_b_prob[0])[:5]
most_prob_cat_idx_b = np.argsort(-y_pred_b_prob[0])[:5]
print(most_prob_cat_idx_b)
print(prob_cat_max_b)

[101  39  16  13  92]
[-0.8483222  -0.04935319 -0.01089907 -0.00571699 -0.00557107]


In [63]:
most_prob_cat_idx_C = np.argsort(-y_pred_prob_C[0])[:5]
print(most_prob_cat_idx_C)

[101  39  92  16  13]


In [65]:
# categories model A
dict_a = get_feat_max(cat_prob= y_pred_a_prob,
                      prod_idx= 0,
                      max_k_feat=5,
                      classes= labels)
dict_a

{'1': ['Musical Instruments', 'Keyboards'],
 '2': ['other'],
 '3': ['Musical Instruments', 'Musical Instrument Accessories'],
 '4': ['Computers & Tablets',
  'Computer Accessories & Peripherals',
  'Mice & Keyboards',
  'Computer Keyboards'],
 '5': ['Cell Phones',
  'Cell Phone Accessories',
  'Cell Phone Batteries & Power']}

In [66]:
### categories model B
dict_b = get_feat_max(cat_prob= y_pred_b_prob,
                      prod_idx= 0,
                      max_k_feat=5,
                      classes= labels)
dict_b

{'1': ['Musical Instruments', 'Keyboards'],
 '2': ['Computers & Tablets',
  'Computer Accessories & Peripherals',
  'Mice & Keyboards',
  'Computer Keyboards'],
 '3': ['Musical Instruments', 'Musical Instrument Accessories'],
 '4': ['Musical Instruments'],
 '5': ['other']}

In [67]:
#categories model C
dict_c = get_feat_max(cat_prob= y_pred_prob_C,
                      prod_idx= 0,
                      max_k_feat=5,
                      classes= labels)
dict_c

{'1': ['Musical Instruments', 'Keyboards'],
 '2': ['Computers & Tablets',
  'Computer Accessories & Peripherals',
  'Mice & Keyboards',
  'Computer Keyboards'],
 '3': ['other'],
 '4': ['Musical Instruments', 'Musical Instrument Accessories'],
 '5': ['Musical Instruments']}

## 6. Ensembled Model B: max of predicted probabilities given by previous models

First way to combine models in order to get predictions is by making an average of the predicted probability for each label. 

In this case same weight to the probabillity of each model, that is:

$$ w_{n} + w_{d} + w_{i} = 1 $$
with $$ w_{n} = w_{d} = w_{i} $$

were $w_{n}$, $w_{d}$ and $w_{i}$ are the weights for the probability per sample given by the name model (BL0), the name and description model (BL1) and the image model respectively. In this second model we do not take the average but the maximun between the probability assigned for each model

$$ P(x) = max(p_{n}(x)  w_{n}  , p_{d}(x)  w_{d} , p_{i}(x)  w_{i})$$ 

In [None]:
y_pred_prob_C = np.array([max(prob1, prob2 ,prob3) for prob1, prob2, prob3 in zip(y_pred_a_prob, y_pred_b_prob, y_pred_prob_C)])

In [None]:
tk_c = top_k_accuracy_score(y_test, y_pred_prob_C, k=5)
tk_a = top_k_accuracy_score(y_test,y_pred_a_prob, k=5)
tk_b = top_k_accuracy_score(y_test,y_pred_b_prob, k=5)
print(f""" 
Top K=5 accuracy score:
----------------------
Model A: {tk_a}
Model B: {tk_b}
Model C(avg): {tk_c}
""")

** TO-REVISE** **DIRECT COMPARISON BETWEEN NLP MODELS** 
Due to this difference on the sizes, it is not possible to make a direct comparison between the predicted probabilities by the different kind of models - NLP and images. However, and given that our main focus is on NLP models we can do such comparison between both NLP models:

### 6.1. Getting predictions from our second combined model

**TO-REVISE** Extracting categories for the API

In [None]:
#Use get_feat_max

## 7. Ensembled Model C:  ponderation of predicted probabilities given by previous models based on its score

First way to combine models in order to get predictions is by making a ponderation of the predicted probability for each label based on its accuracy score.

In this case the weight to the probabillity of each model is given by the following formulas:

$$ w_{n} = \frac{sc_{n}}{sc_{n} + sc_{d} + sc_{i}}$$

$$ w_{d} = \frac{sc_{d}}{sc_{n} + sc_{d} + sc_{i}}$$

$$ w_{i} = \frac{sc_{i}}{sc_{n} + sc_{d} + sc_{i}}$$


where $sc_{n}$ , $sc_{d}$  and $sc_{i}$ correspond to the accuracy of the name model (BL0), the name and description model (BL1) and the image model respectively.

In this third ensembled model we do not take the average but the maximun between the probability assigned for each model according to their weigths

$$ P(x) = max(p_{n}(x)  w_{n}  , p_{d}(x)  w_{d} , p_{i}(x)  w_{i})$$ 



In [None]:
acc_name = y_pred_name
acc_desc = y_pred_desc 
acc_img = y_pred_img

wn = acc_name / (acc_name + acc_desc + acc_img)
wd = acc_desc / (acc_name + acc_desc + acc_img)
wi = acc_img / (acc_name + acc_desc + acc_img)

y_pred_comb_c = np.array([max(prob1 * wn, prob2 * wd , prob3 * wi) for prob1, prob2, prob3 in zip(y_pred_a_prob, y_pred_b_prob, y_pred_prob_C)])

** TO-REVISE** **DIRECT COMPARISON BETWEEN NLP MODELS** 
Due to this difference on the sizes, it is not possible to make a direct comparison between the predicted probabilities by the different kind of models - NLP and images. However, and given that our main focus is on NLP models we can do such comparison between both NLP models:

### 7.1. Getting predictions from our second combined model

**TO-REVISE** Extracting categories for the API

In [49]:
name_sample = "Casio - Portable Keyboard with 61 Touch-Sensitive Keys - Black/Silver "
descr_sample = "CASIO Portable Keyboard with 61 Touch-Sensitive Keys: MIDI and USB connectivity; 600 AHL keyboard voices; 180 rhythms; 152 songs; auto accompaniment"
true_label_sample = 'Keyboards'


-----------------------------------------------------------------------------------
## 8. Concluding remarks

**TO-REVISE**
SELECTED COMBINED MODEL 

### 8.1. Create a class for combined model
This class contains the following methods:

predict

predict_proba

get_best_five 

This class is used for get predictions and performs evaluation of the model (REFERENCE NOTEBOOK) 

**Ensembled Model II**

In [37]:
a_list = ["algo1", "algo2", "other", "algo3", "algo5", "algo6" ]

In [38]:
dict_no_other = {}

In [40]:
a_list

['algo1', 'algo2', 'algo3', 'algo5', 'algo6']

In [41]:
dict_no_other

{'0': 'algo1', '1': 'algo2', '2': 'algo3', '3': 'algo5', '4': 'algo6'}