# Name and descriptions

In [62]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
import time
from model.text_normalizer import normalize_corpus, stopword_list
from model import evaluation
from model.utils import decoder
from scripts.build_df import build_df
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from scripts import tree_utils
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from joblib import dump, load

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In future training you will be able to proceed without performing again the normalization step:

```python

df = pd.read_csv('data/normalized_data.csv')

name = df['name'].apply(str)
description = df['description'].apply(str)
name_and_description = df['name_and_description'].apply(str)

# Select X depending on which data you are going to use to train your model
X = name
```

## Labels and features selection
`X` will vary depending if we choose name, description or name_and_description as feature.

**Features:**

In [63]:
df = pd.read_csv('data/normalized_data.csv')

In [64]:
name = df['name'].apply(str)
description = df['description'].apply(str)
name_and_description = df['name_and_description'].apply(str)
X = name_and_description

In [65]:
X.shape

(51646,)

In [66]:
X.head()

0    duracel aaa batteri 4pack compat select electr...
1    duracel aa 15v coppertop batteri 4pack longlas...
2    duracel aa batteri 8pack compat select electro...
3    energ max batteri aa 4pack 4pack aa alkalin ba...
4    duracel c batteri 4pack compat select electron...
Name: name_and_description, dtype: object

**Labels**

`build_df()` function returns a new dataset with custom leaf (label) according to the threshold of min. products selected per category.

Call `build_df()` to extract the labels 

In [67]:
cat = build_df(json_path='data/products.json', 
             threshold=100, 
             preprocessed_csv='data/normalized_data.csv'
            ) 

In [68]:
y = cat['leaf']

Recreating the hierarchical structure of our categories applying our `make_tree()` function

We extracted the nodes from the same dataframe generated by `build_df()`

In [69]:
tree_dict = tree_utils.make_tree(cat, cat['category'], 'Categories', display_tree= True)

Categories
├── pcmcat312300050015
│   ├── pcmcat248700050021
│   │   ├── pcmcat303600050001
│   │   └── pcmcat179100050006
│   │       ├── pcmcat179200050003
│   │       ├── pcmcat179200050008
│   │       │   └── pcmcat748300322875
│   │       └── pcmcat179200050013
│   ├── abcat0802000
│   │   ├── abcat0811011
│   │   └── abcat0802001
│   │       └── pcmcat159300050002
│   ├── abcat0805000
│   │   └── abcat0511001
│   │       └── pcmcat266500050030
│   ├── pcmcat275600050000
│   │   └── abcat0807000
│   │       ├── abcat0807001
│   │       ├── pcmcat335400050008
│   │       └── abcat0807009
│   ├── abcat0809000
│   │   ├── abcat0809004
│   │   └── abcat0809002
│   ├── pcmcat249700050006
│   │   ├── pcmcat219100050010
│   │   ├── pcmcat286300050020
│   │   └── pcmcat272800050000
│   ├── pcmcat254000050002
│   │   └── pcmcat308100050020
│   │       └── pcmcat340500050007
│   └── pcmcat341100050005
│       └── pcmcat253700050018
│           └── pcmcat248300050003
├── other
├── abcat03000

**IMPORTANT**:
- Generate the labels and the tree in the **same step**. If you do not do that you will not be allowed to get the distance between predicted and true categories when apply `get_performance()` function 

- `make_tree()` print the tree if you set `display_tree= True`. `display_tree= False` only generates the tree structure (without printing it) and the dictionary of nodes 

In [70]:
tree_dict2 = tree_utils.make_tree(cat, cat['category'], 'Categories', display_tree= False)

## Train/test split

In [71]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20, 
    random_state=42,
    stratify = y
)

## Feature engineering
Try different values for `max_features` and `ngram_range` in TF-IDF. Also experimenting with and without IDF and min and max idf values.

In [72]:

tfid_vectorizer = TfidfVectorizer(max_features=5000,
                                  ngram_range=(1, 2),
                                  use_idf=False,
                                  min_df=1,
                                  norm='l2',
                                  smooth_idf=True
                                 ) 
X_train = tfid_vectorizer.fit_transform(X_train)
X_test = tfid_vectorizer.transform(X_test)

## Modeling
In this sample notebook we train a logistic regressor 

### 1. Logistic Regressor

In [73]:
logreg = LogisticRegression(max_iter=2000, 
                            n_jobs=-1, 
                            multi_class='multinomial')

In [74]:
logreg.fit(X_train, y_train)

### Evaluation

In [75]:
y_pred_test = logreg.predict(X_test)
y_pred_test_p = logreg.predict_proba(X_test)

y_pred_tr = logreg.predict(X_train)
y_pred_tr_p = logreg.predict_proba(X_train)

evaluation.get_performance(model=logreg,
                           pred_labels=y_pred_test, 
                           true_labels=y_test,
                           vectorizer=tfid_vectorizer,
                           average='micro',
                           probs=y_pred_test_p,
                           tree= tree_dict)

Model Performance metrics:
------------------------------
Accuracy: 0.8300096805421103
Precision: 0.8300096805421103
Recall: 0.8300096805421103
F1 Score: 0.8300096805421103
Average distance between nodes categories: 0.376379477250726
Top 5 Score: 0.9688286544046466

Model Classification report:
------------------------------
                                           precision    recall  f1-score   support

                      3D Printer Filament       0.84      1.00      0.91        47
                  A/V Cables & Connectors       0.70      0.84      0.76        90
                  Action Camcorder Mounts       0.63      0.79      0.70        28
           Activity Trackers & Pedometers       0.95      0.95      0.95        39
              Adapters, Cables & Chargers       0.69      0.76      0.72        71
                         Air Conditioners       0.96      0.96      0.96        28
             Air Purifier Filters & Parts       1.00      0.86      0.92        21
        

In [21]:
y_pred_tr = logreg.predict(X_train)
y_pred_tr_p = logreg.predict_proba(X_train)


In [22]:
evaluation.get_performance(model=logreg,
                           pred_labels=y_pred_test, 
                           true_labels=y_test,
                           vectorizer=tfid_vectorizer,
                           average='micro',
                           probs=y_pred_test_p,
                           tree= tree_dict)
# 2000 iter max feat 3000 3gram

Model Performance metrics:
------------------------------
Accuracy: 0.8197483059051307
Precision: 0.8197483059051307
Recall: 0.8197483059051307
F1 Score: 0.8197483059051306
Average distance between nodes categories: 0.39661181026137465
Top 5 Score: 0.9655372700871249

Model Classification report:
------------------------------
                                           precision    recall  f1-score   support

                      3D Printer Filament       0.87      1.00      0.93        47
                  A/V Cables & Connectors       0.69      0.84      0.76        90
                  Action Camcorder Mounts       0.67      0.79      0.72        28
           Activity Trackers & Pedometers       0.97      0.95      0.96        39
              Adapters, Cables & Chargers       0.66      0.75      0.70        71
                         Air Conditioners       0.96      0.96      0.96        28
             Air Purifier Filters & Parts       1.00      0.86      0.92        21
      

### 2. Decision Tree

In [24]:
clf = DecisionTreeClassifier(random_state=42)

In [37]:
clf.fit(X_train, y_train)

In [40]:
y_pred_test = clf.predict(X_test)
y_pred_test_p = clf.predict_proba(X_test)

y_pred_tr = clf.predict(X_train)
y_pred_tr_p = clf.predict_proba(X_train)

evaluation.get_performance(model=clf,
                           pred_labels=y_pred_tr, 
                           true_labels=y_train,
                           vectorizer=tfid_vectorizer,
                           average='micro',
                           probs=y_pred_tr_p,
                           tree= tree_dict)

Model Performance metrics:
------------------------------
Accuracy: 0.9972891857875883
Precision: 0.9972891857875883
Recall: 0.9972891857875883
F1 Score: 0.9972891857875883
Average distance between nodes categories: 0.005566850614773937
Top 5 Score: 1.0

Model Classification report:
------------------------------
                                           precision    recall  f1-score   support

                      3D Printer Filament       1.00      1.00      1.00       190
                  A/V Cables & Connectors       1.00      1.00      1.00       361
                  Action Camcorder Mounts       1.00      1.00      1.00       114
           Activity Trackers & Pedometers       1.00      1.00      1.00       154
              Adapters, Cables & Chargers       1.00      1.00      1.00       284
                         Air Conditioners       1.00      1.00      1.00       114
             Air Purifier Filters & Parts       1.00      1.00      1.00        83
                    

### 3. Random Forest 

In [42]:
from lightgbm import LGBMClassifier

In [43]:
lgbm = LGBMClassifier(objective='multiclass',
                      random_state=42)

In [44]:
lgbm.fit(X_train, y_train)

In [48]:
y_pred_test = lgbm.predict(X_test)
y_pred_test_p = lgbm.predict_proba(X_test)

y_pred_tr = lgbm.predict(X_train)
y_pred_tr_p = lgbm.predict_proba(X_train)

evaluation.get_performance(model=lgbm,
                           pred_labels=y_pred_test, 
                           true_labels=y_test,
                           vectorizer=tfid_vectorizer,
                           average='micro',
                           probs=y_pred_test_p,
                           tree= tree_dict)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Model Performance metrics:
------------------------------
Accuracy: 0.04385285575992256
Precision: 0.04385285575992256
Recall: 0.04385285575992256
F1 Score: 0.04385285575992256
Average distance between nodes categories: 2.9200387221684414
Top 5 Score: 0.05972894482090997

Model Classification report:
------------------------------
                                           precision    recall  f1-score   support

                      3D Printer Filament       0.00      0.00      0.00        47
                  A/V Cables & Connectors       0.00      0.00      0.00        90
                  Action Camcorder Mounts       0.00      0.00      0.00        28
           Activity Trackers & Pedometers       0.00      0.00      0.00        39
              Adapters, Cables & Chargers       0.00      0.00      0.00        71
                         Air Conditioners       0.00      0.00      0.00        28
             Air Purifier Filters & Parts       0.00      0.00      0.00        21
  

## EDA POST PREDICT

In [64]:
df_last_result = pd.read_csv('/home/app/src/model/experiments/exp2022-12-14 17:53:58.108102/labels.csv')

In [65]:
df_last_result.head()

Unnamed: 0,pred_cat,true_cat,pred_cat_dec,true_cat_dec,dist
0,pcmcat151600050037,pcmcat151600050037,Keyboards,Keyboards,0
1,pcmcat367400050002,abcat0912008,"Coffee, Tea & Espresso",Coffee Pods,2
2,abcat0507000,abcat0507000,Computer Cards & Components,Computer Cards & Components,0
3,pcmcat183800050006,pcmcat183800050006,Laptop Batteries,Laptop Batteries,0
4,pcmcat152100050038,pcmcat152100050020,Microphones,Recording Equipment,3


In [66]:
df_last_result['dist'].value_counts()

0    8516
1     594
2     553
3     404
4     229
5      34
Name: dist, dtype: int64

In [57]:
evaluation.get_performance(model=logreg,
                           pred_labels=y_pred_test, 
                           true_labels=y_test,
                           vectorizer=tfid_vectorizer,
                           average='micro',
                           tree= tree_dict)
# 3000 iter max feat 2500

Model Performance metrics:
------------------------------
Accuracy: 0.8175217812197483
Precision: 0.8175217812197483
Recall: 0.8175217812197483
F1 Score: 0.8175217812197483
Average distance between nodes categories: 0.40203291384317524

Model Classification report:
------------------------------
                                           precision    recall  f1-score   support

                      3D Printer Filament       0.85      1.00      0.92        47
                  A/V Cables & Connectors       0.69      0.82      0.75        90
                  Action Camcorder Mounts       0.70      0.75      0.72        28
           Activity Trackers & Pedometers       0.95      0.95      0.95        39
              Adapters, Cables & Chargers       0.65      0.75      0.69        71
                         Air Conditioners       0.96      0.96      0.96        28
             Air Purifier Filters & Parts       1.00      0.86      0.92        21
                            Air Purifi

lr on names alone

In [28]:
evaluation.get_performance(model=logreg,
                           pred_labels=y_pred_test, 
                           true_labels=y_test,
                           vectorizer=tfid_vectorizer,
                           average='micro',
                           tree= tree_dict)

Model Performance metrics:
------------------------------
Accuracy: 0.7820909970958374
Precision: 0.7820909970958374
Recall: 0.7820909970958374
F1 Score: 0.7820909970958374
Average distance between nodes categories: 0.4915779283639884

Model Classification report:
------------------------------
                                           precision    recall  f1-score   support

                      3D Printer Filament       0.85      1.00      0.92        47
                  A/V Cables & Connectors       0.67      0.78      0.72        90
                  Action Camcorder Mounts       0.52      0.57      0.54        28
           Activity Trackers & Pedometers       0.89      0.85      0.87        39
              Adapters, Cables & Chargers       0.63      0.73      0.68        71
                         Air Conditioners       0.96      0.96      0.96        28
             Air Purifier Filters & Parts       1.00      0.76      0.86        21
                            Air Purifie