# Probability

In [1]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
import time
from model.text_normalizer import normalize_corpus, stopword_list
from model import evaluation
from model.utils import decoder
from scripts.build_df import build_df
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from scripts import tree_utils
from sklearn.metrics import top_k_accuracy_score
from sklearn.base import BaseEstimator, TransformerMixin
from joblib import dump, load

%load_ext autoreload
%autoreload 2

  from pandas import MultiIndex, Int64Index
[nltk_data] Downloading package stopwords to /home/app/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In future training you will be able to proceed without performing again the normalization step:

```python

df = pd.read_csv('data/normalized_data.csv')

name = df['name'].apply(str)
description = df['description'].apply(str)
name_and_description = df['name_and_description'].apply(str)

# Select X depending on which data you are going to use to train your model
X = name
```

## Labels and features selection
`X` will vary depending if we choose name, description or name_and_description as feature.

**Features:**

In [2]:
df = pd.read_csv('data/normalized_data.csv')

In [3]:
name = df['name'].apply(str)
description = df['description'].apply(str)
name_and_description = df['name_and_description'].apply(str)
X1 = name

In [4]:
X1.head()

0                 duracel aaa batteri 4pack
1    duracel aa 15v coppertop batteri 4pack
2                  duracel aa batteri 8pack
3                energ max batteri aa 4pack
4                   duracel c batteri 4pack
Name: name, dtype: object

**Labels**

`build_df()` function returns a new dataset with custom leaf (label) according to the threshold of min. products selected per category.

Call `build_df()` to extract the labels 

In [5]:
cat = build_df(json_path='data/products.json', 
             threshold=100, 
             preprocessed_csv='data/normalized_data.csv'
            ) 

In [6]:
y = cat['leaf']

Recreating the hierarchical structure of our categories applying our `make_tree()` function

We extracted the nodes from the same dataframe generated by `build_df()`

In [7]:
tree_dict = tree_utils.make_tree(cat, cat['category'], 'Categories', display_tree= True)

Categories
├── pcmcat312300050015
│   ├── pcmcat248700050021
│   │   ├── pcmcat303600050001
│   │   └── pcmcat179100050006
│   │       ├── pcmcat179200050003
│   │       ├── pcmcat179200050008
│   │       │   └── pcmcat748300322875
│   │       └── pcmcat179200050013
│   ├── abcat0802000
│   │   ├── abcat0811011
│   │   └── abcat0802001
│   │       └── pcmcat159300050002
│   ├── abcat0805000
│   │   └── abcat0511001
│   │       └── pcmcat266500050030
│   ├── pcmcat275600050000
│   │   └── abcat0807000
│   │       ├── abcat0807001
│   │       ├── pcmcat335400050008
│   │       └── abcat0807009
│   ├── abcat0809000
│   │   ├── abcat0809004
│   │   └── abcat0809002
│   ├── pcmcat249700050006
│   │   ├── pcmcat219100050010
│   │   ├── pcmcat286300050020
│   │   └── pcmcat272800050000
│   ├── pcmcat254000050002
│   │   └── pcmcat308100050020
│   │       └── pcmcat340500050007
│   └── pcmcat341100050005
│       └── pcmcat253700050018
│           └── pcmcat248300050003
├── other
├── abcat03000

**IMPORTANT**:
- Generate the labels and the tree in the **same step**. If you do not do that you will not be allowed to get the distance between predicted and true categories when apply `get_performance()` function 

- `make_tree()` print the tree if you set `display_tree= True`. `display_tree= False` only generates the tree structure (without printing it) and the dictionary of nodes 

In [8]:
tree_dict2 = tree_utils.make_tree(cat, cat['category'], 'Categories', display_tree= False)

## Train/test split

In [8]:
X1_train, X1_test, y_train, y_test = train_test_split(
    X1, y,
    test_size=0.20, 
    random_state=42,
    stratify = y
)

In [9]:
X1_train.head(5)

7029                conair suprem 2in1 hot air brush white
26164    hp slimlin desktop intel pentium 4gb memori 50...
46217    mb quart discu 1200w class sq ab bridgeabl 2ch...
13187                  sabr window glass alarm 2pack white
41483                   elit beat agent preown nintendo ds
Name: name, dtype: object

## Feature engineering
Try different values for `max_features` and `ngram_range` in TF-IDF. Also experimenting with and without IDF and min and max idf values.

In [10]:
tfid_vectorizer = TfidfVectorizer(max_features=2500,
                                  ngram_range=(1, 2),
                                  use_idf=False,
                                  min_df=1,
                                  norm='l2',
                                  smooth_idf=True
                                 ) 
X_train1 = tfid_vectorizer.fit_transform(X1_train)
X_test1 = tfid_vectorizer.transform(X1_test)

### Exploration of categories with the highest probability 
 

In [12]:
logreg1 = LogisticRegression(max_iter=500, 
                            n_jobs=-1, 
                            multi_class='multinomial',
                            random_state=42)

In [13]:
logreg1.fit(X_train1, y_train)

**Obtaining predictions**

In [14]:
y_pred_test1 = logreg1.predict(X_test1)

In [14]:
# Predicted category at row 0
y_pred_test1[0]

'pcmcat151600050037'

In [24]:
#probabilities of each category
y_pred_test1_p = logreg1.predict_proba(X_test1)

#Let's see the array of probabilities for the first row:
#print(y_pred_test1_p)

In [27]:
from model import evaluation

In [28]:
evaluation.get_performance(model=logreg1,
                           pred_labels=y_pred_test1, 
                           true_labels=y_test,
                           vectorizer=tfid_vectorizer,
                           probs=y_pred_test1_p,
                           average='micro',
                           tree= tree_dict)

Model Performance metrics:
------------------------------
Accuracy: 0.8096805421103582
Precision: 0.8096805421103582
Recall: 0.8096805421103582
F1 Score: 0.8096805421103582
Average distance between nodes categories: 0.425459825750242
Top 5 Score: 0.9526621490803485

Model Classification report:
------------------------------
                                           precision    recall  f1-score   support

                      3D Printer Filament       0.85      1.00      0.92        47
                  A/V Cables & Connectors       0.64      0.80      0.71        90
                  Action Camcorder Mounts       0.58      0.54      0.56        28
           Activity Trackers & Pedometers       0.92      0.85      0.88        39
              Adapters, Cables & Chargers       0.64      0.70      0.67        71
                         Air Conditioners       1.00      0.96      0.98        28
             Air Purifier Filters & Parts       0.89      0.81      0.85        21
        

We can look for the index of the highest value:

In [16]:
# index of the category with the highest probability 
idx_max_prob = np.argmax(y_pred_test1_p[0])
print(idx_max_prob)

101


If we pass the index to the array of classes, we can obtain the name of the category with the highest probability. We can see that it matches with the predicted category by `log_reg1.predict()`

In [17]:
print(logreg1.classes_[101]) #name obtained by accesing classes in predict_proba()
print(y_pred_test1[0]) #category name obtained by predict()

pcmcat151600050037
pcmcat151600050037


Next step: obtain indexes class of the five most probable predicted categories:

In [18]:
most_prob_cat_idx = np.argsort(-y_pred_test1_p[0])[:5]
print(most_prob_cat_idx)


[101  92  39  16  60]


And now we pass those index to `classes_` and saved the names of the categories with the highest probabilities in a list :)

In [19]:
name_cat_most_prob= []
for idx in most_prob_cat_idx:
    nm_cat = logreg1.classes_[idx]
    name_cat_most_prob.append(nm_cat)

print(name_cat_most_prob)

['pcmcat151600050037', 'other', 'abcat0513004', 'abcat0208024', 'abcat0811004']


```python
>>> import numpy as np
>>> from sklearn.metrics import top_k_accuracy_score
>>> y_true = np.array([0, 1, 2, 2])
>>> y_score = np.array([[0.5, 0.2, 0.2],  # 0 is in top 2
...                     [0.3, 0.4, 0.2],  # 1 is in top 2
...                     [0.2, 0.4, 0.3],  # 2 is in top 2
...                     [0.7, 0.2, 0.1]]) # 2 isn't in top 2
>>> top_k_accuracy_score(y_true, y_score, k=2)
0.75
>>> # Not normalizing gives the number of "correctly" classified samples
>>> top_k_accuracy_score(y_true, y_score, k=2, normalize=False)
3
```

In [19]:
top_k_accuracy_score(y_test, y_pred_test1_p, k=5, normalize=False)

9841

In [20]:

top_k_accuracy_score(y_test, y_pred_test1_p, k=5)

0.9526621490803485