## 5 - É possível identificar casos inconsistentes utilizando classificação de NCM?

### Aplicação do SMOTE para lidar com desbalanceamento

- Target: CÓDIGO NCM/SH

- Texto: DESCRIÇÃO DO PRODUTO/SERVIÇO

### Importação de Bibliotecas

In [1]:
import pandas as pd
import numpy as np
import random
import seaborn as sns
import datetime
import os
from sklearn.metrics import classification_report

import tensorflow_addons as tfa
import keras_tuner as kt
from tensorflow import keras
import tensorflow as tf

from classes import Preprocessing, Model, Lstm

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

2023-04-03 12:10:03.832317: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-04-03 12:10:03.881682: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-04-03 12:10:03.882340: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
 The versions of TensorFlow you are currently using is 2.12.0 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://gith

### Leitura de Dados

In [2]:
raw_df = pd.read_csv('data/processed_nfe/nfe_100000.csv')

### Pré-processamento

Tipagem

In [3]:
df = Preprocessing.define_types(raw_df)
df = Preprocessing.filter_event_authorized(df)

Definição de colunas

In [4]:
df['CAPÍTULO NCM'] = df['CÓDIGO NCM/SH'].astype(str).str[0] + df['CÓDIGO NCM/SH'].astype(str).str[1]
df = df[['DESCRIÇÃO DO PRODUTO/SERVIÇO','CAPÍTULO NCM']]
df.rename(columns={'DESCRIÇÃO DO PRODUTO/SERVIÇO':'DESCRICAO'},inplace=True)

df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['CAPÍTULO NCM'] = df['CÓDIGO NCM/SH'].astype(str).str[0] + df['CÓDIGO NCM/SH'].astype(str).str[1]


Unnamed: 0,DESCRICAO,CAPÍTULO NCM
0,CEBOLA KG,70
2,1020-1200 #FRESA INICIAL ACCOLADE TMFZ,90
3,NAN COMFOR 2 400G - NESTLE,19
4,OLIGONUCLEOTIDEOS - IDT,29
5,CARTUCHO DE TONER COMPATÍVEL SAMSUNG CTL 406Y ...,84


In [5]:
ncms_list = df['CAPÍTULO NCM'].value_counts()[0:50].index.tolist()
df = df[df['CAPÍTULO NCM'].isin(ncms_list)]

Aplica pré-processamento no texto da 'DESCRICAO'

In [6]:
df, corpus_desc = Preprocessing.apply_preprocessing(df)

df.head()

Unnamed: 0,DESCRICAO,CAPÍTULO NCM
0,"[cebola, kg]",70
2,"[fresa, inicial, accolade, tmfz]",90
3,"[nan, comfor, QUANTITY, nestle]",19
4,"[oligonucleotideos, idt]",29
5,"[cartucho, toner, compatível, samsung, ctl, y,...",84


In [7]:
df_train, df_val, df_test = Preprocessing().split_dataset(df,['DESCRICAO'],'CAPÍTULO NCM')

train: 70%
val: 10%
test: 20%


In [8]:
len(df_train['CAPÍTULO NCM'].unique())

50

In [9]:
len(df_val['CAPÍTULO NCM'].unique())

50

In [10]:
mean_sequence_length, max_sequence_length = Preprocessing.get_sequences_details(df_train)

print(f'Mean sequence length: {mean_sequence_length}')
print(f'Max sequence length: {max_sequence_length}')

Mean sequence length: 4.781898377414373
Max sequence length: 25


In [11]:
MAX_SEQUENCE_LENGTH = max_sequence_length
NUM_LABELS = len(ncms_list)

VOCAB_SIZE, X_train_padded, X_val_padded, X_test_padded = Preprocessing.adapt_X_for_input_layer(df_train['DESCRICAO'].astype(str), df_val['DESCRICAO'].astype(str), df_test['DESCRICAO'].astype(str), MAX_SEQUENCE_LENGTH)

print('Número de labels: ', NUM_LABELS)
print('Training features shape:', X_train_padded.shape)
print('Validation features shape:', X_val_padded.shape)
print('Test features shape:', X_test_padded.shape)

X_train_padded

Número de labels:  50
Training features shape: (62801, 25)
Validation features shape: (8972, 25)
Test features shape: (17944, 25)


array([[    0,     0,     0, ...,     4,     1,    43],
       [    0,     0,     0, ...,   906,   404,  2521],
       [    0,     0,     0, ...,     1,   567, 12366],
       ...,
       [    0,     0,     0, ...,   166,  2693,     8],
       [    0,     0,     0, ...,     0,   167,   896],
       [    0,     0,     0, ...,     0,    95,   447]], dtype=int32)

In [12]:
# Resample all classes but the majority class
X_train_padded, y_train_smote = Preprocessing.smote(X_train_padded, df_train['CAPÍTULO NCM'])

pd.DataFrame(y_train_smote).value_counts()

CAPÍTULO NCM
10              10031
76              10031
49              10031
61              10031
62              10031
63              10031
68              10031
69              10031
70              10031
71              10031
73              10031
74              10031
80              10031
11              10031
81              10031
82              10031
83              10031
84              10031
85              10031
87              10031
88              10031
90              10031
94              10031
95              10031
48              10031
44              10031
42              10031
40              10031
15              10031
16              10031
17              10031
18              10031
19              10031
20              10031
21              10031
22              10031
23              10031
25              10031
27              10031
28              10031
29              10031
30              10031
31              10031
32              10031
33              100

In [13]:
y_train_cat, y_val_cat, y_test_cat = Preprocessing.adapt_y_for_input_layer(y_train_smote, df_val['CAPÍTULO NCM'], df_test['CAPÍTULO NCM'])

print('Training features shape:', y_train_cat.shape)
print('Validation features shape:', y_val_cat.shape)
print('Test features shape:', y_test_cat.shape)

Training features shape: (501550, 50)
Validation features shape: (8972, 50)
Test features shape: (17944, 50)


In [14]:
METRICS = [
      'accuracy'
]

In [15]:
tuner = kt.RandomSearch(
    hypermodel=Lstm(VOCAB_SIZE, MAX_SEQUENCE_LENGTH, NUM_LABELS, METRICS),
    objective='accuracy',
    max_trials=1,
    executions_per_trial=1,
    overwrite=True,
    directory="models/hyperparameters_search",
    project_name="lstm_smote",
    seed=SEED
)

print(tuner.search_space_summary())

2023-04-03 12:12:13.791791: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


Search space summary
Default search space size: 5
units (Int)
{'default': None, 'conditions': [], 'min_value': 16, 'max_value': 96, 'step': 8, 'sampling': 'linear'}
activation1 (Choice)
{'default': 'relu', 'conditions': [], 'values': ['relu', 'sigmoid'], 'ordered': False}
rate (Float)
{'default': 0.1, 'conditions': [], 'min_value': 0.1, 'max_value': 0.2, 'step': 0.1, 'sampling': 'linear'}
activation2 (Choice)
{'default': 'softmax', 'conditions': [], 'values': ['softmax'], 'ordered': False}
learning_rate (Float)
{'default': 0.0001, 'conditions': [], 'min_value': 0.0001, 'max_value': 0.005, 'step': None, 'sampling': 'log'}
None


In [16]:
logdir = os.path.join("models/logs/lstm_smote/", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = keras.callbacks.TensorBoard(logdir, histogram_freq=1)
earlystopping_callback = keras.callbacks.EarlyStopping('val_loss', mode='min', verbose=1, patience=5)

callbacks_list = [earlystopping_callback, tensorboard_callback]

In [17]:
tuner.search(X_train_padded, y_train_cat,
             validation_data=(X_val_padded, y_val_cat),
             callbacks=callbacks_list)

Trial 1 Complete [01h 16m 09s]
accuracy: 0.40655967593193054

Best accuracy So Far: 0.40655967593193054
Total elapsed time: 01h 16m 09s
INFO:tensorflow:Oracle triggered exit


In [18]:
best_model = tuner.get_best_models()[0]
Model.save(best_model, 'saved_models/lstm')

best_hps = tuner.get_best_hyperparameters()[0]
print(best_hps.values)

{'units': 72, 'activation1': 'relu', 'rate': 0.1, 'activation2': 'softmax', 'learning_rate': 0.0032295411136862955, 'batch_size': 8, 'epochs': 4}


In [24]:
hypermodel = Model.recover('saved_models/lstm')

EPOCHS = 4
BATCH_SIZE = 8
LEARNING_RATE = 0.0032295411136862955
LOSS = 'categorical_crossentropy'

hypermodel.compile(optimizer=keras.optimizers.Adam(learning_rate=LEARNING_RATE),
                    loss=LOSS, 
                    metrics=METRICS) 

In [20]:
eval_result_val = hypermodel.evaluate(X_val_padded, y_val_cat)

y_probabilities_val = hypermodel.predict(X_val_padded)
y_pred_val = np.argmax(y_probabilities_val, axis=1)
y_val = np.argmax(y_val_cat, axis=1)

print('\nValidation')
print(classification_report(y_val, y_pred_val, zero_division=True))

2023-04-03 13:28:30.780455: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_11' with dtype uint8 and shape [8972,50]
	 [[{{node Placeholder/_11}}]]
2023-04-03 13:28:30.780763: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_11' with dtype uint8 and shape [8972,50]
	 [[{{node Placeholder/_11}}]]


  1/281 [..............................] - ETA: 44s

2023-04-03 13:28:31.840678: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype int32 and shape [8972,25]
	 [[{{node Placeholder/_10}}]]
2023-04-03 13:28:31.840924: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype int32 and shape [8972,25]
	 [[{{node Placeholder/_10}}]]



Validation
              precision    recall  f1-score   support

           0       0.40      0.59      0.48        29
           1       0.71      0.63      0.67        57
           2       0.31      0.71      0.43        21
           3       0.50      0.85      0.63        73
           4       0.53      0.46      0.49        39
           5       0.38      0.43      0.40        14
           6       0.85      0.68      0.75       194
           7       0.68      0.63      0.65       201
           8       0.69      0.65      0.67       193
           9       0.77      0.71      0.74       125
          10       0.05      0.44      0.09        18
          11       0.63      0.69      0.66        49
          12       0.89      0.86      0.87       288
          13       0.79      0.67      0.72       112
          14       0.09      0.23      0.13        40
          15       0.56      0.67      0.61       321
          16       0.86      0.92      0.89        13
          17   

In [21]:
eval_result_test = hypermodel.evaluate(X_test_padded, y_test_cat)

y_probabilities_test = hypermodel.predict(X_test_padded)
y_pred_test = np.argmax(y_probabilities_test, axis=1)
y_test = np.argmax(y_test_cat, axis=1)

print('\Test')
print(classification_report(y_test, y_pred_test, zero_division=True))

 61/561 [==>...........................] - ETA: 1s - loss: 1.5060 - accuracy: 0.6829

2023-04-03 13:28:35.774167: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_11' with dtype uint8 and shape [17944,50]
	 [[{{node Placeholder/_11}}]]
2023-04-03 13:28:35.774491: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_11' with dtype uint8 and shape [17944,50]
	 [[{{node Placeholder/_11}}]]


 34/561 [>.............................] - ETA: 2s

2023-04-03 13:28:37.529326: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype int32 and shape [17944,25]
	 [[{{node Placeholder/_10}}]]
2023-04-03 13:28:37.529698: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype int32 and shape [17944,25]
	 [[{{node Placeholder/_10}}]]


\Test
              precision    recall  f1-score   support

           0       0.39      0.59      0.47        64
           1       0.74      0.61      0.67       109
           2       0.45      0.77      0.56        65
           3       0.56      0.85      0.68       164
           4       0.67      0.70      0.68        57
           5       0.52      0.73      0.61        33
           6       0.85      0.66      0.74       348
           7       0.68      0.59      0.63       367
           8       0.66      0.69      0.68       355
           9       0.70      0.75      0.73       216
          10       0.05      0.57      0.10        28
          11       0.56      0.68      0.61        78
          12       0.88      0.86      0.87       576
          13       0.73      0.70      0.72       181
          14       0.16      0.29      0.21        95
          15       0.60      0.70      0.65       762
          16       0.80      0.62      0.70        26
          17       0.