# Orquestrador de Chatbots - Etapa 3
-----------------------------------
## Obter Classificadores-Base

1. Fazer *Grid Search* com *Cross Validation* para obter a melhor parametrização dos classificadores base. Esses classificadores serão usados um esquema de *ensemble* com um **VotingClassifier**.
2. Persistir os classificadores base

In [1]:
%load_ext autoreload
%autoreload 2

## Bibliotecas utilizadas

In [2]:
import pandas as pd
import numpy as np
import csv
import codecs
import os
import glob
import pickle
import re
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix, make_scorer

from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV

from sklearn.ensemble import VotingClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import json

In [3]:
import orquestrador_funcoes_gerais as ofg

## Configurações

In [4]:
cfg = ofg.carregar_configuracoes()

Conferir as configurações antes de prosseguir
nome_arquivo_configuracao: config.json
------------------------------------------------------------------------------------------------------------------------
aplicar_stemmer: False
------------------------------------------------------------------------------------------------------------------------
bots: [{'bot_id': 1, 'nome': 'Alistamento Militar', 'arquivo': 'skill-alistamento-militar.json'}, {'bot_id': 2, 'nome': 'COVID', 'arquivo': 'skill-covid.json'}, {'bot_id': 3, 'nome': 'Login Único', 'arquivo': 'skill-login-unico.json'}, {'bot_id': 4, 'nome': 'IRPF 2020', 'arquivo': 'skill-perguntao-irpf-2020.json'}, {'bot_id': 5, 'nome': 'PGMEI', 'arquivo': 'skill-pgmei.json'}, {'bot_id': 6, 'nome': 'Selo Turismo Responsável', 'arquivo': 'skill-poc-selo-turismo-responsavel.json'}, {'bot_id': 7, 'nome': 'Cadastur', 'arquivo': 'skill-cadastur.json'}, {'bot_id': 8, 'nome': 'Tuberculose', 'arquivo': 'skill-tuberculose.json'}]
---------------------

In [5]:
bots=cfg['bots']

In [6]:
print('ATENÇÃO!!! Aplicação de Stemmer =', cfg['aplicar_stemmer'])

ATENÇÃO!!! Aplicação de Stemmer = False


## Carregar dados processados

In [7]:
arquivo_treino_testes_processado = os.path.join(os.getcwd(),  cfg['diretorio_dados'], cfg['arquivo_treino_testes_processado']) 
df = pd.read_csv(arquivo_treino_testes_processado, index_col=None, engine='python', sep =',', encoding="utf-8")
print('Total de registros carregados:',len(df), 'de', cfg['arquivo_treino_testes_processado'])
df.tail(-1)

Total de registros carregados: 2631 de treino_testes_processado.csv


Unnamed: 0,bot_id,pergunta
1,8,aperto mao transmite tuberculose
2,6,diaria tera valor maior aderir selo turismo re...
3,3,preciso conta acesso login unico
4,3,resolver problema cpf invalido
5,4,perda total carro declarar recebimento seguro
...,...,...
2626,4,filho dependente
2627,4,contribuinte obrigado preenchimento numero recibo
2628,5,preciso imprimir guia microempreendedor indivi...
2629,7,quer dizer cnae


In [8]:
mask = (df['pergunta'].str.len() < 4)
df.loc[mask]

Unnamed: 0,bot_id,pergunta


## Carregar Vetorizador

In [9]:
arquivo_vetorizador = os.path.join(os.getcwd(), cfg['diretorio_modelos'], cfg['arquivo_vetorizador'])
print('Carregando Vetorizador --->',arquivo_vetorizador,'\n')
try:            
    file = open(arquivo_vetorizador, 'rb')
    vectorizer = pickle.load(file)
    file.close()
except Exception as e:
    print('Erro no carregamento do vetorizador',arquivo_vetorizador,'-->',str(e))
print(vectorizer)

Carregando Vetorizador ---> E:\DataScience\PUC\TCC\tcc_orquestrador_bots_final\modelos\vetorizador.pkl 

TfidfVectorizer(ngram_range=[1, 2], smooth_idf=False, sublinear_tf=True)


# Obter Parametrização dos Classificadores Base com Grid Search

Procura melhores parâmetros para (alguns) os clssificadores base que irão compor o *VotingClassifier*.

A cada *Grid Search*, é armazenado o melhor classificador obtido. Esses classificadores irão compor a base do VotingClassifier.

Como não teremos um Voting multiclasse, mas vários Votings específicos por classe, o ideal é que os gridsearches também fossem por classe, gerando parâmetros otimizados dos classificadores base também por classe. Mas como isso vai aumentar consideravelmente a complexidade da solução, ficará para um segundo momento.

In [10]:
bot_id_gs = 0 # Se = 0, usa todas as classes (multiclasse).
if bot_id_gs != 0:
    df['classe'] = df['bot_id'].apply(lambda x : 1 if x == bot['id'] else 0)
    y_GS = df['classe'].tolist()
else:
    y_GS = df['bot_id'].tolist()

X_GS = df['pergunta'].tolist()
clfs_base = []
scores={}

### Grid Search: *Logistic Regression*

Parâmetros: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [11]:
param_grid = {'classifier__solver' : ['liblinear', 'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
              'classifier__penalty' : ['l1', 'l2'],
              'classifier__C' : [0.001, 0.01, 0.1, 0.5, 1, 1.2, 1.5, 2, 3.5, 5]
            }
clf = LogisticRegression(random_state=cfg['random_state'], max_iter=1000, class_weight='balanced')
estimator, results = ofg.executa_grid_search(param_grid, clf, X_GS,  y_GS, vectorizer)
clfs_base.append(estimator['classifier'])
scores[estimator['classifier'].__class__.__name__] = results.iloc[0]['mean_test_score']
results.head(10)

Fitting 3 folds for each of 120 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  96 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 256 tasks      | elapsed:    6.6s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:   38.3s finished


LogisticRegression - Média Score: 0.8805013839861138 
Params: {'classifier__C': 5, 'classifier__penalty': 'l2', 'classifier__solver': 'saga'}


Unnamed: 0,index,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__C,param_classifier__penalty,param_classifier__solver,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,119,0.697456,0.581413,0.013002,0.001080523,5.0,l2,saga,"{'classifier__C': 5, 'classifier__penalty': 'l...",0.877199,0.869896,0.89441,0.880501,0.010277,1
1,118,0.15336,0.006237,0.013002,0.000408316,5.0,l2,sag,"{'classifier__C': 5, 'classifier__penalty': 'l...",0.873395,0.869896,0.89441,0.879233,0.010826,2
2,116,0.602272,0.023157,0.014003,0.0004084134,5.0,l2,lbfgs,"{'classifier__C': 5, 'classifier__penalty': 'l...",0.873395,0.869896,0.89441,0.879233,0.010826,2
3,115,0.243543,0.009654,0.014002,1.94668e-07,5.0,l2,newton-cg,"{'classifier__C': 5, 'classifier__penalty': 'l...",0.873395,0.869896,0.89441,0.879233,0.010826,2
4,95,1.851157,0.01145,0.013669,0.000471539,2.0,l2,saga,"{'classifier__C': 2, 'classifier__penalty': 'l...",0.881671,0.865585,0.889277,0.878844,0.009877,5
5,103,0.24571,0.007984,0.015003,0.002160569,3.5,l2,newton-cg,"{'classifier__C': 3.5, 'classifier__penalty': ...",0.874574,0.874266,0.886258,0.878366,0.005582,6
6,104,0.630277,0.007331,0.015503,0.002483449,3.5,l2,lbfgs,"{'classifier__C': 3.5, 'classifier__penalty': ...",0.874574,0.874266,0.886258,0.878366,0.005582,6
7,106,0.181865,0.04864,0.015003,0.001080266,3.5,l2,sag,"{'classifier__C': 3.5, 'classifier__penalty': ...",0.874574,0.874266,0.886258,0.878366,0.005582,6
8,91,0.243876,0.012143,0.013835,0.00085004,2.0,l2,newton-cg,"{'classifier__C': 2, 'classifier__penalty': 'l...",0.868497,0.869122,0.886649,0.874756,0.008413,9
9,94,0.28805,0.071974,0.014169,0.0008497906,2.0,l2,sag,"{'classifier__C': 2, 'classifier__penalty': 'l...",0.868497,0.869122,0.886649,0.874756,0.008413,9


### Grid Search: *Linear SVC*

Parâmetros: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

In [12]:
param_grid = {'classifier__loss': ['squared_hinge', 'hinge'], 
              'classifier__C': [0.001, 0.01, 0.1, 0.2, 0.25, 0.3, 0.4, 0.5, 0.7, 1, 2, 5, 10],
              'classifier__dual': [True, False],
              'classifier__penalty': ['l1', 'l2']  
             }
clf = LinearSVC(random_state=cfg['random_state'], class_weight='balanced', max_iter=2000)
estimator, results = ofg.executa_grid_search(param_grid, clf, X_GS,  y_GS, vectorizer)
clfs_base.append(estimator['classifier'])
scores[estimator['classifier'].__class__.__name__] = results.iloc[0]['mean_test_score']
results.head(10)

Fitting 3 folds for each of 104 candidates, totalling 312 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done 160 tasks      | elapsed:    1.5s
[Parallel(n_jobs=-1)]: Done 281 out of 312 | elapsed:    2.2s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done 312 out of 312 | elapsed:    2.3s finished


LinearSVC - Média Score: 0.8877512938837833 
Params: {'classifier__C': 5, 'classifier__dual': True, 'classifier__loss': 'hinge', 'classifier__penalty': 'l2'}




Unnamed: 0,index,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__C,param_classifier__dual,param_classifier__loss,param_classifier__penalty,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,91,0.141191,0.089046,0.012669,0.000624,5.0,True,hinge,l2,"{'classifier__C': 5, 'classifier__dual': True,...",0.88161,0.879158,0.902486,0.887751,0.010467,1
1,73,0.046342,0.004715,0.008668,0.000235,1.0,True,squared_hinge,l2,"{'classifier__C': 1, 'classifier__dual': True,...",0.880693,0.882585,0.895152,0.886143,0.006417,2
2,77,0.05501,0.003895,0.010335,0.001546,1.0,False,squared_hinge,l2,"{'classifier__C': 1, 'classifier__dual': False...",0.880693,0.882585,0.895152,0.886143,0.006417,2
3,65,0.040174,0.002249,0.009502,0.000707,0.7,True,squared_hinge,l2,"{'classifier__C': 0.7, 'classifier__dual': Tru...",0.884875,0.876052,0.895981,0.885636,0.008154,4
4,69,0.044674,0.002461,0.009668,0.001027,0.7,False,squared_hinge,l2,"{'classifier__C': 0.7, 'classifier__dual': Fal...",0.884875,0.876052,0.895981,0.885636,0.008154,4
5,99,0.139191,0.068958,0.010168,0.002014,10.0,True,hinge,l2,"{'classifier__C': 10, 'classifier__dual': True...",0.882414,0.872581,0.898924,0.88464,0.010869,6
6,85,0.062177,0.00741,0.012669,0.001178,2.0,False,squared_hinge,l2,"{'classifier__C': 2, 'classifier__dual': False...",0.878786,0.879438,0.895204,0.884476,0.007591,7
7,81,0.072346,0.005922,0.011835,0.002393,2.0,True,squared_hinge,l2,"{'classifier__C': 2, 'classifier__dual': True,...",0.878786,0.879438,0.895204,0.884476,0.007591,7
8,83,0.108019,0.051284,0.013836,0.000236,2.0,True,hinge,l2,"{'classifier__C': 2, 'classifier__dual': True,...",0.875943,0.878917,0.898257,0.884372,0.009893,9
9,61,0.042674,0.00165,0.008835,0.000236,0.5,False,squared_hinge,l2,"{'classifier__C': 0.5, 'classifier__dual': Fal...",0.8836,0.87712,0.888723,0.883148,0.004747,10


### Grid Search: *SVC*

Parâmetros: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [13]:
param_grid = {'classifier__C':[0.01, 0.1, 0.5, 0.75, 1, 1.5, 2, 10], 
              'classifier__gamma':['scale', 0.1, 0.5, 1, 1.5, 2, 2.5, 5],
              'classifier__kernel':['precomputed','rbf','poly','sigmoid'], #,'linear'],
             }
clf = SVC(random_state=cfg['random_state'], class_weight='balanced', probability=True)
estimator, results = ofg.executa_grid_search(param_grid, clf, X_GS,  y_GS, vectorizer)
clfs_base.append(estimator['classifier'])
scores[estimator['classifier'].__class__.__name__] = results.iloc[0]['mean_test_score']
results.head(10)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.


Fitting 3 folds for each of 256 candidates, totalling 768 fits


[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:   18.2s
[Parallel(n_jobs=-1)]: Done 288 tasks      | elapsed:   39.3s
[Parallel(n_jobs=-1)]: Done 512 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 768 out of 768 | elapsed:  1.7min finished


SVC - Média Score: 0.8810979465853578 
Params: {'classifier__C': 0.75, 'classifier__gamma': 1.5, 'classifier__kernel': 'sigmoid'}


Unnamed: 0,index,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__C,param_classifier__gamma,param_classifier__kernel,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,115,1.850324,0.047791,0.078014,1.123916e-07,0.75,1.5,sigmoid,"{'classifier__C': 0.75, 'classifier__gamma': 1...",0.866716,0.880398,0.89618,0.881098,0.012039,1
1,119,1.645121,0.038309,0.074346,0.001545937,0.75,2,sigmoid,"{'classifier__C': 0.75, 'classifier__gamma': 2...",0.864925,0.873067,0.898899,0.878964,0.014483,2
2,179,1.766643,0.041857,0.07518,0.001247277,1.5,1.5,sigmoid,"{'classifier__C': 1.5, 'classifier__gamma': 1....",0.869253,0.870801,0.895167,0.878407,0.011868,3
3,151,1.576943,0.028711,0.071179,0.001027612,1.0,2,sigmoid,"{'classifier__C': 1, 'classifier__gamma': 2, '...",0.865692,0.866574,0.899755,0.87734,0.015854,4
4,147,1.828153,0.016355,0.081848,0.003009687,1.0,1.5,sigmoid,"{'classifier__C': 1, 'classifier__gamma': 1.5,...",0.863249,0.871352,0.896384,0.876995,0.014103,5
5,183,1.532935,0.041851,0.072679,0.001650246,1.5,2,sigmoid,"{'classifier__C': 1.5, 'classifier__gamma': 2,...",0.863812,0.870734,0.893651,0.876066,0.012752,6
6,215,1.483426,0.019942,0.070012,0.001080376,2.0,2,sigmoid,"{'classifier__C': 2, 'classifier__gamma': 2, '...",0.867216,0.86637,0.89153,0.875039,0.011666,7
7,123,1.510598,0.022794,0.071179,0.0004715952,0.75,2.5,sigmoid,"{'classifier__C': 0.75, 'classifier__gamma': 2...",0.858788,0.87075,0.895044,0.874861,0.015084,8
8,211,1.718467,0.021499,0.07318,0.001433802,2.0,1.5,sigmoid,"{'classifier__C': 2, 'classifier__gamma': 1.5,...",0.866111,0.865745,0.891598,0.874484,0.012102,9
9,99,2.082864,0.03573,0.083681,0.001027689,0.75,scale,sigmoid,"{'classifier__C': 0.75, 'classifier__gamma': '...",0.865426,0.872747,0.884901,0.874358,0.008032,10


### Grid Search: *SGD*

Parâmetros: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html

In [14]:
param_grid = {'classifier__max_iter': [10,20,35,50,100],
              'classifier__loss':['hinge','log','squared_hinge','modified_huber','perceptron'],
              'classifier__penalty':['l2', 'l1', 'elasticnet'],
              'classifier__alpha': [0.0001, 0.0005, 0.0007, 0.0009, 0.001, 0.0015]
             }
clf = SGDClassifier(n_jobs=-1, random_state=cfg['random_state'], class_weight='balanced')
estimator, results = ofg.executa_grid_search(param_grid, clf, X_GS,  y_GS, vectorizer)
clfs_base.append(estimator['classifier'])
scores[estimator['classifier'].__class__.__name__] = results.iloc[0]['mean_test_score']
results.head(10)

Fitting 3 folds for each of 450 candidates, totalling 1350 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done 160 tasks      | elapsed:    1.7s
[Parallel(n_jobs=-1)]: Done 480 tasks      | elapsed:    4.8s
[Parallel(n_jobs=-1)]: Done 928 tasks      | elapsed:    9.1s


SGDClassifier - Média Score: 0.8878559780805455 
Params: {'classifier__alpha': 0.0005, 'classifier__loss': 'modified_huber', 'classifier__max_iter': 10, 'classifier__penalty': 'l2'}


[Parallel(n_jobs=-1)]: Done 1350 out of 1350 | elapsed:   13.0s finished


Unnamed: 0,index,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__alpha,param_classifier__loss,param_classifier__max_iter,param_classifier__penalty,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,120,0.143858,0.00085,0.009502,0.000408,0.0005,modified_huber,10,l2,"{'classifier__alpha': 0.0005, 'classifier__los...",0.884961,0.87813,0.900476,0.887856,0.00935,1
1,0,0.135691,0.003966,0.009502,0.000409,0.0001,hinge,10,l2,"{'classifier__alpha': 0.0001, 'classifier__los...",0.880294,0.874776,0.905853,0.886974,0.013538,2
2,195,0.137691,0.006343,0.010668,0.002014,0.0007,modified_huber,10,l2,"{'classifier__alpha': 0.0007, 'classifier__los...",0.884875,0.877513,0.898287,0.886892,0.0086,3
3,207,0.137191,0.005483,0.012169,0.002249,0.0007,modified_huber,100,l2,"{'classifier__alpha': 0.0007, 'classifier__los...",0.884875,0.878806,0.895621,0.886434,0.006952,4
4,198,0.13519,0.007193,0.012002,0.002121,0.0007,modified_huber,20,l2,"{'classifier__alpha': 0.0007, 'classifier__los...",0.884875,0.878806,0.895621,0.886434,0.006952,4
5,201,0.142358,0.003473,0.009335,0.000236,0.0007,modified_huber,35,l2,"{'classifier__alpha': 0.0007, 'classifier__los...",0.884875,0.878806,0.895621,0.886434,0.006952,4
6,204,0.13219,0.005908,0.010669,0.002015,0.0007,modified_huber,50,l2,"{'classifier__alpha': 0.0007, 'classifier__los...",0.884875,0.878806,0.895621,0.886434,0.006952,4
7,123,0.138191,0.006343,0.010001,0.00108,0.0005,modified_huber,20,l2,"{'classifier__alpha': 0.0005, 'classifier__los...",0.883214,0.880172,0.895621,0.886336,0.006682,8
8,126,0.140858,0.00165,0.010669,0.002014,0.0005,modified_huber,35,l2,"{'classifier__alpha': 0.0005, 'classifier__los...",0.883214,0.880172,0.895621,0.886336,0.006682,8
9,129,0.13069,0.002393,0.010668,0.00165,0.0005,modified_huber,50,l2,"{'classifier__alpha': 0.0005, 'classifier__los...",0.883214,0.880172,0.895621,0.886336,0.006682,8


### Grid Search: *Random Forest*

Parâmetros: https://scikit-learn.org/0.15/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [15]:
param_grid = {'classifier__max_depth': [80, 100, 120, 140, 160], 
              'classifier__n_estimators': [1200, 1400, 1600, 1800],
              'classifier__min_samples_split': [2, 3, 4, 5, 6],
              'classifier__max_features':['auto','log2']            
}

clf = RandomForestClassifier(n_jobs=-1, random_state=cfg['random_state'], criterion='entropy')
estimator, results = ofg.executa_grid_search(param_grid, clf, X_GS,  y_GS, vectorizer)
clfs_base.append(estimator['classifier'])
scores[estimator['classifier'].__class__.__name__] = results.iloc[0]['mean_test_score']
results.head(10)

Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  96 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 256 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 480 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:  6.8min finished


RandomForestClassifier - Média Score: 0.8081470154783211 
Params: {'classifier__max_depth': 160, 'classifier__max_features': 'auto', 'classifier__min_samples_split': 2, 'classifier__n_estimators': 1600}


Unnamed: 0,index,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__max_depth,param_classifier__max_features,param_classifier__min_samples_split,param_classifier__n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,162,13.353836,1.10737,3.226231,0.301545,160,auto,2,1600,"{'classifier__max_depth': 160, 'classifier__ma...",0.809718,0.801435,0.813288,0.808147,0.004965,1
1,165,12.11312,1.823867,1.530601,0.247271,160,auto,3,1400,"{'classifier__max_depth': 160, 'classifier__ma...",0.811544,0.801435,0.811116,0.808032,0.004668,2
2,163,15.128481,1.507766,2.638962,0.796485,160,auto,2,1800,"{'classifier__max_depth': 160, 'classifier__ma...",0.809465,0.801435,0.811562,0.807487,0.004365,3
3,172,8.343127,0.883915,1.291893,0.323504,160,auto,5,1200,"{'classifier__max_depth': 160, 'classifier__ma...",0.813567,0.798969,0.809681,0.807406,0.006173,4
4,177,10.321639,0.998908,1.578943,0.366848,160,auto,6,1400,"{'classifier__max_depth': 160, 'classifier__ma...",0.810784,0.798969,0.811987,0.807247,0.005874,5
5,178,9.605681,1.582045,1.96301,0.070498,160,auto,6,1600,"{'classifier__max_depth': 160, 'classifier__ma...",0.810784,0.798969,0.811987,0.807247,0.005874,5
6,161,10.371481,1.360962,3.910184,1.009719,160,auto,2,1400,"{'classifier__max_depth': 160, 'classifier__ma...",0.809465,0.799964,0.812278,0.807236,0.005268,7
7,160,7.038232,0.73235,1.868827,0.311594,160,auto,2,1200,"{'classifier__max_depth': 160, 'classifier__ma...",0.808744,0.799964,0.812977,0.807228,0.005419,8
8,169,10.373982,0.695007,1.779978,0.261812,160,auto,4,1400,"{'classifier__max_depth': 160, 'classifier__ma...",0.811544,0.798969,0.810758,0.80709,0.005751,9
9,120,7.186424,0.948393,1.98068,0.49342,140,auto,2,1200,"{'classifier__max_depth': 140, 'classifier__ma...",0.811504,0.799466,0.809691,0.806887,0.005299,10


## Conferência do Processamento e Persistência dos Classificadores Base Obtidos

In [16]:
print('Classificadores Base obtidos:')
clfs_base

Classificadores Base obtidos:


[LogisticRegression(C=5, class_weight='balanced', max_iter=1000,
                    random_state=112020, solver='saga'),
 LinearSVC(C=5, class_weight='balanced', loss='hinge', max_iter=2000,
           random_state=112020),
 SVC(C=0.75, class_weight='balanced', gamma=1.5, kernel='sigmoid',
     probability=True, random_state=112020),
 SGDClassifier(alpha=0.0005, class_weight='balanced', loss='modified_huber',
               max_iter=10, n_jobs=-1, random_state=112020),
 RandomForestClassifier(criterion='entropy', max_depth=160, n_estimators=1600,
                        n_jobs=-1, random_state=112020)]

In [17]:
print('Scores Multiclasse na Base de Treino/Testes')
print('-'*50)
for score in scores:
    print('%-25s' % score,scores[score])

Scores Multiclasse na Base de Treino/Testes
--------------------------------------------------
LogisticRegression        0.8805013839861138
LinearSVC                 0.8877512938837833
SVC                       0.8810979465853578
SGDClassifier             0.8878559780805455
RandomForestClassifier    0.8081470154783211


In [18]:
# Persiste os scores em um arquivo json
info = {'scores':scores}
arquivo_informacoes = os.path.join(os.getcwd(), cfg['diretorio_dados'], cfg['arquivo_informacoes'])
with open(arquivo_informacoes, 'w') as fp:
    json.dump(info, fp, indent=2)
print('Informações atualizadas em', arquivo_informacoes)

Informações atualizadas em E:\DataScience\PUC\TCC\tcc_orquestrador_bots_final\dados\info.json


In [19]:
# Persistindo os classificadores
print('%-25s' % 'Classificador', 'Arquivo')
print('-'*120)
for clf in clfs_base:    
    arquivo_classificador_base = cfg['padrao_arquivo_classificador_base'].replace('%classe%',clf.__class__.__name__.lower())
    arquivo_classificador_base = os.path.join(os.getcwd(), cfg['diretorio_modelos'], arquivo_classificador_base)
    with open(arquivo_classificador_base, "wb") as clf_file:
        pickle.dump(clf, clf_file)
    print('%-25s' % clf.__class__.__name__, arquivo_classificador_base)

Classificador             Arquivo
------------------------------------------------------------------------------------------------------------------------
LogisticRegression        E:\DataScience\PUC\TCC\tcc_orquestrador_bots_final\modelos\clf_base_logisticregression.pkl
LinearSVC                 E:\DataScience\PUC\TCC\tcc_orquestrador_bots_final\modelos\clf_base_linearsvc.pkl
SVC                       E:\DataScience\PUC\TCC\tcc_orquestrador_bots_final\modelos\clf_base_svc.pkl
SGDClassifier             E:\DataScience\PUC\TCC\tcc_orquestrador_bots_final\modelos\clf_base_sgdclassifier.pkl
RandomForestClassifier    E:\DataScience\PUC\TCC\tcc_orquestrador_bots_final\modelos\clf_base_randomforestclassifier.pkl


In [20]:
print('Fim da etapa 3!')

Fim da etapa 3!
