# Treinamento do modelos

Nesta seção feitos alguns treinamentos de modelos visando obter o que fornace o melhor resultado. Considerei 5 modelos para o estudo () cada modelo foi treinado considerando com as seguintes tranformações nos dados:

Etapas:

   **Treinamento manual de um modelo simples**: nesta etapa mostro o treinamentos manuais de modelo sem nenhuama transformação, o objjetivo desta etapa é entender os modelos e estabelecer valores limites para um optimização de hiperparâmetros
    
   **Otimização de hiperparâmetros:** raliza a optimização  de hiperparâmetros para os modelos. Nesta eta não são consideradas transformações nos dados.
   
   **Treinamento de modelo com transformações nos dados:**  Utilizando os hiperparametros obtidos na etapa anterior, treinei os modelos com algumas transformações no dados de treinamento:



Sequência para os treinamentos de modelo:

- Divide os dados
- (Faz uma transformação) *somente na etapa final
- Treina o modelo
- Testa o modelo

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# transfomações
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import Normalizer

# para divisao dos dados
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

# modelos
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

# para não mostrar mensagens com warnings
import warnings

import time

In [6]:
#leitura dos dados
df = pd.read_csv(".//dados//dados_originais//dados_treinamento.csv" ,sep=',');
df.head(3)

Unnamed: 0,age,gender,height_cm,weight_kg,body fat_%,diastolic,systolic,gripForce,sit and bend forward_cm,sit-ups counts,broad jump_cm,class
0,24.0,M,167.3,56.1,15.6,76.0,130.0,44.6,23.1,44.0,208.0,C
1,54.0,F,161.0,51.86,29.9,88.0,154.0,22.1,23.0,36.0,148.0,A
2,34.0,M,171.7,74.22,22.6,86.0,138.0,56.1,11.4,38.0,229.0,C


In [7]:
#Mapeia a variável 'gender' para valores inteiros
gender_map = {'M': 0, 'F': 1}
var = 'gender'
df[var] = df[var].map(gender_map)

In [8]:
# seleciona variaveis de treinamento e alvo
var_train=[var for var in df.columns if var != 'class'] 

X = df[var_train]   
y = df['class'] 

# divide os dados
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    test_size=0.1,
                                                    random_state=0)

# Otimização de hiperparâmetros

## decision tree

In [125]:
# set up the model
tree = DecisionTreeClassifier(criterion="entropy",
                             random_state=0,
                             min_impurity_decrease=0
                             )

# determine the hyperparameter space
param_grid = dict(
    max_depth=[8,10,12,14,16],
    min_samples_split=[8,12,14,16,18,20,22,24],
    min_samples_leaf=[3,5,7],    
    )

# set up the search
search = GridSearchCV(tree, param_grid, scoring='accuracy',
                      cv=3, refit=True)

# find best hyperparameters
search.fit(X_train, y_train)

In [130]:
# Seleciona resultados desejados

df_results = pd.DataFrame(search.cv_results_)

columns=['mean_test_score','std_test_score', 'param_max_depth',
     'param_min_samples_leaf','param_min_samples_split', 
     'split0_test_score','split1_test_score','split2_test_score']
 
colunms_rename=['mean_score','std_score', 'max_depth',
     'min_samples_leaf','min_samples_split', 
     'split0_score','split1_score','split2_score']
    

df_results = df_results[columns]
df_results.columns=colunms_rename

# ordena os resultados
df_results.sort_values(by='mean_score',
                       ascending=False,
                       inplace=True)

df_results.reset_index(inplace=True, drop=True)
df_results.head(5)

Unnamed: 0,mean_score,std_score,max_depth,min_samples_leaf,min_samples_split,split0_score,split1_score,split2_score
0,0.669862,0.009002,12,3,18,0.681969,0.660398,0.66722
1,0.66977,0.009083,12,3,20,0.682522,0.662058,0.66473
2,0.669401,0.010877,12,3,24,0.684735,0.660675,0.662794
3,0.669217,0.009484,10,3,20,0.680863,0.657633,0.669156
4,0.669125,0.010302,12,3,22,0.683628,0.660675,0.663071


In [132]:
# seleciona os melhores resultados
df_results_sel=df_results[df_results['mean_score']> 0.668]

# organiza os resultado para 'std_score' crescente
df_results_sel.sort_values(by='std_score',
                           ascending=True,
                           inplace=True)

df_results_sel.reset_index(inplace=True, drop=True)
df_results_sel.head(3)

Unnamed: 0,mean_score,std_score,max_depth,min_samples_leaf,min_samples_split,split0_score,split1_score,split2_score
0,0.668203,0.006436,10,5,18,0.676162,0.660398,0.66805
1,0.668295,0.007113,10,5,20,0.676991,0.659569,0.668326
2,0.668018,0.007194,12,5,20,0.678097,0.661781,0.664177


## Random forest

In [134]:
# set up the model
forest= RandomForestClassifier(bootstrap=True,
                               max_samples=None,
                               random_state=0)

# determine the hyperparameter space
param_grid = dict(
    max_depth=[12,14,16,18,20],
    max_features=[5,6,8],
    n_estimators=[40,50,60,80],    
    )

# set up the search
search = GridSearchCV(forest, param_grid, scoring='accuracy',
                      cv=3, refit=True)

# find best hyperparameters
search.fit(X_train, y_train)

In [138]:
# Seleciona resultados desejados

df_results = pd.DataFrame(search.cv_results_)

columns=['mean_test_score', 'std_test_score',  'param_max_depth',
         'param_max_features', 'param_n_estimators', 
         'split0_test_score', 'split1_test_score', 'split2_test_score']
 
colunms_rename=['mean_score', 'std_score',  'max_depth',
             'max_features', 'n_estimators',
             'split0_score', 'split1_score', 'split2_score']

df_results = df_results[columns]
df_results.columns=colunms_rename

# ordena os resultados
df_results.sort_values(by='mean_score',
                       ascending=False,
                       inplace=True)

df_results.reset_index(inplace=True, drop=True)
df_results.head(5)

Unnamed: 0,mean_score,std_score,max_depth,max_features,n_estimators,split0_score,split1_score,split2_score
0,0.736886,0.001571,16,6,80,0.738385,0.737555,0.734716
1,0.736148,0.003937,14,5,60,0.741704,0.733684,0.733057
2,0.735964,0.004122,14,5,80,0.741427,0.731471,0.734993
3,0.734581,0.004328,16,6,60,0.740044,0.734237,0.729461
4,0.734488,0.00479,14,5,50,0.740044,0.735066,0.728354


In [140]:
# seleciona os melhores resultados
df_results_sel=df_results[df_results['mean_score']> 0.732]

# organiza os resultado para 'std_score' crescente
df_results_sel.sort_values(by='std_score',
                           ascending=True,
                           inplace=True)

df_results_sel.reset_index(inplace=True, drop=True)
df_results_sel.head(3)

Unnamed: 0,mean_score,std_score,max_depth,max_features,n_estimators,split0_score,split1_score,split2_score
0,0.73329,0.000825,14,8,80,0.733407,0.734237,0.732227
1,0.73329,0.00094,18,5,80,0.734513,0.733131,0.732227
2,0.733382,0.001042,14,6,80,0.73479,0.732301,0.733057


## Modelos com as variações

In [141]:
transformacoes={
    'sem_transformacao': [],     #dados sem tranformação
    'yeo_johnson': [PowerTransformer(method="yeo-johnson")],
    'quantile': [QuantileTransformer(output_distribution="normal")],
    'standart': [StandardScaler()],
    'normalization': [Normalizer()],
}

In [142]:
classificadores={'decisionTree':DecisionTreeClassifier(
                               criterion="entropy",
                               max_depth=10,
                               min_samples_leaf=5,
                               min_samples_split=18,                               
                               min_impurity_decrease=0,
                               random_state=0),
                
                'randomForest':RandomForestClassifier(                             
                             max_depth=14,
                             max_features=8,
                             n_estimators=80,
                             bootstrap=True,
                             max_samples=None,
                             random_state=0),
                 
                 'kNearestNeighbour':KNeighborsClassifier(
                     n_neighbors=50,
                     weights='distance'
                     ),
                                                          
                 'rbf_SVM':svm.SVC(kernel="rbf",random_state=0,
                        gamma=2.9,                
                        max_iter=1000,
                        C=0.1),
                                                          
                 'poly3_SVM':svm.SVC(kernel="poly",
                                     degree=5,
                                     random_state=0,
                                     gamma=2.9,                
                                     max_iter=5000,
                                     C=0.5)
                
                }
                

In [143]:
#colunas que serão transformadas
listaVar=['age', 'height_cm', 'weight_kg',
 'body fat_%', 'diastolic', 'systolic',
 'gripForce', 'sit and bend forward_cm',
 'sit-ups counts', 'broad jump_cm']

In [144]:
warnings.filterwarnings("ignore")

dict_metricas={}

for key in transformacoes.keys():
    
    print("--- Treinamento para dados com a transformação: "+key)
    t_ini=time.time()
    
    # copia os dados
    X_train_temp = X_train.copy()
    X_test_temp = X_test.copy()     
        
        
    # transforma os dados de treinamento e de teste
    if transformacoes[key]:        
        dados_train = X_train_temp[listaVar].values  
        dados_test = X_test_temp[listaVar].values
        
        obj = transformacoes[key][0].fit(dados_train)  
        
        X_train_temp[listaVar] = obj.transform(dados_train)  
        X_test_temp[listaVar] = obj.transform(dados_test)  


    #treina e testa os modelos
    metricas_modelo=[]
    for nome_modelo, clf in classificadores.items():    
        
        clf.fit(X_train_temp,y_train)   
        
        y_hat = clf.predict(X_test_temp)       
        metricas_modelo= metricas_modelo + [metrics.accuracy_score(y_test, y_hat)]
        
        print(nome_modelo+" treinado!")
        
    
    
    # armazena os resultados
    dict_metricas[key]=metricas_modelo    
    print("(tempo: "+str(time.time()-t_ini)+" seg.)\n")    

    

--- Treinamento para dados com a transformação: sem_transformacao
decisionTree treinado!
randomForest treinado!
kNearestNeighbour treinado!
rbf_SVM treinado!
poly3_SVM treinado!
(tempo: 11.042866945266724 seg.)

--- Treinamento para dados com a transformação: yeo_johnson
decisionTree treinado!
randomForest treinado!
kNearestNeighbour treinado!
rbf_SVM treinado!
poly3_SVM treinado!
(tempo: 13.331109285354614 seg.)

--- Treinamento para dados com a transformação: quantile
decisionTree treinado!
randomForest treinado!
kNearestNeighbour treinado!
rbf_SVM treinado!
poly3_SVM treinado!
(tempo: 13.092884302139282 seg.)

--- Treinamento para dados com a transformação: standart
decisionTree treinado!
randomForest treinado!
kNearestNeighbour treinado!
rbf_SVM treinado!
poly3_SVM treinado!
(tempo: 12.39039158821106 seg.)

--- Treinamento para dados com a transformação: normalization
decisionTree treinado!
randomForest treinado!
kNearestNeighbour treinado!
rbf_SVM treinado!
poly3_SVM treinado!
(te

In [145]:
# apresentação dos resultados
df=pd.DataFrame(dict_metricas)
df.index=list(classificadores.keys())
df

Unnamed: 0,sem_transformacao,yeo_johnson,quantile,standart,normalization
decisionTree,0.660033,0.660033,0.659204,0.659204,0.631841
randomForest,0.72471,0.727197,0.725539,0.725539,0.679934
kNearestNeighbour,0.58126,0.605307,0.604478,0.613599,0.58209
rbf_SVM,0.50995,0.547264,0.537313,0.560531,0.328358
poly3_SVM,0.334992,0.373964,0.338308,0.404643,0.523217


In [23]:
#=====================================
# Nao rodar e nao apagar
#=====================================

# resultado anterior- sem escolha por optimização

# apresentação dos resultados
df=pd.DataFrame(dict_metricas)
df.index=list(classificadores.keys())
df

Unnamed: 0,sem_transformacao,yeo_johnson,quantile,standart,normalization
decisionTree,0.624378,0.621891,0.621061,0.624378,0.597015
randomForest,0.718905,0.721393,0.719735,0.720564,0.675788
kNearestNeighbour,0.576285,0.600332,0.596186,0.614428,0.58126
rbf_SVM,0.50995,0.547264,0.537313,0.560531,0.328358
poly3_SVM,0.334992,0.373964,0.338308,0.404643,0.523217
