## Material Auxiliar 4: Pré-processamento + MLP (Redes Neurais)

Este exemplo tem como objetivo mostrar algumas técnicas de pré-processamento necessárias e o funcionamento do algoritmo de classificação MLP (Multilayer Perceptron).

In [1]:
''' Importa as bibliotecas Pandas e Numpy
'''
import pandas as pd
import numpy as np

Neste exemplo, o objetivo principal é classificar se um paciente possui doença cardíaca utilizando o dataset Heart Disease. Este dataset possui 13 atributos (7 do tipo numérico e 6 do tipo categórico) e 1 classe do tipo binário, sendo que o valor 1 indica ocorrência de doença cardíaca, e 0 caso contrário.

Baixar o dataset no Material A

In [2]:
# na_values = substitui o valor '?' (dado faltante) para Nan 
dataset = pd.read_csv('clv_heart_disease.data', sep=',',  index_col=0, na_values='?')
dataset

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63.0,male,Typical Angina,145.0,233.0,133.051002,Left ventricular hypertrophy,150.0,no,2.3,Downsloping,0.0,Fixed defect,0
1,67.0,male,Asymptomatic,160.0,286.0,106.803350,Left ventricular hypertrophy,108.0,yes,1.5,Flat,3.0,Normal,1
2,67.0,male,Asymptomatic,120.0,229.0,105.341447,Left ventricular hypertrophy,129.0,yes,2.6,Flat,2.0,Reversable defect,1
3,37.0,male,Non-anginal pain,130.0,250.0,102.194825,Normal,187.0,no,3.5,Downsloping,0.0,Normal,0
4,41.0,female,Atypical Angine,130.0,204.0,110.926185,Left ventricular hypertrophy,172.0,no,1.4,Upsloping,0.0,Normal,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,male,Typical Angina,110.0,264.0,114.129098,Normal,132.0,no,1.2,Flat,0.0,Reversable defect,1
299,68.0,male,Asymptomatic,144.0,193.0,131.664016,Normal,141.0,no,3.4,Flat,2.0,Reversable defect,1
300,57.0,male,Asymptomatic,130.0,131.0,109.655227,Normal,115.0,yes,1.2,Flat,1.0,Reversable defect,1
301,57.0,female,Atypical Angine,130.0,236.0,105.952547,Left ventricular hypertrophy,174.0,no,0.0,Flat,1.0,Normal,1


In [3]:
# mostra informações das colunas
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   sex       303 non-null    object 
 2   cp        303 non-null    object 
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    float64
 5   fbs       303 non-null    float64
 6   restecg   303 non-null    object 
 7   thalach   303 non-null    float64
 8   exang     303 non-null    object 
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    object 
 11  ca        299 non-null    float64
 12  thal      301 non-null    object 
 13  num       303 non-null    int64  
dtypes: float64(7), int64(1), object(6)
memory usage: 35.5+ KB


Note que nos atributos 'ca' e 'thal' existem dados faltantes, ou seja, exemplos que não tem valor nenhum. 

A remoção de exemplos com atributos faltantes é mais interessante nos casos em que a maioria dos atributos estão ausentes. Quando isso não ocorre, existem algumas técnicas para substituição de dados faltantes, alguns exemplos incluem:
    <ul>
     <li>Substituição pela média do atributo (para atributos numéricos).
         <ul><li>
         https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html</li></ul>
     <li>Substituição pelo valor mais frequente (para atributos categóricos).
         <ul><li>
         https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html</li></ul>
     <li>Substituição de acordo com o valor dos K vizinhos mais próximos (para atributos categóricos e numéricos)
         <ul>
         <li>Se o atributo for numérico, a média dos K vizinhos mais próximos é usada.</li>
         <li>Se o atributo for categórico, o valor mais frequente entre os K vizinhos mais próximos é usado.</li>
         <li>https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html</li></ul>
    </ul>

Para este projeto, recomendamos a utilização de datasets sem ou com poucos dados faltantes. Caso isso não seja possível, pode-se eliminar os exemplos com dados faltantes, mas salientamos que isso não é sempre a prática ideal. Entretanto, para este projeto a eliminação dos exemplos é permitida, e também é permitido a utilização das técnicas descritas acima.

In [4]:
# elimina as linhas com dados faltantes
dataset = dataset.dropna(axis=0)
dataset

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63.0,male,Typical Angina,145.0,233.0,133.051002,Left ventricular hypertrophy,150.0,no,2.3,Downsloping,0.0,Fixed defect,0
1,67.0,male,Asymptomatic,160.0,286.0,106.803350,Left ventricular hypertrophy,108.0,yes,1.5,Flat,3.0,Normal,1
2,67.0,male,Asymptomatic,120.0,229.0,105.341447,Left ventricular hypertrophy,129.0,yes,2.6,Flat,2.0,Reversable defect,1
3,37.0,male,Non-anginal pain,130.0,250.0,102.194825,Normal,187.0,no,3.5,Downsloping,0.0,Normal,0
4,41.0,female,Atypical Angine,130.0,204.0,110.926185,Left ventricular hypertrophy,172.0,no,1.4,Upsloping,0.0,Normal,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,57.0,female,Asymptomatic,140.0,241.0,101.072354,Normal,123.0,yes,0.2,Flat,0.0,Reversable defect,1
298,45.0,male,Typical Angina,110.0,264.0,114.129098,Normal,132.0,no,1.2,Flat,0.0,Reversable defect,1
299,68.0,male,Asymptomatic,144.0,193.0,131.664016,Normal,141.0,no,3.4,Flat,2.0,Reversable defect,1
300,57.0,male,Asymptomatic,130.0,131.0,109.655227,Normal,115.0,yes,1.2,Flat,1.0,Reversable defect,1


Note que foram eliminadas 6 linhas do conjunto de dados

A maior parte dos algoritmos de classificação só trabalham com dados numéricos, por isso é necessário transformar dados categóricos (por exemplo: atributo Sexo que contém os valores Feminino e Masculino) em numéricos. Para isso transformamos cada valor categórico em um novo atributo. Por exemplo, o valor Masculino será um novo atributo e terá valor 1 se a pessoa for do sexo masculino e 0 caso contrário. E isso será feito para todos os valores de todos os atributos categóricos. No pandas existe a função get_dummies que realiza essa função, como mostrado abaixo. No sklearn existem algumas técnicas que também realizam essa operação, mas para esse projeto utlizaremos o get_dummies do pandas. (Importante verificar se os dados numéricos no DataFrame estão como numéricos e os dados categóricos estão como categorical ou object, pois o get_dummies vai agir de acordo com o tipo da coluna (atributo)).

In [5]:
dataset = pd.get_dummies(dataset)
dataset

Unnamed: 0,age,trestbps,chol,fbs,thalach,oldpeak,ca,num,sex_female,sex_male,...,restecg_Normal,restecg_ST-T wave abnormality,exang_no,exang_yes,slope_Downsloping,slope_Flat,slope_Upsloping,thal_Fixed defect,thal_Normal,thal_Reversable defect
0,63.0,145.0,233.0,133.051002,150.0,2.3,0.0,0,0,1,...,0,0,1,0,1,0,0,1,0,0
1,67.0,160.0,286.0,106.803350,108.0,1.5,3.0,1,0,1,...,0,0,0,1,0,1,0,0,1,0
2,67.0,120.0,229.0,105.341447,129.0,2.6,2.0,1,0,1,...,0,0,0,1,0,1,0,0,0,1
3,37.0,130.0,250.0,102.194825,187.0,3.5,0.0,0,0,1,...,1,0,1,0,1,0,0,0,1,0
4,41.0,130.0,204.0,110.926185,172.0,1.4,0.0,0,1,0,...,0,0,1,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,57.0,140.0,241.0,101.072354,123.0,0.2,0.0,1,1,0,...,1,0,0,1,0,1,0,0,0,1
298,45.0,110.0,264.0,114.129098,132.0,1.2,0.0,1,0,1,...,1,0,1,0,0,1,0,0,0,1
299,68.0,144.0,193.0,131.664016,141.0,3.4,2.0,1,0,1,...,1,0,1,0,0,1,0,0,0,1
300,57.0,130.0,131.0,109.655227,115.0,1.2,1.0,1,0,1,...,1,0,0,1,0,1,0,0,0,1


Note que na tabela acima o número de atributos aumentou, todos os atributos não numéricos tiveram seus valores transformados em atributos. 

Agora separamos as variáveis independentes (conjunto X) e a variável dependente y (que neste caso é a coluna 'num')

In [6]:
X = dataset.loc[:, dataset.columns != 'num'] 
y = np.array(dataset.loc[:, dataset.columns == 'num']).ravel()

É importante deixar os atributos numa mesma escala de valor, para que um determinado atributo não tenha maior influência sobre os outros atributos no aprendizado do modelo.  

In [7]:
from sklearn.preprocessing import MinMaxScaler
# reescala os valores entre 0 e 1 utilizando o valor minimo e máximo de acada atributo
X = pd.DataFrame(MinMaxScaler().fit_transform(X), columns=X.columns)
X

Unnamed: 0,age,trestbps,chol,fbs,thalach,oldpeak,ca,sex_female,sex_male,cp_Asymptomatic,...,restecg_Normal,restecg_ST-T wave abnormality,exang_no,exang_yes,slope_Downsloping,slope_Flat,slope_Upsloping,thal_Fixed defect,thal_Normal,thal_Reversable defect
0,0.708333,0.481132,0.244292,0.845242,0.603053,0.370968,0.000000,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1,0.791667,0.622642,0.365297,0.260186,0.282443,0.241935,1.000000,0.0,1.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
2,0.791667,0.245283,0.235160,0.227601,0.442748,0.419355,0.666667,0.0,1.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.166667,0.339623,0.283105,0.157463,0.885496,0.564516,0.000000,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,0.250000,0.339623,0.178082,0.352084,0.770992,0.225806,0.000000,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292,0.583333,0.433962,0.262557,0.132443,0.396947,0.032258,0.000000,1.0,0.0,1.0,...,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
293,0.333333,0.150943,0.315068,0.423476,0.465649,0.193548,0.000000,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
294,0.812500,0.471698,0.152968,0.814327,0.534351,0.548387,0.666667,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
295,0.583333,0.339623,0.011416,0.323754,0.335878,0.193548,0.333333,0.0,1.0,1.0,...,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0


In [8]:
# importa o algoritmo de classificação Multilayer-Perceptron (redes neurais)
from sklearn.neural_network import MLPClassifier
# importa o GridSearchCV
from sklearn.model_selection import GridSearchCV

# na MLP um parâmetro a ser testado é a quantidade de neurônios na camada escondida, 
# onde utilzamos uma tupla para representar a camada escondida.
# Por exemplo:
# (5) - Cinco neurônios e uma camada escondida
# (8, 5) - Oito neurônios na primeira camada escondida e cinco neurônios na segunda camada escondida.
# Interessante nesse projeto utilizar no máximo duas camadas para verificação 
# Quanto mais camadas e neurônios maior o tempo de processamento do algoritmo
parameters = {'hidden_layer_sizes' : [(5), (8), (15), (5, 3), (8, 5), (10, 5)],
              'max_iter' : [3000], 'random_state' : [42]}

# define o algoritmo de classificação que será usado
mlp = MLPClassifier()
gs_mlp = GridSearchCV(mlp, parameters, cv=5, scoring='accuracy')
# o grid search treinará todos os modelos conforme a parametrização acima
gs_mlp.fit(X, y)

GridSearchCV(cv=5, estimator=MLPClassifier(),
             param_grid={'hidden_layer_sizes': [5, 8, 15, (5, 3), (8, 5),
                                                (10, 5)],
                         'max_iter': [3000], 'random_state': [42]},
             scoring='accuracy')

In [9]:
view = ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']
results = pd.DataFrame(gs_mlp.cv_results_)
results[view].sort_values(by='rank_test_score')

Unnamed: 0,params,mean_test_score,std_test_score,rank_test_score
1,"{'hidden_layer_sizes': 8, 'max_iter': 3000, 'r...",0.844915,0.043885,1
0,"{'hidden_layer_sizes': 5, 'max_iter': 3000, 'r...",0.834802,0.035322,2
2,"{'hidden_layer_sizes': 15, 'max_iter': 3000, '...",0.824689,0.032412,3
4,"{'hidden_layer_sizes': (8, 5), 'max_iter': 300...",0.821243,0.044549,4
5,"{'hidden_layer_sizes': (10, 5), 'max_iter': 30...",0.81452,0.042491,5
3,"{'hidden_layer_sizes': (5, 3), 'max_iter': 300...",0.767627,0.029079,6


Importante para reprodução dos resultados, definir uma semente no random_state, pois se esse valor for definido randomicamente, terá pequenas alterações de resultados em diferentes reproduções.