# Projeto Prático #5

O PP5 trata-se de uma [competição](https://www.kaggle.com/c/combatendo-o-cncer-de-mama), entre os alunos da disciplina, na plataforma [Kaggle](https://www.kaggle.com).

Neste projeto prático, as equipes terão de identificar o papel de marcadores biológicos na presença ou ausência de câncer de mama. O câncer de mama é o tipo de câncer mais comum entre as mulheres no mundo e no Brasil, depois do de pele não melanoma, respondendo por cerca de 28% dos casos novos a cada ano. O câncer de mama também acomete homens, porém é raro, representando apenas 1% do total de casos da doença. Para o ano de 2018 foram estimados 60 mil novos casos da doença, conforme [INCA](http://www2.inca.gov.br/wps/wcm/connect/tiposdecancer/site/home/mama).

Pesquisadores da Universidade de Coimbra obtiveram 10 preditores quantitativos correspondentes a dados antropométricos de pacientes, todos oriundos de exames de sangue de rotina. Se modelos inteligentes baseados nestes preditores forem acurados, há potencial para uso destes biomarcadores como indicador de câncer de mama. Leia mais sobre em [UCI Breast Cancer Coimbra](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra).

- Aluno: Jean Phelipe de Oliveira Lima. Matrícula: 1615080096

## Bibliotecas

In [1]:
import pandas as pd
import numpy as np
from math import ceil
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neural_network import MLPClassifier
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, f1_score
import warnings
warnings.filterwarnings('ignore')

## Breast Cancer Coimbra Dataset - Conjunto de treino

In [2]:
dataset_with_id = pd.read_csv('train.csv')
dataset = dataset_with_id.drop(['id'], axis=1)
dataset.head()

Unnamed: 0,Age,BMI,Glucose,Insulin,HOMA,Leptin,Adiponectin,Resistin,MCP.1,Classification
0,47,22.03,84,2.869,0.59,26.65,38.04,3.32,191.72,1
1,75,30.48,152,7.01,2.6283,50.53,10.06,11.73,99.45,2
2,25,22.86,82,4.09,0.8273,20.45,23.67,5.14,313.73,1
3,54,24.2188,86,3.73,0.7913,8.6874,3.7052,10.3446,635.049,2
4,69,35.0927,101,5.646,1.4066,83.4821,6.797,82.1,263.499,1


In [3]:
def piramide_geometrica(ni, no, alfa):
    nh = alfa*((ni*no)**(1/2))
    return ceil(nh)

In [4]:
def hidden_layers(layers, nh):
    for i in range(1, nh):
        neurons_layers = (i, nh-i)
        layers.append(neurons_layers)
    return layers

In [5]:
num_in = 9
num_out = 1
alpha = [2, 3]
layers = []

for i in range(len(alpha)):
    nh = piramide_geometrica(num_in, num_out, alpha[i])
    print('Para α = %.1f, Nh = %d'%(alpha[i],nh))
    hidden_layers(layers, nh)#insere cada possibilidade de camadas ocultas, dado o numero de neurônios, na lista 'layers'
    
print()
print('Distribuições de Camadas Ocultas:\n')
for i in layers:
    print(i)

Para α = 2.0, Nh = 6
Para α = 3.0, Nh = 9

Distribuições de Camadas Ocultas:

(1, 5)
(2, 4)
(3, 3)
(4, 2)
(5, 1)
(1, 8)
(2, 7)
(3, 6)
(4, 5)
(5, 4)
(6, 3)
(7, 2)
(8, 1)


### Treinamento de Redes Neurais Artificiais do tipo Multilayer Perceptron

- Utilização de busca em grade para encontrar melhores parâmetros e hiperparâmetros para a rede.
- Validação cruzada com k = 5 folds

In [6]:
parameters = {'solver': ['lbfgs'], 
              'activation': ['identity'],
              'hidden_layer_sizes': layers,
              'max_iter':[1000],
              'batch_size': [16, 32]}

# Busca em grade e validação cruzada. K=5 folds
gs = GridSearchCV(MLPClassifier(), 
                  parameters, 
                  cv=5, 
                  scoring='accuracy')

In [7]:
#Atributos preditores
x = dataset.drop(['Classification'], axis = 1) 

#Atributo Alvo
y = dataset.Classification 

In [8]:
gs.fit(x, y)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'solver': ['lbfgs'], 'activation': ['identity'], 'hidden_layer_sizes': [(1, 5), (2, 4), (3, 3), (4, 2), (5, 1), (1, 8), (2, 7), (3, 6), (4, 5), (5, 4), (6, 3), (7, 2), (8, 1)], 'max_iter': [1000], 'batch_size': [16, 32]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

### Resultados do GridSerach

#### 7 melhores RNAs da busca

In [9]:
results = pd.DataFrame(gs.cv_results_)
analysis_dict = {}

analysis_dict['hidden_layer_sizes'] = results['param_hidden_layer_sizes']
analysis_dict['activation'] = results['param_activation']
analysis_dict['max_iter'] = results['param_max_iter']
analysis_dict['batch_size'] = results['param_batch_size']
analysis_dict['mean_test_accuracy'] = results['mean_test_score']

analysis_dataset = pd.DataFrame(analysis_dict)
top7 = analysis_dataset.sort_values('mean_test_accuracy', ascending=False).head(7)
top7

Unnamed: 0,hidden_layer_sizes,activation,max_iter,batch_size,mean_test_accuracy
19,"(2, 7)",identity,1000,32,0.717391
18,"(1, 8)",identity,1000,32,0.673913
7,"(3, 6)",identity,1000,16,0.673913
22,"(5, 4)",identity,1000,32,0.663043
13,"(1, 5)",identity,1000,32,0.652174
23,"(6, 3)",identity,1000,32,0.652174
16,"(4, 2)",identity,1000,32,0.652174


## Solução #1 

Na primeira solução, uma entrada será submetida às 7 melhores redes neurais obtidas por meio da busca em grade.
Será escolhida a resposta que mais se repete dentre as saídas das 7 redes.

In [38]:
top7_matrix=[]
for i in top7.index:
    top7_matrix.append(list(top7.loc[i]))

mlps = []
for i in top7_matrix:
    mlps.append(MLPClassifier(hidden_layer_sizes = i[0], 
                     activation = i[1], 
                     max_iter=i[2], 
                     solver = 'lbfgs',
                     batch_size = i[3]))
    
for i in range(len(mlps)):
    mlps[i].fit(x, y)

In [12]:
# Função para verificar qual saída se repete mais
def vote(classes, predict):
    winner = classes[0]
    for i in classes:
        if predict.count(i)> predict.count(winner):
            winner = i
    return winner

In [13]:
# Função para que define a Solução 1
def predict_mlps_winner(data, mlps):
    predicts = []
    for i in range(len(mlps)):
        predicts.append(mlps[i].predict([data]))
    return vote([1,2], predicts)

### Conjunto de Teste

In [14]:
testes_with_id = pd.read_csv('test.csv')
testes = testes_with_id.drop(['id'], axis = 1)
testes.head()

Unnamed: 0,Age,BMI,Glucose,Insulin,HOMA,Leptin,Adiponectin,Resistin,MCP.1
0,62,22.6562,92,3.482,0.7902,9.8648,11.2362,10.6955,703.973
1,29,23.01,82,5.663,1.1454,35.59,26.72,4.58,174.8
2,75,25.7,94,8.079,1.8733,65.926,3.7412,4.4968,206.802
3,44,27.8876,99,9.208,2.2486,12.6757,5.4782,23.0331,407.206
4,75,23.0,83,4.952,1.0138,17.127,11.579,7.0913,318.302


### Testes para a Solução #1

In [39]:
results = []
for i in range(len(testes)):
    results.append(predict_mlps_winner(testes.loc[i], mlps))

In [16]:
dict_results = {'id': list(testes_with_id.id),'Classification': results}
submission = pd.DataFrame(dict_results)
submission.head()

Unnamed: 0,id,Classification
0,100,2
1,78,1
2,77,1
3,113,2
4,86,1


In [17]:
submission.to_csv('submission.csv', index=False)

## Solução #2

Esta solução consiste na implementação de um *ensemble* . As saídas das 7 melhores RNAs obtidas pela busca em grade, serão submetidas à outra rede neural, esta decidirá a resposta final.

### Novo conjunto de treino
Saídas 7 melhores RNAs para cada instância do dataset original.

In [22]:
new_dataset = []
for i in range(len(mlps)):
    new_dataset.append(mlps[i].predict(x))
new_dataset.append(np.asarray(y))

In [23]:
new_dataset = np.asarray(new_dataset)
new_dataset = pd.DataFrame(new_dataset.transpose(), columns=[1,2,3,4,5,6,7,'Classification'])
new_dataset.head()

Unnamed: 0,1,2,3,4,5,6,7,Classification
0,1,2,2,1,1,2,1,1
1,2,2,2,2,2,2,2,2
2,1,2,2,1,1,2,1,1
3,1,2,2,1,1,1,1,2
4,2,2,2,2,2,2,2,1


In [24]:
# Função que define a solção #2
def ensemble_winner(data, mlps, ensemble):
    predicts = []
    for i in range(len(mlps)):
        predicts.append(mlps[i].predict([data]))
    predicts = np.asarray(predicts)
    return ensemble.predict(predicts.transpose())

### Treino - *ensemble*

In [25]:
ensemble = MLPClassifier(hidden_layer_sizes=(14,7),
                         activation= 'relu',
                         batch_size = 3,
                         max_iter = 1000,
                         learning_rate = 'constant',
                         learning_rate_init = 0.0005)

#Atributos preditores do novo dataset
new_x = new_dataset.drop(['Classification'], axis=1)
#Atributo alvo do novo dataset
new_y = new_dataset.Classification

#Treino do ensemble
ensemble.fit(new_x, new_y)

MLPClassifier(activation='relu', alpha=0.0001, batch_size=3, beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(14, 7), learning_rate='constant',
       learning_rate_init=0.0005, max_iter=1000, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

### Testes para a Solução #2

In [42]:
ensemble_results = []

for i in range(len(testes)):
    ensemble_results.append(ensemble_winner(testes.loc[i], mlps, ensemble))

In [43]:
results = []
for i in ensemble_results:
    results.append(i[0])
dict_results = {'id': list(testes_with_id.id),'Classification': results}
submission = pd.DataFrame(dict_results)
submission.head()

Unnamed: 0,id,Classification
0,100,2
1,78,1
2,77,1
3,113,2
4,86,1


In [28]:
submission.to_csv('submission.csv', index=False)

## Solução #3

Esta solução também consiste em um ensemble. No entanto, desta vez, as 7 melhores redes neurais, obtidas na busca em grade, fornecerão as probabilidades de previsão (*predict_proba*) como entrada para uma nova RNA que decidirá a resposta final.

### Construção do novo dataset

In [44]:
def proba_into_list(proba, proba_list_1, proba_list_2):
    for i in range(len(proba)):
        proba_list_1.append(proba[i][0])
        proba_list_2.append(proba[i][1])

rna1_prob1 = []
rna1_prob2 = []
rna2_prob1 = []
rna2_prob2 = []
rna3_prob1 = []
rna3_prob2 = []
rna4_prob1 = []
rna4_prob2 = []
rna5_prob1 = []
rna5_prob2 = []
rna6_prob1 = []
rna6_prob2 = []
rna7_prob1 = []
rna7_prob2 = []
rna8_prob1 = []
rna8_prob2 = []
rna9_prob1 = []
rna9_prob2 = []

proba = mlps[0].predict_proba(x)
proba_into_list(proba, rna1_prob1, rna1_prob2)
proba = mlps[1].predict_proba(x)
proba_into_list(proba, rna2_prob1, rna2_prob2)
proba = mlps[2].predict_proba(x)
proba_into_list(proba, rna3_prob1, rna3_prob2)
proba = mlps[3].predict_proba(x)
proba_into_list(proba, rna4_prob1, rna4_prob2)
proba = mlps[4].predict_proba(x)
proba_into_list(proba, rna5_prob1, rna5_prob2)
proba = mlps[5].predict_proba(x)
proba_into_list(proba, rna6_prob1, rna6_prob2)
proba = mlps[6].predict_proba(x)
proba_into_list(proba, rna7_prob1, rna7_prob2)

    
classification = np.asarray(y)

proba_dataset = {'rna1_prob1':rna1_prob1, 
                 'rna1_prob2':rna1_prob2, 
                 'rna2_prob1':rna2_prob1,
                 'rna2_prob2':rna2_prob2,
                 'rna3_prob1':rna3_prob1,
                 'rna3_prob2':rna3_prob2,
                 'rna4_prob1':rna4_prob1, 
                 'rna4_prob2':rna4_prob2, 
                 'rna5_prob1':rna5_prob1,
                 'rna5_prob2':rna5_prob2,
                 'rna6_prob1':rna6_prob1,
                 'rna6_prob2':rna6_prob2,
                 'rna7_prob1':rna7_prob1, 
                 'rna7_prob2':rna7_prob2, 
                 'Classification':classification}

proba_dataset = pd.DataFrame(proba_dataset)
proba_dataset.head(5)

Unnamed: 0,rna1_prob1,rna1_prob2,rna2_prob1,rna2_prob2,rna3_prob1,rna3_prob2,rna4_prob1,rna4_prob2,rna5_prob1,rna5_prob2,rna6_prob1,rna6_prob2,rna7_prob1,rna7_prob2,Classification
0,0.686067,0.313933,0.477146,0.522854,1.0,2.912713e-15,0.69447,0.30553,0.705509,0.294491,0.675457,0.324543,0.69805,0.30195,1
1,0.033584,0.966416,0.477712,0.522288,1.0,7.408375999999999e-44,0.031571,0.968429,0.032888,0.967112,0.031898,0.968102,0.031097,0.968903,2
2,0.613089,0.386911,0.471049,0.528951,1e-05,0.9999896,0.618909,0.381091,0.624653,0.375347,0.600535,0.399465,0.619025,0.380975,1
3,0.568511,0.431489,0.4536,0.5464,0.0,1.0,0.571764,0.428236,0.570457,0.429543,0.573582,0.426418,0.559524,0.440476,2
4,0.183284,0.816716,0.470458,0.529542,1.0,2.252961e-12,0.157231,0.842769,0.16542,0.83458,0.173593,0.826407,0.147002,0.852998,1


### Conjunto de Teste

In [45]:
rna1_prob1 = []
rna1_prob2 = []
rna2_prob1 = []
rna2_prob2 = []
rna3_prob1 = []
rna3_prob2 = []
rna4_prob1 = []
rna4_prob2 = []
rna5_prob1 = []
rna5_prob2 = []
rna6_prob1 = []
rna6_prob2 = []
rna7_prob1 = []
rna7_prob2 = []


proba = mlps[0].predict_proba(testes)
proba_into_list(proba, rna1_prob1, rna1_prob2)
proba = mlps[1].predict_proba(testes)
proba_into_list(proba, rna2_prob1, rna2_prob2)
proba = mlps[2].predict_proba(testes)
proba_into_list(proba, rna3_prob1, rna3_prob2)
proba = mlps[3].predict_proba(testes)
proba_into_list(proba, rna4_prob1, rna4_prob2)
proba = mlps[4].predict_proba(testes)
proba_into_list(proba, rna5_prob1, rna5_prob2)
proba = mlps[5].predict_proba(testes)
proba_into_list(proba, rna6_prob1, rna6_prob2)
proba = mlps[6].predict_proba(testes)
proba_into_list(proba, rna7_prob1, rna7_prob2)
    
test_proba = {}
test_proba = {'rna1_prob1':rna1_prob1, 
                 'rna1_prob2':rna1_prob2, 
                 'rna2_prob1':rna2_prob1,
                 'rna2_prob2':rna2_prob2,
                 'rna3_prob1':rna3_prob1,
                 'rna3_prob2':rna3_prob2,
                 'rna4_prob1':rna4_prob1, 
                 'rna4_prob2':rna4_prob2, 
                 'rna5_prob1':rna5_prob1,
                 'rna5_prob2':rna5_prob2,
                 'rna6_prob1':rna6_prob1,
                 'rna6_prob2':rna6_prob2,
                 'rna7_prob1':rna7_prob1, 
                 'rna7_prob2':rna7_prob2, 
}

test_proba = pd.DataFrame(test_proba)
test_proba.head(5)

Unnamed: 0,rna1_prob1,rna1_prob2,rna2_prob1,rna2_prob2,rna3_prob1,rna3_prob2,rna4_prob1,rna4_prob2,rna5_prob1,rna5_prob2,rna6_prob1,rna6_prob2,rna7_prob1,rna7_prob2
0,0.424908,0.575092,0.449497,0.550503,0.0,1.0,0.428411,0.571589,0.431871,0.568129,0.42774,0.57226,0.412313,0.587687
1,0.660227,0.339773,0.478272,0.521728,1.0,2.424616e-14,0.665328,0.334672,0.670174,0.329826,0.637744,0.362256,0.666276,0.333724
2,0.697612,0.302388,0.475358,0.524642,1.0,9.448002e-24,0.696919,0.303081,0.692082,0.307918,0.662993,0.337007,0.685389,0.314611
3,0.242636,0.757364,0.464623,0.535377,2.642331e-14,1.0,0.238947,0.761053,0.2384,0.7616,0.253605,0.746395,0.237756,0.762244
4,0.724633,0.275367,0.470275,0.529725,0.8136776,0.1863224,0.73135,0.26865,0.730083,0.269917,0.728275,0.271725,0.724326,0.275674


### Treino

In [46]:
proba_ensemble = MLPClassifier(hidden_layer_sizes=(13,20),
                         activation= 'relu',
                         batch_size = 3,
                         max_iter = 10000,
                         learning_rate = 'constant',
                         learning_rate_init = 0.00005)

#Atributos preditores do novo dataset
x_proba = proba_dataset.drop(['Classification'], axis=1)
#Atributo alvo do novo dataset
y_proba = proba_dataset.Classification

#Treino do ensemble
proba_ensemble.fit(x_proba, y_proba)

MLPClassifier(activation='relu', alpha=0.0001, batch_size=3, beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(13, 20), learning_rate='constant',
       learning_rate_init=5e-05, max_iter=10000, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

### Testes

In [47]:
dict_results = {'id': list(testes_with_id.id),'Classification': proba_ensemble_results}
submission = pd.DataFrame(dict_results)
submission.head()

Unnamed: 0,id,Classification
0,100,2
1,78,1
2,77,1
3,113,2
4,86,1


In [99]:
submission.to_csv('submission.csv', index=False)