### KNN

- Para cada divisão treino/teste obtida pela técnica de validação cruzada k-fold com 5 divisões (k=5):
    - Realizar o treinamento do algoritmo.
    - Executar os testes.
- Calcular a média das acurácias obtidas em cada divisão treino/teste
- criar a matriz de confusão com os resultados acumulados (soma de todas as matrizes de confusão geradas a cada divisão treino/teste)
- Calcular a precisão (precision) e a revocação (recall) a partir da matriz de confusão gerada.
- O procedimento acima deve ser executado para k (número de vizinhos) = 3,5,7 e 9.

#### 1º Passo 

Importar os pacotes que serão utilizados durante o algoritmo

In [1]:
# Import modules
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

#### 2º Passo

Carregar base de dados. Mude a varíavel `use_local` para True ou False, para utilizar a base de dados de um diretório local ou diretamente do GIT (o repositório deve estar público).

In [2]:
use_local = True
heart_disease_path = '/media/LexisNexis/IFES-Heart-Attack-Prediction/heart.csv'
heart_disease_url = 'https://raw.githubusercontent.com/objetovazio/IFES-Heart-Attack-Prediction/master/heart.csv'

In [3]:
# Loading heart disease dataset
heart_disease_ds = pd.read_csv(heart_disease_path) if use_local else pd.read_csv(heart_disease_url)


# Check file head
heart_disease_ds.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


#### Passo 3

A seguir, é feito a separação das Features e do Target da base de dados e armazenados nas variáveis X e y.

In [4]:
# Load features on 'X' and Target on 'y'
X,y = heart_disease_ds.drop('target', axis=1), heart_disease_ds.target

A seguir, é instanciado o KFold que será utilizado para cross-validation, e o número de folds e nighbors que serão testados.

In [5]:
k_folds=5
kf = KFold(n_splits=k_folds)
n_neighbors_list = [3, 5, 7, 9]

A seguir temos um for que percorre a lista de número de vizinhos que deve ser testada.

Dentro deste looping é feito o treinamento do modelo de acordo com esse número de vizinhos, feito a predição e printado os resultados.

In [8]:
# Iterate on the n_neighbors_list
for n_neighbors in n_neighbors_list:
    
    sum_accuracy = 0
    accuracy_alg = []
    predictions = []
    confusion_matrix_sum = [[0, 0], [0, 0]]

    # Iterate using KFold `k` times on dataset
    for train_indexes, test_indexes in kf.split(X):
        # Get train and test data from this fold
        X_train, y_train = X[X.index.isin(train_indexes)], y[y.index.isin(train_indexes)]
        X_test, y_test = X[X.index.isin(test_indexes)], y[y.index.isin(test_indexes)]

        # Create Classifier
        knn_model = KNeighborsClassifier(n_neighbors=n_neighbors)  

        # Fits train data
        knn_model.fit(X_train, y_train)

        # Get prediction
        y_preds = knn_model.predict(X_test)
        predictions.append(y_preds)

        # Get accuracy
        accuracy = knn_model.score(X_test, y_test)
        accuracy_alg.append(accuracy)    
        sum_accuracy = sum_accuracy + accuracy

        # Generate Confusion Matrix and Sum
        cm = confusion_matrix(y_test, y_preds, labels=[1, 0])
        confusion_matrix_sum = np.add(confusion_matrix_sum, cm)
    #end for

    # Create a confusion matrix with labels
    cmtx = pd.DataFrame(
        confusion_matrix_sum, 
        index=['true:1', 'true:0'], 
        columns=['pred:1', 'pred:0']
    )

    # Create a variable for each cell on the confusion matrix (just to see the calc in a easy way)
    tp = confusion_matrix_sum[0][0]
    fp = confusion_matrix_sum[1][0]
    fn = confusion_matrix_sum[0][1]
    tn = confusion_matrix_sum[1][1]

    # Calculate results
    avg_accuracy = sum_accuracy / k_folds 
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)

    print('Results with KNN using %d neighbors:\n' % (n_neighbors))
    print('Confusion Matrix:')
    print(cmtx,'\n')

    print('Accuracy Average: %f%%' % (avg_accuracy * 100))
    print('Precision: %f%%' % (precision * 100))
    print('Recall: %f%%' % (recall * 100))
    print('###################################################################################################\n')
#End for

Results with KNN using 3 neighbors:

Confusion Matrix:
        pred:1  pred:0
true:1      95      70
true:0      82      56 

Accuracy Average: 49.754098%
Precision: 53.672316%
Recall: 57.575758%
###################################################################################################

Results with KNN using 5 neighbors:

Confusion Matrix:
        pred:1  pred:0
true:1      98      67
true:0      85      53 

Accuracy Average: 49.737705%
Precision: 53.551913%
Recall: 59.393939%
###################################################################################################

Results with KNN using 7 neighbors:

Confusion Matrix:
        pred:1  pred:0
true:1     100      65
true:0      89      49 

Accuracy Average: 49.065574%
Precision: 52.910053%
Recall: 60.606061%
###################################################################################################

Results with KNN using 9 neighbors:

Confusion Matrix:
        pred:1  pred:0
true:1     102      63
true:0  