# Combined model

In this notebook we will combine the optimized version of the three models we've created. So, the predictions of our k-Nearest Neighbor, Decision tree and Neural Network will be combined to one new prediction. 

In [None]:
# Instal nbimporter to be able to import functions from other notebooks
pip install nbimporter

In [74]:
from data_processing import prepare_data, split_data, one_hot_encode
import numpy as np
import nbimporter
from helper_functions import get_metrics

# Import functions for k-nearest neighbors
from kNN_die_wel_opent import split_datatypes, train_and_predict
from oversampling import smote_loop

# Import functions for decision tree
from Decision_tree import resampled_forest

# Load the data normalized
data = prepare_data('healthcare-dataset-stroke-data.csv', one_hot = False, binary = False, normalize = True)

# Load the data one-hot encoded and not normalized
data_one_hot = prepare_data('healthcare-dataset-stroke-data.csv', one_hot = True, binary = False, normalize = False)

# Split the normalized data into training, testing and validation data
train_data, test_data, val_data, train_labels, test_labels, val_labels = split_data(data, (0.6, 0.2, 0.2))

# Split the one-hot encoded data into training, testing and validation data
train_hot, test_hot, val_hot, train_labels_hot, test_labels_hot, val_labels_hot = split_data(data_one_hot, (0.6, 0.2, 0.2))

### k-Nearest Neighbor
The k-Nearest Neighbor model with the best balanced accuracy was trained on only numeric data that was overfitted with a ratio of 0.6. 

In [95]:
# Change testing data to one hot encoded data, since this also happens to training data in smote_loop
test_data_hot = one_hot_encode(test_data)

# Split test data into numeric and binary data
test_num, test_bin = split_datatypes(test_data_hot)

# Get the oversampled data with a oversampling ratio of 0.6
data_list, labels_list, ratio_list = smote_loop(train_data, train_labels, 0.6, 0.7, 0.1)
train_num, train_bin = split_datatypes(data_list[0])

# Predictions using model trained on numerical, oversampled data and euclidean distance metric and 5 neighbors
predict_train_kNN, predict_test_kNN = train_and_predict(train_num, labels_list[0], test_num, 5, "distance", 
                                                          metric='euclidean')


### Decision tree
The optimal number of splits was around 17 most of the time.

In [96]:
# Create a tuple of the data that gets accepted by the forest function
data_DT = (train_hot, train_labels_hot, test_hot, test_labels_hot)

# Train the forest on the training data and return a list with predicted labels fror training and testing data
predict_train_DT, predict_test_DT = resampled_forest(data_DT, 17)


# Combining the models
The models can be combined in different ways. Considering we started with too few stroke predictions an OR function might be good. We will also try to build and train a neural network based on the output of the three models.

In [97]:
# predict_combined = map(list(predict_test_kNN) and list(predict_test_DT))
predict_combined_2 = np.any([predict_test_kNN, predict_test_DT], axis=0)

print('The accuracy using only k-Nearest Neighbors: ')
test_acc, test_balacc = get_metrics(test_labels, predict_test_kNN, verbose = True)
print('The accuracy using only Random Forest: ')
test_acc, test_balacc = get_metrics(test_labels, predict_test_DT, verbose = True)
print('The accuracy using the combined predictions: ')
test_acc, test_balacc = get_metrics(test_labels, predict_combined_2, verbose = True)

The accuracy using only k-Nearest Neighbors: 
accuracy: 83.4638 % 

balanced accuracy: 68.5412 %
sensitivity: 0.5200
specificity: 0.8508 

confusion matrix: 
[[827 145]
 [ 24  26]] 

[["True Negative", "False Positive"] 
 ["False Negative", "True Positive"]] 

The accuracy using only Random Forest: 
accuracy: 73.5812 % 

balanced accuracy: 75.6770 %
sensitivity: 0.7800
specificity: 0.7335 

confusion matrix: 
[[713 259]
 [ 11  39]] 

[["True Negative", "False Positive"] 
 ["False Negative", "True Positive"]] 

The accuracy using the combined predictions: 
accuracy: 71.0372 % 

balanced accuracy: 76.2366 %
sensitivity: 0.8200
specificity: 0.7047 

confusion matrix: 
[[685 287]
 [  9  41]] 

[["True Negative", "False Positive"] 
 ["False Negative", "True Positive"]] 

