# Code Implementation

We have defined two functions in `ml_knn.py`:

- `get_folds` generates the Multi-Label Stratified K-Folds required for our cross-validation

- `log_loss_score` fits a `KNeighborsClassifier` and we predict probabilities for *out-of-fold* samples. The function returns the accumulated log loss which will act as our "fitness" score for the genetic algorithm.

In [1]:
import os
import numpy as np
import pandas as pd
from ml_knn import GeneticAlgorithm, EnsembleClassifier

In [2]:
fnames = os.listdir('feature_subspaces')
population = np.array([np.load(os.path.join('feature_subspaces', f)) for f in fnames])
population.shape

(206, 875)

In [3]:
X = pd.read_csv('train_features.csv', index_col=0)
y = pd.read_csv('train_targets_scored.csv')
X.shape, y.shape

((23814, 875), (23814, 207))

In [4]:
X = X.replace({
    'cp_type': {'trt_cp': -1.0, 'ctl_vehicle': 1.0},
    'cp_time': {24: -1.0, 48: 0.0, 72: 1.0},
    'cp_dose': {'D1': -1.0, 'D2': 1.0}
})

In [5]:
# model = GeneticAlgorithm(population)
# model.fit(X, y, generations=20)
# np.save('final_population.npy', model.population)

In [6]:
feature_subspaces = np.load('final_population.npy')
feature_subspaces

array([[ True,  True,  True, ..., False,  True, False],
       [ True, False, False, ..., False,  True, False],
       [ True,  True,  True, ..., False,  True, False],
       ...,
       [False, False,  True, ...,  True,  True,  True],
       [ True,  True, False, ...,  True, False,  True],
       [ True, False,  True, ..., False,  True, False]])

In [27]:
ensemble_clf = EnsembleClassifier(feature_subspaces)
ensemble_clf.fit(X, y, solution_per_population=8)

In [28]:
weights = ensemble_clf.weights
prediction_probas = ensemble_clf.prediction_probas

In [29]:
np.save('prediction_probas.npy', prediction_probas)
np.save('weights.npy', weights)

In [30]:
best_weight = weights[0]
best_weight.shape

(103,)

In [31]:
best_weight.sum()

1.0

In [32]:
prediction_probas.shape

(103, 23814, 206)

In [68]:
final_predictions = prediction_probas.T.dot(best_weight).T

In [75]:
from sklearn.metrics import log_loss

log_loss(y.iloc(axis=1)[1:].values.flatten(), final_predictions.flatten())

0.018739821884494166