# Predicting diabetes 
This notebook uses the toolkit to develop a range of models for the diabetes use case using the Pima Indians dataset. Source: UCI repository.

In [None]:
import morpher
from morpher.jobs import *
from morpher.plots import *
from morpher.metrics import *
from morpher.config import (
    imputers,
    algorithms,
    explainers,
    selectors,
)

### Basic definitions
Now define the set up for this classification problem, such as filename, target, and test size. Note that the input dataset must be numeric and the target variable binary.

In [None]:
filename = 'diabetes.csv'
target = 'diabetes'
test_size = 0.2

### Loading and imputing data 
Load the data set and impute it using the mean imputer and split it. Dataset should be composed of numeric or boolean features and target variable should be numeric, e.g., 0 for 'no' and 1 for 'yes'.

In [None]:
data = Load().execute(filename=filename)
data = data.drop('patient id', axis=1) #remove ids

data,_ = Impute().execute(data)

train, test = Split().execute(
    data, test_size=test_size
)

### Select best features
Check what the most relevant features are using F-Test. `selection_method` can take any of the available methods in the toolkit.

In [None]:
train, selected_features = Select().execute(
    train,
    selection_method=selectors.F_TEST,
    top=3,
    target=target
)

### Training different models
Now train models using decision tree, random forest, gradient boosting decision tree:

In [None]:
models = Train().execute(
    train,
    target=target,
    algorithms=[algorithms.DT, algorithms.RF, algorithms.GBDT],
    verbose=True
)

### Evaluate the models
Now evaluate the trained models on the test set obtained previously and plot a ROC curve.

In [None]:
test = test[selected_features + [target]] #get features selected + target
results = Evaluate().execute(
    test,
    target=target,
    models=models
)

### Discrimination and clinical usefulness (decision curve)
Use the curves below to identify how well the model is performing and whether it is clinical useful in a given threshold range.

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(10,5))

''' Area under the curve '''
plot_roc(results, ax=axs[0])

''' Decision curve '''
plot_dc(results, ax=axs[1])

plt.tight_layout()

### Explain the models
Now explain the models using model feature contribution, LIME and mimic learning and plot the explanations for Random Forest (RF).

In [None]:

explanations = Explain().execute(
    train,
    models=models,
    explainers = [explainers.FEAT_CONTRIB, explainers.LIME, explainers.MIMIC],
    target=target,
    exp_args = {'test':test}                 
)

plot_explanation_heatmap(explanations[algorithms.RF])
