# Predicting Group_ID in Control, ASD, Conduct Problems and Depression cohorts using HRV measures

### Notebook automatically generated our model

Model XGBoost, trained on 2019-04-26 00:19:18.

#### Generated on 2019-05-02 03:17:06.691733

#### Warning

The goal of this notebook is to provide an easily readable and explainable code that reproduces the main steps
of training the model. It is not complete: some of the preprocessing done by the DSS visual machine learning is not
replicated in this notebook. This notebook will not give the same results and model performance as the DSS visual machine
learning model.

Let's start with importing the required libs :

In [None]:
import sys
import dataiku
import numpy as np
import pandas as pd
import sklearn as sk
import dataiku.core.pandasutils as pdu
from dataiku.doctor.preprocessing import PCA
from collections import defaultdict, Counter

And tune pandas display options:

In [None]:
pd.set_option('display.width', 3000)
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

#### Importing base data

The first step is to get our machine learning dataset:

In [None]:
# We apply the preparation that you defined. You should not modify this.
preparation_steps = []
preparation_output_schema = {u'userModified': False, u'columns': [{u'type': u'bigint', u'name': u'Group_ID'}, {u'type': u'double', u'name': u'fet_SDNN'}, {u'type': u'double', u'name': u'fet_RMSSD'}, {u'type': u'double', u'name': u'fet_SD1'}, {u'type': u'double', u'name': u'fet_SD2'}, {u'type': u'double', u'name': u'fet_Sample Entropy'}, {u'type': u'double', u'name': u'fet_Fuzzy Entropy'}, {u'type': u'double', u'name': u'fet_moment coeff of skewness'}, {u'type': u'double', u'name': u'fet_mode skewness'}, {u'type': u'double', u'name': u'fet_median skewness'}, {u'type': u'double', u'name': u'fet_LF power'}, {u'type': u'double', u'name': u'fet_HF power'}, {u'type': u'double', u'name': u'fet_RR'}, {u'type': u'double', u'name': u'fet_DET'}, {u'type': u'double', u'name': u'fet_ENTR'}, {u'type': u'double', u'name': u'fet_L'}, {u'type': u'double', u'name': u'PRSA_fetal'}]}

ml_dataset_handle = dataiku.Dataset('ASD_plus_three_groups_UPDATED_ASD_code_vs_rest')
ml_dataset_handle.set_preparation_steps(preparation_steps, preparation_output_schema)
%time ml_dataset = ml_dataset_handle.get_dataframe(limit = 100000)

print ('Base data has %i rows and %i columns' % (ml_dataset.shape[0], ml_dataset.shape[1]))
# Five first records",
ml_dataset.head(5)

#### Initial data management

The preprocessing aims at making the dataset compatible with modeling.
At the end of this step, we will have a matrix of float numbers, with no missing values.
We'll use the features and the preprocessing steps defined in Models.

Let's only keep selected features

In [None]:
ml_dataset = ml_dataset[[u'fet_SDNN', u'fet_Fuzzy Entropy', u'fet_ENTR', u'fet_SD2', u'fet_median skewness', u'PRSA_fetal', u'fet_mode skewness', u'fet_L', u'Group_ID', u'fet_RR', u'fet_moment coeff of skewness', u'fet_SD1', u'fet_RMSSD', u'fet_LF power', u'fet_Sample Entropy', u'fet_DET', u'fet_HF power']]

Let's first coerce categorical columns into unicode, numerical features into floats.

In [None]:
# astype('unicode') does not work as expected

def coerce_to_unicode(x):
    if sys.version_info < (3, 0):
        if isinstance(x, str):
            return unicode(x,'utf-8')
        else:
            return unicode(x)
    else:
        return str(x)


categorical_features = []
numerical_features = [u'fet_SDNN', u'fet_Fuzzy Entropy', u'fet_ENTR', u'fet_SD2', u'fet_median skewness', u'PRSA_fetal', u'fet_mode skewness', u'fet_L', u'fet_RR', u'fet_moment coeff of skewness', u'fet_SD1', u'fet_RMSSD', u'fet_LF power', u'fet_Sample Entropy', u'fet_DET', u'fet_HF power']
text_features = []
from dataiku.doctor.utils import datetime_to_epoch
for feature in categorical_features:
    ml_dataset[feature] = ml_dataset[feature].apply(coerce_to_unicode)
for feature in text_features:
    ml_dataset[feature] = ml_dataset[feature].apply(coerce_to_unicode)
for feature in numerical_features:
    if ml_dataset[feature].dtype == np.dtype('M8[ns]'):
        ml_dataset[feature] = datetime_to_epoch(ml_dataset[feature])
    else:
        ml_dataset[feature] = ml_dataset[feature].astype('double')

We are now going to handle the target variable and store it in a new variable:

In [None]:
target_map = {u'1': 2, u'3': 0, u'2': 1}
ml_dataset['__target__'] = ml_dataset['Group_ID'].map(str).map(target_map)
del ml_dataset['Group_ID']


# Remove rows for which the target is unknown.
ml_dataset = ml_dataset[~ml_dataset['__target__'].isnull()]

#### Cross-validation strategy

The dataset needs to be split into 2 new sets, one that will be used for training the model (train set)
and another that will be used to test its generalization capability (test set)

This is a simple cross-validation strategy.

In [None]:
train, test = pdu.split_train_valid(ml_dataset, prop=0.8)
print ('Train data has %i rows and %i columns' % (train.shape[0], train.shape[1]))
print ('Test data has %i rows and %i columns' % (test.shape[0], test.shape[1]))

#### Features preprocessing

The first thing to do at the features level is to handle the missing values.
Let's reuse the settings defined in the model

In [None]:
drop_rows_when_missing = []
impute_when_missing = [{'impute_with': u'MEAN', 'feature': u'fet_SDNN'}, {'impute_with': u'MEAN', 'feature': u'fet_Fuzzy Entropy'}, {'impute_with': u'MEAN', 'feature': u'fet_ENTR'}, {'impute_with': u'MEAN', 'feature': u'fet_SD2'}, {'impute_with': u'MEAN', 'feature': u'fet_median skewness'}, {'impute_with': u'MEAN', 'feature': u'PRSA_fetal'}, {'impute_with': u'MEAN', 'feature': u'fet_mode skewness'}, {'impute_with': u'MEAN', 'feature': u'fet_L'}, {'impute_with': u'MEAN', 'feature': u'fet_RR'}, {'impute_with': u'MEAN', 'feature': u'fet_moment coeff of skewness'}, {'impute_with': u'MEAN', 'feature': u'fet_SD1'}, {'impute_with': u'MEAN', 'feature': u'fet_RMSSD'}, {'impute_with': u'MEDIAN', 'feature': u'fet_LF power'}, {'impute_with': u'MEAN', 'feature': u'fet_Sample Entropy'}, {'impute_with': u'MEAN', 'feature': u'fet_DET'}, {'impute_with': u'MEAN', 'feature': u'fet_HF power'}]

# Features for which we drop rows with missing values"
for feature in drop_rows_when_missing:
    train = train[train[feature].notnull()]
    test = test[test[feature].notnull()]
    print ('Dropped missing records in %s' % feature)

# Features for which we impute missing values"
for feature in impute_when_missing:
    if feature['impute_with'] == 'MEAN':
        v = train[feature['feature']].mean()
    elif feature['impute_with'] == 'MEDIAN':
        v = train[feature['feature']].median()
    elif feature['impute_with'] == 'CREATE_CATEGORY':
        v = 'NULL_CATEGORY'
    elif feature['impute_with'] == 'MODE':
        v = train[feature['feature']].value_counts().index[0]
    elif feature['impute_with'] == 'CONSTANT':
        v = feature['value']
    train[feature['feature']] = train[feature['feature']].fillna(v)
    test[feature['feature']] = test[feature['feature']].fillna(v)
    print ('Imputed missing values in feature %s with value %s' % (feature['feature'], coerce_to_unicode(v)))

We can now handle the categorical features (still using the settings defined in Models):

Let's rescale numerical features

In [None]:
rescale_features = {u'fet_RMSSD': u'AVGSTD', u'fet_SDNN': u'AVGSTD', u'fet_Fuzzy Entropy': u'AVGSTD', u'fet_ENTR': u'AVGSTD', u'fet_RR': u'AVGSTD', u'fet_median skewness': u'AVGSTD', u'PRSA_fetal': u'AVGSTD', u'fet_mode skewness': u'AVGSTD', u'fet_L': u'AVGSTD', u'fet_moment coeff of skewness': u'AVGSTD', u'fet_SD1': u'AVGSTD', u'fet_SD2': u'AVGSTD', u'fet_LF power': u'AVGSTD', u'fet_Sample Entropy': u'AVGSTD', u'fet_DET': u'AVGSTD', u'fet_HF power': u'AVGSTD'}
for (feature_name, rescale_method) in rescale_features.items():
    if rescale_method == 'MINMAX':
        _min = train[feature_name].min()
        _max = train[feature_name].max()
        scale = _max - _min
        shift = _min
    else:
        shift = train[feature_name].mean()
        scale = train[feature_name].std()
    if scale == 0.:
        del train[feature_name]
        del test[feature_name]
        print ('Feature %s was dropped because it has no variance' % feature_name)
    else:
        print ('Rescaled %s' % feature_name)
        train[feature_name] = (train[feature_name] - shift).astype(np.float64) / scale
        test[feature_name] = (test[feature_name] - shift).astype(np.float64) / scale

#### Modeling

Before actually creating our model, we need to split the datasets into their features and labels parts:

In [None]:
train_X = train.drop('__target__', axis=1)
test_X = test.drop('__target__', axis=1)

train_Y = np.array(train['__target__'])
test_Y = np.array(test['__target__'])

Now we can finally create our model !

In [None]:
import xgboost as xgb
clf = xgb.XGBClassifier(
                    max_depth=3,
                    learning_rate=0.2,
                    gamma=0.0,
                    min_child_weight=0.0,
                    max_delta_step=0.0,
                    subsample=1.0,
                    colsample_bytree=1.0,
                    colsample_bylevel=1.0,
                    reg_alpha=0.0,
                    reg_lambda=1.0,
                    n_estimators=1,
                    silent=0,
                    nthread=4,
                    scale_pos_weight=1.0,
                    base_score=0.5,
                    seed=1337,
                    missing=None,
                  )

... And train it

In [None]:
%time clf.fit(train_X, train_Y)

Build up our result dataset

The model is now being trained, we can apply it to our test set:

In [None]:
%time _predictions = clf.predict(test_X)
%time _probas = clf.predict_proba(test_X)
predictions = pd.Series(data=_predictions, index=test_X.index, name='predicted_value')
cols = [
    u'probability_of_value_%s' % label
    for (_, label) in sorted([(int(target_map[label]), label) for label in target_map])
]
probabilities = pd.DataFrame(data=_probas, index=test_X.index, columns=cols)

# Build scored dataset
results_test = test_X.join(predictions, how='left')
results_test = results_test.join(probabilities, how='left')
results_test = results_test.join(test['__target__'], how='left')
results_test = results_test.rename(columns= {'__target__': 'Group_ID'})

#### Results

You can measure the model's accuracy:

In [None]:
from dataiku.doctor.utils.metrics import mroc_auc_score
test_Y_ser = pd.Series(test_Y)
print ('AUC value:', mroc_auc_score(test_Y_ser, _probas))

We can also view the predictions directly.
Since scikit-learn only predicts numericals, the labels have been mapped to 0,1,2 ...
We need to 'reverse' the mapping to display the initial labels.

In [None]:
inv_map = { target_map[label] : label for label in target_map}
predictions.map(inv_map)

That's it. It's now up to you to tune your preprocessing, your algo, and your analysis !
