# MLBox

#### Author's description:

MLBox is a powerful Automated Machine Learning python library. It provides the following features:

* Fast reading and distributed data preprocessing/cleaning/formatting
* Highly robust feature selection and leak detection
* Accurate hyper-parameter optimization in high-dimensional space
* State-of-the art predictive models for classification and regression (Deep Learning, Stacking, LightGBM,...)
* Prediction with models interpretation

#### Useful links:

[home](https://pypi.org/project/mlbox/),
[tutorial](https://www.analyticsvidhya.com/blog/2017/07/mlbox-library-automated-machine-learning/),
[manual](https://mlbox.readthedocs.io/en/latest/),
[git](https://github.com/AxeldeRomblay/MLBox),
[more examples](https://mlbox.readthedocs.io/en/latest/introduction.html)

## Install and import

In [25]:
!pip install jsonschema==2.6 --user
!pip install mlbox==0.8.2 --user

You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m


#### MLBox main package contains 3 sub-packages : preprocessing, optimisation and prediction. Each one of them are respectively aimed at reading and preprocessing data, testing or optimising a wide range of learners and predicting the target on a test dataset.

In [26]:
from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *
import numpy as np
import pandas as pd
import sklearn
import subprocess

In [27]:
!rm -rf ../results/joblib

## A few pointers to keep in mind

#### Importing data
MLBox seems to prefer csv files. Otherwise you have to build your own dictionary. The dictionary structure is not overly complicated, but it introduces another chance for syntax or type errors. It might be wise to just use csv if saving and loading as csv is not too expensive.

#### Documentation
MLBox documentation is high-level. Implementing in practice is more difficult. Could not find anything on deep learning.

## Heart Disease

#### A note on importing data
csv files for the two datasets in this project are saved at **/mnt/data/raw/**

#### A note on the train & test function
MLBox has a function called **train_test_split()**. It does not behave like the scikit-learn function of the same name. It can take a little getting use to. It will help if you imagine that the authors of MLBox built it as a tool for Kaggle competitions. The training set needs to have **y** in it. The test set should not. You're on your own for accuracy against the test set as it is assumed you'll find out the real answers later with an external test set that is not part of the MLBox flow.

#### A note on categorical fields
MLBox tries to infer which columns are categorical. From what I can tell, it only looks at data type when doing so. This is a little annoying. Below, I had to take the extra step of mapping numeric values to text for each of the numeric/categorical columns so that MLBox will treat them as categorical.

#### Load the heart disease dataset

The raw data can be found in the project files at /mnt/data/raw/heart.csv

Attribute documentation:

      age: age in years
      sex: sex (1 = male; 0 = female)
      cp: chest pain type
        -- Value 1: typical angina
        -- Value 2: atypical angina
        -- Value 3: non-anginal pain
        -- Value 4: asymptomatic
     trestbps: resting blood pressure (in mm Hg on admission to the 
        hospital)
     chol: serum cholestoral in mg/dl
     fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
     restecg: resting electrocardiographic results
        -- Value 0: normal
        -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST 
                    elevation or depression of > 0.05 mV)
        -- Value 2: showing probable or definite left ventricular hypertrophy
                    by Estes' criteria
     thalach: maximum heart rate achieved
     exang: exercise induced angina (1 = yes; 0 = no)
     oldpeak = ST depression induced by exercise relative to rest
     slope: the slope of the peak exercise ST segment
        -- Value 1: upsloping
        -- Value 2: flat
        -- Value 3: downsloping
     ca: number of major vessels (0-3) colored by flourosopy
     thal: 
         3 = normal; 
         6 = fixed defect; 
         7 = reversable defect
     target: diagnosis of heart disease (angiographic disease status)
        -- Value 0: < 50% diameter narrowing
        -- Value 1: > 50% diameter narrowing

In [28]:
# column names
names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang', \
         'oldpeak','slope','ca','thal','target']

# load data from Domino project directory
hd_data = pd.read_csv("/mnt/data/raw/heart.csv", header=None, names=names)

In [29]:
# in case some data comes in as string, convert to numeric and coerce errors to NaN
for col in hd_data.columns:  # Iterate over columns
    hd_data[col] = pd.to_numeric(hd_data[col], errors='coerce')
    
# drop nulls
hd_data.dropna(inplace=True)

In [30]:
# function to force non-numeric data for categorical columns
def force_non_numeric(data, cols):
    for c in cols:
        data[c] = 'text_' + data[c].map(str)  
    return data

In [31]:
cat_cols = ['cp', 'restecg', 'slope', 'ca', 'thal']
hd_data = force_non_numeric(hd_data, cat_cols)

In [32]:
# create MLBox random samples for train and test
hd_data_train = hd_data.sample(frac=0.7, replace=False, random_state=1)
hd_data_test = hd_data[~hd_data.isin(hd_data_train)].dropna()
hd_data_test_wo_target = hd_data_test.drop('target', axis=1)

hd_data_train.to_csv('/mnt/data/processed/hd_data_train.csv', index=False)
hd_data_test.to_csv('/mnt/data/processed/hd_data_test.csv', index=False)
hd_data_test_wo_target.to_csv('/mnt/data/processed/hd_data_test_wo_target.csv', index=False)

In [33]:
# the list of paths to your train datasets and test datasets
paths_hd = ["/mnt/data/processed/hd_data_train.csv", \
         "/mnt/data/processed/hd_data_test_wo_target.csv"]

# the name of the target you try to predict (classification or regression)
target_hd = "target"

#### Process the data

Pass the training set (with the target) and the test set (without the target) to the **train_test_split()** funciton. This automatically cleans both data sets.

Use **to_path** to keep your world organized. In my project I want everything in the results directory so we use **/mnt/results**.

Note that after adding text to the numeric/categorical columns, they are now recognized as such. 

In [34]:
# to read and preprocess your files
mlb_data_hd = Reader(sep=",", to_path = '/mnt/results').train_test_split(paths_hd, target_hd)


reading csv : hd_data_train.csv ...
cleaning data ...
CPU time: 0.042601823806762695 seconds

reading csv : hd_data_test_wo_target.csv ...
cleaning data ...
CPU time: 0.03788352012634277 seconds

> Number of common features : 13

gathering and crunching for train and test datasets ...
reindexing for train and test datasets ...
dropping training duplicates ...
dropping constant variables on training set ...

> Number of categorical features: 5
> Number of numerical features: 8
> Number of training samples : 211
> Number of test samples : 91

> You have no missing values on train set...

> Task : classification
1.0    109
0.0    102
Name: target, dtype: int64

encoding target ...


#### Last processing note

After building the dictionary, we processes the data as below with the nice MLBox feature of automatically droping ids and [drifting variables](https://github.com/AxeldeRomblay/MLBox/blob/master/docs/webinars/features.pdf) between train and test datasets. I have found that it does not automatically drop ids. The source code only seems to detect drift, which is not found in randomly generated id fields.

In [35]:
# drop IDs and useless columns
mlb_data_hd = Drift_thresholder(to_path='/mnt/results').fit_transform(mlb_data_hd)


computing drifts ...
CPU time: 0.1304154396057129 seconds

> Top 10 drifts

('age', 0.14152992087364535)
('thalach', 0.13771835705387803)
('thal', 0.08920237769704542)
('restecg', 0.0744704437239303)
('ca', 0.06710837228884792)
('sex', 0.046450382176387084)
('trestbps', 0.03950679497028964)
('oldpeak', 0.03157616833990917)
('cp', 0.027520736487102404)
('fbs', 0.023051030639217762)

> Deleted variables : []
> Drift coefficients dumped into directory : /mnt/results


#### Build the modeling routine

#### Defining the search criteria

MLBox gives you good control over the modeling algorithms and parameter settings to try.

You define a space dictionary and pass it to the **Optimiser** function.

Then you pass that Optimiser and the data dictionary to the **Predictor** function.

In [36]:
space = {

        'ne__numerical_strategy' : {"space" : [0, 'mean']},

        'ce__strategy' : {"space" : ["label_encoding", "random_projection", \
                                     "entity_embedding"]},

        'fs__strategy' : {"space" : ["variance", "rf_feature_importance"]},
        'fs__threshold': {"search" : "choice", "space" : [0.1, 0.2, 0.3]},

        'est__strategy' : {"space" : ["LightGBM", "RandomForest", "ExtraTrees",\
                                      "Linear"]},
        'est__max_depth' : {"search" : "choice", "space" : [5,10,20]},
        'est__subsample' : {"search" : "uniform", "space" : [0.6,0.7]}

        }

In [37]:
%%time

best_hd = Optimiser(to_path = '../results').optimise(space, mlb_data_hd, max_evals = 10)

##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'random_projection'}   
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.3}
>>> ESTIMATOR :{'strategy': 'RandomForest', 'max_depth': 20, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}
  0%|          | 0/10 [00:00<?, ?it/s, best loss: ?]

  +str(self.to_path)+"/joblib'. Please clear it regularly.")
  + ". Parameter IGNORED. Check the list of "



MEAN SCORE : neg_log_loss = -0.49375991086701737    
VARIANCE : 0.08285445396939733 (fold 1 = -0.41090545689762004, fold 2 = -0.5766143648364147)
CPU time: 0.8617739677429199 seconds                
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'label_encoding'}                                
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.2}              
>>> ESTIMATOR :{'strategy': 'Linear', 'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'ovr', 'n_jobs': -1, 'penalty': 'l2', 'random_state': 0, 'solver': 'liblinear', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
MEAN SCORE : neg_log_loss = -0.5896209935209158                               
VARIANCE : 0.1377512083396423 (fold

  + ". Parameter IGNORED. Check the list of "


  " = {}.".format(effective_n_jobs(self.n_jobs)))


  " = {}.".format(effective_n_jobs(self.n_jobs)))

  + ". Parameter IGNORED. Check the list of "


  " = {}.".format(effective_n_jobs(self.n_jobs)))




MEAN SCORE : neg_log_loss = -0.5518733951240642                               
VARIANCE : 0.06004658203846455 (fold 1 = -0.49182681308559967, fold 2 = -0.6119199771625288)
CPU time: 3.0330495834350586 seconds                                          
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}   
>>> CA ENCODER :{'strategy': 'random_projection'}                             
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.1}              
>>> ESTIMATOR :{'strategy': 'LightGBM', 'max_depth': 20, 'subsample': 0.6045062493390837, 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': 

  " = {}.".format(effective_n_jobs(self.n_jobs)))



MEAN SCORE : neg_log_loss = -0.7073377054062613                               
VARIANCE : 0.34851575890520126 (fold 1 = -0.35882194650106003, fold 2 = -1.0558534643114625)
CPU time: 0.214324951171875 seconds                                           
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'random_projection'}                             
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.1}              
>>> ESTIMATOR :{'strategy': 'RandomForest', 'max_depth': 20, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_st

  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.4832465820704197                               
VARIANCE : 0.07856052659436771 (fold 1 = -0.404686055476052, fold 2 = -0.5618071086647874)
CPU time: 1.0462853908538818 seconds                                          
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}  
>>> CA ENCODER :{'strategy': 'label_encoding'}                               
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.3}             
>>> ESTIMATOR :{'strategy': 'Linear', 'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'ovr', 'n_jobs': -1, 'penalty': 'l2', 'random_state': 0, 'solver': 'liblinear', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
MEAN SCORE : neg_log_loss = -0.5711533744167593                      

  + ". Parameter IGNORED. Check the list of "

  " = {}.".format(effective_n_jobs(self.n_jobs)))

  " = {}.".format(effective_n_jobs(self.n_jobs)))

  + ". Parameter IGNORED. Check the list of "



MEAN SCORE : neg_log_loss = -0.5896614808877094                              
VARIANCE : 0.13779169469416513 (fold 1 = -0.45186978619354423, fold 2 = -0.7274531755818745)
CPU time: 0.3496098518371582 seconds                                         
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}  
>>> CA ENCODER :{'strategy': 'label_encoding'}                               
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.1}
>>> ESTIMATOR :{'strategy': 'ExtraTrees', 'max_depth': 5, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, '

  " = {}.".format(effective_n_jobs(self.n_jobs)))

  " = {}.".format(effective_n_jobs(self.n_jobs)))

  + ". Parameter IGNORED. Check the list of "



MEAN SCORE : neg_log_loss = -0.4580266598958518                              
VARIANCE : 0.03461159117412102 (fold 1 = -0.42341506872173074, fold 2 = -0.4926382510699728)
CPU time: 1.1533076763153076 seconds                                         
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}  
>>> CA ENCODER :{'strategy': 'label_encoding'}                               
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.2}
>>> ESTIMATOR :{'strategy': 'ExtraTrees', 'max_depth': 20, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 

  + ". Parameter IGNORED. Check the list of "



MEAN SCORE : neg_log_loss = -0.45851525299008816                             
VARIANCE : 0.04524559855903626 (fold 1 = -0.4132696544310519, fold 2 = -0.5037608515491244)
CPU time: 1.0636494159698486 seconds                                         
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'label_encoding'}                               
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.3}
>>> ESTIMATOR :{'strategy': 'RandomForest', 'max_depth': 10, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state':

  + ". Parameter IGNORED. Check the list of "



MEAN SCORE : neg_log_loss = -0.4821777973260123                              
VARIANCE : 0.04563708147096665 (fold 1 = -0.43654071585504567, fold 2 = -0.527814878796979)
CPU time: 1.110327959060669 seconds                                          
100%|██████████| 10/10 [00:09<00:00,  1.10it/s, best loss: 0.4580266598958518]


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ BEST HYPER-PARAMETERS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

{'ce__strategy': 'label_encoding', 'est__max_depth': 5, 'est__strategy': 'ExtraTrees', 'est__subsample': 0.6726522327502583, 'fs__strategy': 'rf_feature_importance', 'fs__threshold': 0.1, 'ne__numerical_strategy': 0}
CPU times: user 8.15 s, sys: 308 ms, to

In [38]:
Predictor(to_path='/mnt/results').fit_predict(best_hd,mlb_data_hd)


fitting the pipeline ...


  + ". Parameter IGNORED. Check the list of "


CPU time: 0.9488317966461182 seconds

> Feature importances dumped into directory : /mnt/results

predicting ...
CPU time: 0.11291813850402832 seconds

> Overview on predictions : 

        0.0       1.0  target_predicted
0  0.333017  0.666983                 1
1  0.069111  0.930889                 1
2  0.260245  0.739755                 1
3  0.161627  0.838373                 1
4  0.392079  0.607921                 1
5  0.277344  0.722656                 1
6  0.379687  0.620313                 1
7  0.093433  0.906567                 1
8  0.312530  0.687470                 1
9  0.450790  0.549210                 1

dumping predictions into directory : /mnt/results ...


<mlbox.prediction.predictor.Predictor at 0x7fa06573ebe0>

## Breast Cancer

#### Load the breast cancer dataset

In [39]:
'''
Attribute Information:

1) ID number 
2) Diagnosis (M = malignant, B = benign) 
3-32) 

Ten real-valued features are computed for each cell nucleus: 

a) radius (mean of distances from center to points on the perimeter) 
b) texture (standard deviation of gray-scale values) 
c) perimeter 
d) area 
e) smoothness (local variation in radius lengths) 
f) compactness (perimeter^2 / area - 1.0) 
g) concavity (severity of concave portions of the contour) 
h) concave points (number of concave portions of the contour) 
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)
'''

#column names
names = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', \
         'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', \
         'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', \
         'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', \
         'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', \
         'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', \
         'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', \
         'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst']

#load data from Domino project directory
bc_data = pd.read_csv("/mnt/data/raw/breast_cancer.csv", index_col=False, header=0, names=names)

#create MLBox random samples for train and test
bc_data_train = bc_data.sample(frac=0.7, replace=False, random_state=1)
bc_data_test = bc_data[~bc_data.isin(bc_data_train)].dropna()
bc_data_test_wo_target = bc_data_test.drop('diagnosis', axis=1)

bc_data_train.to_csv('/mnt/data/processed/bc_data_train.csv', index=False)
bc_data_test.to_csv('/mnt/data/processed/bc_data_test.csv', index=False)
bc_data_test_wo_target.to_csv('/mnt/data/processed/bc_data_test_wo_target.csv', index=False)

In [40]:
# the list of paths to your train datasets and test datasets
paths_bc = ["/mnt/data/processed/bc_data_train.csv", \
         "/mnt/data/processed/bc_data_test_wo_target.csv"]

# the name of the target you try to predict (classification or regression)
target_bc = "diagnosis"

#### Process the data

In [41]:
# to read and preprocess your files
mlb_data_bc = Reader(sep=",", to_path = '/mnt/results').train_test_split(paths_bc, target_bc)


reading csv : bc_data_train.csv ...
cleaning data ...
CPU time: 0.07143115997314453 seconds

reading csv : bc_data_test_wo_target.csv ...
cleaning data ...
CPU time: 0.07071280479431152 seconds

> Number of common features : 31

gathering and crunching for train and test datasets ...
reindexing for train and test datasets ...
dropping training duplicates ...
dropping constant variables on training set ...

> Number of categorical features: 0
> Number of numerical features: 31
> Number of training samples : 398
> Number of test samples : 171

> You have no missing values on train set...

> Task : classification
B    249
M    149
Name: diagnosis, dtype: int64

encoding target ...


In [42]:
# drop IDs and useless columns
mlb_data_bc = Drift_thresholder(to_path='/mnt/results').fit_transform(mlb_data_bc)


computing drifts ...
CPU time: 0.2409067153930664 seconds

> Top 10 drifts

('concave_points_se', 0.12935058328578597)
('texture_worst', 0.10813953488372086)
('smoothness_worst', 0.10493782180395828)
('concavity_se', 0.10426929448886013)
('smoothness_mean', 0.09663330331548314)
('symmetry_worst', 0.07588524015425979)
('compactness_worst', 0.07550199698904914)
('radius_se', 0.07485890464635059)
('area_worst', 0.07297396696203329)
('smoothness_se', 0.06312650805326214)

> Deleted variables : []
> Drift coefficients dumped into directory : /mnt/results


#### Optimise the space and fit the model

In [43]:
%%time

best_bc = Optimiser(to_path = '/mnt/results').optimise(space, mlb_data_bc, max_evals = 10)

##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'entity_embedding'}    
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.3}
>>> ESTIMATOR :{'strategy': 'ExtraTrees', 'max_depth': 5, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}
  0%|          | 0/10 [00:00<?, ?it/s, best loss: ?]

  +str(self.to_path)+"/joblib'. Please clear it regularly.")
  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.16601760645184827    
VARIANCE : 0.0005732844143375299 (fold 1 = -0.16544432203751072, fold 2 = -0.16659089086618578)
CPU time: 0.9426901340484619 seconds                
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'label_encoding'}                                
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.3}              
>>> ESTIMATOR :{'strategy': 'RandomForest', 'max_depth': 10, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}
 10%|

  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.15384259712537718                              
VARIANCE : 0.010918148696575411 (fold 1 = -0.14292444842880175, fold 2 = -0.16476074582195258)
CPU time: 0.8942985534667969 seconds                                          
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'label_encoding'}                                
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.2}              
>>> ESTIMATOR :{'strategy': 'RandomForest', 'max_depth': 20, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_

  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.15344310487726132                              
VARIANCE : 0.014118581089442568 (fold 1 = -0.13932452378781876, fold 2 = -0.1675616859667039)
CPU time: 0.8772115707397461 seconds                                          
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}   
>>> CA ENCODER :{'strategy': 'random_projection'}                             
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.2} 
>>> ESTIMATOR :{'strategy': 'LightGBM', 'max_depth': 5, 'subsample': 0.6214889635325321, 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': 

  + ". Parameter IGNORED. Check the list of "

  " = {}.".format(effective_n_jobs(self.n_jobs)))

  " = {}.".format(effective_n_jobs(self.n_jobs)))

  + ". Parameter IGNORED. Check the list of "



MEAN SCORE : neg_log_loss = -0.1527134976930898                               
VARIANCE : 0.009199758701414928 (fold 1 = -0.14351373899167488, fold 2 = -0.16191325639450474)
CPU time: 1.1181492805480957 seconds                                          
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}  
>>> CA ENCODER :{'strategy': 'random_projection'}                            
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.2}
>>> ESTIMATOR :{'strategy': 'LightGBM', 'max_depth': 10, 'subsample': 0.698105747330585, 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': No

  + ". Parameter IGNORED. Check the list of "

  " = {}.".format(effective_n_jobs(self.n_jobs)))

  " = {}.".format(effective_n_jobs(self.n_jobs)))



MEAN SCORE : neg_log_loss = -0.304993852096757                               
VARIANCE : 0.06357587958766733 (fold 1 = -0.24141797250908964, fold 2 = -0.3685697316844243)
CPU time: 0.3453526496887207 seconds                                         
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'entity_embedding'}                             
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.1}
>>> ESTIMATOR :{'strategy': 'Linear', 'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'ovr', 'n_jobs': -1, 'penalty': 'l2', 'random_state': 0, 'solver': 'liblinear', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
MEAN SCORE : neg_log_loss = -0.6698380210671552                   

  + ". Parameter IGNORED. Check the list of "

  " = {}.".format(effective_n_jobs(self.n_jobs)))

  " = {}.".format(effective_n_jobs(self.n_jobs)))



In [44]:
Predictor(to_path='/mnt/results').fit_predict(best_bc,mlb_data_bc)


fitting the pipeline ...


  + ". Parameter IGNORED. Check the list of "


CPU time: 0.9251341819763184 seconds

> Feature importances dumped into directory : /mnt/results

predicting ...
CPU time: 0.10654902458190918 seconds

> Overview on predictions : 

        B       M diagnosis_predicted
0  0.0000  1.0000                   M
1  0.0950  0.9050                   M
2  0.2100  0.7900                   M
3  0.0100  0.9900                   M
4  0.9800  0.0200                   B
5  0.0625  0.9375                   M
6  0.0025  0.9975                   M
7  0.0225  0.9775                   M
8  0.0000  1.0000                   M
9  0.2825  0.7175                   M

dumping predictions into directory : /mnt/results ...


<mlbox.prediction.predictor.Predictor at 0x7f9fceac2630>

## Print Accuracy and Save to Domino Stats File

Saving stats to this file [allows Domino to track and trend them in the Experiment Manager](https://support.dominodatalab.com/hc/en-us/articles/204348169-Diagnostic-statistics-with-dominostats-json) when this notebook is run as a batch or scheduled job.

In [45]:
# this predictions file is the output of the Prediction funtion from above
bc_pred = pd.read_csv('/mnt/results/diagnosis_predictions.csv')
y_bc_pred = bc_pred['diagnosis_predicted']

# these are the answers from the file stored in the project
bc_test = pd.read_csv('/mnt/data/processed/bc_data_test.csv')
y_bc_test = bc_test['diagnosis']

# this predictions file is the output of the Prediction funtion from above
hd_pred = pd.read_csv('/mnt/results/target_predictions.csv')
y_hd_pred = hd_pred['target_predicted']

# these are the answers from the file stored in the project
hd_test = pd.read_csv('/mnt/data/processed/hd_data_test.csv')
y_hd_test = hd_test['target']

In [46]:
import sklearn

hd_acc = sklearn.metrics.accuracy_score(y_hd_test,y_hd_pred)
bc_acc = sklearn.metrics.accuracy_score(y_bc_test,y_bc_pred)

print('bc ', bc_acc)
print('hd ', hd_acc)

bc  0.9707602339181286
hd  0.9010989010989011


#### Save to Domino

In [47]:
import json
with open('/mnt/dominostats.json', 'w') as f:
    f.write(json.dumps( {"HD_ACC": hd_acc, "BC_ACC": bc_acc}))