## AutoPrognosis API Tutorial

A demonstration for AP functionality and operation

This tutorial shows how to use [Autoprognosis](https://arxiv.org/abs/1802.07207). We are using the UCI Spam dataset.

See [installation instructions](../../doc/install.md) to install the dependencies.

Load dataset and show the first five samples:

In [1]:
import pandas as pd
import numpy as np
import initpath_ap
initpath_ap.init_sys_path()
import utilmlab

from sklearn.datasets import load_breast_cancer

#df = load_breast_cancer()
#X_ = pd.DataFrame(df.data)
#Y_ = pd.DataFrame(df.target)

  from collections import Mapping, defaultdict


In [2]:
X_= pd.read_feather('cardio_data/trained_df_noNan')
X_.set_index('eid', inplace=True)
Y_= pd.read_feather('cardio_data/trained_df_noNan_outcome')
Y_.set_index('eid', inplace=True)
#Y_=Y_['outcome']

In [3]:
# make a small random dataset
df_all= X_.join(Y_)
df_all=df_all.reindex(np.random.permutation(df_all.index))
df_all=df_all[:20000]
#df_all.to_csv('cardio_data/small_cardio_data.csv')

In [8]:
df_all= df_all[['gender', 'age-0', 'average-sys-0', 'history-of-diabetes', 'hypertention-medication-0', 'smoker',
                'average-BMI-0', 'outcome']]
#df_all.to_csv('cardio_data/small_cardio_data_7_feature.csv')

In [None]:
X_= df_all.drop(columns=['outcome'])
Y_= df_all[['outcome']]

## Import the AutoPrognosis library

In [4]:
import model

## Run the model from command line

--it : total number of iterations for each fold or n-fold cross validation

--cv : n for n-fold cross validation

--nstage: size of pipeline: 0: auto (selects imputation when missing data is detected),
        1: only classifiers, 
        2: feature processesing + clf, 
        3: imputers + feature processors and clf
        
--ensemble : include ensembles when fitting

--modelindexes

1 Random Forest,
2 Gradient Boosting, 
3 XGBoost, 
4 Adaboost, 
5 Bagging, 
6 Bernoulli Naive Bayes, 
7 Gauss Naive Bayes, 
8 Multinomial Naive Bayes, 
9 Logistic Regression, 
10 Perceptron, 
11 Decision Trees, 
12 QDA, 
13 LDA, 
14 KNN, 
15 Linear SVM, 
16 Neural Network

In [10]:
!python3 autoprognosis.py -i ../../../AutoPrognosisThings/cardio_data/small_cardio_data_7_feature.csv\
--target outcome -o ../../../AutoPrognosisThings/outputs --it 10 --cv 2 --nstage 1 --ensemble 1 --modelindexes [0]

['R[write to console]: Loading required package: missForest',
 '',
 'R[write to console]: Loading required package: randomForest',
 '',
 'R[write to console]: randomForest 4.6-14',
 '',
 'R[write to console]: Type rfNews() to see new features/changes/bug fixes.',
 '',
 'R[write to console]: Loading required package: foreach',
 '',
 'R[write to console]: Loading required package: itertools',
 '',
 'R[write to console]: Loading required package: iterators',
 '',
 'R[write to console]: Loading required package: softImpute',
 '',
 'R[write to console]: Loading required package: Matrix',
 '',
 'R[write to console]: Loaded softImpute 1.4',
 '',
 '',
 'Iteration number: 1 8s (8s) (78s), Current pipelines:  [[[ Random Forest ]]], [[[ Gradient Boosting ]]], [[[ XGBoost ]]], BO objective: 0.0',
 'Iteration number: 2 10s (5s) (50s), Current pipelines:  [[[ Random Forest ]]], [[[ Gradient Boosting ]]], [[[ XGBoost ]]], BO objective: -1.0',
 'Iteration number: 3 13s (4s) (42s), Current pipelines:  

In [11]:
!python3 autoprognosis_report.py -i ../../../AutoPrognosisThings/outputs

Score

classifier      aucroc 0.723
classifier      aucprc 0.068
ensemble        aucroc 0.727
ensemble        aucprc 0.068

Report

best score single pipeline (while fitting)    0.725
model_names_single_pipeline                   [ Gradient Boosting ]
best ensemble score (while fittng)            0.738
ensemble_pipelines                            ['[ Gradient Boosting ]', '[ XGBoost ]', '[ XGBoost ]']
ensemble_pipelines_weight                     [0.2413297261706806, 0.38940120006055917, 0.3692690737687602]
optimisation_metric                           aucroc
hyperparameter_properties                     [{'name': 'Gradient Boosting', 'hyperparameters': {'model': "GradientBoostingClassifier(criterion='friedman_mse', init=None,\n              learning_rate=0.5, loss='deviance', max_depth=1,\n              max_features=None, max_leaf_nodes=None,\n              min_impurity_decrease=0.0, min_impurity_split=None,\n              min_samples_leaf=1, min_samples_split=2,\n    

## Run the model by short simple python code

In [5]:
metric = 'aucprc'
acquisition_type = 'MPI' # default and prefered is LCB but this generates excessive warnings, MPI is a good compromise.
#I changed kernel_freq=100 and Gibbs_iter=100
AP_mdl   = model.AutoPrognosis_Classifier(
    metric=metric, CV=5, num_iter=3, kernel_freq=10, ensemble=False,
    ensemble_size=3, Gibbs_iter=100, burn_in=50, num_components=3, 
    acquisition_type=acquisition_type, is_nan=False, use_imputer=False, use_preprocessor=False)

In [6]:
AP_mdl.fit(X_, Y_)

R[write to console]: Loading required package: missForest

R[write to console]: Loading required package: randomForest

R[write to console]: randomForest 4.6-14

R[write to console]: Type rfNews() to see new features/changes/bug fixes.

R[write to console]: Loading required package: foreach

R[write to console]: Loading required package: itertools

R[write to console]: Loading required package: iterators

R[write to console]: Loading required package: softImpute

R[write to console]: Loading required package: Matrix

R[write to console]: Loaded softImpute 1.4




[ Gradient Boosting ]
[ MultinomialNaiveBayes ]
[ LinearSVM ]


HBox(children=(FloatProgress(value=0.0, description='BO progress', max=3.0, style=ProgressStyle(description_wi…

[ NeuralNet ]
[ MultinomialNaiveBayes ]


Iteration number: 1 43s (43s) (129s), Current pipelines:  [[[ NeuralNet ]]], [[[ MultinomialNaiveBayes ]]], [[[ DecisionTrees ]]], BO objective: 0.0


[ DecisionTrees ]
[ Gradient Boosting ]
[ AdaBoost ]
[ LDA ]


Iteration number: 2 90s (45s) (135s), Current pipelines:  [[[ Gradient Boosting ]]], [[[ AdaBoost ]]], [[[ LDA ]]], BO objective: -1.0000000000000004


[ XGBoost ]
[ Bagging ]
[ LDA ]


Iteration number: 3 146s (49s) (146s), Current pipelines:  [[[ XGBoost ]]], [[[ Bagging ]]], [[[ LDA ]]], BO objective: -1.4142135623730951





**The best model is: **[ LDA ]

[{'name': 'initial', 'aucprc': 0.057172614799005705},
 {'aucprc': 0.042970511167461346,
  'aucroc': 0.6587761425272423,
  'name': '[ NeuralNet ]',
  'cv': 5,
  'iter': 0,
  'component_idx': 0,
  'hyperparameter_properties': [{'name': 'NeuralNet',
    'hyperparameters': {'model': "MLPClassifier(activation='tanh', alpha=0.0001, batch_size='auto', beta_1=0.9,\n       beta_2=0.999, early_stopping=False, epsilon=1e-08,\n       hidden_layer_sizes=(50, 50), learning_rate='constant',\n       learning_rate_init=0.001, max_iter=200, momentum=0.9,\n       nesterovs_momentum=True, power_t=0.5, random_state=None,\n       shuffle=True, solver='lbfgs', tol=0.0001, validation_fraction=0.1,\n       verbose=False, warm_start=False)"}}],
  'model': '<pipelines.basePipeline.basePipeline object at 0x1a25fe9550>'},
 {'aucprc': 0.026046956029114793,
  'aucroc': 0.48341496651951343,
  'name': '[ MultinomialNaiveBayes ]',
  'cv': 5,
  'iter': 0,
  'component_idx': 1,
  'hyperparameter_properties': [{'name': 'M

## Computing model predictions

##### ~~~First element in the output is the predictions of a single model, the second element is the prediction of the ensemble~~~

In [7]:
AP_mdl.predict(X_)

(array([[0.98759929, 0.01240071],
        [0.9912874 , 0.0087126 ],
        [0.96629033, 0.03370967],
        ...,
        [0.95964161, 0.04035839],
        [0.96429803, 0.03570197],
        [0.98756443, 0.01243557]]),
 array([[0.98759929, 0.01240071],
        [0.9912874 , 0.0087126 ],
        [0.96629033, 0.03370967],
        ...,
        [0.95964161, 0.04035839],
        [0.96429803, 0.03570197],
        [0.98756443, 0.01243557]]))

## Compute performance via multi-fold cross-validation

In [12]:
model.evaluate_ens(X_, Y_, AP_mdl, n_folds=5, visualize=True)

## Visualize data...

In [13]:
AP_mdl.visualize_data(X_)

## Visualize the model...

In [12]:
AP_mdl.APReport()

***Ensemble Report***

**----------------------**

**Rank0:   [ XGBoost ],   Ensemble weight: 0.337141036259983**

**----------------------**

{'model_list': [<models.classifiers.XGboost object at 0x1a3466ae90>], 'explained': '[ *GBoost is an open-source software library which provides the gradient boosting framework for C++, Java, Python, R, and Julia.* ]', 'image_name': None, 'classes': None, 'num_stages': 1, 'pipeline_stages': ['classifier'], 'name': '[ XGBoost ]', 'analysis_mode': None, 'analysis_type': None}


**_____________________________________________**

[ *GBoost is an open-source software library which provides the gradient boosting framework for C++, Java, Python, R, and Julia.* ]

**Rank1:   [ AdaBoost ],   Ensemble weight: 0.33191778877869144**

**----------------------**

{'model_list': [<models.classifiers.Adaboost object at 0x1a34668790>], 'explained': "[ *AdaBoost, short for Adaptive Boosting, is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many other types of learning algorithms to improve performance. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers.* ]", 'image_name': None, 'classes': None, 'num_stages': 1, 'pipeline_stages': ['classifier'], 'name': '[ AdaBoost ]', 'analysis_mode': None, 'analysis_type': None}


**_____________________________________________**

[ *AdaBoost, short for Adaptive Boosting, is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many other types of learning algorithms to improve performance. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers.* ]

**Rank2:   [ XGBoost ],   Ensemble weight: 0.33094117496132563**

**----------------------**

{'model_list': [<models.classifiers.XGboost object at 0x1a376f1e90>], 'explained': '[ *GBoost is an open-source software library which provides the gradient boosting framework for C++, Java, Python, R, and Julia.* ]', 'image_name': None, 'classes': None, 'num_stages': 1, 'pipeline_stages': ['classifier'], 'name': '[ XGBoost ]', 'analysis_mode': None, 'analysis_type': None}


**_____________________________________________**

[ *GBoost is an open-source software library which provides the gradient boosting framework for C++, Java, Python, R, and Julia.* ]

**----------------------**

***Kernel Report***

**Component 0**

**Members: ['XGBoost', 'Gradient Boosting', 'Random Forest', 'Neural Network']**

  [1mMat52.     [0;0m  |               value  |  constraints  |  priors
  [1mvariance   [0;0m  |  0.9999990030869095  |      +ve      |        
  [1mlengthscale[0;0m  |  0.9031481051397683  |      +ve      |        


**Component 1**

**Members: ['Multinomial Naive Bayes', 'Bernoulli Naive Bayes', 'Bagging', 'Adaboost']**

  [1mMat52.     [0;0m  |               value  |  constraints  |  priors
  [1mvariance   [0;0m  |  0.9720888366936934  |      +ve      |        
  [1mlengthscale[0;0m  |   5.127609133763431  |      +ve      |        


**Component 2**

**Members: ['Linear SVM', 'KNN', 'Decision Trees', 'Perceptron', 'Logistic Regression', 'Gauss Naive Bayes', 'QDA', 'LDA']**

  [1mMat52.     [0;0m  |               value  |  constraints  |  priors
  [1mvariance   [0;0m  |   46.03231114719721  |      +ve      |        
  [1mlengthscale[0;0m  |  21.029530765403038  |      +ve      |        


{'best_score_single_pipeline': 0.06294085111764766,
 'model_names_single_pipeline': '[ XGBoost ]',
 'ensemble_score': 0.06402736930011581,
 'ensemble_pipelines': ['[ XGBoost ]', '[ AdaBoost ]', '[ XGBoost ]'],
 'ensemble_pipelines_weight': [0.337141036259983,
  0.33191778877869144,
  0.33094117496132563],
 'optimisation_metric': 'aucprc',
 'hyperparameter_properties': [{'name': 'XGBoost',
   'hyperparameters': {'model': "XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,\n       colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,\n       importance_type='gain', interaction_constraints=None,\n       learning_rate=0.06145542076570746, max_delta_step=0, max_depth=2,\n       min_child_weight=1, missing=nan, monotone_constraints=None,\n       n_estimators=253, n_jobs=0, num_parallel_tree=1,\n       objective='binary:logistic', random_state=0, reg_alpha=0,\n       reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method=None,\n       validate_parameters=False, verbosi