## AutoPrognosis API Tutorial

A demonstration for AP functionality and operation

This tutorial shows how to use [Autoprognosis](https://arxiv.org/abs/1802.07207). We are using the UCI Spam dataset.

See [installation instructions](../../doc/install.md) to install the dependencies.

Load dataset and show the first five samples:

In [19]:
import pandas as pd
import initpath_ap
initpath_ap.init_sys_path()
import utilmlab

## Import the AutoPrognosis library

In [20]:
import model

## Run the model from command line

--it : total number of iterations for each fold or n-fold cross validation

--cv : n for n-fold cross validation

--nstage: size of pipeline: 0: auto (selects imputation when missing data is detected),
        1: only classifiers, 
        2: feature processesing + clf, 
        3: imputers + feature processors and clf
        
--ensemble : include ensembles when fitting. It gives an assertion error when set to 0! should be looked into.

--modelindexes : list of 

1 Random Forest,
2 Gradient Boosting, 
3 XGBoost, 
4 Adaboost, 
5 Bagging, 
6 Bernoulli Naive Bayes, 
7 Gauss Naive Bayes, 
8 Multinomial Naive Bayes, 
9 Logistic Regression, 
10 Perceptron, 
11 Decision Trees, 
12 QDA, 
13 LDA, 
14 KNN, 
15 Linear SVM, 
16 Neural Network

In [7]:
!python3 autoprognosis.py\
-i ../../../AutoPrognosisThings/cardio_data/small_cardio_data_7_feature.csv\
--target outcome \
-o ../../../AutoPrognosisThings/outputs \
--it 3 \
--cv 2 \
--nstage 1 \
--modelindexes 0 1\
--num_components 1

R[write to console]: Loading required package: missForest

R[write to console]: Loading required package: randomForest

R[write to console]: randomForest 4.6-14

R[write to console]: Type rfNews() to see new features/changes/bug fixes.

R[write to console]: Loading required package: foreach

R[write to console]: Loading required package: itertools

R[write to console]: Loading required package: iterators

R[write to console]: Loading required package: softImpute

R[write to console]: Loading required package: Matrix

R[write to console]: Loaded softImpute 1.4


[ Random Forest ]
HBox(children=(FloatProgress(value=0.0, description='BO progress', max=3.0, style=ProgressStyle(description_width='initial')), HTML(value='')))
[ Gradient Boosting ]
Iteration number: 1 1s (1s) (4s), Current pipelines:  [[[ Gradient Boosting ]]], BO objective: 0.0
[ Random Forest ]
Iteration number: 2 5s (2s) (7s), Current pipelines:  [[[ Random Forest ]]], BO objective: -1.0
[ Random Forest ]
Iteration number:

In [8]:
!python3 autoprognosis_report.py -i ../../../AutoPrognosisThings/outputs

Score

classifier      aucroc 0.678
classifier      aucprc 0.051
ensemble        aucroc 0.679
ensemble        aucprc 0.053

Report

best score single pipeline (while fitting)    0.676
model_names_single_pipeline                   [ Random Forest ]
best ensemble score (while fittng)            0.697
ensemble_pipelines                            ['[ Random Forest ]', '[ Random Forest ]', '[ Random Forest ]']
ensemble_pipelines_weight                     [0.4670247717753407, 0.2970553824277082, 0.23591984579695116]
optimisation_metric                           aucroc
hyperparameter_properties                     [{'name': 'Random Forest', 'hyperparameters': {'model': "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',\n            max_depth=None, max_features='auto', max_leaf_nodes=None,\n            min_impurity_decrease=0.0, min_impurity_split=None,\n            min_samples_leaf=1, min_samples_split=2,\n            min_weight_fraction_leaf=0.0,

## Run the model with few iterations

In [None]:
from sklearn.datasets import load_breast_cancer

df = load_breast_cancer()
X_ = pd.DataFrame(df.data)
Y_ = pd.DataFrame(df.target)

In [None]:
metric = 'aucprc'
acquisition_type = 'MPI' # default and prefered is LCB but this generates excessive warnings, MPI is a good compromise.
AP_mdl   = model.AutoPrognosis_Classifier(
    metric=metric, CV=5, num_iter=3, kernel_freq=100, ensemble=True,
    ensemble_size=3, Gibbs_iter=100, burn_in=50, num_components=3, 
    acquisition_type=acquisition_type)

AP_mdl.fit(X_, Y_)

[ mean, Gradient Boosting ]
[ most_frequent, MultinomialNaiveBayes ]
[ median, LinearSVM ]


Widget Javascript not detected.  It may not be installed properly. Did you enable the widgetsnbextension? If not, then run "jupyter nbextension enable --py --sys-prefix widgetsnbextension"


[ mean, XGBoost ]
[ mean, BernoullinNaiveBayes ]
[ most_frequent, LinearSVM ]


Iteration number: 1 3s (3s) (8s), Current pipelines:  [[[ median, XGBoost ]]], [[[ mean, BernoullinNaiveBayes ]]], [[[ median, LinearSVM ]]], BO objective: -0.9891936728238395


[ median, XGBoost ]
[ median, BernoullinNaiveBayes ]
[ median, QDA ]


Iteration number: 2 5s (3s) (8s), Current pipelines:  [[[ mean, XGBoost ]]], [[[ mean, BernoullinNaiveBayes ]]], [[[ median, QDA ]]], BO objective: -0.9999999999999997


[ missForest, Random Forest ]
[ mean, Bagging ]


## Computing model predictions

##### ~~~First element in the output is the predictions of a single model, the second element is the prediction of the ensemble~~~

In [None]:
AP_mdl.predict(X_)

## Compute performance via multi-fold cross-validation

In [None]:
model.evaluate_ens(X_, Y_, AP_mdl, n_folds=5, visualize=True)

## Visualize data...

In [None]:
AP_mdl.visualize_data(X_)

## Visualize the model...

In [None]:
AP_mdl.APReport()