# AutoML with Tabular data - Inference

In machine learning, *inference* refers to the process of using a trained predictor to make predictions on data with unknown labels.  First let's produce such a predictor, utilizing bagging/stacking to boost its accuracy:

In [1]:
from autogluon import TabularPrediction as task
from IPython.display import display
import numpy as np
import pprint

subsample_size = 1000 # experiment with larger values to try AutoGluon with larger datasets 
time_limits = 120 # experiment with larger values to get a better sense of the achievable accuracy

train_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/diabetes/train.csv')
train_data = train_data.head(subsample_size) # subsample data for faster demo
label_column = 'readmitted'

predictor = task.fit(train_data=train_data, label=label_column, auto_stack=True, time_limits=time_limits)

test_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/diabetes/test.csv')
test_data = test_data.head(subsample_size) # subsample data for faster demo
y_test = test_data[label_column] # ground-truth target values
test_data_nolab = test_data.drop(labels=[label_column],axis=1) # delete label column to prove we're not cheating

y_pred = predictor.predict(test_data_nolab)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/diabetes/train.csv | Columns = 47 / 47 | Rows = 61059 -> 61059
No output_directory specified. Models will be saved in: AutogluonModels/ag-20200801_082821/
Beginning AutoGluon training ... Time limit = 120s
AutoGluon will save models to AutogluonModels/ag-20200801_082821/
AutoGluon Version:  0.0.13b20200731
Train Data Rows:    1000
Train Data Columns: 47
Preprocessing data ...
Here are the 3 unique label values in your data:  ['NO', '>30', '<30']
AutoGluon infers your prediction problem is: multiclass  (because dtype of label-column == object).
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Train Data Class Count: 3
Feature Generator processed 1000 data points with 33 features
Original Features (raw dtypes):
	object features: 25
	float64 features: 1
	int64 features: 7
Original Features (inferred dtypes):
	object featur

	Ran out of time, early stopping on iteration 89. Best iteration is:
	[7]	train_set's multi_error: 0.215556	valid_set's multi_error: 0.39
	Ran out of time, early stopping on iteration 83. Best iteration is:
	[16]	train_set's multi_error: 0.114444	valid_set's multi_error: 0.39
	Ran out of time, early stopping on iteration 85. Best iteration is:
	[22]	train_set's multi_error: 0.0466667	valid_set's multi_error: 0.45
	Ran out of time, early stopping on iteration 106. Best iteration is:
	[3]	train_set's multi_error: 0.287778	valid_set's multi_error: 0.45
	Ran out of time, early stopping on iteration 111. Best iteration is:
	[11]	train_set's multi_error: 0.17	valid_set's multi_error: 0.46
	0.574	 = Validation accuracy score
	14.71s	 = Training runtime
	0.41s	 = Validation runtime
Fitting model: CatboostClassifier_STACKER_l1 ... Training model for up to 0.32s of the 0.31s of remaining time.
	Time limit exceeded... Skipping CatboostClassifier_STACKER_l1.
Fitting model: NeuralNetClassifier_STAC

While the above predictions are produced by the model AutoGluon believes to be the most accurate, recall we can also evaluate the predictive performance of every model AutoGluon has trained:

In [2]:
test_perf = predictor.leaderboard(test_data)
display(test_perf)

                                    model  score_test  score_val  pred_time_test  pred_time_val   fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer
0           LightGBMClassifier_STACKER_l1       0.551      0.574        9.478982       8.360810  62.905900                 0.498886                0.410340          14.708010            1       True
1                 weighted_ensemble_k0_l2       0.551      0.574        9.481465       8.363896  63.292353                 0.002483                0.003086           0.386453            2       True
2           CatboostClassifier_STACKER_l0       0.550      0.582        0.113835       0.385099  13.741464                 0.113835                0.385099          13.741464            0       True
3   RandomForestClassifierGini_STACKER_l1       0.542      0.553       10.614846       9.339597  58.561704                 1.634750                1.389127          10.363814            1       True
4   R

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer
0,LightGBMClassifier_STACKER_l1,0.551,0.574,9.478982,8.36081,62.9059,0.498886,0.41034,14.70801,1,True
1,weighted_ensemble_k0_l2,0.551,0.574,9.481465,8.363896,63.292353,0.002483,0.003086,0.386453,2,True
2,CatboostClassifier_STACKER_l0,0.55,0.582,0.113835,0.385099,13.741464,0.113835,0.385099,13.741464,0,True
3,RandomForestClassifierGini_STACKER_l1,0.542,0.553,10.614846,9.339597,58.561704,1.63475,1.389127,10.363814,1,True
4,RandomForestClassifierGini_STACKER_l0,0.541,0.554,1.283559,1.277415,6.900042,1.283559,1.277415,6.900042,0,True
5,ExtraTreesClassifierEntr_STACKER_l0,0.54,0.527,1.700516,1.275319,6.131904,1.700516,1.275319,6.131904,0,True
6,ExtraTreesClassifierGini_STACKER_l0,0.538,0.509,1.646748,1.295418,6.231692,1.646748,1.295418,6.231692,0,True
7,RandomForestClassifierEntr_STACKER_l0,0.538,0.553,1.682435,1.279882,7.926347,1.682435,1.279882,7.926347,0,True
8,RandomForestClassifierEntr_STACKER_l1,0.538,0.555,10.943728,9.287851,58.306758,1.963632,1.337381,10.108868,1,True
9,ExtraTreesClassifierEntr_STACKER_l1,0.535,0.538,10.471045,9.220299,54.190841,1.490949,1.269828,5.992951,1,True


We can see that different models not only have different predictive accuracy, but also take differing amounts of time to compute predictions (`pred_time_test`). Low latency inference may be important for ML deployed in settings where decisions must be made in real-time, one datapoint at a time.  While the model ensembles are typically the most accurate, they are also generally higher latency with larger memory footprint than individual models.

When we call `predict()`, AutoGluon automatically predicts with the model that displayed the best validation performance (this is typically the weighted-ensemble at the topmost stack-layer). Recall we can instead specify predictions should be made with a certain model as follows:

In [3]:
predictor.predict(test_data[:20], model='LightGBMClassifier_STACKER_l0')

array(['NO', 'NO', 'NO', 'NO', 'NO', '>30', '>30', '>30', 'NO', 'NO',
       'NO', 'NO', '>30', 'NO', 'NO', 'NO', '>30', 'NO', '>30', 'NO'],
      dtype=object)

In [4]:
gbm_stacker = predictor._trainer.load_model('LightGBMClassifier_STACKER_l0')
display(gbm_stacker.get_info())

{'name': 'LightGBMClassifier_STACKER_l0',
 'model_type': 'StackerEnsembleModel',
 'problem_type': 'multiclass',
 'eval_metric': 'accuracy',
 'stopping_metric': 'accuracy',
 'fit_time': 7.051509618759155,
 'predict_time': 0.3066377639770508,
 'val_score': 0.585,
 'hyperparameters': {'max_models': 25, 'max_models_per_type': 5},
 'hyperparameters_fit': {},
 'hyperparameters_nondefault': [],
 'memory_size': 2744,
 'bagged_info': {'child_type': 'LGBModel',
  'num_child_models': 10,
  'child_model_names': ['LightGBMClassifier_fold_0',
   'LightGBMClassifier_fold_1',
   'LightGBMClassifier_fold_2',
   'LightGBMClassifier_fold_3',
   'LightGBMClassifier_fold_4',
   'LightGBMClassifier_fold_5',
   'LightGBMClassifier_fold_6',
   'LightGBMClassifier_fold_7',
   'LightGBMClassifier_fold_8',
   'LightGBMClassifier_fold_9'],
  '_n_repeats': 1,
  '_k_per_n_repeat': [10],
  '_random_state': 0,
  'low_memory': True,
  'bagged_mode': True,
  'max_memory_size': 2654226,
  'min_memory_size': 727045},
 's

In this case, the **LightGBMClassifier_STACKER_l0** model is a bagged ensemble (since we are using stacking), which recall actually involves 10 different LightGBM models trained on different train/validation folds. We can collapse this bag of 10 models into a single **LightGBMClassifier** model that's fit to the full dataset. Since this model has no validation data to gauge performance after each boosting round, it will early-stop based on the average of the early-stopping points previously used by the 10 LightGBM models in the bag (which were each trained with validation-based early-stopping).

In [5]:
refit_model_map = predictor.refit_full(model='LightGBMClassifier_STACKER_l0')
print("Name of refit-full model corresponding to the previous bagged ensemble:")
display(refit_model_map)

Fitting model: LightGBMClassifier_FULL_STACKER_l0 ...
	0.35s	 = Training runtime


Name of refit-full model corresponding to the previous bagged ensemble:


{'LightGBMClassifier_STACKER_l0': 'LightGBMClassifier_FULL_STACKER_l0'}

In [6]:
single_gbm_preds = predictor.predict(test_data[:20], model='LightGBMClassifier_FULL_STACKER_l0')
print(single_gbm_preds)

single_gbm = predictor._trainer.load_model('LightGBMClassifier_FULL_STACKER_l0')
display(single_gbm.get_info())

['NO' 'NO' 'NO' 'NO' 'NO' '>30' '>30' '>30' 'NO' 'NO' 'NO' 'NO' '>30' 'NO'
 'NO' 'NO' '>30' 'NO' '>30' 'NO']


{'name': 'LightGBMClassifier_FULL_STACKER_l0',
 'model_type': 'StackerEnsembleModel',
 'problem_type': 'multiclass',
 'eval_metric': 'accuracy',
 'stopping_metric': 'accuracy',
 'fit_time': 0.3497910499572754,
 'predict_time': None,
 'val_score': None,
 'hyperparameters': {'max_models': 25, 'max_models_per_type': 5},
 'hyperparameters_fit': {},
 'hyperparameters_nondefault': [],
 'memory_size': 2393,
 'bagged_info': {'child_type': 'LGBModel',
  'num_child_models': 1,
  'child_model_names': ['LightGBMClassifier_FULL'],
  '_n_repeats': 1,
  '_k_per_n_repeat': [1],
  '_random_state': 0,
  'low_memory': True,
  'bagged_mode': False,
  'max_memory_size': 271369,
  'min_memory_size': 271369},
 'stacker_info': {'num_base_models': 0,
  'base_model_names': [],
  'use_orig_features': True},
 'children_info': {'LightGBMClassifier_FULL': {'name': 'LightGBMClassifier_FULL',
   'model_type': 'LGBModel',
   'problem_type': 'multiclass',
   'eval_metric': 'accuracy',
   'stopping_metric': 'accuracy',


This single GBM model requires 10x less memory than its bagged counterpart and can produce predictions 10x faster as well.

## Model Distillation 

While computationally-favorable, single individual models will usually have lower accuracy than weighted/stacked/bagged ensembles. *Model Distillation* offers one way the retain the computational benefits of a single model, while enjoying some of the accuracy-boost that comes with ensembling. The idea is to train the individual model (which we can call the *student*) to mimic the predictions of the full stack ensemble (the *teacher*). Rather than fitting the individual model to the training data target-values, one can fit this student model to an alternative dataset whose target values are the predictions from the ensemble teacher ([Bucila et al, 2006](https://www.cs.cornell.edu/~caruana/compression.kdd06.pdf)). 
In classification tasks, the teacher provides its predicted class-probabilities as targets for the student model, which may encode richer information about class-similarities or label-noise, and we can also blend the teacher predictions with the original target-values from the training data for further improvement  ([Hinton et al, 2014](https://arxiv.org/abs/1503.02531)). 

Distillation in AutoGluon follows the strategy proposed in 
([Fakoor et al, 2020](https://arxiv.org/abs/2006.14284)) which we call *augmented distillation* and depict below.

<img src="files/images/distillfigure.png" width="700" height="400">

The overall augmented distillation procedure involves 4 steps:

1. **Fit** stack ensemble to the original training data via: `fit(train_data, label_column, auto_stack=True)`

2. Use the training data features to **generate** a synthetic dataset of features (we call this the *augmented data*). The augmented data should ideally follow a similar underlying feature distribution as the training data, and can either be generated via simple feature-permutations/perturbations of the training data as in ([Bucila et al, 2006](https://www.cs.cornell.edu/~caruana/compression.kdd06.pdf)) or via a generative model trained on the training data features as in ([Fakoor et al, 2020](https://arxiv.org/abs/2006.14284)).

3. Use the stack-ensemble to **predict** on each augmented datapoint and treat these as the corresponding target values. For classification tasks, these are predicted class-probabilities rather than predicted class-labels.

4. Merge the original training data and the augmented data into one large dataset and **fit** the single student model to this dataset.

In step 2, we typically generate 5-10x as many augmented datapoints as there are training samples, so the student model can better mimic the teacher by being trained to replicate its predictions at a vast number of points in the feature-space. In AutoGluon, this is done as follows:

In [7]:
student_name = predictor.distill(time_limits=time_limits, hyperparameters={'GBM':{}}, verbosity=3)
print(f"Name of distillated model: {student_name[0]}")

Distilling with teacher_preds=soft, augment_method=spunge ...
SPUNGE: Augmenting training data with 4000 synthetic samples for distillation...
Distilling with each of these student models: ['LightGBMClassifier_DSTL']
Fitting model: LightGBMClassifier_DSTL ... Training model for up to 120.0s of the 120.0s of remaining time.


[50]	train_set's soft_log_loss: -0.821886	valid_set's soft_log_loss: -0.866148
[100]	train_set's soft_log_loss: -0.810469	valid_set's soft_log_loss: -0.853696
[150]	train_set's soft_log_loss: -0.806772	valid_set's soft_log_loss: -0.847813
[200]	train_set's soft_log_loss: -0.804357	valid_set's soft_log_loss: -0.844715
[250]	train_set's soft_log_loss: -0.802466	valid_set's soft_log_loss: -0.844028
[300]	train_set's soft_log_loss: -0.800868	valid_set's soft_log_loss: -0.841656
[350]	train_set's soft_log_loss: -0.79947	valid_set's soft_log_loss: -0.840694
[400]	train_set's soft_log_loss: -0.798206	valid_set's soft_log_loss: -0.839248
[450]	train_set's soft_log_loss: -0.79706	valid_set's soft_log_loss: -0.837473
[500]	train_set's soft_log_loss: -0.795989	valid_set's soft_log_loss: -0.836912
[550]	train_set's soft_log_loss: -0.795029	valid_set's soft_log_loss: -0.836592
[600]	train_set's soft_log_loss: -0.794121	valid_set's soft_log_loss: -0.835849
[650]	train_set's soft_log_loss: -0.793275	

	15.52s	 = Training runtime
	0.12s	 = Validation runtime
	0.66	 = Validation accuracy score


Name of distillated model: LightGBMClassifier_DSTL


Above, `hyperparameters` allows us to specify which types of models to consider as students for distillation as well as their hyperparameter-values. For instance, if we want really small student models, we can suitably specify the apppropriate hyperparameter-values that affect model size. We can predict with the resulting distilled model or compare its accuracy with the ensemble-models previously trained during `fit()`:

In [8]:
preds_student = predictor.predict(test_data[:20], model=student_name[0])
print(preds_student)

perf = predictor.leaderboard(test_data, silent=True)
display(perf)

['NO' 'NO' 'NO' 'NO' '>30' '>30' '>30' 'NO' '>30' '>30' 'NO' 'NO' 'NO'
 'NO' 'NO' 'NO' 'NO' '>30' '>30' 'NO']


Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer
0,LightGBMClassifier_STACKER_l1,0.551,0.574,10.829331,8.36081,62.9059,0.386916,0.41034,14.70801,1,True
1,weighted_ensemble_k0_l2,0.551,0.574,10.833116,8.363896,63.292353,0.003785,0.003086,0.386453,2,True
2,CatboostClassifier_STACKER_l0,0.55,0.582,0.178802,0.385099,13.741464,0.178802,0.385099,13.741464,0,True
3,RandomForestClassifierGini_STACKER_l1,0.542,0.553,11.78719,9.339597,58.561704,1.344775,1.389127,10.363814,1,True
4,RandomForestClassifierGini_STACKER_l0,0.541,0.554,1.697313,1.277415,6.900042,1.697313,1.277415,6.900042,0,True
5,ExtraTreesClassifierEntr_STACKER_l0,0.54,0.527,1.969712,1.275319,6.131904,1.969712,1.275319,6.131904,0,True
6,LightGBMClassifier_FULL_STACKER_l0,0.539,,0.044287,,0.349791,0.044287,,0.349791,0,True
7,RandomForestClassifierEntr_STACKER_l0,0.538,0.553,1.640646,1.279882,7.926347,1.640646,1.279882,7.926347,0,True
8,ExtraTreesClassifierGini_STACKER_l0,0.538,0.509,2.380385,1.295418,6.231692,2.380385,1.295418,6.231692,0,True
9,RandomForestClassifierEntr_STACKER_l1,0.538,0.555,11.862324,9.287851,58.306758,1.419909,1.337381,10.108868,1,True


## Understanding the importance of each Feature

Often, high accuracy on some test data may not be enough for us to fully trust that our trained predictor will lead to the right decisions when deployed. 
We may also want to understand which features the predictor relies on to make these accurate predictions. It is not uncommon that a dataset contains spurious correlations due to how it was collected. For example, radiologists discovered one accurate ML classifier was basing its cancer-diagnoses on the spurious presence of a ruler in tumor images ([Patel, 2017](https://www.thedailybeast.com/why-doctors-arent-afraid-of-better-more-efficient-ai-diagnosing-cancer)).

One way to quantify how much each feature individually contributes to an (already-trained) predictor's accuracy is via the method of *permutation-shuffling* ([Parr, 2018](https://explained.ai/rf-importance/)). Here, we randomly permute one column's values (corresponding to measurements of one feature) across the rows of the dataset, and measure the resulting drop in predictive performance when the predictor as asked to predict on the resulting data. The size of this drop is considered the feature's importance-score. This way of scoring feature-importance is often more trustworthy than alternatives that make many approximations or are based on oft-unrealistic assumptions. For our trained AutoGluon stack-ensemble predictor, we can compute these scores like this:

In [9]:
feature_importances = predictor.feature_importance(test_data)
print("Feature importance scores:")
display(feature_importances)

Computing raw permutation importance for 46 features on weighted_ensemble_k0_l1 ...
	25.7s	= Expected runtime
	25.41s	= Actual runtime


Feature importance scores:


diag_1                      0.029
medical_specialty           0.012
number_inpatient            0.010
diag_2                      0.006
num_procedures              0.006
num_medications             0.003
metformin                   0.002
admission_type_id           0.002
gender                      0.001
glipizide                   0.001
admission_source_id         0.001
number_emergency            0.000
chlorpropamide              0.000
payer_code                  0.000
max_glu_serum               0.000
A1Cresult                   0.000
weight                      0.000
nateglinide                 0.000
number_outpatient           0.000
diabetesMed                 0.000
examide                     0.000
troglitazone                0.000
metformin-pioglitazone      0.000
metformin-rosiglitazone     0.000
glimepiride-pioglitazone    0.000
glipizide-metformin         0.000
glyburide-metformin         0.000
citoglipton                 0.000
acetohexamide               0.000
tolazamide    

The top features in this list contribute most to AutoGluon's accuracy (for predicting when/if a patient will be readmitted to the hospital).  Features with non-positive importance score hardly contribute to the predictor's accuracy (at least not on an individual basis, these features may theoretically still provide useful predictive signal through interaction-effects with other features' values).

If reducing inference latency matters to us, we can also train a more-efficient predictor that only operates on a small subset of the most useful features:

In [10]:
feature_subset = list(feature_importances[feature_importances > 0].index) + [label_column]
train_data_small = train_data[feature_subset]
display(train_data_small)

predictor_small = task.fit(train_data=train_data_small, label=label_column, time_limits=time_limits)
predictor_small.leaderboard(test_data[feature_subset])

Unnamed: 0,diag_1,medical_specialty,number_inpatient,diag_2,num_procedures,num_medications,metformin,admission_type_id,gender,glipizide,admission_source_id,readmitted
0,250.83,Pediatrics-Endocrinology,0,?,0,1,No,"""6""",Female,No,"""1""",NO
1,276,?,0,250.01,0,18,No,"""1""",Female,No,"""7""",>30
2,648,?,1,250,5,13,No,"""1""",Female,Steady,"""7""",NO
3,8,?,0,250.43,1,16,No,"""1""",Male,No,"""7""",NO
4,197,?,0,157,0,8,No,"""1""",Male,Steady,"""7""",NO
...,...,...,...,...,...,...,...,...,...,...,...,...
995,38,InternalMedicine,1,788,1,19,No,"""1""",Male,No,"""7""",<30
996,250.13,Pediatrics-CriticalCare,0,?,0,8,No,"""2""",Male,No,"""1""",NO
997,414,Cardiology,0,411,0,16,No,"""1""",Male,No,"""7""",>30
998,410,Cardiology,0,285,5,24,No,"""1""",Female,No,"""7""",>30


No output_directory specified. Models will be saved in: AutogluonModels/ag-20200801_083345/
Beginning AutoGluon training ... Time limit = 120s
AutoGluon will save models to AutogluonModels/ag-20200801_083345/
AutoGluon Version:  0.0.13b20200731
Train Data Rows:    1000
Train Data Columns: 12
Preprocessing data ...
Here are the 3 unique label values in your data:  ['NO', '>30', '<30']
AutoGluon infers your prediction problem is: multiclass  (because dtype of label-column == object).
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Train Data Class Count: 3
Feature Generator processed 1000 data points with 11 features
Original Features (raw dtypes):
	object features: 8
	int64 features: 3
Original Features (inferred dtypes):
	object features: 8
	int features: 3
Generated Features (special dtypes):
Processed Features (raw dtypes):
	int features: 3
	category features: 8
Processed Featu

                         model  score_test  score_val  pred_time_test  pred_time_val  fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer
0           LightGBMClassifier       0.546      0.620        0.033110       0.013984  2.231284                 0.033110                0.013984           2.231284            0       True
1           CatboostClassifier       0.533      0.635        0.012914       0.009126  2.005353                 0.012914                0.009126           2.005353            0       True
2      weighted_ensemble_k0_l1       0.533      0.635        0.018172       0.010346  2.424684                 0.005258                0.001220           0.419331            1       True
3   RandomForestClassifierEntr       0.519      0.525        0.126994       0.119581  1.055600                 0.126994                0.119581           1.055600            0       True
4   RandomForestClassifierGini       0.513      0.535        0.12

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer
0,LightGBMClassifier,0.546,0.62,0.03311,0.013984,2.231284,0.03311,0.013984,2.231284,0,True
1,CatboostClassifier,0.533,0.635,0.012914,0.009126,2.005353,0.012914,0.009126,2.005353,0,True
2,weighted_ensemble_k0_l1,0.533,0.635,0.018172,0.010346,2.424684,0.005258,0.00122,0.419331,1,True
3,RandomForestClassifierEntr,0.519,0.525,0.126994,0.119581,1.0556,0.126994,0.119581,1.0556,0,True
4,RandomForestClassifierGini,0.513,0.535,0.127619,0.123259,0.893335,0.127619,0.123259,0.893335,0,True
5,LightGBMClassifierCustom,0.484,0.59,0.162309,0.018341,2.712698,0.162309,0.018341,2.712698,0,True
6,ExtraTreesClassifierGini,0.483,0.525,0.163947,0.132366,0.842483,0.163947,0.132366,0.842483,0,True
7,ExtraTreesClassifierEntr,0.482,0.52,0.176688,0.116681,0.776156,0.176688,0.116681,0.776156,0,True
8,KNeighborsClassifierUnif,0.456,0.495,0.107943,0.107785,0.003167,0.107943,0.107785,0.003167,0,True
9,KNeighborsClassifierDist,0.436,0.455,0.110001,0.114218,0.003697,0.110001,0.114218,0.003697,0,True


## Practical ML Deployment  with AutoGluon

Here's a practical guide on how to use AutoGluon to easily turn raw tabular data into deployed models for applications where inference-latency/costs matter. For demonstration, we'll consider a setting where the predictions will be made on one datapoint at a time (a common pattern in settings where inference-latency really matters). For such settings, here is how we recommend you train with AutoGluon:

In [11]:
model_folder = 'agModels'
predictor = task.fit(train_data=train_data, label=label_column, time_limits=time_limits, output_directory=model_folder,
                     presets=['good_quality_faster_inference_only_refit', 'optimize_for_deployment'])

Beginning AutoGluon training ... Time limit = 120s
AutoGluon will save models to agModels/
AutoGluon Version:  0.0.13b20200731
Train Data Rows:    1000
Train Data Columns: 47
Preprocessing data ...
Here are the 3 unique label values in your data:  ['NO', '>30', '<30']
AutoGluon infers your prediction problem is: multiclass  (because dtype of label-column == object).
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Train Data Class Count: 3
Feature Generator processed 1000 data points with 33 features
Original Features (raw dtypes):
	object features: 25
	float64 features: 1
	int64 features: 7
Original Features (inferred dtypes):
	object features: 25
	float features: 1
	int features: 7
Generated Features (special dtypes):
Processed Features (raw dtypes):
	float features: 1
	int features: 7
	category features: 25
Processed Features:
	float features: 1
	int features: 7
	category featu

Deleting model weighted_ensemble_k0_l1. All files under agModels/models/weighted_ensemble_k0_l1/ will be removed.
Deleting model RandomForestClassifierGini_STACKER_l1. All files under agModels/models/RandomForestClassifierGini_STACKER_l1/ will be removed.
Deleting model RandomForestClassifierEntr_STACKER_l1. All files under agModels/models/RandomForestClassifierEntr_STACKER_l1/ will be removed.
Deleting model ExtraTreesClassifierGini_STACKER_l1. All files under agModels/models/ExtraTreesClassifierGini_STACKER_l1/ will be removed.
Deleting model ExtraTreesClassifierEntr_STACKER_l1. All files under agModels/models/ExtraTreesClassifierEntr_STACKER_l1/ will be removed.
Deleting model LightGBMClassifier_STACKER_l1. All files under agModels/models/LightGBMClassifier_STACKER_l1/ will be removed.
Deleting model CatboostClassifier_STACKER_l1. All files under agModels/models/CatboostClassifier_STACKER_l1/ will be removed.
Deleting model weighted_ensemble_k0_l2. All files under agModels/models/we

Note that AutoGluon will save its models to the **agModels/** folder, which we can then easily move to another server and reload the predictor for deployments. We used the [`presets` argument](https://autogluon.mxnet.io/api/autogluon.task.html#autogluon.task.TabularPrediction.fit) which offers an easy interface to tell AutoGluon whether you care more about accuracy above all else, or whether computational considerations like inference-latency/memory-footprint matter in your applications.
For this demonstration, we specified two presets:
- **good_quality_faster_inference_only_refit**: tells AutoGluon we care about fast inference
- **optimize_for_deployment**: deletes all additional files beyond the key model-files needed to produce predictions. With this preset, some functionality in the `Predictor` may no longer be available, but the key `load()`, `predict()`, `predict_proba()` methods will still work.

If you require even lower latency, we recommend additionally specifying the [`fit()` argument](https://autogluon.mxnet.io/api/autogluon.task.html#autogluon.task.TabularPrediction.fit) `hyperparameters = 'light'` or `= 'very_light`. AutoGluon will then train models with hyperparameter-settings that improve their efficiency. You may also manually specify such hyperparameter-settings yourself and also try [`distill()`](https://autogluon.mxnet.io/api/autogluon.task.html#autogluon.task.tabular_prediction.TabularPredictor.distill).

Here's how one can deploy the trained predictor on a new server:

In [12]:
predictor = task.load(model_folder) # unecessary here, just demonstrates how one would reload the predictor in practice
predictor._learner.persist_trainer()
predictor.info()

{'path': 'agModels/',
 'label': 'readmitted',
 'time_fit_preprocessing': 0.22698187828063965,
 'time_fit_training': None,
 'time_fit_total': None,
 'time_limit': 120,
 'random_seed': 0,
 'version': '0.0.13b20200731',
 'time_train_start': 1596270850.380117,
 'num_rows_train': 1000,
 'num_cols_train': 33,
 'num_classes': 3,
 'problem_type': 'multiclass',
 'eval_metric': 'accuracy',
 'stopping_metric': 'accuracy',
 'best_model': 'CatboostClassifier_FULL_STACKER_l1',
 'best_model_score_val': None,
 'best_model_stack_level': 1,
 'num_models_trained': 6,
 'num_bagging_folds': 10,
 'max_stack_level': 2,
 'max_core_stack_level': 1,
 'model_stack_info': defaultdict(<function autogluon.utils.tabular.ml.utils.dd_list()>,
             {'core': defaultdict(list, {0: [], 1: []}),
              'aux1': defaultdict(list, {1: [], 2: []}),
              'refit_single_full': defaultdict(list,
                          {0: ['RandomForestClassifierGini_FULL_STACKER_l0',
                            'RandomF

The `persist_trainer()` function above loads all models required for prediction from disk into memory. Doing this deserialization in advance and only one time is crucial for efficient online inference. We're now ready to efficiently predict on one datapoint at a time:

In [13]:
num_test = 30
preds = np.array(['']*num_test, dtype='object')

for i in range(30):
    datapoint = test_data_nolab.iloc[[i]]
    pred_numpy = predictor.predict(datapoint)
    preds[i] = pred_numpy[0]

display(datapoint)
perf = predictor.evaluate_predictions(y_test[:num_test], preds, auxiliary_metrics=True)
print("Predictions: ", preds)

Unnamed: 0,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,...,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed
29,Female,[70-80),?,"""1""","""1""","""7""",5.3,MC,?,37,...,No,No,No,No,No,No,No,No,No,Yes


Evaluation: accuracy on test data: 0.43333333333333335
Evaluations on test data:
{
    "accuracy": 0.43333333333333335,
    "accuracy_score": 0.43333333333333335,
    "balanced_accuracy_score": 0.28698752228163993,
    "matthews_corrcoef": -0.10475656017578482
}
  _warn_prf(average, modifier, msg_start, len(result))
Detailed (per-class) classification report:
{
    "<30": {
        "precision": 0.0,
        "recall": 0.0,
        "f1-score": 0.0,
        "support": 2
    },
    ">30": {
        "precision": 0.25,
        "recall": 0.2727272727272727,
        "f1-score": 0.2608695652173913,
        "support": 11
    },
    "NO": {
        "precision": 0.5555555555555556,
        "recall": 0.5882352941176471,
        "f1-score": 0.5714285714285715,
        "support": 17
    },
    "accuracy": 0.43333333333333335,
    "macro avg": {
        "precision": 0.26851851851851855,
        "recall": 0.28698752228163993,
        "f1-score": 0.2774327122153209,
        "support": 30
    },
    "wei

Predictions:  ['NO' 'NO' 'NO' '>30' 'NO' '>30' '>30' 'NO' 'NO' 'NO' 'NO' 'NO' '>30' 'NO'
 'NO' 'NO' '>30' 'NO' '>30' '>30' 'NO' '>30' '>30' 'NO' '>30' 'NO' '>30'
 'NO' 'NO' '>30']


## References

[**AutoGluon Documentation** (autogluon.mxnet.io)](https://autogluon.mxnet.io/api/autogluon.task.html)

[**Train/Deploy AutoGluon in the Cloud**](https://github.com/awslabs/autogluon/#traindeploy-autogluon-in-the-cloud)

[**Build, Train, and Deploy a Machine Learning Model
with Amazon SageMaker**](https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/)

Kervizic, J. [**Overview of Different Approaches to Deploying Machine Learning Models in Production**](https://www.kdnuggets.com/2019/06/approaches-deploying-machine-learning-production.html). *KDNuggets*, 2019.

Fakoor et al. [**Fast, Accurate, and Simple Models for Tabular Data
via Augmented Distillation**](https://arxiv.org/abs/2006.14284). *Arxiv*, 2020.

Bucila et al. [**Model Compression**](https://www.cs.cornell.edu/~caruana/compression.kdd06.pdf). In: *KDD*, 2006. 

Hinton et al. [**Distilling the Knowledge in a Neural Network**](https://arxiv.org/abs/1503.02531). In: *NIPS Deep Learning Workshop*, 2014. 

Parr et al. [**Beware Default Random Forest Importances**](https://explained.ai/rf-importance/). *explained.ai*, 2018.
