# AutoGluon Tabular

## Essential

https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#

### library

In [1]:
from autogluon.tabular import TabularDataset, TabularPredictor

### data load

In [2]:
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
subsample_size = 500  # subsample subset of data for faster demo, try setting this to much larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
6118,51,Private,39264,Some-college,10,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,>50K
23204,58,Private,51662,10th,6,Married-civ-spouse,Other-service,Wife,White,Female,0,0,8,United-States,<=50K
29590,40,Private,326310,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,44,United-States,<=50K
18116,37,Private,222450,HS-grad,9,Never-married,Sales,Not-in-family,White,Male,0,2339,40,El-Salvador,<=50K
33964,62,Private,109190,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024,0,40,United-States,>50K


In [3]:
label = 'class'
print("Summary of class variable: \n", train_data[label].describe())

Summary of class variable: 
 count        500
unique         2
top        <=50K
freq         365
Name: class, dtype: object


### predictor, fitting

In [5]:
save_path = 'agModels-predictClass'  # specifies folder to store trained models
predictor = TabularPredictor(label=label, path=save_path).fit(train_data)

Beginning AutoGluon training ...
AutoGluon will save models to "agModels-predictClass/"
AutoGluon Version:  0.7.0
Python Version:     3.10.9
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
Train Data Rows:    500
Train Data Columns: 14
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' >50K', ' <=50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_

	0.83	 = Validation score   (accuracy)
	0.58s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: LightGBM ...
	0.85	 = Validation score   (accuracy)
	0.37s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: RandomForestGini ...
	0.84	 = Validation score   (accuracy)
	0.33s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: RandomForestEntr ...
	0.83	 = Validation score   (accuracy)
	0.18s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: CatBoost ...
	0.85	 = Validation score   (accuracy)
	0.7s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: ExtraTreesGini ...
	0.82	 = Validation score   (accuracy)
	0.18s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: ExtraTreesEntr ...
	0.81	 = Validation score   (accuracy)
	0.18s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: NeuralNetFastAI ...
	0.82	 = Validation score   (accuracy)
	0.53s	 = Training   runtime
	0.01s	 = Validation runti

In [6]:
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
y_test = test_data[label]  # values to predict
test_data_nolab = test_data.drop(columns=[label])  # delete label column to prove we're not cheating
test_data_nolab.head()

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,31,Private,169085,11th,7,Married-civ-spouse,Sales,Wife,White,Female,0,0,20,United-States
1,17,Self-emp-not-inc,226203,12th,8,Never-married,Sales,Own-child,White,Male,0,0,45,United-States
2,47,Private,54260,Assoc-voc,11,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,1887,60,United-States
3,21,Private,176262,Some-college,10,Never-married,Exec-managerial,Own-child,White,Female,0,0,30,United-States
4,17,Private,241185,12th,8,Never-married,Prof-specialty,Own-child,White,Male,0,0,20,United-States


### predict

In [7]:
predictor = TabularPredictor.load(save_path)  # unnecessary, just demonstrates how to load previously-trained predictor from file

y_pred = predictor.predict(test_data_nolab)
print("Predictions:  \n", y_pred)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

Evaluation: accuracy on test data: 0.8397993653393387
Evaluations on test data:
{
    "accuracy": 0.8397993653393387,
    "balanced_accuracy": 0.7437076677780596,
    "mcc": 0.5295565206264157,
    "f1": 0.6242496998799519,
    "precision": 0.7038440714672441,
    "recall": 0.5608283002588438
}


Predictions:  
 0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object


### leader board

In [10]:
predictor.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatBoost,0.842461,0.85,0.00816,0.002433,0.702048,0.00816,0.002433,0.702048,1,True,7
1,XGBoost,0.842461,0.85,0.015381,0.002221,0.214266,0.015381,0.002221,0.214266,1,True,11
2,RandomForestGini,0.842461,0.84,0.070275,0.014841,0.331338,0.070275,0.014841,0.331338,1,True,5
3,RandomForestEntr,0.840925,0.83,0.065891,0.014404,0.180476,0.065891,0.014404,0.180476,1,True,6
4,LightGBM,0.839799,0.85,0.014113,0.002041,0.369218,0.014113,0.002041,0.369218,1,True,4
5,WeightedEnsemble_L2,0.839799,0.85,0.01619,0.002314,0.52894,0.002077,0.000273,0.159722,2,True,14
6,LightGBMXT,0.836421,0.83,0.007182,0.002296,0.577779,0.007182,0.002296,0.577779,1,True,3
7,ExtraTreesGini,0.834374,0.82,0.067526,0.015412,0.177546,0.067526,0.015412,0.177546,1,True,8
8,ExtraTreesEntr,0.832839,0.81,0.065272,0.014964,0.176079,0.065272,0.014964,0.176079,1,True,9
9,LightGBMLarge,0.828949,0.83,0.016615,0.002345,1.342726,0.016615,0.002345,1.342726,1,True,13


### predict_proba

In [11]:
pred_probs = predictor.predict_proba(test_data_nolab)
pred_probs.head(5)

Unnamed: 0,<=50K,>50K
0,0.949797,0.050203
1,0.945973,0.054027
2,0.433299,0.566701
3,0.991393,0.008607
4,0.949908,0.050092


### fit_summary

we can see many different types of models performing results 

In [12]:
results = predictor.fit_summary(show_plot=True)

*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0              LightGBM       0.85       0.002041  0.369218                0.002041           0.369218            1       True          4
1               XGBoost       0.85       0.002221  0.214266                0.002221           0.214266            1       True         11
2   WeightedEnsemble_L2       0.85       0.002314  0.528940                0.000273           0.159722            2       True         14
3              CatBoost       0.85       0.002433  0.702048                0.002433           0.702048            1       True          7
4        NeuralNetTorch       0.84       0.004084  0.614711                0.004084           0.614711            1       True         12
5      RandomForestGini       0.84       0.014841  0.331338                0.014841           0.331338        

### type check

> problem type<br>
> feature type



In [13]:
print("AutoGluon infers problem type is: ", predictor.problem_type)
print("AutoGluon identified the following types of features:")
print(predictor.feature_metadata)

AutoGluon infers problem type is:  binary
AutoGluon identified the following types of features:
('category', [])  : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('int', [])       : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('int', ['bool']) : 1 | ['sex']


### specific model

In [16]:
predictor.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatBoost,0.842461,0.85,0.007754,0.002433,0.702048,0.007754,0.002433,0.702048,1,True,7
1,XGBoost,0.842461,0.85,0.016167,0.002221,0.214266,0.016167,0.002221,0.214266,1,True,11
2,RandomForestGini,0.842461,0.84,0.069004,0.014841,0.331338,0.069004,0.014841,0.331338,1,True,5
3,RandomForestEntr,0.840925,0.83,0.064986,0.014404,0.180476,0.064986,0.014404,0.180476,1,True,6
4,LightGBM,0.839799,0.85,0.015015,0.002041,0.369218,0.015015,0.002041,0.369218,1,True,4
5,WeightedEnsemble_L2,0.839799,0.85,0.016686,0.002314,0.52894,0.001671,0.000273,0.159722,2,True,14
6,LightGBMXT,0.836421,0.83,0.006282,0.002296,0.577779,0.006282,0.002296,0.577779,1,True,3
7,ExtraTreesGini,0.834374,0.82,0.067059,0.015412,0.177546,0.067059,0.015412,0.177546,1,True,8
8,ExtraTreesEntr,0.832839,0.81,0.066861,0.014964,0.176079,0.066861,0.014964,0.176079,1,True,9
9,LightGBMLarge,0.828949,0.83,0.017103,0.002345,1.342726,0.017103,0.002345,1.342726,1,True,13


In [17]:
predictor.predict(test_data, model='LightGBMXT')

0        <=50K
1        <=50K
2        <=50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object

### Presets
- best_quality
- high_quality
- good_quality
- medium_quality (default)

### maximizing predictive performance

In [18]:
time_limit = 60  # for quick demonstration only, you should set this to longest time you are willing to wait (in seconds)
metric = 'roc_auc'  # specify your evaluation metric here
predictor = TabularPredictor(label, eval_metric=metric).fit(train_data, time_limit=time_limit, presets='best_quality')
predictor.leaderboard(test_data, silent=True)

No path specified. Models will be saved in: "AutogluonModels/ag-20230606_072846/"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=5, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "AutogluonModels/ag-20230606_072846/"
AutoGluon Version:  0.7.0
Python Version:     3.10.9
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
Train Data Rows:    500
Train Data Columns: 14
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' >50K', ' <=50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Select

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatBoost_BAG_L1,0.901743,0.895728,0.156766,0.041657,15.987669,0.156766,0.041657,15.987669,1,True,7
1,LightGBMXT_BAG_L1,0.900042,0.891385,0.290548,0.050951,2.68838,0.290548,0.050951,2.68838,1,True,3
2,WeightedEnsemble_L2,0.896692,0.905327,1.645161,0.284655,24.172973,0.002355,0.00017,0.112481,2,True,14
3,XGBoost_BAG_L1,0.893014,0.880304,0.401158,0.038149,2.568441,0.401158,0.038149,2.568441,1,True,11
4,LightGBM_BAG_L1,0.89195,0.880528,0.187948,0.04125,3.275766,0.187948,0.04125,3.275766,1,True,4
5,RandomForestEntr_BAG_L1,0.886841,0.889264,0.067833,0.036116,0.179776,0.067833,0.036116,0.179776,1,True,6
6,NeuralNetTorch_BAG_L1,0.886591,0.864576,0.474017,0.078949,9.155463,0.474017,0.078949,9.155463,1,True,12
7,RandomForestGini_BAG_L1,0.885065,0.8869,0.069668,0.036471,0.203449,0.069668,0.036471,0.203449,1,True,5
8,NeuralNetFastAI_BAG_L1,0.88279,0.891202,0.970134,0.081715,4.819302,0.970134,0.081715,4.819302,1,True,10
9,ExtraTreesEntr_BAG_L1,0.880534,0.887519,0.075892,0.036747,0.176887,0.075892,0.036747,0.176887,1,True,9


### Regression (predicting numeric table columns)

In [19]:
age_column = 'age'
print("Summary of age variable: \n", train_data[age_column].describe())

Summary of age variable: 
 count    500.00000
mean      39.65200
std       13.52393
min       17.00000
25%       29.00000
50%       38.00000
75%       49.00000
max       85.00000
Name: age, dtype: float64


In [20]:
predictor_age = TabularPredictor(label=age_column, path="agModels-predictAge").fit(train_data, time_limit=60)
performance = predictor_age.evaluate(test_data)

Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "agModels-predictAge/"
AutoGluon Version:  0.7.0
Python Version:     3.10.9
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
Train Data Rows:    500
Train Data Columns: 14
Label Column: age
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == int and many unique label-values observed).
	Label info (max, min, mean, stddev): (85, 17, 39.652, 13.52393)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    7521.87 MB
	Train Data (Original)  Memory Us

In [21]:
predictor_age.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-10.518465,-11.117906,0.173487,0.030104,1.407216,0.002694,0.000152,0.10045,2,True,12
1,ExtraTreesMSE,-10.655822,-11.370142,0.054727,0.015389,0.157286,0.054727,0.015389,0.157286,1,True,7
2,RandomForestMSE,-10.745762,-11.666954,0.057256,0.015611,0.179851,0.057256,0.015611,0.179851,1,True,5
3,CatBoost,-10.780312,-11.799279,0.009914,0.00286,0.49189,0.009914,0.00286,0.49189,1,True,6
4,LightGBMXT,-10.837373,-11.709228,0.04394,0.002778,0.211336,0.04394,0.002778,0.211336,1,True,3
5,LightGBM,-10.972156,-11.929546,0.016594,0.001602,0.156346,0.016594,0.001602,0.156346,1,True,4
6,XGBoost,-11.076006,-12.261029,0.018151,0.002217,0.236305,0.018151,0.002217,0.236305,1,True,9
7,NeuralNetTorch,-11.191017,-11.533245,0.025506,0.004868,0.632134,0.025506,0.004868,0.632134,1,True,10
8,NeuralNetFastAI,-11.38212,-12.082255,0.06136,0.003926,0.27823,0.06136,0.003926,0.27823,1,True,8
9,LightGBMLarge,-11.469922,-12.315314,0.026284,0.002131,0.537697,0.026284,0.002131,0.537697,1,True,11


## In depth

https://auto.gluon.ai/stable/tutorials/tabular/tabular-indepth.html


- hyperparameter_tune_kwargs
- hyperparameters
- num_stack_levels
- num_bag_folds
- num_bag_sets

### library and data

In [22]:
from autogluon.tabular import TabularDataset, TabularPredictor
import numpy as np

train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
subsample_size = 500  # subsample subset of data for faster demo, try setting this to much larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
print(train_data.head())

label = 'occupation'
print("Summary of occupation column: \n", train_data['occupation'].describe())

new_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
test_data = new_data[5000:].copy()  # this should be separate data in your applications
y_test = test_data[label]
test_data_nolabel = test_data.drop(columns=[label])  # delete label column
val_data = new_data[:5000].copy()

metric = 'accuracy' # we specify eval-metric just for demo (unnecessary as it's the default)

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073


       age workclass  fnlwgt      education  education-num  \
6118    51   Private   39264   Some-college             10   
23204   58   Private   51662           10th              6   
29590   40   Private  326310   Some-college             10   
18116   37   Private  222450        HS-grad              9   
33964   62   Private  109190      Bachelors             13   

            marital-status        occupation    relationship    race      sex  \
6118    Married-civ-spouse   Exec-managerial            Wife   White   Female   
23204   Married-civ-spouse     Other-service            Wife   White   Female   
29590   Married-civ-spouse      Craft-repair         Husband   White     Male   
18116        Never-married             Sales   Not-in-family   White     Male   
33964   Married-civ-spouse   Exec-managerial         Husband   White     Male   

       capital-gain  capital-loss  hours-per-week  native-country   class  
6118              0             0              40   United-State

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


In [28]:
train_data[label].value_counts()

 Exec-managerial      77
 Craft-repair         67
 Sales                62
 Adm-clerical         57
 Prof-specialty       54
 Other-service        37
 Machine-op-inspct    35
 Transport-moving     27
 ?                    26
 Handlers-cleaners    24
 Tech-support         12
 Farming-fishing      11
 Protective-serv       9
 Armed-Forces          1
 Priv-house-serv       1
Name: occupation, dtype: int64

### specifying hyperparameters and tuning them

In [35]:
import autogluon.core as ag

nn_options = {  # specifies non-default hyperparameter values for neural network models
    'num_epochs': 10,  # number of training epochs (controls training time of NN models)
    'learning_rate': ag.space.Real(1e-4, 1e-2, default=5e-4, log=True),  # learning rate used in training (real-valued hyperparameter searched on log-scale)
    'activation': ag.space.Categorical('relu', 'softrelu', 'tanh'),  # activation function used in NN (categorical hyperparameter, default = first entry)
    'dropout_prob': ag.space.Real(0.0, 0.5, default=0.1),  # dropout probability (real-valued hyperparameter)
                }

gbm_options = {  # specifies non-default hyperparameter values for lightGBM gradient boosted trees
    'num_boost_round': 100,  # number of boosting rounds (controls training time of GBM models)
    'num_leaves': ag.space.Int(lower=26, upper=66, default=36),  # number of leaves in trees (integer hyperparameter)
                }

hyperparameters = {  # hyperparameters of each model type
                   'GBM': gbm_options,
                   'NN_TORCH': nn_options,  # NOTE: comment this line out if you get errors on Mac OSX
                  }  # When these keys are missing from hyperparameters dict, no models of that type are trained

time_limit = 2*60  # train various models for ~2 min
num_trials = 5  # try at most 5 different hyperparameter configurations for each type of model
search_strategy = 'auto'  # to tune hyperparameters using random search routine with a local scheduler

hyperparameter_tune_kwargs = {  # HPO is not performed unless hyperparameter_tune_kwargs is specified
    'num_trials': num_trials,
    'scheduler' : 'local',
    'searcher': search_strategy,
                }

predictor = TabularPredictor(
    label=label, 
    eval_metric=metric
    ).fit(
        train_data, 
        tuning_data=val_data, # 없을 경우, 훈련세트에서 자동으로 유혀성 검타 데이터 선택 진행
        time_limit=time_limit,
        hyperparameters=hyperparameters, 
        hyperparameter_tune_kwargs=hyperparameter_tune_kwargs,
        verbosity=3
        )

No path specified. Models will be saved in: "AutogluonModels/ag-20230606_074956/"
User Specified kwargs:
{'hyperparameter_tune_kwargs': {'num_trials': 5,
                                'scheduler': 'local',
                                'searcher': 'auto'},
 'verbosity': 3}
Full kwargs:
{'_feature_generator_kwargs': None,
 '_save_bag_folds': None,
 'ag_args': None,
 'ag_args_ensemble': None,
 'ag_args_fit': None,
 'auto_stack': False,
 'calibrate': 'auto',
 'excluded_model_types': None,
 'feature_generator': 'auto',
 'feature_prune_kwargs': None,
 'holdout_frac': None,
 'hyperparameter_tune_kwargs': {'num_trials': 5,
                                'scheduler': 'local',
                                'searcher': 'auto'},
 'keep_only_best': False,
 'name_suffix': None,
 'num_bag_folds': None,
 'num_bag_sets': None,
 'num_stack_levels': None,
 'pseudo_data': None,
 'refit_full': False,
 'save_space': False,
 'set_best_to_refit_full': False,
 'unlabeled_data': None,
 'use_bag_holdout'

  0%|          | 0/5 [00:00<?, ?it/s]

Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/dataset_train.pkl
Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/dataset_val.pkl
	Fitting 100 rounds... Hyperparameters: {'learning_rate': 0.05, 'num_leaves': 36, 'feature_fraction': 1.0, 'min_data_in_leaf': 20}
Saving /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/T1/model.pkl
Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/dataset_train.pkl
Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/dataset_val.pkl
	Fitting 100 rounds... Hyperparameters: {'learning_rate': 0.06994332504138305, 'num_leaves': 29, 'feature_fraction': 0.8872033759818312, 'min_data_in_leaf': 5}
Saving /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/T2/model.pkl
Loading: /Users/byeongsikbu/python/autogluo

In [36]:
y_pred = predictor.predict(test_data_nolabel)
print("Predictions:  ", list(y_pred)[:5])
perf = predictor.evaluate(test_data, auxiliary_metrics=False)

Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/T1/model.pkl
Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/T2/model.pkl
Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/T3/model.pkl
Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/T4/model.pkl
Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/T5/model.pkl
Loading: AutogluonModels/ag-20230606_074956/models/WeightedEnsemble_L2/model.pkl
Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/T1/model.pkl
Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/T2/model.pkl
Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/T3/model.pkl
Loading: /Users/byeongsikbu/python/autogluon/Au

Predictions:   [' Exec-managerial', ' Craft-repair', ' Craft-repair', ' Adm-clerical', ' Sales']


Loading: AutogluonModels/ag-20230606_074956/models/WeightedEnsemble_L2/model.pkl
Evaluation: accuracy on test data: 0.3036275948836234
Evaluations on test data:
{
    "accuracy": 0.3036275948836234
}


In [37]:
results = predictor.fit_summary()

Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/T1/model.pkl
Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/T2/model.pkl
Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/T3/model.pkl
Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/T4/model.pkl
Loading: /Users/byeongsikbu/python/autogluon/AutogluonModels/ag-20230606_074956/models/LightGBM/T5/model.pkl
Loading: AutogluonModels/ag-20230606_074956/models/WeightedEnsemble_L2/model.pkl


*** Summary of fit() ***
Estimated performance of each model:
                 model  score_val  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0  WeightedEnsemble_L2   0.328481       0.180445  3.388797                0.000683           0.339182            2       True          6
1          LightGBM/T3   0.323765       0.014499  0.310731                0.014499           0.310731            1       True          3
2          LightGBM/T5   0.310847       0.017980  0.526610                0.017980           0.526610            1       True          5
3          LightGBM/T1   0.303260       0.037204  0.656409                0.037204           0.656409            1       True          1
4          LightGBM/T2   0.289932       0.030602  0.858524                0.030602           0.858524            1       True          2
5          LightGBM/T4   0.280910       0.079478  0.697340                0.079478           0.697340            1  

### Model ensembling with stacking/bagging

tuning_data를 입력하지 않고 autogluon이 자동으로 선택하게 하는 것이 좋음

In [38]:
hyperparameters = {'NN_TORCH': {'num_epochs': 2}, 
                       'GBM': {'num_boost_round': 20}}
predictor = TabularPredictor(label=label, 
                             eval_metric=metric
                            ).fit(train_data,
                                  num_bag_folds=5, # 5~10
                                  num_bag_sets=1, 
                                  num_stack_levels=1, # 1~3
                                  hyperparameters = hyperparameters,  # last  argument is just for quick demo here, omit it in real applications
)

No path specified. Models will be saved in: "AutogluonModels/ag-20230606_075306/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230606_075306/"
AutoGluon Version:  0.7.0
Python Version:     3.10.9
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
Train Data Rows:    500
Train Data Columns: 14
Label Column: occupation
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == object).
	First 10 (of 15) unique label values:  [' Exec-managerial', ' Other-service', ' Craft-repair', ' Sales', ' Prof-specialty', ' Protective-serv', ' ?', ' Adm-clerical', ' Machine-op-inspct', ' Tech-support']
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass

In [39]:
save_path = 'agModels-predictOccupation'  # folder where to store trained models
hyperparameters = {'NN_TORCH': {'num_epochs': 2}, 'GBM': {'num_boost_round': 20}}

predictor = TabularPredictor(label=label, 
                             eval_metric=metric, 
                             path=save_path).fit(
    train_data, 
    auto_stack=True,
    time_limit=30, 
    hyperparameters= hyperparameters # last 2 arguments are for quick demo, omit them in real applications
)

Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=5, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 30s
AutoGluon will save models to "agModels-predictOccupation/"
AutoGluon Version:  0.7.0
Python Version:     3.10.9
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
Train Data Rows:    500
Train Data Columns: 14
Label Column: occupation
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == object).
	First 10 (of 15) unique label values:  [' Exec-managerial', ' Other-service', ' Craft-repair', ' Sales', ' Prof-specialty', ' Protective-serv', ' ?', ' Adm-clerical', ' Machine-op-inspct', ' Tech-support']
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['b

### prediction options

In [40]:
predictor = TabularPredictor.load(save_path)

In [41]:
predictor.features()

['age',
 'workclass',
 'fnlwgt',
 'education',
 'education-num',
 'marital-status',
 'relationship',
 'race',
 'sex',
 'capital-gain',
 'capital-loss',
 'hours-per-week',
 'native-country',
 'class']

In [42]:
datapoint = test_data_nolabel.iloc[[0]]  # Note: .iloc[0] won't work because it returns pandas Series instead of DataFrame
print(datapoint)
predictor.predict(datapoint)

      age workclass  fnlwgt      education  education-num marital-status  \
5000   49   Private  259087   Some-college             10       Divorced   

        relationship    race      sex  capital-gain  capital-loss  \
5000   Not-in-family   White   Female             0             0   

      hours-per-week  native-country   class  
5000              40   United-States   <=50K  


5000     Exec-managerial
Name: occupation, dtype: object

In [43]:
predictor.predict_proba(datapoint)

Unnamed: 0,?,Adm-clerical,Armed-Forces,Craft-repair,Exec-managerial,Farming-fishing,Handlers-cleaners,Machine-op-inspct,Other-service,Priv-house-serv,Prof-specialty,Protective-serv,Sales,Tech-support,Transport-moving
5000,0.070368,0.107069,0.0,0.101387,0.140451,0.05347,0.063921,0.075378,0.078306,0.0,0.087411,0.0,0.096707,0.051346,0.074186


In [44]:
predictor.get_model_best()

'WeightedEnsemble_L2'

In [45]:
predictor.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBM_BAG_L1,0.283078,0.296524,1.080664,0.197598,14.128732,1.080664,0.197598,14.128732,1,True,1
1,WeightedEnsemble_L2,0.270287,0.323108,2.367221,0.597866,19.059679,0.001082,0.000225,0.053206,2,True,3
2,NeuralNetTorch_BAG_L1,0.129377,0.157464,1.285475,0.400043,4.877741,1.285475,0.400043,4.877741,1,True,2


In [46]:
predictor.leaderboard(extra_info=True, silent=True)

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order,num_features,...,hyperparameters,hyperparameters_fit,ag_args_fit,features,compile_time,child_hyperparameters,child_hyperparameters_fit,child_ag_args_fit,ancestors,descendants
0,WeightedEnsemble_L2,0.323108,0.597866,19.059679,0.000225,0.053206,2,True,3,24,...,"{'use_orig_features': False, 'max_base_models': 25, 'max_base_models_per_type': 5, 'save_bag_folds': True}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[LightGBM_BAG_L1_1, LightGBM_BAG_L1_4, LightGBM_BAG_L1_8, LightGBM_BAG_L1_11, LightGBM_BAG_L1_6, NeuralNetTorch_BAG_L1_1, NeuralNetTorch_BAG_L1_7, LightGBM_BAG_L1_3, NeuralNetTorch_BAG_L1_3, NeuralNetTorch_BAG_L1_8, LightGBM_BAG_L1_10, NeuralNetTorch_BAG_L1_10, LightGBM_BAG_L1_2, LightGBM_BAG_L1_5, LightGBM_BAG_L1_7, NeuralNetTorch_BAG_L1_4, NeuralNetTorch_BAG_L1_2, NeuralNetTorch_BAG_L1_9, LightGBM_BAG_L1_0, NeuralNetTorch_BAG_L1_5, NeuralNetTorch_BAG_L1_6, LightGBM_BAG_L1_9, NeuralNetTorch_BAG_L1_0, NeuralNetTorch_BAG_L1_11]",,{'ensemble_size': 100},{'ensemble_size': 19},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[LightGBM_BAG_L1, NeuralNetTorch_BAG_L1]",[]
1,LightGBM_BAG_L1,0.296524,0.197598,14.128732,0.197598,14.128732,1,True,1,14,...,"{'use_orig_features': True, 'max_base_models': 25, 'max_base_models_per_type': 5, 'save_bag_folds': True}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[age, class, sex, workclass, education, race, education-num, capital-gain, capital-loss, relationship, native-country, hours-per-week, fnlwgt, marital-status]",,"{'learning_rate': 0.05, 'num_boost_round': 20}",{'num_boost_round': 12},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': ['bool', 'int', 'float', 'category'], 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None}",[],[WeightedEnsemble_L2]
2,NeuralNetTorch_BAG_L1,0.157464,0.400043,4.877741,0.400043,4.877741,1,True,2,14,...,"{'use_orig_features': True, 'max_base_models': 25, 'max_base_models_per_type': 5, 'save_bag_folds': True}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[age, class, sex, workclass, education, race, education-num, capital-gain, capital-loss, relationship, native-country, hours-per-week, fnlwgt, marital-status]",,"{'num_epochs': 2, 'epochs_wo_improve': 20, 'activation': 'relu', 'embedding_size_factor': 1.0, 'embed_exponent': 0.56, 'max_embedding_dim': 100, 'y_range': None, 'y_range_extend': 0.05, 'dropout_prob': 0.1, 'optimizer': 'adam', 'learning_rate': 0.0003, 'weight_decay': 1e-06, 'proc.embed_min_categories': 4, 'proc.impute_strategy': 'median', 'proc.max_category_levels': 100, 'proc.skew_threshold': 0.99, 'use_ngram_features': False, 'num_layers': 4, 'hidden_size': 128, 'max_batch_size': 512, 'use_batchnorm': False, 'loss_function': 'auto'}","{'batch_size': 32, 'num_epochs': 2}","{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': ['bool', 'int', 'float', 'category'], 'valid_special_types': None, 'ignored_type_group_special': ['text_ngram', 'text_as_category'], 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None}",[],[WeightedEnsemble_L2]


In [48]:
predictor.leaderboard(test_data, extra_metrics=['accuracy', 'balanced_accuracy', 'log_loss'], silent=True)

Unnamed: 0,model,score_test,accuracy,balanced_accuracy,log_loss,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBM_BAG_L1,0.283078,0.283078,0.177364,-11.749224,0.296524,1.111216,0.197598,14.128732,1.111216,0.197598,14.128732,1,True,1
1,WeightedEnsemble_L2,0.270287,0.270287,0.167753,-11.728956,0.323108,2.413881,0.597866,19.059679,0.001066,0.000225,0.053206,2,True,3
2,NeuralNetTorch_BAG_L1,0.129377,0.129377,0.066667,-11.760057,0.157464,1.301599,0.400043,4.877741,1.301599,0.400043,4.877741,1,True,2


### particular model

In [51]:
predictor.get_model_names()

['LightGBM_BAG_L1', 'NeuralNetTorch_BAG_L1', 'WeightedEnsemble_L2']

In [47]:
i = 0  # index of model to use
model_to_use = predictor.get_model_names()[i]
model_pred = predictor.predict(datapoint, model=model_to_use)
print("Prediction from %s model: %s" % (model_to_use, model_pred.iloc[0]))

Prediction from LightGBM_BAG_L1 model:  Exec-managerial


In [50]:
y_pred_proba = predictor.predict_proba(test_data_nolabel)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred_proba)

Evaluation: accuracy on test data: 0.2702872719647725
Evaluations on test data:
{
    "accuracy": 0.2702872719647725,
    "balanced_accuracy": 0.1677525538139834,
    "mcc": 0.179705418939371
}


In [52]:
perf = predictor.evaluate(test_data)

Evaluation: accuracy on test data: 0.2702872719647725
Evaluations on test data:
{
    "accuracy": 0.2702872719647725,
    "balanced_accuracy": 0.1677525538139834,
    "mcc": 0.179705418939371
}


### Interpretability (feature importance)

In [53]:
predictor.feature_importance(test_data)

Computing feature importance via permutation shuffling for 14 features using 4637 rows with 5 shuffle sets...
	182.83s	= Expected runtime (36.57s per shuffle set)
	111.54s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
education-num,0.056071,0.002017,2.007034e-07,5,0.060224,0.051917
workclass,0.03623,0.001056,8.667111e-08,5,0.038406,0.034055
sex,0.030149,0.00135,4.814644e-07,5,0.032929,0.027369
hours-per-week,0.027949,0.003248,2.149158e-05,5,0.034636,0.021262
age,0.016606,0.005041,0.0009053658,5,0.026986,0.006225
class,0.006599,0.001945,0.0008087244,5,0.010603,0.002595
relationship,0.001423,0.001834,0.07881967,5,0.005199,-0.002352
education,0.001337,0.000894,0.01438016,5,0.003179,-0.000504
capital-loss,0.000733,0.000289,0.002391255,5,0.001329,0.000137
native-country,0.000345,0.000246,0.01745985,5,0.000851,-0.000161


### Accelerating inference

- refit_full : -Quality, +FitTime
- persist_models : ++ MemoryUsage
- infer_limit : -Quality
- distill : -Quality, ++FitTime
- feature pruning : -Quality?, ++FitTime
- use faster hardware : +Hardware
- manual hyperparameters adjustmetn : -Quality, ++UserMLExpertise
- manual data processing : +++UserMLExpertise, +++UserCode

### Kepping models in memoery

In [54]:
predictor.persist_models()

num_test = 20
preds = np.array(['']*num_test, dtype='object')
for i in range(num_test):
    datapoint = test_data_nolabel.iloc[[i]]
    pred_numpy = predictor.predict(datapoint, as_pandas=False)
    preds[i] = pred_numpy[0]

perf = predictor.evaluate_predictions(y_test[:num_test], preds, auxiliary_metrics=True)
print("Predictions: ", preds)

predictor.unpersist_models()  # free memory by clearing models, future predict() calls will load models from disk

Persisting 3 models in memory. Models will require 0.62% of memory.
Evaluation: accuracy on test data: 0.25
Evaluations on test data:
{
    "accuracy": 0.25,
    "balanced_accuracy": 0.3208333333333336,
    "mcc": 0.13582634199860824
}
Unpersisted 3 models: ['WeightedEnsemble_L2', 'LightGBM_BAG_L1', 'NeuralNetTorch_BAG_L1']


Predictions:  [' Exec-managerial' ' Exec-managerial' ' Craft-repair' ' Adm-clerical'
 ' ?' ' Exec-managerial' ' Exec-managerial' ' Sales' ' Exec-managerial'
 ' Adm-clerical' ' Other-service' ' Exec-managerial' ' Exec-managerial'
 ' Exec-managerial' ' Adm-clerical' ' ?' ' Craft-repair' ' Craft-repair'
 ' Exec-managerial' ' Craft-repair']


['WeightedEnsemble_L2', 'LightGBM_BAG_L1', 'NeuralNetTorch_BAG_L1']

### Inference speed as a fit constraint

In [55]:
# At most 0.05 ms per row (20000 rows per second throughput)
infer_limit = 0.00005 # 1행 예측에 걸리는 시간
# adhere to infer_limit with batches of size 10000 (batch-inference, easier to satisfy infer_limit)
infer_limit_batch_size = 10000 # 한번에 통과하는 행의 크기
# adhere to infer_limit with batches of size 1 (online-inference, much harder to satisfy infer_limit)
# infer_limit_batch_size = 1  # Note that infer_limit<0.02 when infer_limit_batch_size=1 can be difficult to satisfy.
predictor_infer_limit = TabularPredictor(label=label, eval_metric=metric).fit(
    train_data=train_data,
    time_limit=30,
    infer_limit=infer_limit,
    infer_limit_batch_size=infer_limit_batch_size,
)

# NOTE: If bagging was enabled, it is important to call refit_full at this stage.
#  infer_limit assumes that the user will call refit_full after fit.
# predictor_infer_limit.refit_full()

# NOTE: To align with inference speed calculated during fit, models must be persisted.
predictor_infer_limit.persist_models()
# Below is an optimized version that only persists the minimum required models for prediction.
# predictor_infer_limit.persist_models('best')

predictor_infer_limit.leaderboard(silent=True)


No path specified. Models will be saved in: "AutogluonModels/ag-20230606_080905/"
Beginning AutoGluon training ... Time limit = 30s
AutoGluon will save models to "AutogluonModels/ag-20230606_080905/"
AutoGluon Version:  0.7.0
Python Version:     3.10.9
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
Train Data Rows:    500
Train Data Columns: 14
Label Column: occupation
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == object).
	First 10 (of 15) unique label values:  [' Exec-managerial', ' Other-service', ' Craft-repair', ' Sales', ' Prof-specialty', ' Protective-serv', ' ?', ' Adm-clerical', ' Machine-op-inspct', ' Tech-support']
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['bin

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,XGBoost,0.397959,0.001756,0.863583,0.001756,0.863583,1,True,11
1,WeightedEnsemble_L2,0.397959,0.002033,0.975178,0.000277,0.111595,2,True,14
2,LightGBM,0.367347,0.003443,1.225938,0.003443,1.225938,1,True,5
3,LightGBMXT,0.357143,0.002538,0.723984,0.002538,0.723984,1,True,4
4,CatBoost,0.346939,0.002195,9.914283,0.002195,9.914283,1,True,8
5,NeuralNetTorch,0.316327,0.003835,0.895814,0.003835,0.895814,1,True,12
6,NeuralNetFastAI,0.306122,0.003783,0.302249,0.003783,0.302249,1,True,3
7,RandomForestGini,0.306122,0.024727,0.354717,0.024727,0.354717,1,True,6
8,RandomForestEntr,0.295918,0.019253,0.346944,0.019253,0.346944,1,True,7
9,ExtraTreesEntr,0.295918,0.019821,0.325546,0.019821,0.325546,1,True,10


In [56]:
test_data_batch = test_data.sample(infer_limit_batch_size, replace=True, ignore_index=True)

import time
time_start = time.time()
predictor_infer_limit.predict(test_data_batch)
time_end = time.time()

infer_time_per_row = (time_end - time_start) / len(test_data_batch)
rows_per_second = 1 / infer_time_per_row
infer_time_per_row_ratio = infer_time_per_row / infer_limit
is_constraint_satisfied = infer_time_per_row_ratio <= 1

print(f'Model is able to predict {round(rows_per_second, 1)} rows per second. (User-specified Throughput = {1 / infer_limit})')
print(f'Model uses {round(infer_time_per_row_ratio * 100, 1)}% of infer_limit time per row.')
print(f'Model satisfies inference constraint: {is_constraint_satisfied}')

Model is able to predict 220287.9 rows per second. (User-specified Throughput = 20000.0)
Model uses 9.1% of infer_limit time per row.
Model satisfies inference constraint: True


### Using smaller ensemble or faster model for prediction

In [57]:
additional_ensembles = predictor.fit_weighted_ensemble(expand_pareto_frontier=True)
print("Alternative ensembles you can use for prediction:", additional_ensembles)

predictor.leaderboard(only_pareto_frontier=True, silent=True)

Fitting model: WeightedEnsemble_L2Best ...
	0.3231	 = Validation score   (accuracy)
	0.07s	 = Training   runtime
	0.0s	 = Validation runtime


Alternative ensembles you can use for prediction: ['WeightedEnsemble_L2Best']


Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2Best,0.323108,0.597859,19.080851,0.000218,0.074378,2,True,4
1,LightGBM_BAG_L1,0.296524,0.197598,14.128732,0.197598,14.128732,1,True,1


In [58]:
model_for_prediction = additional_ensembles[0]
predictions = predictor.predict(test_data, model=model_for_prediction)
predictor.delete_models(models_to_delete=additional_ensembles, dry_run=False)  # delete these extra models so they don't affect rest of tutorial

Deleting model WeightedEnsemble_L2Best. All files under agModels-predictOccupation/models/WeightedEnsemble_L2Best/ will be removed.


### Collapsing bagged ensembles via rifit_full

In [59]:
refit_model_map = predictor.refit_full()
print("Name of each refit-full model corresponding to a previous bagged ensemble:")
print(refit_model_map)
predictor.leaderboard(test_data, silent=True)

Refitting models via `predictor.refit_full` using all of the data (combined train and validation)...
	Models trained in this way will have the suffix "_FULL" and have NaN validation score.
	This process is not bound by time_limit, but should take less time than the original `predictor.fit` call.
	To learn more, refer to the `.refit_full` method docstring which explains how "_FULL" models differ from normal models.
Fitting 1 L1 models ...
Fitting model: LightGBM_BAG_L1_FULL ...
	0.12s	 = Training   runtime
Fitting 1 L1 models ...
Fitting model: NeuralNetTorch_BAG_L1_FULL ...
	0.06s	 = Training   runtime
Fitting model: WeightedEnsemble_L2_FULL | Skipping fit via cloning parent ...
	0.05s	 = Training   runtime
Updated best model to "WeightedEnsemble_L2_FULL" (Previously "WeightedEnsemble_L2"). AutoGluon will default to using "WeightedEnsemble_L2_FULL" for predict() and predict_proba().
Refit complete, total runtime = 0.41s


Name of each refit-full model corresponding to a previous bagged ensemble:
{'LightGBM_BAG_L1': 'LightGBM_BAG_L1_FULL', 'NeuralNetTorch_BAG_L1': 'NeuralNetTorch_BAG_L1_FULL', 'WeightedEnsemble_L2': 'WeightedEnsemble_L2_FULL'}


Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBM_BAG_L1,0.283078,0.296524,1.08537,0.197598,14.128732,1.08537,0.197598,14.128732,1,True,1
1,WeightedEnsemble_L2,0.270287,0.323108,2.440959,0.597866,19.059679,0.001141,0.000225,0.053206,2,True,3
2,LightGBM_BAG_L1_FULL,0.269868,,0.012544,,0.124398,0.012544,,0.124398,1,True,4
3,WeightedEnsemble_L2_FULL,0.257077,,0.033614,,0.242409,0.001189,,0.053206,2,True,6
4,NeuralNetTorch_BAG_L1_FULL,0.129377,,0.019881,,0.064805,0.019881,,0.064805,1,True,5
5,NeuralNetTorch_BAG_L1,0.129377,0.157464,1.354448,0.400043,4.877741,1.354448,0.400043,4.877741,1,True,2


### Model distillation

The idea is to train the individual model (which we can call the student) to mimic the predictions of the full stack ensemble (the teacher)

In [60]:
student_models = predictor.distill(time_limit=30)  # specify much longer time limit in real applications
print(student_models)
preds_student = predictor.predict(test_data_nolabel, model=student_models[0])
print(f"predictions from {student_models[0]}:", list(preds_student)[:5])
predictor.leaderboard(test_data)

Distilling with teacher='WeightedEnsemble_L2_FULL', teacher_preds=soft, augment_method=spunge ...
SPUNGE: Augmenting training data with 1955 synthetic samples for distillation...
Distilling with each of these student models: ['LightGBM_DSTL', 'NeuralNetMXNet_DSTL', 'RandomForestMSE_DSTL', 'CatBoost_DSTL', 'NeuralNetTorch_DSTL']
Fitting 5 L1 models ...
Fitting model: LightGBM_DSTL ... Training model for up to 30.0s of the 30.0s of remaining time.
	Note: model has different eval_metric than default.
	-2.0909	 = Validation score   (-soft_log_loss)
	3.65s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: NeuralNetMXNet_DSTL ... Training model for up to 26.28s of the 26.28s of remaining time.
		Unable to import dependency mxnet. A quick tip is to install via `pip install mxnet --upgrade`, or `pip install mxnet_cu101 --upgrade`
Fitting model: RandomForestMSE_DSTL ... Training model for up to 26.28s of the 26.28s of remaining time.
	Note: model has different eval_metric than d

['LightGBM_DSTL', 'RandomForestMSE_DSTL', 'CatBoost_DSTL', 'WeightedEnsemble_L2_DSTL']
predictions from LightGBM_DSTL: [' Exec-managerial', ' Exec-managerial', ' Craft-repair', ' Sales', ' Exec-managerial']
                        model  score_test  score_val  pred_time_test  pred_time_val   fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0               LightGBM_DSTL    0.291256   0.306122        0.126910       0.005275   3.652510                 0.126910                0.005275           3.652510            1       True          7
1    WeightedEnsemble_L2_DSTL    0.289788   0.346939        0.223685       0.026412   4.237754                 0.001342                0.000181           0.024037            2       True         10
2               CatBoost_DSTL    0.287901   0.295918        0.035285       0.002559  25.390957                 0.035285                0.002559          25.390957            1       True          9
3

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBM_DSTL,0.291256,0.306122,0.12691,0.005275,3.65251,0.12691,0.005275,3.65251,1,True,7
1,WeightedEnsemble_L2_DSTL,0.289788,0.346939,0.223685,0.026412,4.237754,0.001342,0.000181,0.024037,2,True,10
2,CatBoost_DSTL,0.287901,0.295918,0.035285,0.002559,25.390957,0.035285,0.002559,25.390957,1,True,9
3,RandomForestMSE_DSTL,0.284546,0.306122,0.095433,0.020956,0.561207,0.095433,0.020956,0.561207,1,True,8
4,LightGBM_BAG_L1,0.283078,0.296524,1.237438,0.197598,14.128732,1.237438,0.197598,14.128732,1,True,1
5,WeightedEnsemble_L2,0.270287,0.323108,3.146374,0.597866,19.059679,0.001375,0.000225,0.053206,2,True,3
6,LightGBM_BAG_L1_FULL,0.269868,,0.013364,,0.124398,0.013364,,0.124398,1,True,4
7,WeightedEnsemble_L2_FULL,0.257077,,0.03395,,0.242409,0.001361,,0.053206,2,True,6
8,NeuralNetTorch_BAG_L1_FULL,0.129377,,0.019225,,0.064805,0.019225,,0.064805,1,True,5
9,NeuralNetTorch_BAG_L1,0.129377,0.157464,1.907561,0.400043,4.877741,1.907561,0.400043,4.877741,1,True,2


### Faster presets or hyperparameters

In [61]:
presets = ['good_quality', 'optimize_for_deployment']
predictor_light = TabularPredictor(label=label, eval_metric=metric).fit(train_data, presets=presets, time_limit=30)

No path specified. Models will be saved in: "AutogluonModels/ag-20230606_084537/"
Presets specified: ['good_quality', 'optimize_for_deployment']
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=5, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 30s
AutoGluon will save models to "AutogluonModels/ag-20230606_084537/"
AutoGluon Version:  0.7.0
Python Version:     3.10.9
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
Train Data Rows:    500
Train Data Columns: 14
Label Column: occupation
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == object).
	First 10 (of 15) unique label values:  [' Exec-managerial', ' Other-service', ' Craft-repair', ' Sales', ' Prof-specialty', ' Protective-serv', ' ?', ' Adm-clerical', ' Machine-op-inspct', ' Tech-support']
	If 'multiclas

another options

In [62]:
predictor_light = TabularPredictor(label=label, eval_metric=metric).fit(train_data, hyperparameters='very_light', time_limit=30)

No path specified. Models will be saved in: "AutogluonModels/ag-20230606_084659/"
Beginning AutoGluon training ... Time limit = 30s
AutoGluon will save models to "AutogluonModels/ag-20230606_084659/"
AutoGluon Version:  0.7.0
Python Version:     3.10.9
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
Train Data Rows:    500
Train Data Columns: 14
Label Column: occupation
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == object).
	First 10 (of 15) unique label values:  [' Exec-managerial', ' Other-service', ' Craft-repair', ' Sales', ' Prof-specialty', ' Protective-serv', ' ?', ' Adm-clerical', ' Machine-op-inspct', ' Tech-support']
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['bin

another options 2

In [63]:
excluded_model_types = ['KNN', 'NN_TORCH', 'custom']
predictor_light = TabularPredictor(label=label, eval_metric=metric).fit(train_data, excluded_model_types=excluded_model_types, time_limit=30)

No path specified. Models will be saved in: "AutogluonModels/ag-20230606_084724/"
Beginning AutoGluon training ... Time limit = 30s
AutoGluon will save models to "AutogluonModels/ag-20230606_084724/"
AutoGluon Version:  0.7.0
Python Version:     3.10.9
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
Train Data Rows:    500
Train Data Columns: 14
Label Column: occupation
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == object).
	First 10 (of 15) unique label values:  [' Exec-managerial', ' Other-service', ' Craft-repair', ' Sales', ' Prof-specialty', ' Protective-serv', ' ?', ' Adm-clerical', ' Machine-op-inspct', ' Tech-support']
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['bin

### (Advanced) Cache preprocessed data

In [64]:
# 반복되어서 사용되는 test_data에 대해 전처리 버전의 데이터를 캐시
test_data_preprocessed = predictor.transform_features(test_data)

# The following call will be faster than a normal predict call because we are skipping the preprocessing stage.
predictions = predictor.predict(test_data_preprocessed, transform_features=False)

## Feature Engineering 

https://auto.gluon.ai/stable/tutorials/tabular/tabular-feature-engineering.html



- column types
- column type detection
- problem type detection
- automatic feature engineering
- numerical/categorical/datetime/text
- additional processing

In [68]:
from autogluon.tabular import TabularDataset, TabularPredictor
import pandas as pd
import numpy as np
import random
from sklearn.datasets import make_regression
from datetime import datetime

x, y = make_regression(n_samples = 100,n_features = 5,n_targets = 1, random_state = 1)
dfx = pd.DataFrame(x, columns=['A','B','C','D','E'])
dfy = pd.DataFrame(y, columns=['label'])

# Create an integer column, a datetime column, a categorical column and a string column to demonstrate how they are processed.
dfx['B'] = (dfx['B']).astype(int)
dfx['C'] = datetime(2000,1,1) + pd.to_timedelta(dfx['C'].astype(int), unit='D')
dfx['D'] = pd.cut(dfx['D'] * 10, [-np.inf,-5,0,5,np.inf],labels=['v','w','x','y'])
dfx['E'] = pd.Series(list(' '.join(random.choice(["abc", "d", "ef", "ghi", "jkl"]) for i in range(4)) for j in range(100)))
dataset=TabularDataset(dfx)
dfx

Unnamed: 0,A,B,C,D,E
0,-0.545774,0,2000-01-01,y,d ghi ghi jkl
1,-0.468674,0,2000-01-02,x,ef ghi abc ghi
2,1.767960,0,1999-12-31,v,ef d ghi abc
3,-0.118771,1,2000-01-01,y,jkl d abc ef
4,0.630196,0,1999-12-31,w,ghi ghi abc jkl
...,...,...,...,...,...
95,-1.182318,-1,2000-01-01,v,ghi ef ef ghi
96,0.562761,0,2000-01-01,v,ghi abc ef jkl
97,-0.797270,0,2000-01-01,w,ef jkl ghi abc
98,0.502741,0,1999-12-31,y,jkl jkl jkl ef


In [70]:
dfy

Unnamed: 0,label
0,10.729289
1,94.928562
2,-64.122910
3,38.896493
4,-63.009453
...,...
95,-82.611204
96,8.499687
97,6.046040
98,75.697053


### AutoMLPipelineFeatureGenerator

In [69]:
from autogluon.features.generators import AutoMLPipelineFeatureGenerator
auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
auto_ml_pipeline_feature_generator.fit_transform(X=dfx)

Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    7440.82 MB
	Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting DatetimeFeatureGenerator...
		Fitting TextSpecialFeatureGenerator...
			Fitting BinnedFeatureGenerator...
			Fitting DropDuplicatesFeatureGenerator...
		Fitting TextNgramFeatureGenerator...
			Fitting CountVectorizer for text features: ['E']
			CountVectorizer fit with vocabulary size = 4
		Reducing Vectorizer vocab size from 4 to 2 to avoid OOM error
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of 

Unnamed: 0,A,B,D,E,C,C.year,C.month,C.day,C.dayofweek,E.char_count,E.symbol_ratio.,__nlp__.abc,__nlp__.ghi,__nlp__._total_
0,-0.545774,0,3,,946684800000000000,2000,1,1,5,4,2,0,2,1
1,-0.468674,0,2,,946771200000000000,2000,1,2,6,5,1,1,2,2
2,1.767960,0,0,,946598400000000000,1999,12,31,4,3,3,1,1,2
3,-0.118771,1,3,5,946684800000000000,2000,1,1,5,3,3,1,0,1
4,0.630196,0,1,,946598400000000000,1999,12,31,4,6,0,1,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,-1.182318,-1,0,,946684800000000000,2000,1,1,5,4,2,0,2,1
96,0.562761,0,0,,946684800000000000,2000,1,1,5,5,1,1,1,2
97,-0.797270,0,1,,946684800000000000,2000,1,1,5,5,1,1,1,2
98,0.502741,0,3,,946598400000000000,1999,12,31,4,5,1,0,0,0


In [71]:
df = pd.concat([dfx, dfy], axis=1)
predictor = TabularPredictor(label='label')
predictor.fit(df, 
              hyperparameters={'GBM' : {}}, 
              feature_generator=auto_ml_pipeline_feature_generator)

No path specified. Models will be saved in: "AutogluonModels/ag-20230606_085609/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230606_085609/"
AutoGluon Version:  0.7.0
Python Version:     3.10.9
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
Train Data Rows:    100
Train Data Columns: 5
Label Column: label
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (186.98105511749836, -267.99365510467214, 9.38193, 71.29287)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
AutoMLP

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x32d8f3df0>

B는 uniuqe 변수가 얼마 없지만 문자열이 아닌, 숫자로<br>
D는 숫자로 매핑되었지만 범주형 변수로 인식이됨

In [72]:
print(len(set(dfx['B'])))

5


In [73]:
# B를 범주형 변수로
dfx["B"] = dfx["B"].astype("category")
auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
auto_ml_pipeline_feature_generator.fit_transform(X=dfx)

Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    7428.42 MB
	Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting DatetimeFeatureGenerator...
		Fitting TextSpecialFeatureGenerator...
			Fitting BinnedFeatureGenerator...
			Fitting DropDuplicatesFeatureGenerator...
		Fitting TextNgramFeatureGenerator...
			Fitting CountVectorizer for text features: ['E']
			CountVectorizer fit with vocabulary size = 4
		Reducing Vectorizer vocab size from 4 to 2 to avoid OOM error
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of 

Unnamed: 0,A,B,D,E,C,C.year,C.month,C.day,C.dayofweek,E.char_count,E.symbol_ratio.,__nlp__.abc,__nlp__.ghi,__nlp__._total_
0,-0.545774,1,3,,946684800000000000,2000,1,1,5,4,2,0,2,1
1,-0.468674,1,2,,946771200000000000,2000,1,2,6,5,1,1,2,2
2,1.767960,1,0,,946598400000000000,1999,12,31,4,3,3,1,1,2
3,-0.118771,2,3,5,946684800000000000,2000,1,1,5,3,3,1,0,1
4,0.630196,1,1,,946598400000000000,1999,12,31,4,6,0,1,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,-1.182318,0,0,,946684800000000000,2000,1,1,5,4,2,0,2,1
96,0.562761,1,0,,946684800000000000,2000,1,1,5,5,1,1,1,2
97,-0.797270,1,1,,946684800000000000,2000,1,1,5,5,1,1,1,2
98,0.502741,1,3,,946598400000000000,1999,12,31,4,5,1,0,0,0


### Missing Value Handling

In [74]:
dfx.iloc[0] = np.nan
dfx.head()

Unnamed: 0,A,B,C,D,E
0,,,NaT,,
1,-0.468674,0.0,2000-01-02,x,ef ghi abc ghi
2,1.76796,0.0,1999-12-31,v,ef d ghi abc
3,-0.118771,1.0,2000-01-01,y,jkl d abc ef
4,0.630196,0.0,1999-12-31,w,ghi ghi abc jkl


In [75]:
auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
auto_ml_pipeline_feature_generator.fit_transform(X=dfx)

Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    7406.22 MB
	Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting DatetimeFeatureGenerator...
		Fitting TextSpecialFeatureGenerator...
			Fitting BinnedFeatureGenerator...
			Fitting DropDuplicatesFeatureGenerator...
		Fitting TextNgramFeatureGenerator...
			Fitting CountVectorizer for text features: ['E']
			CountVectorizer fit with vocabulary size = 4
		Reducing Vectorizer vocab size from 4 to 2 to avoid OOM error
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of 

Unnamed: 0,A,B,D,E,C,C.year,C.month,C.day,C.dayofweek,E.char_count,E.word_count,E.symbol_ratio.,__nlp__.abc,__nlp__.ghi,__nlp__._total_
0,,,,,946687418181818240,2000,1,1,5,0,0,0,0,0,0
1,-0.468674,1,2,,946771200000000000,2000,1,2,6,6,1,2,1,2,2
2,1.767960,1,0,,946598400000000000,1999,12,31,4,4,1,4,1,1,2
3,-0.118771,2,3,5,946684800000000000,2000,1,1,5,4,1,4,1,0,1
4,0.630196,1,1,,946598400000000000,1999,12,31,4,7,1,1,1,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,-1.182318,0,0,,946684800000000000,2000,1,1,5,5,1,3,0,2,1
96,0.562761,1,0,,946684800000000000,2000,1,1,5,6,1,2,1,1,2
97,-0.797270,1,1,,946684800000000000,2000,1,1,5,6,1,2,1,1,2
98,0.502741,1,3,,946598400000000000,1999,12,31,4,6,1,2,0,0,0


A,B,D,E는 모두 NaN으로 나오지만 date변수인 C에 대해서만 파생변수가 생성됨

### Customizing of Feature Engineering

In [78]:
dfx

Unnamed: 0,A,B,C,D,E
0,,,NaT,,
1,-0.468674,0,2000-01-02,x,ef ghi abc ghi
2,1.767960,0,1999-12-31,v,ef d ghi abc
3,-0.118771,1,2000-01-01,y,jkl d abc ef
4,0.630196,0,1999-12-31,w,ghi ghi abc jkl
...,...,...,...,...,...
95,-1.182318,-1,2000-01-01,v,ghi ef ef ghi
96,0.562761,0,2000-01-01,v,ghi abc ef jkl
97,-0.797270,0,2000-01-01,w,ef jkl ghi abc
98,0.502741,0,1999-12-31,y,jkl jkl jkl ef


In [76]:
from autogluon.features.generators import PipelineFeatureGenerator, CategoryFeatureGenerator, IdentityFeatureGenerator
from autogluon.common.features.types import R_INT, R_FLOAT
mypipeline = PipelineFeatureGenerator(
    generators = [[        
        CategoryFeatureGenerator(maximum_num_cat=10),  # Overridden from default.
        IdentityFeatureGenerator(infer_features_in_args=dict(valid_raw_types=[R_INT, R_FLOAT])),
    ]]
)

In [77]:
mypipeline.fit_transform(X=dfx)

Fitting PipelineFeatureGenerator...
	Available Memory:                    7391.19 MB
	Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting IdentityFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Unused Original Features (Count: 1): ['C']
		These features were not used to generate any of the output features. Add a feature generator compatible with these features to utilize them.
		Features can also be unused if they carry very little information, such as being categorical but having almost entirely unique values or being duplicates of other features.
		These features d

Unnamed: 0,B,D,E,A
0,,,,
1,1,2,,-0.468674
2,1,0,,1.767960
3,2,3,5,-0.118771
4,1,1,,0.630196
...,...,...,...,...
95,0,0,,-1.182318
96,1,0,,0.562761
97,1,1,,-0.797270
98,1,3,,0.502741


# (Advanced) Tabular AutoGluon Tutorials

## Multi-Label Prediction

### predicting Multiple columns in a Table

https://auto.gluon.ai/stable/tutorials/tabular/advanced/tabular-multilabel.html

In [79]:
from autogluon.tabular import TabularDataset, TabularPredictor
from autogluon.common.utils.utils import setup_outputdir
from autogluon.core.utils.loaders import load_pkl
from autogluon.core.utils.savers import save_pkl
import os.path

class MultilabelPredictor():
    """ Tabular Predictor for predicting multiple columns in table.
        Creates multiple TabularPredictor objects which you can also use individually.
        You can access the TabularPredictor for a particular label via: `multilabel_predictor.get_predictor(label_i)`

        Parameters
        ----------
        labels : List[str]
            The ith element of this list is the column (i.e. `label`) predicted by the ith TabularPredictor stored in this object.
        path : str, default = None
            Path to directory where models and intermediate outputs should be saved.
            If unspecified, a time-stamped folder called "AutogluonModels/ag-[TIMESTAMP]" will be created in the working directory to store all models.
            Note: To call `fit()` twice and save all results of each fit, you must specify different `path` locations or don't specify `path` at all.
            Otherwise files from first `fit()` will be overwritten by second `fit()`.
            Caution: when predicting many labels, this directory may grow large as it needs to store many TabularPredictors.
        problem_types : List[str], default = None
            The ith element is the `problem_type` for the ith TabularPredictor stored in this object.
        eval_metrics : List[str], default = None
            The ith element is the `eval_metric` for the ith TabularPredictor stored in this object.
        consider_labels_correlation : bool, default = True
            Whether the predictions of multiple labels should account for label correlations or predict each label independently of the others.
            If True, the ordering of `labels` may affect resulting accuracy as each label is predicted conditional on the previous labels appearing earlier in this list (i.e. in an auto-regressive fashion).
            Set to False if during inference you may want to individually use just the ith TabularPredictor without predicting all the other labels.
        kwargs :
            Arguments passed into the initialization of each TabularPredictor.

    """

    multi_predictor_file = 'multilabel_predictor.pkl'

    def __init__(self, labels, path=None, problem_types=None, eval_metrics=None, consider_labels_correlation=True, **kwargs):
        if len(labels) < 2:
            raise ValueError("MultilabelPredictor is only intended for predicting MULTIPLE labels (columns), use TabularPredictor for predicting one label (column).")
        if (problem_types is not None) and (len(problem_types) != len(labels)):
            raise ValueError("If provided, `problem_types` must have same length as `labels`")
        if (eval_metrics is not None) and (len(eval_metrics) != len(labels)):
            raise ValueError("If provided, `eval_metrics` must have same length as `labels`")
        self.path = setup_outputdir(path, warn_if_exist=False)
        self.labels = labels
        self.consider_labels_correlation = consider_labels_correlation
        self.predictors = {}  # key = label, value = TabularPredictor or str path to the TabularPredictor for this label
        if eval_metrics is None:
            self.eval_metrics = {}
        else:
            self.eval_metrics = {labels[i] : eval_metrics[i] for i in range(len(labels))}
        problem_type = None
        eval_metric = None
        for i in range(len(labels)):
            label = labels[i]
            path_i = self.path + "Predictor_" + label
            if problem_types is not None:
                problem_type = problem_types[i]
            if eval_metrics is not None:
                eval_metric = eval_metrics[i]
            self.predictors[label] = TabularPredictor(label=label, problem_type=problem_type, eval_metric=eval_metric, path=path_i, **kwargs)

    def fit(self, train_data, tuning_data=None, **kwargs):
        """ Fits a separate TabularPredictor to predict each of the labels.

            Parameters
            ----------
            train_data, tuning_data : str or autogluon.tabular.TabularDataset or pd.DataFrame
                See documentation for `TabularPredictor.fit()`.
            kwargs :
                Arguments passed into the `fit()` call for each TabularPredictor.
        """
        if isinstance(train_data, str):
            train_data = TabularDataset(train_data)
        if tuning_data is not None and isinstance(tuning_data, str):
            tuning_data = TabularDataset(tuning_data)
        train_data_og = train_data.copy()
        if tuning_data is not None:
            tuning_data_og = tuning_data.copy()
        else:
            tuning_data_og = None
        save_metrics = len(self.eval_metrics) == 0
        for i in range(len(self.labels)):
            label = self.labels[i]
            predictor = self.get_predictor(label)
            if not self.consider_labels_correlation:
                labels_to_drop = [l for l in self.labels if l != label]
            else:
                labels_to_drop = [self.labels[j] for j in range(i+1, len(self.labels))]
            train_data = train_data_og.drop(labels_to_drop, axis=1)
            if tuning_data is not None:
                tuning_data = tuning_data_og.drop(labels_to_drop, axis=1)
            print(f"Fitting TabularPredictor for label: {label} ...")
            predictor.fit(train_data=train_data, tuning_data=tuning_data, **kwargs)
            self.predictors[label] = predictor.path
            if save_metrics:
                self.eval_metrics[label] = predictor.eval_metric
        self.save()

    def predict(self, data, **kwargs):
        """ Returns DataFrame with label columns containing predictions for each label.

            Parameters
            ----------
            data : str or autogluon.tabular.TabularDataset or pd.DataFrame
                Data to make predictions for. If label columns are present in this data, they will be ignored. See documentation for `TabularPredictor.predict()`.
            kwargs :
                Arguments passed into the predict() call for each TabularPredictor.
        """
        return self._predict(data, as_proba=False, **kwargs)

    def predict_proba(self, data, **kwargs):
        """ Returns dict where each key is a label and the corresponding value is the `predict_proba()` output for just that label.

            Parameters
            ----------
            data : str or autogluon.tabular.TabularDataset or pd.DataFrame
                Data to make predictions for. See documentation for `TabularPredictor.predict()` and `TabularPredictor.predict_proba()`.
            kwargs :
                Arguments passed into the `predict_proba()` call for each TabularPredictor (also passed into a `predict()` call).
        """
        return self._predict(data, as_proba=True, **kwargs)

    def evaluate(self, data, **kwargs):
        """ Returns dict where each key is a label and the corresponding value is the `evaluate()` output for just that label.

            Parameters
            ----------
            data : str or autogluon.tabular.TabularDataset or pd.DataFrame
                Data to evalate predictions of all labels for, must contain all labels as columns. See documentation for `TabularPredictor.evaluate()`.
            kwargs :
                Arguments passed into the `evaluate()` call for each TabularPredictor (also passed into the `predict()` call).
        """
        data = self._get_data(data)
        eval_dict = {}
        for label in self.labels:
            print(f"Evaluating TabularPredictor for label: {label} ...")
            predictor = self.get_predictor(label)
            eval_dict[label] = predictor.evaluate(data, **kwargs)
            if self.consider_labels_correlation:
                data[label] = predictor.predict(data, **kwargs)
        return eval_dict

    def save(self):
        """ Save MultilabelPredictor to disk. """
        for label in self.labels:
            if not isinstance(self.predictors[label], str):
                self.predictors[label] = self.predictors[label].path
        save_pkl.save(path=self.path+self.multi_predictor_file, object=self)
        print(f"MultilabelPredictor saved to disk. Load with: MultilabelPredictor.load('{self.path}')")

    @classmethod
    def load(cls, path):
        """ Load MultilabelPredictor from disk `path` previously specified when creating this MultilabelPredictor. """
        path = os.path.expanduser(path)
        if path[-1] != os.path.sep:
            path = path + os.path.sep
        return load_pkl.load(path=path+cls.multi_predictor_file)

    def get_predictor(self, label):
        """ Returns TabularPredictor which is used to predict this label. """
        predictor = self.predictors[label]
        if isinstance(predictor, str):
            return TabularPredictor.load(path=predictor)
        return predictor

    def _get_data(self, data):
        if isinstance(data, str):
            return TabularDataset(data)
        return data.copy()

    def _predict(self, data, as_proba=False, **kwargs):
        data = self._get_data(data)
        if as_proba:
            predproba_dict = {}
        for label in self.labels:
            print(f"Predicting with TabularPredictor for label: {label} ...")
            predictor = self.get_predictor(label)
            if as_proba:
                predproba_dict[label] = predictor.predict_proba(data, as_multiclass=True, **kwargs)
            data[label] = predictor.predict(data, **kwargs)
        if not as_proba:
            return data[self.labels]
        else:
            return predproba_dict

### Training

In [80]:
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
subsample_size = 500  # subsample subset of data for faster demo, try setting this to much larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head()

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
6118,51,Private,39264,Some-college,10,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,>50K
23204,58,Private,51662,10th,6,Married-civ-spouse,Other-service,Wife,White,Female,0,0,8,United-States,<=50K
29590,40,Private,326310,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,44,United-States,<=50K
18116,37,Private,222450,HS-grad,9,Never-married,Sales,Not-in-family,White,Male,0,2339,40,El-Salvador,<=50K
33964,62,Private,109190,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024,0,40,United-States,>50K


labels, problem_types, eval_metrics 를 지정해줍니다. 

In [81]:
labels = ['education-num','education','class']  # which columns to predict based on the others
problem_types = ['regression','multiclass','binary']  # type of each prediction problem (optional)
eval_metrics = ['mean_absolute_error','accuracy','accuracy']  # metrics used to evaluate predictions for each label (optional)
save_path = 'agModels-predictEducationClass'  # specifies folder to store trained models (optional)

time_limit = 5  # how many seconds to train the TabularPredictor for each label, set much larger in your applications!

In [82]:
multi_predictor = MultilabelPredictor(labels=labels, problem_types=problem_types, eval_metrics=eval_metrics, path=save_path)
multi_predictor.fit(train_data, time_limit=time_limit)

Beginning AutoGluon training ... Time limit = 5s
AutoGluon will save models to "agModels-predictEducationClass/Predictor_education-num/"
AutoGluon Version:  0.7.0
Python Version:     3.10.9
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
Train Data Rows:    500
Train Data Columns: 12
Label Column: education-num
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    7408.97 MB
	Train Data (Original)  Memory Usage: 0.26 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeat

Fitting TabularPredictor for label: education-num ...


	-1.7808	 = Validation score   (-mean_absolute_error)
	0.22s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 4.71s of the 4.71s of remaining time.
	-1.7854	 = Validation score   (-mean_absolute_error)
	0.2s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: RandomForestMSE ... Training model for up to 4.51s of the 4.5s of remaining time.
	-1.7079	 = Validation score   (-mean_absolute_error)
	0.18s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 4.3s of the 4.3s of remaining time.
	-1.7377	 = Validation score   (-mean_absolute_error)
	0.77s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: ExtraTreesMSE ... Training model for up to 3.53s of the 3.53s of remaining time.
	-1.8167	 = Validation score   (-mean_absolute_error)
	0.15s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: NeuralNetFastAI ... Training model for up to 3.35s of th

Fitting TabularPredictor for label: education ...


	0.8163	 = Validation score   (accuracy)
	0.3s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: LightGBMXT ... Training model for up to 4.63s of the 4.63s of remaining time.
	0.9694	 = Validation score   (accuracy)
	1.0s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 3.57s of the 3.57s of remaining time.
	1.0	 = Validation score   (accuracy)
	0.51s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 3.05s of the 3.05s of remaining time.
	0.9082	 = Validation score   (accuracy)
	0.24s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 2.77s of the 2.77s of remaining time.
	0.898	 = Validation score   (accuracy)
	0.23s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 2.52s of the 2.52s of remaining time.
	Ran out of time, early stopping on itera

Fitting TabularPredictor for label: class ...


	0.83	 = Validation score   (accuracy)
	0.16s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 4.78s of the 4.78s of remaining time.
	0.85	 = Validation score   (accuracy)
	0.19s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 4.59s of the 4.59s of remaining time.
	0.84	 = Validation score   (accuracy)
	0.21s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 4.36s of the 4.35s of remaining time.
	0.83	 = Validation score   (accuracy)
	0.18s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 4.15s of the 4.15s of remaining time.
	0.85	 = Validation score   (accuracy)
	0.56s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: ExtraTreesGini ... Training model for up to 3.59s of the 3.59s of remaining time.
	0.82	 = Validation score   (accuracy)
	0

MultilabelPredictor saved to disk. Load with: MultilabelPredictor.load('agModels-predictEducationClass/')


### Inference and Evaluation

In [83]:
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
test_data = test_data.sample(n=subsample_size, random_state=0)
test_data_nolab = test_data.drop(columns=labels)  # unnecessary, just to demonstrate we're not cheating here
test_data_nolab.head()

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


Unnamed: 0,age,workclass,fnlwgt,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
5454,41,Self-emp-not-inc,408498,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,50,United-States
6111,39,Private,746786,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,55,United-States
5282,50,Private,62593,Married-civ-spouse,Farming-fishing,Husband,Asian-Pac-Islander,Male,0,0,40,United-States
3046,31,Private,248178,Married-civ-spouse,Other-service,Husband,Black,Male,0,0,35,United-States
2162,43,State-gov,52849,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,40,United-States


In [84]:
multi_predictor = MultilabelPredictor.load(save_path)  # unnecessary, just demonstrates how to load previously-trained multilabel predictor from file

predictions = multi_predictor.predict(test_data_nolab)
print("Predictions:  \n", predictions)

Predicting with TabularPredictor for label: education-num ...
Predicting with TabularPredictor for label: education ...
Predicting with TabularPredictor for label: class ...
Predictions:  
       education-num      education   class
5454      10.314816   Some-college    >50K
6111      13.340745      Bachelors    >50K
5282       9.820933        HS-grad   <=50K
3046       9.638832        HS-grad   <=50K
2162      12.913060        HS-grad    >50K
...             ...            ...     ...
6965       9.444245        HS-grad    >50K
4762       8.696204           11th   <=50K
234       10.413758   Some-college   <=50K
6291      10.391630   Some-college   <=50K
9575      10.274948   Some-college    >50K

[500 rows x 3 columns]


In [85]:
evaluations = multi_predictor.evaluate(test_data)
print(evaluations)
print("Evaluated using metrics:", multi_predictor.eval_metrics)

Evaluation: mean_absolute_error on test data: -1.6336145429611206
	Note: Scores are always higher_is_better. This metric score can be multiplied by -1 to get the metric value.
Evaluations on test data:
{
    "mean_absolute_error": -1.6336145429611206,
    "root_mean_squared_error": -2.198722579088362,
    "mean_squared_error": -4.834380979792979,
    "r2": 0.3748938437982332,
    "pearsonr": 0.6188337473117967,
    "median_absolute_error": -1.224937915802002
}
Evaluation: accuracy on test data: 0.218
Evaluations on test data:
{
    "accuracy": 0.218,
    "balanced_accuracy": 0.08682331905790826,
    "mcc": 0.03233013068324609
}
Evaluation: accuracy on test data: 0.84
Evaluations on test data:
{
    "accuracy": 0.84,
    "balanced_accuracy": 0.7303746421780648,
    "mcc": 0.5471381005232028,
    "roc_auc": 0.85346538791032,
    "f1": 0.6190476190476191,
    "precision": 0.8024691358024691,
    "recall": 0.5038759689922481
}


Evaluating TabularPredictor for label: education-num ...
Evaluating TabularPredictor for label: education ...
Evaluating TabularPredictor for label: class ...
{'education-num': {'mean_absolute_error': -1.6336145429611206, 'root_mean_squared_error': -2.198722579088362, 'mean_squared_error': -4.834380979792979, 'r2': 0.3748938437982332, 'pearsonr': 0.6188337473117967, 'median_absolute_error': -1.224937915802002}, 'education': {'accuracy': 0.218, 'balanced_accuracy': 0.08682331905790826, 'mcc': 0.03233013068324609}, 'class': {'accuracy': 0.84, 'balanced_accuracy': 0.7303746421780648, 'mcc': 0.5471381005232028, 'roc_auc': 0.85346538791032, 'f1': 0.6190476190476191, 'precision': 0.8024691358024691, 'recall': 0.5038759689922481}}
Evaluated using metrics: {'education-num': 'mean_absolute_error', 'education': 'accuracy', 'class': 'accuracy'}


### Accessing the TabularPredictor for One Label

In [90]:
multi_predictor.get_predictor('education-num').leaderboard(silent=True)

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-1.620802,0.024199,1.473809,0.000142,0.070705,2,True,12
1,XGBoost,-1.631029,0.001897,0.203823,0.001897,0.203823,1,True,9
2,RandomForestMSE,-1.7079,0.016527,0.182843,0.016527,0.182843,1,True,5
3,CatBoost,-1.737659,0.001799,0.770317,0.001799,0.770317,1,True,6
4,NeuralNetTorch,-1.745082,0.003626,0.801264,0.003626,0.801264,1,True,10
5,LightGBMXT,-1.780762,0.002007,0.215174,0.002007,0.215174,1,True,3
6,LightGBM,-1.785416,0.001803,0.20082,0.001803,0.20082,1,True,4
7,ExtraTreesMSE,-1.816733,0.014317,0.152974,0.014317,0.152974,1,True,7
8,LightGBMLarge,-1.892534,0.001783,0.338026,0.001783,0.338026,1,True,11
9,NeuralNetFastAI,-1.929672,0.00366,0.277434,0.00366,0.277434,1,True,8


In [91]:
multi_predictor.get_predictor('education').leaderboard(silent=True)

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatBoost,1.0,0.001956,2.505106,0.001956,2.505106,1,True,8
1,LightGBM,1.0,0.002955,0.512164,0.002955,0.512164,1,True,5
2,WeightedEnsemble_L2,1.0,0.003144,0.602399,0.000189,0.090235,2,True,9
3,LightGBMXT,0.969388,0.004205,1.00471,0.004205,1.00471,1,True,4
4,RandomForestGini,0.908163,0.018387,0.244683,0.018387,0.244683,1,True,6
5,RandomForestEntr,0.897959,0.017997,0.22914,0.017997,0.22914,1,True,7
6,NeuralNetFastAI,0.816327,0.003726,0.301394,0.003726,0.301394,1,True,3
7,KNeighborsUnif,0.265306,0.005394,0.004069,0.005394,0.004069,1,True,1
8,KNeighborsDist,0.234694,0.002665,0.003351,0.002665,0.003351,1,True,2


In [86]:
predictor_class = multi_predictor.get_predictor('class')
predictor_class.leaderboard(silent=True)

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatBoost,0.85,0.001981,0.555203,0.001981,0.555203,1,True,7
1,XGBoost,0.85,0.002079,0.125008,0.002079,0.125008,1,True,11
2,LightGBM,0.85,0.002153,0.186465,0.002153,0.186465,1,True,4
3,WeightedEnsemble_L2,0.85,0.00242,0.346392,0.000267,0.159927,2,True,14
4,NeuralNetTorch,0.84,0.003904,0.791422,0.003904,0.791422,1,True,12
5,RandomForestGini,0.84,0.015179,0.211425,0.015179,0.211425,1,True,5
6,LightGBMLarge,0.83,0.002223,0.372655,0.002223,0.372655,1,True,13
7,LightGBMXT,0.83,0.00241,0.156718,0.00241,0.156718,1,True,3
8,RandomForestEntr,0.83,0.014635,0.180579,0.014635,0.180579,1,True,6
9,NeuralNetFastAI,0.82,0.004322,0.295996,0.004322,0.295996,1,True,10


## Trainig models with GPU support

https://auto.gluon.ai/stable/tutorials/tabular/advanced/tabular-gpu.html

In [92]:
predictor = TabularPredictor(label=label).fit(
    train_data,
    num_gpus=1,  # Grant 1 gpu for the entire Tabular Predictor
)

No path specified. Models will be saved in: "AutogluonModels/ag-20230606_090837/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230606_090837/"
AutoGluon Version:  0.7.0
Python Version:     3.10.9
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
Train Data Rows:    500
Train Data Columns: 14
Label Column: occupation
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == object).
	First 10 (of 15) unique label values:  [' Exec-managerial', ' Other-service', ' Craft-repair', ' Sales', ' Prof-specialty', ' Protective-serv', ' ?', ' Adm-clerical', ' Machine-op-inspct', ' Tech-support']
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass

In [None]:
hyperparameters = {
    'GBM': [
        {'ag_args_fit': {'num_gpus': 0}},  # Train with CPU
        {'ag_args_fit': {'num_gpus': 1}}   # Train with GPU. This amount needs to be <= total num_gpus granted to TabularPredictor
    ]
}
predictor = TabularPredictor(label=label).fit(
    train_data, 
    num_gpus=1,
    hyperparameters=hyperparameters, 
)

### advanced resource allocation 

In [None]:
predictor.fit(
    num_cpus=32,
    num_gpus=4,
    hyperparameters={
        'NN_TORCH': {},
    },
    num_bag_folds=2,
    ag_args_ensemble={
        'ag_args_fit': {
            'num_cpus': 10,
            'num_gpus': 2,
        }
    },
    'ag_args_fit': {
        'num_cpus': 4,
        'num_gpus': 0.5,
    }
    hyperparameter_tune_kwargs={
        'searcher': 'random',
        'scheduler': 'local',
        'num_trials: 2
    }
)

## Adding a custom metric to Autogluon

https://auto.gluon.ai/stable/tutorials/tabular/advanced/tabular-custom-metric.html

In [93]:
import numpy as np

y_true = np.random.randint(low=0, high=2, size=10)
y_pred = np.random.randint(low=0, high=2, size=10)

print(f'y_true: {y_true}')
print(f'y_pred: {y_pred}')

y_true: [0 1 1 0 1 1 1 1 1 1]
y_pred: [1 0 0 1 0 0 0 0 0 1]


In [94]:
import sklearn.metrics

sklearn.metrics.accuracy_score(y_true, y_pred)

0.1

In [95]:
from autogluon.core.metrics import make_scorer

ag_accuracy_scorer = make_scorer(name='accuracy',
                                 score_func=sklearn.metrics.accuracy_score,
                                 optimum=1,
                                 greater_is_better=True)

### Custom Mean Squared Error metric

In [96]:
y_true = np.random.rand(10)
y_pred = np.random.rand(10)

print(f'y_true: {y_true}')
print(f'y_pred: {y_pred}')

y_true: [0.79172504 0.52889492 0.56804456 0.92559664 0.07103606 0.0871293
 0.0202184  0.83261985 0.77815675 0.87001215]
y_pred: [0.97861834 0.79915856 0.46147936 0.78052918 0.11827443 0.63992102
 0.14335329 0.94466892 0.52184832 0.41466194]


In [97]:
sklearn.metrics.mean_squared_error(y_true, y_pred)

0.07489374242602942

In [98]:
ag_mean_squared_error_scorer = make_scorer(name='mean_squared_error',
                                           score_func=sklearn.metrics.mean_squared_error,
                                           optimum=0,
                                           greater_is_better=False)

### making func

In [99]:
def mse_func(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    return ((y_true - y_pred) ** 2).mean()

mse_func(y_true, y_pred)

0.07489374242602942

In [100]:
ag_mean_squared_error_custom_scorer = make_scorer(name='mean_squared_error',
                                                  score_func=mse_func,
                                                  optimum=0,
                                                  greater_is_better=False)
ag_mean_squared_error_custom_scorer(y_true, y_pred)

-0.07489374242602942

### Using Custom Metric in TabularPredictor

In [101]:
from autogluon.tabular import TabularDataset

train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')  # can be local CSV file as well, returns Pandas DataFrame
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')  # another Pandas DataFrame
label = 'class'  # specifies which column we want to predict
train_data = train_data.sample(n=1000, random_state=0)  # subsample dataset for faster demo

train_data.head(5)

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073
Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
6118,51,Private,39264,Some-college,10,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,>50K
23204,58,Private,51662,10th,6,Married-civ-spouse,Other-service,Wife,White,Female,0,0,8,United-States,<=50K
29590,40,Private,326310,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,44,United-States,<=50K
18116,37,Private,222450,HS-grad,9,Never-married,Sales,Not-in-family,White,Male,0,2339,40,El-Salvador,<=50K
33964,62,Private,109190,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024,0,40,United-States,>50K


In [102]:
from autogluon.tabular import TabularPredictor

predictor = TabularPredictor(label=label).fit(train_data, hyperparameters='toy')

predictor.leaderboard(test_data, silent=True)

No path specified. Models will be saved in: "AutogluonModels/ag-20230606_091158/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230606_091158/"
AutoGluon Version:  0.7.0
Python Version:     3.10.9
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
Train Data Rows:    1000
Train Data Columns: 14
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' >50K', ' <=50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which labe

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,XGBoost,0.847784,0.84,0.016339,0.002301,0.034474,0.016339,0.002301,0.034474,1,True,3
1,CatBoost,0.842768,0.86,0.00855,0.002246,0.019817,0.00855,0.002246,0.019817,1,True,2
2,WeightedEnsemble_L2,0.842768,0.86,0.009629,0.002559,0.07571,0.001079,0.000314,0.055892,2,True,5
3,NeuralNetTorch,0.831917,0.815,0.052639,0.006049,0.260721,0.052639,0.006049,0.260721,1,True,4
4,LightGBM,0.78094,0.77,0.006437,0.002364,0.115073,0.006437,0.002364,0.115073,1,True,1


In [104]:
predictor.leaderboard(test_data, extra_metrics=[ ag_accuracy_scorer], silent=True)

Unnamed: 0,model,score_test,accuracy,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,XGBoost,0.847784,0.847784,0.84,0.019465,0.002301,0.034474,0.019465,0.002301,0.034474,1,True,3
1,CatBoost,0.842768,0.842768,0.86,0.008188,0.002246,0.019817,0.008188,0.002246,0.019817,1,True,2
2,WeightedEnsemble_L2,0.842768,0.842768,0.86,0.009149,0.002559,0.07571,0.000961,0.000314,0.055892,2,True,5
3,NeuralNetTorch,0.831917,0.831917,0.815,0.052139,0.006049,0.260721,0.052139,0.006049,0.260721,1,True,4
4,LightGBM,0.78094,0.78094,0.77,0.005682,0.002364,0.115073,0.005682,0.002364,0.115073,1,True,1


## Adding a custom model to AutoGluon

https://auto.gluon.ai/stable/tutorials/tabular/advanced/tabular-custom-model-advanced.html

### load data

In [105]:
from autogluon.tabular import TabularDataset

train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')  # can be local CSV file as well, returns Pandas DataFrame
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')  # another Pandas DataFrame
label = 'class'  # specifies which column do we want to predict
train_data = train_data.sample(n=1000, random_state=0)  # subsample for faster demo

train_data.head(5)

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073
Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
6118,51,Private,39264,Some-college,10,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,>50K
23204,58,Private,51662,10th,6,Married-civ-spouse,Other-service,Wife,White,Female,0,0,8,United-States,<=50K
29590,40,Private,326310,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,44,United-States,<=50K
18116,37,Private,222450,HS-grad,9,Never-married,Sales,Not-in-family,White,Male,0,2339,40,El-Salvador,<=50K
33964,62,Private,109190,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024,0,40,United-States,>50K


### Force features to not be dropped in model-specific preprocessing

In [106]:
from autogluon.core.models import AbstractModel

class DummyModel(AbstractModel):
    def _fit(self, X, **kwargs):
        print(f'Before {self.__class__.__name__} Preprocessing ({len(X.columns)} features):\n\t{list(X.columns)}')
        X = self.preprocess(X)
        print(f'After  {self.__class__.__name__} Preprocessing ({len(X.columns)} features):\n\t{list(X.columns)}')
        print(X.head(5))

class DummyModelKeepUnique(DummyModel):
    def _get_default_auxiliary_params(self) -> dict:
        default_auxiliary_params = super()._get_default_auxiliary_params()
        extra_auxiliary_params = dict(
            drop_unique=False,  # Whether to drop features that have only 1 unique value, default is True
        )
        default_auxiliary_params.update(extra_auxiliary_params)
        return default_auxiliary_params

In [107]:
# WARNING: To use this in practice, you must put this code in a separate python file
#  from the main process and import it or else it will not be serializable.)
from autogluon.features import BulkFeatureGenerator, AutoMLPipelineFeatureGenerator, IdentityFeatureGenerator


class CustomFeatureGeneratorWithUserOverride(BulkFeatureGenerator):
    def __init__(self, automl_generator_kwargs: dict = None, **kwargs):
        generators = self._get_default_generators(automl_generator_kwargs=automl_generator_kwargs)
        super().__init__(generators=generators, **kwargs)

    def _get_default_generators(self, automl_generator_kwargs: dict = None):
        if automl_generator_kwargs is None:
            automl_generator_kwargs = dict()

        generators = [
            [
                # Preprocessing logic that handles normal features
                AutoMLPipelineFeatureGenerator(banned_feature_special_types=['user_override'], **automl_generator_kwargs),

                # Preprocessing logic that handles special features user wishes to treat separately, here we simply skip preprocessing for these features.
                IdentityFeatureGenerator(infer_features_in_args=dict(required_special_types=['user_override'])),
            ],
        ]
        return generators

In [108]:
# add a useless dummy feature to show that it is not dropped in preprocessing
train_data['dummy_feature'] = 'dummy value'
test_data['dummy_feature'] = 'dummy value'

from autogluon.tabular import FeatureMetadata
feature_metadata = FeatureMetadata.from_df(train_data)

print('Before inserting overrides:')
print(feature_metadata)

feature_metadata = feature_metadata.add_special_types(
    {
        'age': ['user_override'],
        'native-country': ['user_override'],
        'dummy_feature': ['user_override'],
    }
)

print('After inserting overrides:')
print(feature_metadata)

Before inserting overrides:
('int', [])    :  6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('object', []) : 10 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
After inserting overrides:
('int', [])                   : 5 | ['fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
('int', ['user_override'])    : 1 | ['age']
('object', [])                : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('object', ['user_override']) : 2 | ['native-country', 'dummy_feature']


### put it all together

In [109]:
# Separate features and labels
X = train_data.drop(columns=[label])
y = train_data[label]
X_test = test_data.drop(columns=[label])
y_test = test_data[label]

# preprocess the label column, as done in the prior custom model tutorial
from autogluon.core.data import LabelCleaner
from autogluon.core.utils import infer_problem_type
# Construct a LabelCleaner to neatly convert labels to float/integers during model training/inference, can also use to inverse_transform back to original.
problem_type = infer_problem_type(y=y)  # Infer problem type (or else specify directly)
label_cleaner = LabelCleaner.construct(problem_type=problem_type, y=y)
y_preprocessed = label_cleaner.transform(y)
y_test_preprocessed = label_cleaner.transform(y_test)

# Make sure to specify your custom feature metadata to the feature generator
my_custom_feature_generator = CustomFeatureGeneratorWithUserOverride(feature_metadata_in=feature_metadata)

X_preprocessed = my_custom_feature_generator.fit_transform(X)
X_test_preprocessed = my_custom_feature_generator.transform(X_test)


AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' >50K', ' <=50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Fitting CustomFeatureGeneratorWithUserOverride...
	Stage 1 Generators:
		Fitting AutoMLPipelineFeatureGenerator...
			Available Memory:                    7377.36 MB
			Train Data (Original)  Memory Usage: 0.66 MB (0.0% of available memory)
			Stage 1 Generators:
				Fitting AsTypeFeatureGenerator...
					Note: Convert

In [110]:
print(list(X_preprocessed.columns))
X_preprocessed.head(5)

['fnlwgt', 'education-num', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'age', 'native-country', 'dummy_feature']


Unnamed: 0,fnlwgt,education-num,sex,capital-gain,capital-loss,hours-per-week,workclass,education,marital-status,occupation,relationship,race,age,native-country,dummy_feature
6118,39264,10,0,0,0,40,3,14,1,4,5,4,51,United-States,dummy value
23204,51662,6,0,0,0,8,3,0,1,8,5,4,58,United-States,dummy value
29590,326310,10,1,0,0,44,3,14,1,3,0,4,40,United-States,dummy value
18116,222450,9,1,0,2339,40,3,11,3,12,1,4,37,El-Salvador,dummy value
33964,109190,13,1,15024,0,40,3,9,1,4,0,4,62,United-States,dummy value


In [111]:
dummy_model = DummyModel()
dummy_model.fit(X=X, y=y, feature_metadata=my_custom_feature_generator.feature_metadata)


No path specified. Models will be saved in: "AutogluonModels/ag-20230606_091456/DummyModel/"
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' >50K', ' <=50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Model DummyModel's eval_metric inferred to be 'accuracy' because problem_type='binary' and eval_metric was not specified during init.


Before DummyModel Preprocessing (15 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
After  DummyModel Preprocessing (14 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
       age workclass  fnlwgt      education  education-num  \
6118    51   Private   39264   Some-college             10   
23204   58   Private   51662           10th              6   
29590   40   Private  326310   Some-college             10   
18116   37   Private  222450        HS-grad              9   
33964   62   Private  109190      Bachelors             13   

            marital-status        occupation    relationship    race      sex  \
6118    Married-civ-spouse   Exec-managerial     

<__main__.DummyModel at 0x32b581240>

In [112]:
dummy_model_keep_unique = DummyModelKeepUnique()
dummy_model_keep_unique.fit(X=X, y=y, feature_metadata=my_custom_feature_generator.feature_metadata)

No path specified. Models will be saved in: "AutogluonModels/ag-20230606_091504/DummyModelKeepUnique/"
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' >50K', ' <=50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Model DummyModelKeepUnique's eval_metric inferred to be 'accuracy' because problem_type='binary' and eval_metric was not specified during init.


Before DummyModelKeepUnique Preprocessing (15 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
After  DummyModelKeepUnique Preprocessing (15 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
       age workclass  fnlwgt      education  education-num  \
6118    51   Private   39264   Some-college             10   
23204   58   Private   51662           10th              6   
29590   40   Private  326310   Some-college             10   
18116   37   Private  222450        HS-grad              9   
33964   62   Private  109190      Bachelors             13   

            marital-status        occupation    relationship    race      sex  \
6118    Marr

<__main__.DummyModelKeepUnique at 0x32b448100>

### Keeping Features via TabularPredictor

In [113]:
from autogluon.tabular import TabularPredictor

feature_generator = CustomFeatureGeneratorWithUserOverride()
predictor = TabularPredictor(label=label)
predictor.fit(
    train_data=train_data,
    feature_metadata=feature_metadata,  # feature metadata with your overrides
    feature_generator=feature_generator,  # your custom feature generator that handles the overrides
    hyperparameters={
        'GBM': {},  # Can fit your custom model alongside default models
        DummyModel: {},  # Will drop dummy_feature
        DummyModelKeepUnique: {},  # Will not drop dummy_feature
        # DummyModel: {'ag_args_fit': {'drop_unique': False}},  # This is another way to get same result as using DummyModelKeepUnique
    }
)

No path specified. Models will be saved in: "AutogluonModels/ag-20230606_091523/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230606_091523/"
AutoGluon Version:  0.7.0
Python Version:     3.10.9
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
Train Data Rows:    1000
Train Data Columns: 15
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' >50K', ' <=50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which labe

Before DummyModel Preprocessing (15 features):
	['fnlwgt', 'education-num', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'age', 'native-country', 'dummy_feature']
After  DummyModel Preprocessing (14 features):
	['fnlwgt', 'education-num', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'age', 'native-country']
       fnlwgt  education-num  sex  capital-gain  capital-loss  hours-per-week  \
20453  369027              9    1             0             0              37   
9855   102766             10    1             0             0              40   
21144  141645             10    0             0             0              40   
22391   64520              9    1             0             0              55   
16793  193820             14    0             0             0              40   

      workclass 

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x32b580af0>

## Deployment Optimization

https://auto.gluon.ai/stable/tutorials/tabular/advanced/tabular-deployment.html

In [114]:
from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
label = 'class'
subsample_size = 500  # subsample subset of data for faster demo, try setting this to much larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head()

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
6118,51,Private,39264,Some-college,10,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,>50K
23204,58,Private,51662,10th,6,Married-civ-spouse,Other-service,Wife,White,Female,0,0,8,United-States,<=50K
29590,40,Private,326310,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,44,United-States,<=50K
18116,37,Private,222450,HS-grad,9,Never-married,Sales,Not-in-family,White,Male,0,2339,40,El-Salvador,<=50K
33964,62,Private,109190,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024,0,40,United-States,>50K


In [115]:
save_path = 'agModels-predictClass-deployment'  # specifies folder to store trained models
predictor = TabularPredictor(label=label, path=save_path).fit(train_data)

Beginning AutoGluon training ...
AutoGluon will save models to "agModels-predictClass-deployment/"
AutoGluon Version:  0.7.0
Python Version:     3.10.9
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
Train Data Rows:    500
Train Data Columns: 14
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' >50K', ' <=50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set th

In [116]:
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
y_test = test_data[label]  # values to predict
test_data.head()

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,31,Private,169085,11th,7,Married-civ-spouse,Sales,Wife,White,Female,0,0,20,United-States,<=50K
1,17,Self-emp-not-inc,226203,12th,8,Never-married,Sales,Own-child,White,Male,0,0,45,United-States,<=50K
2,47,Private,54260,Assoc-voc,11,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,1887,60,United-States,>50K
3,21,Private,176262,Some-college,10,Never-married,Exec-managerial,Own-child,White,Female,0,0,30,United-States,<=50K
4,17,Private,241185,12th,8,Never-married,Prof-specialty,Own-child,White,Male,0,0,20,United-States,<=50K


In [117]:
predictor = TabularPredictor.load(save_path)  # unnecessary, just demonstrates how to load previously-trained predictor from file

y_pred = predictor.predict(test_data)
y_pred

0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object

In [120]:
predictor.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatBoost,0.842461,0.85,0.009499,0.001961,0.604667,0.009499,0.001961,0.604667,1,True,7
1,XGBoost,0.842461,0.85,0.019269,0.002214,0.127821,0.019269,0.002214,0.127821,1,True,11
2,RandomForestGini,0.842461,0.84,0.085714,0.016358,0.220086,0.085714,0.016358,0.220086,1,True,5
3,RandomForestEntr,0.840925,0.83,0.087561,0.016415,0.201911,0.087561,0.016415,0.201911,1,True,6
4,LightGBM,0.839799,0.85,0.012322,0.002591,0.289338,0.012322,0.002591,0.289338,1,True,4
5,WeightedEnsemble_L2,0.839799,0.85,0.014513,0.002865,0.456426,0.002191,0.000274,0.167088,2,True,14
6,LightGBMXT,0.836421,0.83,0.008205,0.003748,0.169393,0.008205,0.003748,0.169393,1,True,3
7,ExtraTreesGini,0.834374,0.82,0.089456,0.016779,0.178069,0.089456,0.016779,0.178069,1,True,8
8,ExtraTreesEntr,0.832839,0.81,0.089109,0.0162,0.194355,0.089109,0.0162,0.194355,1,True,9
9,LightGBMLarge,0.828949,0.83,0.01673,0.002476,0.519117,0.01673,0.002476,0.519117,1,True,13


### Snapshot a Predictor with .clone()

In [121]:
save_path_clone = save_path + '-clone'
# will return the path to the cloned predictor, identical to save_path_clone
path_clone = predictor.clone(path=save_path_clone)

Cloned TabularPredictor located in 'agModels-predictClass-deployment/' to 'agModels-predictClass-deployment-clone'.
	To load the cloned predictor: predictor_clone = TabularPredictor.load(path="agModels-predictClass-deployment-clone")


In [122]:
predictor_clone = TabularPredictor.load(path=path_clone)

In [123]:
y_pred_clone = predictor.predict(test_data)
y_pred_clone

0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object

In [124]:
y_pred.equals(y_pred_clone)

True

In [125]:
predictor_clone.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatBoost,0.842461,0.85,0.008821,0.001961,0.604667,0.008821,0.001961,0.604667,1,True,7
1,XGBoost,0.842461,0.85,0.018567,0.002214,0.127821,0.018567,0.002214,0.127821,1,True,11
2,RandomForestGini,0.842461,0.84,0.098795,0.016358,0.220086,0.098795,0.016358,0.220086,1,True,5
3,RandomForestEntr,0.840925,0.83,0.072902,0.016415,0.201911,0.072902,0.016415,0.201911,1,True,6
4,LightGBM,0.839799,0.85,0.011799,0.002591,0.289338,0.011799,0.002591,0.289338,1,True,4
5,WeightedEnsemble_L2,0.839799,0.85,0.014085,0.002865,0.456426,0.002286,0.000274,0.167088,2,True,14
6,LightGBMXT,0.836421,0.83,0.005686,0.003748,0.169393,0.005686,0.003748,0.169393,1,True,3
7,ExtraTreesGini,0.834374,0.82,0.103416,0.016779,0.178069,0.103416,0.016779,0.178069,1,True,8
8,ExtraTreesEntr,0.832839,0.81,0.095408,0.0162,0.194355,0.095408,0.0162,0.194355,1,True,9
9,LightGBMLarge,0.828949,0.83,0.015563,0.002476,0.519117,0.015563,0.002476,0.519117,1,True,13


### extra logic with the clone, such as refit_full

In [126]:
predictor_clone.refit_full()

predictor_clone.leaderboard(test_data, silent=True)

Refitting models via `predictor.refit_full` using all of the data (combined train and validation)...
	Models trained in this way will have the suffix "_FULL" and have NaN validation score.
	This process is not bound by time_limit, but should take less time than the original `predictor.fit` call.
	To learn more, refer to the `.refit_full` method docstring which explains how "_FULL" models differ from normal models.
Fitting 1 L1 models ...
Fitting model: KNeighborsUnif_FULL ...
	0.01s	 = Training   runtime
Fitting 1 L1 models ...
Fitting model: KNeighborsDist_FULL ...
	0.0s	 = Training   runtime
Fitting 1 L1 models ...
Fitting model: LightGBMXT_FULL ...
	0.13s	 = Training   runtime
Fitting 1 L1 models ...
Fitting model: LightGBM_FULL ...
	0.14s	 = Training   runtime
Fitting 1 L1 models ...
Fitting model: RandomForestGini_FULL ...
	0.22s	 = Training   runtime
Fitting 1 L1 models ...
Fitting model: RandomForestEntr_FULL ...
	0.18s	 = Training   runtime
Fitting 1 L1 models ...
Fitting model

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,XGBoost_FULL,0.844303,,0.014524,,0.037782,0.014524,,0.037782,1,True,25
1,CatBoost_FULL,0.84287,,0.00753,,0.015589,0.00753,,0.015589,1,True,21
2,CatBoost,0.842461,0.85,0.007651,0.001961,0.604667,0.007651,0.001961,0.604667,1,True,7
3,XGBoost,0.842461,0.85,0.016113,0.002214,0.127821,0.016113,0.002214,0.127821,1,True,11
4,RandomForestGini,0.842461,0.84,0.066654,0.016358,0.220086,0.066654,0.016358,0.220086,1,True,5
5,RandomForestEntr,0.840925,0.83,0.06392,0.016415,0.201911,0.06392,0.016415,0.201911,1,True,6
6,LightGBM_FULL,0.840823,,0.014992,,0.135045,0.014992,,0.135045,1,True,18
7,WeightedEnsemble_L2_FULL,0.840823,,0.016161,,0.302133,0.001169,,0.167088,2,True,28
8,RandomForestGini_FULL,0.840618,,0.066347,,0.220644,0.066347,,0.220644,1,True,19
9,LightGBM,0.839799,0.85,0.013693,0.002591,0.289338,0.013693,0.002591,0.289338,1,True,4


In [127]:
predictor.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatBoost,0.842461,0.85,0.008566,0.001961,0.604667,0.008566,0.001961,0.604667,1,True,7
1,XGBoost,0.842461,0.85,0.020118,0.002214,0.127821,0.020118,0.002214,0.127821,1,True,11
2,RandomForestGini,0.842461,0.84,0.099076,0.016358,0.220086,0.099076,0.016358,0.220086,1,True,5
3,RandomForestEntr,0.840925,0.83,0.081861,0.016415,0.201911,0.081861,0.016415,0.201911,1,True,6
4,LightGBM,0.839799,0.85,0.01293,0.002591,0.289338,0.01293,0.002591,0.289338,1,True,4
5,WeightedEnsemble_L2,0.839799,0.85,0.014183,0.002865,0.456426,0.001253,0.000274,0.167088,2,True,14
6,LightGBMXT,0.836421,0.83,0.011955,0.003748,0.169393,0.011955,0.003748,0.169393,1,True,3
7,ExtraTreesGini,0.834374,0.82,0.078354,0.016779,0.178069,0.078354,0.016779,0.178069,1,True,8
8,ExtraTreesEntr,0.832839,0.81,0.084106,0.0162,0.194355,0.084106,0.0162,0.194355,1,True,9
9,LightGBMLarge,0.828949,0.83,0.017703,0.002476,0.519117,0.017703,0.002476,0.519117,1,True,13


predictor_clone은 변하였지만 original predictor는 그대로

### Snapshot a deployment optimized Predictor via .clone_for_deployment()

In [128]:
save_path_clone_opt = save_path + '-clone-opt'
# will return the path to the cloned predictor, identical to save_path_clone_opt
path_clone_opt = predictor.clone_for_deployment(path=save_path_clone_opt)

Cloned TabularPredictor located in 'agModels-predictClass-deployment/' to 'agModels-predictClass-deployment-clone-opt'.
	To load the cloned predictor: predictor_clone = TabularPredictor.load(path="agModels-predictClass-deployment-clone-opt")
Clone: Keeping minimum set of models required to predict with best model 'WeightedEnsemble_L2'...
Deleting model KNeighborsUnif. All files under agModels-predictClass-deployment-clone-opt/models/KNeighborsUnif/ will be removed.
Deleting model KNeighborsDist. All files under agModels-predictClass-deployment-clone-opt/models/KNeighborsDist/ will be removed.
Deleting model LightGBMXT. All files under agModels-predictClass-deployment-clone-opt/models/LightGBMXT/ will be removed.
Deleting model RandomForestGini. All files under agModels-predictClass-deployment-clone-opt/models/RandomForestGini/ will be removed.
Deleting model RandomForestEntr. All files under agModels-predictClass-deployment-clone-opt/models/RandomForestEntr/ will be removed.
Deleting m

In [129]:
predictor_clone_opt = TabularPredictor.load(path=path_clone_opt)

In [130]:
# 모든 예측 호출에 모델을 로드하지 않기 위해 메모리에 유지
predictor_clone_opt.persist_models()

Persisting 2 models in memory. Models will require 0.0% of memory.


['WeightedEnsemble_L2', 'LightGBM']

In [131]:
y_pred_clone_opt = predictor_clone_opt.predict(test_data)
y_pred_clone_opt

0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object

In [132]:
y_pred.equals(y_pred_clone_opt)

True

In [133]:
predictor_clone_opt.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBM,0.839799,0.85,0.012381,0.002591,0.289338,0.012381,0.002591,0.289338,1,True,1
1,WeightedEnsemble_L2,0.839799,0.85,0.012951,0.002865,0.456426,0.00057,0.000274,0.167088,2,True,2


In [134]:
size_original = predictor.get_size_disk()
size_opt = predictor_clone_opt.get_size_disk()
print(f'Size Original:  {size_original} bytes')
print(f'Size Optimized: {size_opt} bytes')
print(f'Optimized predictor achieved a {round((1 - (size_opt/size_original)) * 100, 1)}% reduction in disk usage.')

Size Original:  16850372 bytes
Size Optimized: 179266 bytes
Optimized predictor achieved a 98.9% reduction in disk usage.


In [135]:
predictor.get_size_disk_per_file()

models/ExtraTreesGini/model.pkl                        4559843
models/ExtraTreesEntr/model.pkl                        4531477
models/RandomForestGini/model.pkl                      3075072
models/RandomForestEntr/model.pkl                      2951194
models/LightGBMLarge/model.pkl                          470889
models/XGBoost/xgb.ubj                                  454178
models/NeuralNetTorch/model.pkl                         252021
models/NeuralNetFastAI/model-internals.pkl              167374
models/LightGBM/model.pkl                               146038
models/LightGBMXT/model.pkl                              42071
models/KNeighborsDist/model.pkl                          39986
models/KNeighborsUnif/model.pkl                          39985
utils/data/X.pkl                                         27655
models/CatBoost/model.pkl                                21714
metadata.json                                            10979
learner.pkl                                            

In [136]:
predictor_clone_opt.get_size_disk_per_file()

models/LightGBM/model.pkl               146065
metadata.json                            10979
learner.pkl                              10728
models/WeightedEnsemble_L2/model.pkl      8285
models/trainer.pkl                        2462
predictor.pkl                              742
__version__                                  5
Name: size, dtype: int64

### Compile models for maximized inference speed

In [137]:
predictor_clone_opt.compile_models()

Compiling 2 Models ...
Skipping compilation for WeightedEnsemble_L2 ... (No config specified)
Skipping compilation for LightGBM ... (No config specified)
Finished compiling models, total runtime = 0s.


In [138]:
y_pred_compile_opt = predictor_clone_opt.predict(test_data)
y_pred_compile_opt

0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object