<p style="padding: 10px; border: 1px solid black;">
<img src="./../../images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>


# <a name="0">Code Walkthrough & Advanced AutoGluon Features</a>


This notebook shows how to use AutoGluon `TabularPredictor` to solve two machine learning tasks: a __regression task__ (book price prediction) and a __multiclass classification task__ (occupation prediction). 

<a href="#01">Part I - Solution Walkthrough & Discussions</a>, covers a basic solution for the Book Price regression problem from the *MLU-DAY-ONE-ML-Hands-On.ipynb* notebook.

<a href="#02">Part II - Advanced AutoGluon Features</a>, dives deeper into more advanced AutoGluon features, solving a multiclass classification task of predicting the occupation of individuals using US census data.

1. <a href="#1">ML Problem Description</a>
2. <a href="#2">Loading the Data</a>
3. <a href="#5">Model Training with AutoGluon</a>
    * Specifying performance metric
    * Specifying settings for TabularPredictor
    * Specifying hyperparameters and tuning them
    
4. <a href="#7">Model ensembling with stacking/bagging</a>
5. <a href="#8">Prediction options (inference)</a>
6. <a href="#10">Selecting individual models for predictions</a>
7. <a href="#11">Interpretability: Feature importance</a>
8. <a href="#12">Inference Speed: Model distillation</a>
    * Training student models
    * Excluding models
    


__Jupiter notebooks environment__:

* Jupiter notebooks allow creating and sharing documents that contain both code and rich text cells. If you are not familiar with Jupiter notebooks, read more [here](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). 
* This is a quick-start demo to bring you up to speed on coding and experimenting with machine learning. Move through the notebook __from top to bottom__. 
* Run each code cell to see its output. To run a cell, click within the cell and press __Shift+Enter__, or click __Run__ from the top of the page menu. 
* A `[*]` symbol next to the cell indicates the code is still running. A `[#]` symbol, where # is an integer, indicates it is finished.
* Beware, __some code cells might take longer to run__, sometimes 5-10 minutes (depending on the task, installing packages and libraries, training models, etc.)

Let's start by loading some libraries and packages!

In [1]:
%%capture
!pip install -q autogluon==0.6.2

In [2]:
# Load in libraries
import pandas as pd
# Importing the libraries needed to work with our Tabular dataset.
from autogluon.tabular import TabularPredictor, TabularDataset
# Additional library for tuning
import autogluon.core as ag

---
# <a name="01">Part I - Walkthrough & Discussions</a>
(<a href="#0">Go to top</a>)

Now that you have finished your hands-on activity, let's walk through the code you have used and discuss it. <br/>

In [3]:
# Loading the train and test datasets
df_train = TabularDataset("../../data/training.csv")
df_test = TabularDataset("../../data/mlu-leaderboard-test.csv")

# Train a model with AutoGluon on the train dataset
# Set the training time to a minute here (60 seconds), for fast experimentation
predictor = TabularPredictor(label="Price", eval_metric="mean_squared_error").fit(
    train_data=df_train, time_limit=60
)

# Make predictions on the test dataset with the AutoGluon model
predictions = predictor.predict(df_test)

# Creating a new dataframe for the MLU Leaderboard submission
submission = df_test[["ID"]].copy(deep=True)

# Creating label column from price prediction list
submission["Price"] = predictions

# Saving the dataframe as a csv file for MLU Leaderboard submission
# index=False prevents printing the row IDs as separate values
submission.to_csv(
    "../../data/predictions/Solution-Demo.csv",
    index=False,
)

No path specified. Models will be saved in: "AutogluonModels/ag-20230207_190023/"
Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "AutogluonModels/ag-20230207_190023/"
AutoGluon Version:  0.6.2
Python Version:     3.9.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Fri Oct 14 01:16:24 UTC 2022
Train Data Rows:    5051
Train Data Columns: 9
Label Column: Price
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (4.149249912590282, 1.414973347970818, 2.60147, 0.33003)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory

---

# <a name="02">Part II - Advanced AutoGluon Features</a>
(<a href="#0">Go to top</a>)

---
## <a name="1">ML Problem Description</a>

Predict the occupation of individuals using census data. 
> This is a __multiclass classification__ task (15 distinct classes). <br>

For the advanced feature demonstration we use a new dataset: Census data. In this particular dataset, each row corresponds to an individual person, and the columns contain various demographic characteristics collected for the census.

We predict the occupation of an individual - this is a multiclass classification problem. Start by importing AutoGluon’s `TabularPredictor` and `TabularDataset`, and load the data from a S3 bucket.

___
## <a name="2">Loading the data</a>
(<a href="#0">Go to top</a>)


In [4]:
# Load in the dataset
train_data = TabularDataset("https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv")

# Let's load the test data
test_data = TabularDataset("https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv")

# Subsample a subset of data for faster demo, try setting this to much larger values
subsample_size = 1000

train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head()

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073
Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
6118,51,Private,39264,Some-college,10,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,>50K
23204,58,Private,51662,10th,6,Married-civ-spouse,Other-service,Wife,White,Female,0,0,8,United-States,<=50K
29590,40,Private,326310,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,44,United-States,<=50K
18116,37,Private,222450,HS-grad,9,Never-married,Sales,Not-in-family,White,Male,0,2339,40,El-Salvador,<=50K
33964,62,Private,109190,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024,0,40,United-States,>50K


___
## <a name="5">Model Training with AutoGluon</a>
(<a href="#0">Go to top</a>)


### Specifying performance metric

In [5]:
# We specify eval-metric just for demo (unnecessary as it's the default)
metric = "accuracy"

The full list of AutoGluon classification metrics can be found here:

`'accuracy', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_micro', 'f1_weighted', 'roc_auc', 'average_precision', 'precision', 'precision_macro', 'precision_micro', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_weighted', 'log_loss', 'pac_score'`

### Specifying settings for TabularPredictor

In [6]:
# Train various models for ~2 min
time_limit = 2 * 60

### Specifying hyperparameters and tuning them

In [7]:
# Set Neural Net options
# Specifies non-default hyperparameter values for neural network models
nn_options = {
    # number of training epochs (controls training time of NN models)
    "num_epochs": 10,
    # learning rate used in training (real-valued hyperparameter searched on log-scale)
    "learning_rate": ag.space.Real(1e-4, 1e-2, default=5e-4, log=True),
    # activation function used in NN (categorical hyperparameter, default = first entry)
    "activation": ag.space.Categorical("relu", "softrelu", "tanh"),
    # dropout probability (real-valued hyperparameter)
    "dropout_prob": ag.space.Real(0.0, 0.5, default=0.1),
}

# Set GBM options
# Specifies non-default hyperparameter values for lightGBM gradient boosted trees
gbm_options = {
    # number of boosting rounds (controls training time of GBM models)
    "num_boost_round": 100,
    # number of leaves in trees (integer hyperparameter)
    "num_leaves": ag.space.Int(lower=26, upper=66, default=36),
}

# Add both NN and GBM options into a hyperparameter dictionary
# hyperparameters of each model type
# When these keys are missing from the hyperparameters dict, no models of that type are trained
hyperparameters = {
    "GBM": gbm_options,
    "NN_TORCH": nn_options,
}

# To tune hyperparameters using Bayesian optimization to find best combination of params
search_strategy = "auto"

# Number of trials for hyperparameters
num_trials = 5

# HPO is not performed unless hyperparameter_tune_kwargs is specified
hyperparameter_tune_kwargs = {
    "num_trials": num_trials,
    "scheduler": "local",
    "searcher": search_strategy,
}

### Train & Tune Model

In [8]:
predictor = TabularPredictor(label="occupation", eval_metric=metric).fit(
    train_data,
    time_limit=time_limit,
    hyperparameters=hyperparameters,
    hyperparameter_tune_kwargs=hyperparameter_tune_kwargs,
)

Fitted model: NeuralNetTorch/d3eb103e ...
	0.355	 = Validation score   (accuracy)
	2.39s	 = Training   runtime
	0.02s	 = Validation runtime
Fitted model: NeuralNetTorch/d526198a ...
	0.32	 = Validation score   (accuracy)
	2.96s	 = Training   runtime
	0.02s	 = Validation runtime
Fitted model: NeuralNetTorch/d528a1fa ...
	0.36	 = Validation score   (accuracy)
	2.97s	 = Training   runtime
	0.02s	 = Validation runtime
Fitted model: NeuralNetTorch/d52be89c ...
	0.35	 = Validation score   (accuracy)
	2.73s	 = Training   runtime
	0.02s	 = Validation runtime
Fitted model: NeuralNetTorch/d53001ac ...
	0.35	 = Validation score   (accuracy)
	1.54s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 119.9s of the 100.99s of remaining time.
	0.405	 = Validation score   (accuracy)
	0.28s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 19.31s ... Best model: "WeightedEnsemble_L2"
TabularPredi

Use the following to view a summary of what happened during the fit. Now this command will show details of the hyperparameter-tuning process for each type of model:

In [9]:
predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                      model  score_val  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0       WeightedEnsemble_L2      0.405       0.052138  6.178108                0.000443           0.280164            2       True         11
1               LightGBM/T3      0.375       0.004916  0.403158                0.004916           0.403158            1       True          3
2               LightGBM/T5      0.375       0.006534  0.592385                0.006534           0.592385            1       True          5
3               LightGBM/T1      0.370       0.004292  0.633591                0.004292           0.633591            1       True          1
4               LightGBM/T4      0.360       0.009535  0.657517                0.009535           0.657517            1       True          4
5   NeuralNetTorch/d528a1fa      0.360       0.019321  2.972190                0.01932

{'model_types': {'LightGBM/T1': 'LGBModel',
  'LightGBM/T2': 'LGBModel',
  'LightGBM/T3': 'LGBModel',
  'LightGBM/T4': 'LGBModel',
  'LightGBM/T5': 'LGBModel',
  'NeuralNetTorch/d3eb103e': 'TabularNeuralNetTorchModel',
  'NeuralNetTorch/d526198a': 'TabularNeuralNetTorchModel',
  'NeuralNetTorch/d528a1fa': 'TabularNeuralNetTorchModel',
  'NeuralNetTorch/d52be89c': 'TabularNeuralNetTorchModel',
  'NeuralNetTorch/d53001ac': 'TabularNeuralNetTorchModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'LightGBM/T1': 0.37,
  'LightGBM/T2': 0.355,
  'LightGBM/T3': 0.375,
  'LightGBM/T4': 0.36,
  'LightGBM/T5': 0.375,
  'NeuralNetTorch/d3eb103e': 0.355,
  'NeuralNetTorch/d526198a': 0.32,
  'NeuralNetTorch/d528a1fa': 0.36,
  'NeuralNetTorch/d52be89c': 0.35,
  'NeuralNetTorch/d53001ac': 0.35,
  'WeightedEnsemble_L2': 0.405},
 'model_best': 'WeightedEnsemble_L2',
 'model_paths': {'LightGBM/T1': '/home/ec2-user/SageMaker/MLA-DAY1-Course/notebooks/demo/AutogluonModels/ag-

In the above example, the predictive performance may be poor because we are using few training data points and small ranges for hyperparameters to ensure quick run times. You can call `fit()` multiple times while modifying these settings to better understand how these choices affect performance outcomes. For example: you can increase `subsample_size` to train using a larger dataset, increase the `num_epochs` and `num_boost_round` hyperparameters, and increase the `time_limit` (which you should do for all code in these tutorials). To see more detailed output during the execution of `fit()`, you can also pass in the argument: `verbosity = 3`.

___
## <a name="7">Model ensembling with stacking/bagging</a>
(<a href="#0">Go to top</a>)

Beyond hyperparameter-tuning with a correctly-specified evaluation metric, there are two other methods to boost predictive performance:
- bagging and 
- stack-ensembling

You’ll often see performance improve if you specify `num_bag_folds = 5-10`, `num_stack_levels = 1-3` in the call to `fit()`. Beware that doing this will increase training times and memory/disk usage.



In [10]:
predictor = TabularPredictor(label="occupation", eval_metric=metric).fit(
    train_data,
    num_bag_folds=5,
    num_bag_sets=1,
    num_stack_levels=1
)

No path specified. Models will be saved in: "AutogluonModels/ag-20230207_190851/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230207_190851/"
AutoGluon Version:  0.6.2
Python Version:     3.9.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Fri Oct 14 01:16:24 UTC 2022
Train Data Rows:    1000
Train Data Columns: 14
Label Column: occupation
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == object).
	First 10 (of 15) unique label values:  [' Exec-managerial', ' Other-service', ' Craft-repair', ' Sales', ' Prof-specialty', ' Protective-serv', ' ?', ' Adm-clerical', ' Machine-op-inspct', ' Tech-support']
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Fraction of data from classes with at least 10 

You should not provide `tuning_data` when stacking/bagging, and instead provide all your available data as train_data (which AutoGluon will split in more intelligent ways). Parameter `num_bag_sets` controls how many times the K-fold bagging process is repeated to further reduce variance (increasing this may further boost accuracy but will substantially increase training times, inference latency, and memory/disk usage). Rather than manually searching for good bagging/stacking values yourself, AutoGluon will automatically select good values for you if you specify `auto_stack` instead:

In [11]:
# Folder where to store trained models
save_path = "agModels-predictOccupation"

predictor = TabularPredictor(label="occupation", eval_metric=metric, path=save_path).fit(
    train_data,
    auto_stack=True,
    time_limit=30
)

Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 30s
AutoGluon will save models to "agModels-predictOccupation/"
AutoGluon Version:  0.6.2
Python Version:     3.9.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Fri Oct 14 01:16:24 UTC 2022
Train Data Rows:    1000
Train Data Columns: 14
Label Column: occupation
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == object).
	First 10 (of 15) unique label values:  [' Exec-managerial', ' Other-service', ' Craft-repair', ' Sales', ' Prof-specialty', ' Protective-serv', ' ?', ' Adm-clerical', ' Machine-op-inspct', ' Tech-support']
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Fraction of data from classe

Often stacking/bagging will produce superior accuracy than hyperparameter-tuning, but you may try combining both techniques (note: specifying `presets='best_quality'` in `fit()` simply sets `auto_stack=True`).

___
## <a name="8">Prediction options (inference)</a>
(<a href="#0">Go to top</a>)

Even if you’ve started a new Python session since last calling `fit()`, you can still load a previously trained predictor from disk:

In [12]:
# `predictor.path` is another way to get the relative path needed to later load predictor.
predictor = TabularPredictor.load(save_path)

Above `save_path` is the same folder previously passed to `TabularPredictor`, in which all the trained models have been saved. You can train easily models on one machine and deploy them on another. Simply copy the `save_path` folder to the new machine and specify its new path in `TabularPredictor.load()`.

We can make a prediction on an individual example rather than on a full dataset:

In [13]:
# Select one datapoint to make a prediction
datapoint = test_data.iloc[[0]] # Note: .iloc[0] won't work because it returns pandas Series instead of DataFrame

predictor.predict(datapoint)

0     Other-service
Name: occupation, dtype: object

To output predicted class probabilities instead of predicted classes, you can use:



In [14]:
# Returns a DataFrame that shows which probability corresponds to which class
predictor.predict_proba(datapoint)

Unnamed: 0,?,Adm-clerical,Armed-Forces,Craft-repair,Exec-managerial,Farming-fishing,Handlers-cleaners,Machine-op-inspct,Other-service,Priv-house-serv,Prof-specialty,Protective-serv,Sales,Tech-support,Transport-moving
0,0.038548,0.234544,0.0,0.050518,0.06077,0.008894,0.073112,0.048278,0.298018,0.0,0.049102,0.004876,0.085449,0.021308,0.026581


By default, `predict()` and `predict_proba()` will utilize the model that AutoGluon thinks is most accurate, which is usually an ensemble of many individual models. Here’s how to see which model this corresponds to:

In [15]:
predictor.get_model_best()

'WeightedEnsemble_L2'

___
## <a name="10">Selecting individual models for predictions</a>
(<a href="#0">Go to top</a>)

We can specify a particular model to use for predictions (e.g. to reduce inference latency). Note that a ‘model’ in AutoGluon may refer to for example a single Neural Network, a bagged ensemble of many Neural Network copies trained on different training/validation splits, a weighted ensemble that aggregates the predictions of many other models, or a stacked model that operates on predictions output by other models. This is akin to viewing a RandomForest as one ‘model’ when it is in fact an ensemble of many decision trees.


Here’s how to specify a particular model to use for prediction instead of AutoGluon’s default model-choice:

In [16]:
# index of model to use
i = 0
model_to_use = predictor.get_model_names()[i]
model_pred = predictor.predict(datapoint, model=model_to_use)
print(f"Prediction from {model_to_use} model: {model_pred.iloc[0]}")

Prediction from KNeighborsUnif_BAG_L1 model:  Adm-clerical


We can easily access information about the trained predictor or a particular model:

In [17]:
all_models = predictor.get_model_names()
model_to_use = all_models[i]
specific_model = predictor._trainer.load_model(model_to_use)

# Objects defined below are dicts with information (not printed here as they are quite large):
model_info = specific_model.get_info()
predictor_information = predictor.info()

Since the label columns remains in the `test_data` DataFrame, we can instead use the shorthand:

In [18]:
predictor.evaluate(test_data)

Evaluation: accuracy on test data: 0.35438632408639575
Evaluations on test data:
{
    "accuracy": 0.35438632408639575,
    "balanced_accuracy": 0.2380254766456843,
    "mcc": 0.27835195719142447
}


{'accuracy': 0.35438632408639575,
 'balanced_accuracy': 0.2380254766456843,
 'mcc': 0.27835195719142447}

___
## <a name="11">Interpretability: Feature importance</a>
(<a href="#0">Go to top</a>)

To better understand our trained predictor, we can estimate the overall importance of each feature:

In [19]:
predictor.feature_importance(test_data)

Computing feature importance via permutation shuffling for 14 features using 5000 rows with 5 shuffle sets...
	58.83s	= Expected runtime (11.77s per shuffle set)
	49.9s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
workclass,0.07032,0.002331,1.445937e-07,5,0.075119,0.065521
sex,0.06248,0.005137,5.43572e-06,5,0.073058,0.051902
education-num,0.05004,0.006349,3.043899e-05,5,0.063112,0.036968
hours-per-week,0.02208,0.006774,0.0009419456,5,0.036029,0.008131
class,0.02004,0.003122,6.846919e-05,5,0.026469,0.013611
education,0.01792,0.003087,0.0001016721,5,0.024277,0.011563
age,0.00752,0.003466,0.004164539,5,0.014656,0.000384
relationship,0.00144,0.002609,0.1423639,5,0.006812,-0.003932
fnlwgt,0.00116,0.001519,0.08147191,5,0.004288,-0.001968
race,0.00048,0.001361,0.2372046,5,0.003282,-0.002322


Computed via permutation-shuffling, these feature importance scores quantify the drop in predictive performance (of the already trained predictor) when one columns values are randomly shuffled across rows. The top features in this list contribute most to AutoGluon’s accuracy. Features with non-positive importance score hardly contribute to the predictors accuracy, or may even be actively harmful to include in the data (consider removing these features from your data and calling `fit` again). These scores facilitate interpretability of the predictors global behavior (which features it relies on for all predictions) rather than local explanations that only rationalize one particular prediction.


___
## <a name="12"> Inference Speed: Model distillation</a>
(<a href="#0">Go to top</a>)

While computationally-favorable, single individual models will usually have lower accuracy than weighted/stacked/bagged ensembles. Model Distillation offers one way to retain the computational benefits of a single model, while enjoying some of the accuracy-boost that comes with ensembling. The idea is to train the individual model (which we can call the student) to mimic the predictions of the full stack ensemble (the teacher). Like `refit_full()`, the `distill()` function will produce additional models we can opt to use for prediction.

### Training student models

In [20]:
# Specify much longer time limit in real applications
student_models = predictor.distill(time_limit=30)
student_models

Distilling with teacher='WeightedEnsemble_L2', teacher_preds=soft, augment_method=spunge ...
SPUNGE: Augmenting training data with 3980 synthetic samples for distillation...
Distilling with each of these student models: ['LightGBM_DSTL', 'NeuralNetMXNet_DSTL', 'RandomForestMSE_DSTL', 'CatBoost_DSTL', 'NeuralNetTorch_DSTL']
Fitting 5 L1 models ...
Fitting model: LightGBM_DSTL ... Training model for up to 30.0s of the 30.0s of remaining time.


[1000]	valid_set's soft_log_loss: -1.67598
[2000]	valid_set's soft_log_loss: -1.67213


	Ran out of time, early stopping on iteration 2666. Best iteration is:
	[2112]	valid_set's soft_log_loss: -1.67157
	Note: model has different eval_metric than default.
	-1.6716	 = Validation score   (-soft_log_loss)
	33.73s	 = Training   runtime
	0.4s	 = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Distilling with each of these student models: ['WeightedEnsemble_L2_DSTL']
Fitting model: WeightedEnsemble_L2_DSTL ... Training model for up to 30.0s of the -7.34s of remaining time.
	Note: model has different eval_metric than default.
	-1.6716	 = Validation score   (-soft_log_loss)
	0.0s	 = Training   runtime
	0.0s	 = Validation runtime
Distilled model leaderboard:
                      model  score_val  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0             LightGBM_DSTL      0.425       0.397912  33.728100                0.397912          33.728100            1       True          8
1  WeightedEnsemble_L2_DSTL  

['LightGBM_DSTL', 'WeightedEnsemble_L2_DSTL']

In [21]:
preds_student = predictor.predict(test_data, model=student_models[0])
print(f"predictions from {student_models[0]}: {list(preds_student)[:5]}")

predictions from LightGBM_DSTL: [' Adm-clerical', ' Farming-fishing', ' Sales', ' Sales', ' Handlers-cleaners']


### Excluding models

Finally, you may also exclude specific unwieldy models from being trained at all. Below we exclude models that tend to be slower (K Nearest Neighbors, Neural Network, models with custom larger-than-default hyperparameters):

In [22]:
excluded_model_types = ["KNN", "NN", "custom"]
predictor_light = TabularPredictor(label="occupation", eval_metric=metric).fit(
    train_data, excluded_model_types=excluded_model_types, time_limit=30
)

No path specified. Models will be saved in: "AutogluonModels/ag-20230207_193310/"
Beginning AutoGluon training ... Time limit = 30s
AutoGluon will save models to "AutogluonModels/ag-20230207_193310/"
AutoGluon Version:  0.6.2
Python Version:     3.9.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Fri Oct 14 01:16:24 UTC 2022
Train Data Rows:    1000
Train Data Columns: 14
Label Column: occupation
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == object).
	First 10 (of 15) unique label values:  [' Exec-managerial', ' Other-service', ' Craft-repair', ' Sales', ' Prof-specialty', ' Protective-serv', ' ?', ' Adm-clerical', ' Machine-op-inspct', ' Tech-support']
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Fraction of data from classes 

___
## <a name="13">Before You Go</a>
(<a href="#0">Go to top</a>)

After you are done with this Demo, clean model artifacts by uncommenting and executing the cell below.

__It is always good practice to clean everything when you are done, preventing the disk from getting full.__

In [23]:
!rm -r AutogluonModels
!rm -r agModels-predictOccupation

<p style="padding: 10px; border: 1px solid black;">
<img src="./../../images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>

# Thank you!