- Title: AutoML on Tabular Data Using AutoGluon
- Slug: python-automl-autogluon
- Date: 2020-02-04
- Category: Ai
- Tags: Ai, Python, data science, machine learning, AutoML, AutoGluon
- Author: Ben Du
- Modified: 2020-02-04


In [1]:
import autogluon
import autogluon as ag
from autogluon import TabularPrediction as task
import pandas as pd
from sklearn.model_selection import train_test_split

Note: please download the dataset `legendu/avg_score_after_round3_features` from Kaggle
before proceeding to the following.

In [3]:
df = pd.read_parquet("avg_score_after_round3_features/")
train, test = train_test_split(df.iloc[:, 4:], test_size=0.4, random_state=119)

In [4]:
train.shape

(553608, 35)

In [5]:
test.shape

(369072, 35)

In [6]:
train_data = task.Dataset(df=train)
test_data = task.Dataset(df=test)

In [None]:
model = task.fit(
    train_data=train_data,
    output_directory="auto_gluon",
    label="avg_score_after_round3"
)

Beginning AutoGluon training ...
AutoGluon will save models to auto_gluon/
Train Data Rows:    553608
Train Data Columns: 35
Preprocessing data ...
Here are the first 10 unique label values in your data:  [ 5.41576689  0.59051321  3.5933087   0.22835347  2.58887605  0.03626061
 15.30469829  4.38147371  4.57907499  9.85850969]
AutoGluon infers your prediction problem is: regression  (because dtype of label-column == float and many unique label-values observed)
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Feature Generator processed 553608 data points with 34 features
Original Features:
	int features: 11
	float features: 23
Generated Features:
	int features: 0
All Features:
	int features: 11
	float features: 23
	Data preprocessing and feature engineering runtime = 1.72s ...
AutoGluon will gauge predictive performance using evaluation metric: root_mean_squared_error
To change thi

In [13]:
model.fit_summary()

*** Summary of fit() ***
Number of models trained: 9
Types of models trained: 
{'LGBModel', 'CatboostModel', 'KNNModel', 'WeightedEnsembleModel', 'TabularNeuralNetModel', 'RFModel'}
Validation performance of individual models: {'RandomForestRegressorMSE': -1.0657452978721853, 'ExtraTreesRegressorMSE': -0.9910711261197889, 'KNeighborsRegressorUnif': -1.7777866717734974, 'KNeighborsRegressorDist': -1.5954465432398592, 'LightGBMRegressor': -1.06333423182554, 'CatboostRegressor': -1.0040615551029053, 'NeuralNetRegressor': -0.9160444564621223, 'LightGBMRegressorCustom': -1.038906098211013, 'weighted_ensemble_k0_l1': -0.8812506880621533}
Best model (based on validation performance): weighted_ensemble_k0_l1
Hyperparameter-tuning used: False
Bagging used: False 
Stack-ensembling used: False 
User-specified hyperparameters:
{'NN': {'num_epochs': 500}, 'GBM': {'num_boost_round': 10000}, 'CAT': {'iterations': 10000}, 'RF': {'n_estimators': 300}, 'XT': {'n_estimators': 300}, 'KNN': {}, 'custom': [

  return _load(spec)


{'model_types': {'RandomForestRegressorMSE': 'RFModel',
  'ExtraTreesRegressorMSE': 'RFModel',
  'KNeighborsRegressorUnif': 'KNNModel',
  'KNeighborsRegressorDist': 'KNNModel',
  'LightGBMRegressor': 'LGBModel',
  'CatboostRegressor': 'CatboostModel',
  'NeuralNetRegressor': 'TabularNeuralNetModel',
  'LightGBMRegressorCustom': 'LGBModel',
  'weighted_ensemble_k0_l1': 'WeightedEnsembleModel'},
 'model_performance': {'RandomForestRegressorMSE': -1.0657452978721853,
  'ExtraTreesRegressorMSE': -0.9910711261197889,
  'KNeighborsRegressorUnif': -1.7777866717734974,
  'KNeighborsRegressorDist': -1.5954465432398592,
  'LightGBMRegressor': -1.06333423182554,
  'CatboostRegressor': -1.0040615551029053,
  'NeuralNetRegressor': -0.9160444564621223,
  'LightGBMRegressorCustom': -1.038906098211013,
  'weighted_ensemble_k0_l1': -0.8812506880621533},
 'model_best': 'weighted_ensemble_k0_l1',
 'model_paths': {'RandomForestRegressorMSE': 'auto_gluon/models/RandomForestRegressorMSE/',
  'ExtraTreesRegr

In [15]:
model.leaderboard()

                      model  score_val     fit_time  pred_time_val  stack_level
8   weighted_ensemble_k0_l1  -0.881251     0.795549       0.001379            1
6        NeuralNetRegressor  -0.916044  4635.101431       4.679487            0
1    ExtraTreesRegressorMSE  -0.991071   100.983680       0.738732            0
5         CatboostRegressor  -1.004062   799.261748       0.049409            0
7   LightGBMRegressorCustom  -1.038906    21.870562       0.037082            0
4         LightGBMRegressor  -1.063334    13.707395       0.027841            0
0  RandomForestRegressorMSE  -1.065745   152.927213       0.554473            0
3   KNeighborsRegressorDist  -1.595447    19.212829       0.134977            0
2   KNeighborsRegressorUnif  -1.777787    19.484992       0.154668            0


Unnamed: 0,model,score_val,fit_time,pred_time_val,stack_level
8,weighted_ensemble_k0_l1,-0.881251,0.795549,0.001379,1
6,NeuralNetRegressor,-0.916044,4635.101431,4.679487,0
1,ExtraTreesRegressorMSE,-0.991071,100.98368,0.738732,0
5,CatboostRegressor,-1.004062,799.261748,0.049409,0
7,LightGBMRegressorCustom,-1.038906,21.870562,0.037082,0
4,LightGBMRegressor,-1.063334,13.707395,0.027841,0
0,RandomForestRegressorMSE,-1.065745,152.927213,0.554473,0
3,KNeighborsRegressorDist,-1.595447,19.212829,0.134977,0
2,KNeighborsRegressorUnif,-1.777787,19.484992,0.154668,0


In [16]:
dir(model)

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_createResults',
 '_format_results',
 '_learner',
 '_save_model',
 '_save_results',
 '_summarize',
 '_trainer',
 'class_labels',
 'eval_metric',
 'evaluate',
 'evaluate_predictions',
 'feature_types',
 'fit_summary',
 'label_column',
 'leaderboard',
 'load',
 'model_names',
 'model_performance',
 'output_directory',
 'predict',
 'predict_proba',
 'problem_type',
 'save']

## Load Saved Models

The trained models are automatically saved to disk 
and can be load back into memory.

In [10]:
model2 = task.load("auto_gluon")

In [11]:
model2.leaderboard()

                      model  score_val     fit_time  pred_time_val  stack_level
8   weighted_ensemble_k0_l1  -0.881251     0.795549       0.001379            1
6        NeuralNetRegressor  -0.916044  4635.101431       4.679487            0
1    ExtraTreesRegressorMSE  -0.991071   100.983680       0.738732            0
5         CatboostRegressor  -1.004062   799.261748       0.049409            0
7   LightGBMRegressorCustom  -1.038906    21.870562       0.037082            0
4         LightGBMRegressor  -1.063334    13.707395       0.027841            0
0  RandomForestRegressorMSE  -1.065745   152.927213       0.554473            0
3   KNeighborsRegressorDist  -1.595447    19.212829       0.134977            0
2   KNeighborsRegressorUnif  -1.777787    19.484992       0.154668            0


Unnamed: 0,model,score_val,fit_time,pred_time_val,stack_level
8,weighted_ensemble_k0_l1,-0.881251,0.795549,0.001379,1
6,NeuralNetRegressor,-0.916044,4635.101431,4.679487,0
1,ExtraTreesRegressorMSE,-0.991071,100.98368,0.738732,0
5,CatboostRegressor,-1.004062,799.261748,0.049409,0
7,LightGBMRegressorCustom,-1.038906,21.870562,0.037082,0
4,LightGBMRegressor,-1.063334,13.707395,0.027841,0
0,RandomForestRegressorMSE,-1.065745,152.927213,0.554473,0
3,KNeighborsRegressorDist,-1.595447,19.212829,0.134977,0
2,KNeighborsRegressorUnif,-1.777787,19.484992,0.154668,0


## Further Research

It is strange that ExtraTreesRegressorMSE and RandomForestRegressorMSE generate huge models. 
Check to see what happened.

In [20]:
!du -lhd 1 auto_gluon/models

1.3M	auto_gluon/models/LightGBMRegressor
100K	auto_gluon/models/weighted_ensemble_k0_l1
20G	auto_gluon/models/ExtraTreesRegressorMSE
311M	auto_gluon/models/KNeighborsRegressorDist
311M	auto_gluon/models/KNeighborsRegressorUnif
13G	auto_gluon/models/RandomForestRegressorMSE
4.3M	auto_gluon/models/LightGBMRegressorCustom
3.9M	auto_gluon/models/NeuralNetRegressor
1.8M	auto_gluon/models/CatboostRegressor
32G	auto_gluon/models
