AutoGluon is an auto-ml package, developed by J Mueller, X Shi, A Smola:

Mueller, Jonas, Xingjian Shi, and Alexander Smola. "Faster, Simpler, More Accurate: Practical Automated Machine Learning with Tabular, Text, and Image Data." Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020.

For tabular data, AutoGluon can produce models to predict the values in one column based on the values in the other columns. With just a single call to fit(), you can achieve high accuracy in standard supervised learning tasks (both classification and regression).

In the economy of a competition it can help you to create benchmarks, get insights on models' workings and accelerate your experimentations.

In [1]:
!pip install autogluon 

Collecting autogluon
  Downloading autogluon-0.3.1-py3-none-any.whl (9.9 kB)
Collecting autogluon.extra==0.3.1
  Downloading autogluon.extra-0.3.1-py3-none-any.whl (28 kB)
Collecting autogluon.vision==0.3.1
  Downloading autogluon.vision-0.3.1-py3-none-any.whl (38 kB)
Collecting autogluon.text==0.3.1
  Downloading autogluon.text-0.3.1-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 764 kB/s 
[?25hCollecting autogluon.features==0.3.1
  Downloading autogluon.features-0.3.1-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 1.7 MB/s 
[?25hCollecting autogluon.mxnet==0.3.1
  Downloading autogluon.mxnet-0.3.1-py3-none-any.whl (33 kB)
Collecting autogluon.core==0.3.1
  Downloading autogluon.core-0.3.1-py3-none-any.whl (352 kB)
[K     |████████████████████████████████| 352 kB 1.3 MB/s 
[?25hCollecting autogluon.tabular[all]==0.3.1
  Downloading autogluon.tabular-0.3.1-py3-none-any.whl (273 kB)
[K     |███████████████████

In [2]:
!pip install scikit-learn -U

Collecting scikit-learn
  Downloading scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
[K     |████████████████████████████████| 22.3 MB 17 kB/s 
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.23.2
    Uninstalling scikit-learn-0.23.2:
      Successfully uninstalled scikit-learn-0.23.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fancyimpute 0.5.5 requires tensorflow, which is not installed.
pyldavis 3.3.1 requires numpy>=1.20.0, but you have numpy 1.19.5 which is incompatible.
pdpbox 0.2.1 requires matplotlib==3.1.1, but you have matplotlib 3.4.2 which is incompatible.
hypertools 0.7.0 requires scikit-learn!=0.22,<0.24,>=0.19.1, but you have scikit-learn 0.24.2 which is incompatible.
allennlp 2.5.0 requires torch<1.9.0,>=1.6.0, but you have 

In [3]:
# Importing core libraries
import numpy as np
import pandas as pd
import gc
from scipy.stats import skew

# Importing AutoGluon
from autogluon.tabular import TabularDataset, TabularPredictor

# Scikit Learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [4]:
# Derived from the original script https://www.kaggle.com/gemartin/load-data-reduce-memory-usage 
# by Guillaume Martin

def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [5]:
# Loading data 
X_train = pd.read_csv("../input/tabular-playground-series-sep-2021/train.csv").set_index('id')
X_test = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv").set_index('id')

y_train = X_train['claim']
X_train = X_train.drop('claim', axis='columns')

In [6]:
def get_stats_per_row(data):
    data['mv_row'] = data.isna().sum(axis=1)
    data['min_row'] = data.min(axis=1)
    data['std_row'] = data.std(axis=1)
    return data

def impute_skewed_features(data):
    skewed_feat = data.skew()
    skewed_feat = [*skewed_feat[abs(skewed_feat.values) > 1].index]

    for feat in skewed_feat:
        median = data[feat].median()
        data[feat] = data[feat].fillna(median)
        
    return data

pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', StandardScaler())
])

X_train = pd.DataFrame(pipeline.fit_transform(impute_skewed_features(get_stats_per_row(X_train))),
                       columns=X_train.columns,
                       index=X_train.index)
X_test = pd.DataFrame(pipeline.transform(impute_skewed_features(get_stats_per_row(X_test))),
                      columns=X_test.columns,
                      index=X_test.index)

In [7]:
# Adding t-SNE and UMAP projections
prj_train = pd.read_csv("../input/really-not-missing-at-random/train.csv")
prj_test = pd.read_csv("../input/really-not-missing-at-random/test.csv")

projections = ['t_sne_0', 't_sne_1', 't_umap_0', 't_umap_1']
X_train[projections] = prj_train[projections]
X_test[projections] = prj_test[projections]

In [8]:
X_train['claim'] = y_train

In [9]:
### REDUCE MEMORY USAGE
X_train = reduce_mem_usage(X_train)
X_test = reduce_mem_usage(X_test)
gc.collect()

Mem. usage decreased to 464.99 Mb (49.9% reduction)
Mem. usage decreased to 246.60 Mb (48.0% reduction)


0

In [10]:
VALIDATION = False
if VALIDATION is True:
    X_train, X_val = train_test_split(X_train, test_size=int(len(X_train) * 0.2), random_state=42)
    train_data = TabularDataset(X_train)
    val_data = TabularDataset(X_val)
else:
    train_data = TabularDataset(X_train)
    val_data = TabularDataset(X_train.iloc[:100_000, :])

SUBSAMPLE = False
if SUBSAMPLE is True:
    subsample_size = 10_000  # subsample subset of data for faster demo, try setting this to much larger values
    train_data = train_data.sample(n=subsample_size, random_state=0)
    
train_data.head()

Unnamed: 0_level_0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f117,f118,mv_row,min_row,std_row,t_sne_0,t_sne_1,t_umap_0,t_umap_1,claim
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.425521,-2.357891,-0.637206,-0.866657,-0.109034,-4.832443,-1.173641,-0.603397,-0.596871,-0.516828,...,-1.219964,1.134424,-0.44442,0.200841,-0.774747,92.911469,47.219639,-1341.255127,-3143.275391,1
1,0.247576,-0.323982,1.223569,0.361863,1.073953,-0.363575,0.079829,-0.74659,0.899454,0.469668,...,-0.6702443,-0.676798,-0.937798,0.200841,-0.458618,175.811325,-53.179821,783321.625,-117230.25,0
2,2.032347,-2.43568,-0.48896,0.341193,1.072427,0.116178,0.534916,-0.044075,-0.763516,1.056879,...,-0.03865338,-0.372563,1.529094,0.128938,-0.935267,27.879652,77.636978,-62.38726,-14.770458,1
3,1.438349,-2.337605,-0.508914,-0.829607,1.488535,3.590249,-1.191501,-0.339152,-0.735281,-0.529158,...,0.2970431,-0.1062,0.048959,0.050703,-0.911519,-66.903397,-83.568672,-1466.017578,1198.011353,1
4,0.602308,1.076218,-0.648438,0.463365,0.277665,-0.16039,0.725214,-0.905498,0.052478,-0.511066,...,2.905263e-16,-0.807808,3.009229,0.186324,-0.779948,-17.888206,-82.21907,17.318653,2.251678,1


In [11]:
label = 'claim'
print("Summary of target variable: \n", train_data[label].describe())

Summary of target variable: 
 count    957919.000000
mean          0.498492
std           0.499998
min           0.000000
25%           0.000000
50%           0.000000
75%           1.000000
max           1.000000
Name: claim, dtype: float64


In [12]:
!mkdir agModels

In [13]:
lgbm1_params = {
    'metric' : 'auc',
    'max_depth' : 3,
    'num_leaves' : 7,
    'n_estimators' : 5000,
    'colsample_bytree' : 0.3,
    'subsample' : 0.5,
    'reg_alpha' : 18,
    'reg_lambda' : 17,
    'learning_rate' : 0.095,
    'device' : 'gpu',
    'objective' : 'binary'
}

lgbm2_params = {
    'metric' : 'auc',
    'objective': 'binary',
    'n_estimators': 10000,
    'learning_rate': 0.095,
    'subsample': 0.6,
    'subsample_freq': 1,
    'colsample_bytree': 0.4,
    'reg_alpha': 10.0,
    'reg_lambda': 1e-1,
    'min_child_weight': 256,
    'min_child_samples': 20,
    'device' : 'gpu',
    'max_depth' : 3,
    'num_leaves' : 7
}

lgbm3_params = {
    'metric' : 'auc',
    'objective' : 'binary',
    'device_type': 'gpu', 
    'n_estimators': 10000, 
    'learning_rate': 0.12230165751633416, 
    'num_leaves': 1400, 
    'max_depth': 8, 
    'min_child_samples': 3100, 
    'reg_alpha': 10, 
    'reg_lambda': 65, 
    'min_split_gain': 5.157818977461183, 
    'subsample': 0.5, 
    'subsample_freq': 1, 
    'colsample_bytree': 0.2
}

catb1_params = {
    'eval_metric' : 'AUC',
    'iterations': 15585, 
    'objective': 'CrossEntropy',
    'bootstrap_type': 'Bernoulli', 
    'od_wait': 1144, 
    'learning_rate': 0.023575206684596582, 
    'reg_lambda': 36.30433203563295, 
    'random_strength': 43.75597655616195, 
    'depth': 7, 
    'min_data_in_leaf': 11, 
    'leaf_estimation_iterations': 1, 
    'subsample': 0.8227911142845009,
    'task_type' : 'GPU',
    'devices' : '0',
    'verbose' : 0
}

catb2_params = {
    'eval_metric' : 'AUC',
    'depth' : 5,
    'grow_policy' : 'SymmetricTree',
    'l2_leaf_reg' : 3.0,
    'random_strength' : 1.0,
    'learning_rate' : 0.1,
    'iterations' : 10000,
    'loss_function' : 'CrossEntropy',
    'task_type' : 'GPU',
    'devices' : '0',
    'verbose' : 0
}

xgb1_params = {
    'eval_metric' : 'auc',
    'lambda': 0.004562711234493688, 
    'alpha': 7.268146704546314, 
    'colsample_bytree': 0.6468987558386358, 
    'colsample_bynode': 0.29113878257290376, 
    'colsample_bylevel': 0.8915913499148167, 
    'subsample': 0.37130229826185135, 
    'learning_rate': 0.021671163563123198, 
    'grow_policy': 'lossguide', 
    'max_depth': 18, 
    'min_child_weight': 215, 
    'max_bin': 272,
    'n_estimators': 10000,
    'random_state': 0,
    'use_label_encoder': False,
    'objective': 'binary:logistic',
    'tree_method': 'gpu_hist',
    'gpu_id': 0,
    'predictor': 'gpu_predictor'
}

xgb2_params = dict(
    eval_metric='auc',
    max_depth=3,
    subsample=0.5,
    colsample_bytree=0.5,
    learning_rate=0.01187431306013263,
    n_estimators=10000,
    n_jobs=-1,
    use_label_encoder=False,
    objective='binary:logistic',
    tree_method='gpu_hist',
    gpu_id=0,
    predictor='gpu_predictor'
)

In [14]:
save_path = 'agModels'  # specifies folder to store trained models
presets='best_quality'
metric = 'roc_auc'
hours = 5.0
predictor = (TabularPredictor(label=label, eval_metric=metric, path=save_path)
             .fit(train_data,
                  excluded_model_types = ['KNN', 'XT' ,'RF', 'NN', 'FASTAI'],
                  hyperparameters = {'GBM': lgbm1_params, 
                                     'CAT': catb1_params,
                                     'XGB': xgb1_params
                                    },
                  presets=presets,
                  time_limit= int(60 * 60 * hours))
            )



[1000]	train_set's auc: 0.823275	train_set's binary_logloss: 0.503282	valid_set's auc: 0.812654	valid_set's binary_logloss: 0.509796
[2000]	train_set's auc: 0.829856	train_set's binary_logloss: 0.49866	valid_set's auc: 0.813526	valid_set's binary_logloss: 0.509372
[3000]	train_set's auc: 0.835348	train_set's binary_logloss: 0.494632	valid_set's auc: 0.813718	valid_set's binary_logloss: 0.509345




[1000]	train_set's auc: 0.823154	train_set's binary_logloss: 0.503347	valid_set's auc: 0.81294	valid_set's binary_logloss: 0.509856
[2000]	train_set's auc: 0.829756	train_set's binary_logloss: 0.498737	valid_set's auc: 0.813819	valid_set's binary_logloss: 0.509265
[3000]	train_set's auc: 0.835276	train_set's binary_logloss: 0.494745	valid_set's auc: 0.814138	valid_set's binary_logloss: 0.50906




[1000]	train_set's auc: 0.822995	train_set's binary_logloss: 0.503551	valid_set's auc: 0.814278	valid_set's binary_logloss: 0.507837
[2000]	train_set's auc: 0.829529	train_set's binary_logloss: 0.498951	valid_set's auc: 0.815382	valid_set's binary_logloss: 0.50713
[3000]	train_set's auc: 0.835108	train_set's binary_logloss: 0.494935	valid_set's auc: 0.815828	valid_set's binary_logloss: 0.506938




[1000]	train_set's auc: 0.822862	train_set's binary_logloss: 0.503771	valid_set's auc: 0.816362	valid_set's binary_logloss: 0.505576
[2000]	train_set's auc: 0.829459	train_set's binary_logloss: 0.499158	valid_set's auc: 0.817272	valid_set's binary_logloss: 0.505004
[3000]	train_set's auc: 0.835045	train_set's binary_logloss: 0.495139	valid_set's auc: 0.8173	valid_set's binary_logloss: 0.505032




[1000]	train_set's auc: 0.823019	train_set's binary_logloss: 0.503435	valid_set's auc: 0.81446	valid_set's binary_logloss: 0.508645
[2000]	train_set's auc: 0.829668	train_set's binary_logloss: 0.498805	valid_set's auc: 0.81552	valid_set's binary_logloss: 0.508078




[1000]	train_set's auc: 0.823018	train_set's binary_logloss: 0.503542	valid_set's auc: 0.813994	valid_set's binary_logloss: 0.507823
[2000]	train_set's auc: 0.829624	train_set's binary_logloss: 0.498917	valid_set's auc: 0.815382	valid_set's binary_logloss: 0.507016
[3000]	train_set's auc: 0.835151	train_set's binary_logloss: 0.494898	valid_set's auc: 0.815606	valid_set's binary_logloss: 0.506898




[1000]	train_set's auc: 0.822905	train_set's binary_logloss: 0.50359	valid_set's auc: 0.81473	valid_set's binary_logloss: 0.507498
[2000]	train_set's auc: 0.829506	train_set's binary_logloss: 0.498967	valid_set's auc: 0.816061	valid_set's binary_logloss: 0.506662
[3000]	train_set's auc: 0.835085	train_set's binary_logloss: 0.494948	valid_set's auc: 0.816308	valid_set's binary_logloss: 0.506551




[1000]	train_set's auc: 0.822878	train_set's binary_logloss: 0.503586	valid_set's auc: 0.815305	valid_set's binary_logloss: 0.507747
[2000]	train_set's auc: 0.829512	train_set's binary_logloss: 0.498949	valid_set's auc: 0.816748	valid_set's binary_logloss: 0.506909
[3000]	train_set's auc: 0.835035	train_set's binary_logloss: 0.494955	valid_set's auc: 0.81696	valid_set's binary_logloss: 0.506824




[1000]	train_set's auc: 0.822987	train_set's binary_logloss: 0.503533	valid_set's auc: 0.814668	valid_set's binary_logloss: 0.508109
[2000]	train_set's auc: 0.829619	train_set's binary_logloss: 0.498903	valid_set's auc: 0.815555	valid_set's binary_logloss: 0.507491
[3000]	train_set's auc: 0.835194	train_set's binary_logloss: 0.494868	valid_set's auc: 0.815678	valid_set's binary_logloss: 0.507378




[1000]	train_set's auc: 0.82286	train_set's binary_logloss: 0.503608	valid_set's auc: 0.815624	valid_set's binary_logloss: 0.507564
[2000]	train_set's auc: 0.829451	train_set's binary_logloss: 0.498972	valid_set's auc: 0.816545	valid_set's binary_logloss: 0.50691
[3000]	train_set's auc: 0.83497	train_set's binary_logloss: 0.494998	valid_set's auc: 0.816728	valid_set's binary_logloss: 0.506792




In [15]:
results = predictor.fit_summary(show_plot=True)

*** Summary of fit() ***
Estimated performance of each model:
                 model  score_val  pred_time_val      fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0  WeightedEnsemble_L2   0.815874     132.899522   7941.810154                0.360934         112.146400            2       True          4
1      LightGBM_BAG_L1   0.815837     115.511845   6958.331681              115.511845        6958.331681            1       True          1
2  WeightedEnsemble_L3   0.815292     282.798018  15083.651847                0.346916         110.916133            3       True          8
3      LightGBM_BAG_L2   0.815266     187.799331  12572.767612               55.260743        4743.103859            2       True          5
4       XGBoost_BAG_L2   0.815224     227.190359  10229.631854               94.651771        2399.968101            2       True          7
5       XGBoost_BAG_L1   0.813024      15.951144    585.739703               15.951144      

In [16]:
leaderboard = predictor.leaderboard(val_data)

                 model  score_test  score_val  pred_time_test  pred_time_val      fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0       XGBoost_BAG_L1    0.845674   0.813024       19.146997      15.951144    585.739703                19.146997               15.951144         585.739703            1       True          3
1       XGBoost_BAG_L2    0.843570   0.815224      255.985978     227.190359  10229.631854               106.221442               94.651771        2399.968101            2       True          7
2  WeightedEnsemble_L3    0.840043   0.815292      311.138590     282.798018  15083.651847                 0.004601                0.346916         110.916133            3       True          8
3      LightGBM_BAG_L2    0.837402   0.815266      204.912546     187.799331  12572.767612                55.148011               55.260743        4743.103859            2       True          5
4  WeightedEnsemble_L2    0.83

In [17]:
test_data = TabularDataset(X_test)
test_preds = predictor.predict_proba(test_data)

In [18]:
# Predicting and submission
submission = pd.DataFrame({'id':X_test.index, 
                           'claim': test_preds.iloc[:,1].ravel()})

submission.to_csv("submission.csv", index=False)

In [19]:
submission

Unnamed: 0,id,claim
0,957919,0.566013
1,957920,0.217300
2,957921,0.616263
3,957922,0.220584
4,957923,0.244773
...,...,...
493469,1451388,0.790373
493470,1451389,0.204020
493471,1451390,0.731978
493472,1451391,0.236312
