## AutoGluon Sample

In [1]:
# ! pip install --upgrade "mxnet<2.0.0"
# ! pip install autogluon.tabular
# ! pip install autogluon.core

In [2]:
import autogluon.core as ag
from autogluon.tabular import TabularPrediction as task

In [3]:
train_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
subsample_size = 500  # subsample subset of data for faster demo, try setting this to much larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
print(train_data.head())

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073


       age workclass  fnlwgt      education  education-num  \
6118    51   Private   39264   Some-college             10   
23204   58   Private   51662           10th              6   
29590   40   Private  326310   Some-college             10   
18116   37   Private  222450        HS-grad              9   
33964   62   Private  109190      Bachelors             13   

            marital-status        occupation    relationship    race      sex  \
6118    Married-civ-spouse   Exec-managerial            Wife   White   Female   
23204   Married-civ-spouse     Other-service            Wife   White   Female   
29590   Married-civ-spouse      Craft-repair         Husband   White     Male   
18116        Never-married             Sales   Not-in-family   White     Male   
33964   Married-civ-spouse   Exec-managerial         Husband   White     Male   

       capital-gain  capital-loss  hours-per-week  native-country   class  
6118              0             0              40   United-State

In [5]:
label_column = 'class'
print("Summary of class variable: \n", train_data[label_column].describe())

Summary of class variable: 
 count        500
unique         2
top        <=50K
freq         365
Name: class, dtype: object


In [6]:
dir = 'agModels-predictClass'  # specifies folder where to store trained models
predictor = task.fit(train_data=train_data, label=label_column, output_directory=dir)

Beginning AutoGluon training ...
AutoGluon will save models to agModels-predictClass/
AutoGluon Version:  0.0.15b20201017
Train Data Rows:    500
Train Data Columns: 14
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' >50K', ' <=50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  <=50K, class 0 =  >50K
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Note: NumExpr detected 36 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
NumExpr defaulting to 8 threads.
	Available Memory:                    59441.4 MB
	Train Data (Original)  Memory Usage: 0.3 MB (0.0% of available memory)
	Inferring data type of each feature based on column values

In [7]:
test_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
y_test = test_data[label_column]  # values to predict
test_data_nolab = test_data.drop(labels=[label_column],axis=1)  # delete label column to prove we're not cheating
print(test_data_nolab.head())

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


   age          workclass  fnlwgt      education  education-num  \
0   31            Private  169085           11th              7   
1   17   Self-emp-not-inc  226203           12th              8   
2   47            Private   54260      Assoc-voc             11   
3   21            Private  176262   Some-college             10   
4   17            Private  241185           12th              8   

        marital-status        occupation relationship    race      sex  \
0   Married-civ-spouse             Sales         Wife   White   Female   
1        Never-married             Sales    Own-child   White     Male   
2   Married-civ-spouse   Exec-managerial      Husband   White     Male   
3        Never-married   Exec-managerial    Own-child   White   Female   
4        Never-married    Prof-specialty    Own-child   White     Male   

   capital-gain  capital-loss  hours-per-week  native-country  
0             0             0              20   United-States  
1             0         

In [8]:
predictor = task.load(dir)  # unnecessary, just demonstrates how to load previously-trained predictor from file

y_pred = predictor.predict(test_data_nolab)
print("Predictions:  ", y_pred)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

Evaluation: accuracy on test data: 0.8367284266557478


Predictions:   [' <=50K' ' <=50K' ' <=50K' ... ' <=50K' ' <=50K' ' <=50K']


Evaluations on test data:
{
    "accuracy": 0.8367284266557478,
    "accuracy_score": 0.8367284266557478,
    "balanced_accuracy_score": 0.7332244231481168,
    "matthews_corrcoef": 0.5159814299085932,
    "f1_score": 0.8367284266557478
}
Detailed (per-class) classification report:
{
    " <=50K": {
        "precision": 0.8657257057207095,
        "recall": 0.9302107099718159,
        "f1-score": 0.8968105065666041,
        "support": 7451
    },
    " >50K": {
        "precision": 0.7050482132728304,
        "recall": 0.5362381363244176,
        "f1-score": 0.6091644204851752,
        "support": 2318
    },
    "accuracy": 0.8367284266557478,
    "macro avg": {
        "precision": 0.78538695949677,
        "recall": 0.7332244231481168,
        "f1-score": 0.7529874635258896,
        "support": 9769
    },
    "weighted avg": {
        "precision": 0.8275999582036471,
        "recall": 0.8367284266557478,
        "f1-score": 0.8285574993461361,
        "support": 9769
    }
}


In [9]:
predictor.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatboostClassifier,0.844815,0.91,0.022211,0.013326,1.274192,0.022211,0.013326,1.274192,0,True,9
1,LightGBMClassifierXT,0.84154,0.91,0.030065,0.011482,0.141894,0.030065,0.011482,0.141894,0,True,8
2,weighted_ensemble_k0_l1,0.836728,0.93,0.155992,0.123657,0.822825,0.002759,0.000857,0.255284,1,True,12
3,LightGBMClassifier,0.833657,0.87,0.025897,0.015923,0.201689,0.025897,0.015923,0.201689,0,True,7
4,RandomForestClassifierGini,0.832531,0.88,0.11969,0.113348,0.53511,0.11969,0.113348,0.53511,0,True,1
5,RandomForestClassifierEntr,0.829051,0.88,0.218787,0.111537,0.529194,0.218787,0.111537,0.529194,0,True,2
6,ExtraTreesClassifierEntr,0.820145,0.87,0.120492,0.111588,0.423593,0.120492,0.111588,0.423593,0,True,4
7,LightGBMClassifierCustom,0.819224,0.82,0.040693,0.011746,0.600136,0.040693,0.011746,0.600136,0,True,11
8,ExtraTreesClassifierGini,0.819224,0.87,0.123168,0.111317,0.425647,0.123168,0.111317,0.425647,0,True,3
9,NeuralNetClassifier,0.794964,0.87,1.094387,0.045435,7.307641,1.094387,0.045435,7.307641,0,True,10
