# Using AutoGluon for AutoML




- This is a package developed by AWS.
- It is used for deep learning operations.
- Project details = [https://autogluon.mxnet.io/](https://autogluon.mxnet.io/)
- Uses YOLO v3 for object detection. Does a transfer learning.
- Used to automatically do image classification, object detection etc
- Work for text and tabular data as well.
- It uses the mxnet framework for deep learning.
- Can use even Neural Architecture Search. 

# Install AutoGluon

- Please do not reverse the order of the cells.
- Execute in this way only

In [0]:
! nvidia-smi

In [0]:
pip uninstall -y mkl

In [0]:
pip install --upgrade mxnet

In [0]:
pip install autogluon

In [0]:
pip install -U ipykernel

- Restart Colab Runtime, then execute remaining cells

# Time to use it

In [0]:
import autogluon as ag

In [0]:
import pandas as pd
import numpy as np
import os, urllib

In [0]:
from autogluon import TabularPrediction as task

In [0]:
BASE_DIR = '/tmp'
OUTPUT_FILE = os.path.join(BASE_DIR, 'churn_data.csv')

In [0]:
churn_data = urllib.request.urlretrieve('https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/WA_Fn-UseC_-Telco-Customer-Churn.csv', OUTPUT_FILE)

In [0]:
churn_master_df = pd.read_csv(OUTPUT_FILE)

In [0]:
size = int(0.8 * len(churn_master_df))
train_df = churn_master_df[:size]
test_df = churn_master_df[size:]

In [10]:
test_df.shape

(1409, 21)

In [11]:
train_df.shape

(5634, 21)

In [12]:
train_df.head(3)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes


# Use the dataset object to train and predict

- Dataset can take pandas df, csv or dict.
- This will create an autogluon dataset 

In [0]:
train_data = task.Dataset(df=train_df)
test_data = task.Dataset(df=test_df)

In [0]:
train_data.head(3)

In [0]:
train_data.describe()

In [27]:
pred_column = 'Churn'
train_data[pred_column].describe()

count     5634
unique       2
top         No
freq      4146
Name: Churn, dtype: object

In [20]:
predictor = task.fit(train_data=train_data, label=pred_column, eval_metric='accuracy')

No output_directory specified. Models will be saved in: AutogluonModels/ag-20200325_104102/
Beginning AutoGluon training ...
AutoGluon will save models to AutogluonModels/ag-20200325_104102/
Train Data Rows:    5634
Train Data Columns: 21
Preprocessing data ...
Here are the first 10 unique label values in your data:  ['No' 'Yes']
AutoGluon infers your prediction problem is: binary  (because only two unique label-values observed)
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Selected class <--> label mapping:  class 1 = Yes, class 0 = No
Feature Generator processed 5634 data points with 20 features
Original Features:
	object features: 17
	int features: 2
	float features: 1
Generated Features:
	int features: 0
All Features:
	object features: 17
	int features: 2
	float features: 1
	Data preprocessing and feature engineering runtime = 0.19s ...
AutoGluon will gauge predictive perfo

In [0]:
y_test = test_data[pred_column]
test_data.drop(labels=[pred_column], axis=1, inplace=True)

In [29]:
test_data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
5634,2320-JRSDE,Female,0,Yes,Yes,1,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,Yes,Electronic check,19.9,19.9
5635,2087-QAREY,Female,0,Yes,No,22,Yes,No,DSL,No,Yes,Yes,No,No,No,Month-to-month,Yes,Mailed check,54.7,1178.75
5636,0601-WZHJF,Male,0,Yes,No,14,No,No phone service,DSL,No,No,No,No,Yes,Yes,Month-to-month,No,Electronic check,46.35,667.7
5637,4423-JWZJN,Male,0,Yes,Yes,64,Yes,Yes,Fiber optic,No,No,Yes,No,No,Yes,One year,No,Credit card (automatic),90.25,5629.15
5638,5143-WMWOG,Male,0,No,No,1,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Electronic check,19.95,19.95


In [0]:
y_pred = predictor.predict(test_data)
performance = predictor.evaluate_predictions(y_test, y_pred, auxiliary_metrics=True)
print("Evaluation: ", performance)

In [34]:
print(predictor.problem_type)
print(predictor.feature_types)

binary
{'nlp': [], 'vectorizers': [], 'object': ['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'TotalCharges'], 'int': ['SeniorCitizen', 'tenure'], 'float': ['MonthlyCharges']}


In [0]:
#   It should have deleted customer ID. It does not take care of feature selection

In [36]:
predictor.predict_proba(test_data)

array([0.27933532, 0.28120423, 0.39816636, ..., 0.35828002, 0.66690673,
       0.11514709])

In [37]:
predictor.leaderboard()

                         model  score_val   fit_time  pred_time_val  stack_level
10     weighted_ensemble_k0_l1   0.799645   0.548092       0.001605            1
7           CatboostClassifier   0.799645   3.282355       0.032724            0
6           LightGBMClassifier   0.789007   0.791672       0.022463            0
0   RandomForestClassifierGini   0.781915   1.535514       0.118640            0
1   RandomForestClassifierEntr   0.778369   2.026696       0.124512            0
8          NeuralNetClassifier   0.778369  19.149632       0.255416            0
9     LightGBMClassifierCustom   0.778369   1.117550       0.026408            0
4     KNeighborsClassifierUnif   0.773050   0.049556       0.123809            0
2     ExtraTreesClassifierGini   0.767730   1.340321       0.120793            0
3     ExtraTreesClassifierEntr   0.767730   1.425996       0.119651            0
5     KNeighborsClassifierDist   0.741135   0.019241       0.113218            0


Unnamed: 0,model,score_val,fit_time,pred_time_val,stack_level
10,weighted_ensemble_k0_l1,0.799645,0.548092,0.001605,1
7,CatboostClassifier,0.799645,3.282355,0.032724,0
6,LightGBMClassifier,0.789007,0.791672,0.022463,0
0,RandomForestClassifierGini,0.781915,1.535514,0.11864,0
1,RandomForestClassifierEntr,0.778369,2.026696,0.124512,0
8,NeuralNetClassifier,0.778369,19.149632,0.255416,0
9,LightGBMClassifierCustom,0.778369,1.11755,0.026408,0
4,KNeighborsClassifierUnif,0.77305,0.049556,0.123809,0
2,ExtraTreesClassifierGini,0.76773,1.340321,0.120793,0
3,ExtraTreesClassifierEntr,0.76773,1.425996,0.119651,0


In [0]:
predictor.fit_summary()

# Customizing hyper parameter tuning

In [0]:
hp_tune = True

rf_options = {
    'n_estimators' : 100,
}

gbm_options = {
    'num_boost_rounds' : 100,
    'num_leaves' : ag.space.Int(lower=6, upper=20, default=8)
}

hyperparameters = {'RF' : rf_options, 'GBM' : gbm_options}

time_limits = 2*60  # 2 mins
num_trials = 1  # Use higher if possible
search_strategy = "skopt" # Bayesian optimization


In [0]:
train_data = task.Dataset(df=train_df)
test_data = task.Dataset(df=test_df)

In [42]:
predictor = task.fit(train_data, tuning_data=test_data, label=pred_column, time_limits=time_limits, num_trials=num_trials,
                     hyperparameter_tune=hp_tune, hyperparameters=hyperparameters, search_strategy=search_strategy, nthreads_per_trial=1, 
                     ngpus_per_trial=1)

No output_directory specified. Models will be saved in: AutogluonModels/ag-20200325_110458/
Beginning AutoGluon training ... Time limit = 120s
AutoGluon will save models to AutogluonModels/ag-20200325_110458/
Train Data Rows:    5634
Train Data Columns: 21
Tuning Data Rows:    1409
Tuning Data Columns: 21
Preprocessing data ...
Here are the first 10 unique label values in your data:  ['No' 'Yes']
AutoGluon infers your prediction problem is: binary  (because only two unique label-values observed)
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Selected class <--> label mapping:  class 1 = Yes, class 0 = No
Feature Generator processed 7043 data points with 20 features
Original Features:
	object features: 17
	int features: 2
	float features: 1
Generated Features:
	int features: 0
All Features:
	object features: 17
	int features: 2
	float features: 1
	Data preprocessing and feature e

In [0]:
predictor.fit_summary()

In [44]:
predictor.leaderboard()

                        model  score_val  fit_time  pred_time_val  stack_level
3     weighted_ensemble_k0_l1   0.813343  0.285369       0.001193            1
2          LightGBMClassifier   0.804826  1.044784       0.034340            0
0  RandomForestClassifierGini   0.801278  0.567397       0.117556            0
1  RandomForestClassifierEntr   0.794890  0.754569       0.118744            0


Unnamed: 0,model,score_val,fit_time,pred_time_val,stack_level
3,weighted_ensemble_k0_l1,0.813343,0.285369,0.001193,1
2,LightGBMClassifier,0.804826,1.044784,0.03434,0
0,RandomForestClassifierGini,0.801278,0.567397,0.117556,0
1,RandomForestClassifierEntr,0.79489,0.754569,0.118744,0
