# About

This simple notebook shows how you can conduct automatic EDA with pandas_profiling and AutoML with autogluon. 

Although these packages can produce a very good performing model with very few lines of code, the results are generated in a "blackbox" fasion. It is still critical to learn EDA and machine learning algorithms/pipelines as we did in this tutorial to better understand the results, try other/better tuning methods, and develop additional custom models that are not covered by these packages. 

Therefore, a good workflow is to use these packages to get a quick overview of the data and models and then drill down to the parts that are of special interest and greater potential for better performance. 

In this example, the best performing model given by AutoGluon is XGBoost with 0.8324 accuracy, which is not part of scikit-learn package. Our manually-tuned best performing decision tree's accuray is 0.8258, which would be ranked #3 after ExtraTrees classfier (accuracy 08268) that we did not try.


In [1]:
# import packages
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
from autogluon.tabular import TabularDataset, TabularPredictor

In [2]:
# read csv data into pandas dataframe
df = pd.read_csv('titanic.csv')

In [3]:
# generate pandas profiling report
profile = ProfileReport(df, title="Titanic Pandas Profiling Report")
#profile = ProfileReport(df, title="Titanic Pandas Profiling Report", minimal=True)  # this option turns off many expensive calculations for large datasets

In [4]:
# show report in notebook
profile.to_notebook_iframe()

Summarize dataset: 100%|██████████| 52/52 [00:02<00:00, 18.55it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.23s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.42it/s]


In [5]:
# this step is optional - I keep it here to be consistent with the tutorial
# dropping unimportant features, such as passenger id, name, ticket number and cabin number
df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

In [6]:
# Split the data into a training set and a test set for autogluon
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)
train_data

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
331,0,1,male,45.5,0,0,28.5000,S
733,0,2,male,23.0,0,0,13.0000,S
382,0,3,male,32.0,0,0,7.9250,S
704,0,3,male,26.0,1,0,7.8542,S
813,0,3,female,6.0,4,2,31.2750,S
...,...,...,...,...,...,...,...,...
106,1,3,female,21.0,0,0,7.6500,S
270,0,1,male,,0,0,31.0000,S
860,0,3,male,41.0,2,0,14.1083,S
435,1,1,female,14.0,1,2,120.0000,S


In [7]:
from autogluon.tabular import TabularDataset, TabularPredictor
predictor = TabularPredictor(label='Survived').fit(train_data)
predictor.leaderboard(test_data, silent=True)

No path specified. Models will be saved in: "AutogluonModels/ag-20220306_151219/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20220306_151219/"
AutoGluon Version:  0.3.2b20220304
Python Version:     3.9.7
Operating System:   Darwin
Train Data Rows:    712
Train Data Columns: 7
Label Column: Survived
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    2669.28 MB
	Train Data (Original)  Memory Usage: 0.11 MB (0.0% of available memory)
	Inferring data type of e

In [8]:
test_data

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
709,1,3,male,,1,1,15.2458,C
439,0,2,male,31.0,0,0,10.5000,S
840,0,3,male,20.0,0,0,7.9250,S
720,1,2,female,6.0,0,1,33.0000,S
39,1,3,female,14.0,1,0,11.2417,C
...,...,...,...,...,...,...,...,...
433,0,3,male,17.0,0,0,7.1250,S
773,0,3,male,,0,0,7.2250,C
25,1,3,female,38.0,1,5,31.3875,S
84,1,2,female,17.0,0,0,10.5000,S


In [9]:
passenger1 = pd.DataFrame(
    {   
        'Pclass': [3],
        'Sex': ['male'], 
        'Age': [23],
        'SibSp': [0],
        'Parch': [0],
        'Fare': [5.5],
        'Embarked': ['C'],
    }
)

In [10]:
# predict one
predictor.predict(passenger1) # default is using the best model

0    0
Name: Survived, dtype: int64

In [11]:
# predict using other model
predictor.predict(passenger1, model='RandomForestEntr')

0    0
Name: Survived, dtype: int64

In [12]:
# try a multiclass classification using different metric 
time_limit = 60  # for quick demonstration only, you should set this to longest time you are willing to wait (in seconds)
label = 'Embarked'
metric = 'log_loss'  # specify your evaluation metric here, most important classification metric based on probabilities, the lower the better
presets = 'best_quality' # this allows AutoGluon to automatically construct powerful model ensembles based on stacking/bagging
predictor = TabularPredictor(label=label, eval_metric=metric).fit(train_data, time_limit=time_limit, presets=presets)
predictor.leaderboard(test_data, silent=True)

No path specified. Models will be saved in: "AutogluonModels/ag-20220306_151224/"
Presets specified: ['best_quality']
Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "AutogluonModels/ag-20220306_151224/"
AutoGluon Version:  0.3.2b20220304
Python Version:     3.9.7
Operating System:   Darwin
Train Data Rows:    712
Train Data Columns: 7
Label Column: Embarked
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == object).
	3 unique label values:  ['S', 'C', 'Q']
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Train Data Class Count: 3
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    2930.93 MB
	Train Data (Original)  Memory Usage: 0.08 MB (0.0% of available memo

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,RandomForestEntr_BAG_L1,-0.553501,-0.50619,0.040934,0.045619,0.262088,0.040934,0.045619,0.262088,1,True,4
1,WeightedEnsemble_L2,-0.573909,-0.438607,0.078602,0.099204,0.743418,0.001286,0.001891,0.223457,2,True,8
2,RandomForestGini_BAG_L1,-0.617655,-0.490167,0.03328,0.046448,0.254864,0.03328,0.046448,0.254864,1,True,3
3,NeuralNetTorch_BAG_L1,-0.679332,-0.620793,0.197261,0.128002,48.810168,0.197261,0.128002,48.810168,1,True,7
4,ExtraTreesEntr_BAG_L1,-0.760222,-0.61305,0.045835,0.046264,0.240349,0.045835,0.046264,0.240349,1,True,6
5,ExtraTreesGini_BAG_L1,-0.765335,-0.603309,0.043582,0.047592,0.262632,0.043582,0.047592,0.262632,1,True,5
6,KNeighborsUnif_BAG_L1,-1.921021,-1.222711,0.003976,0.010264,0.003205,0.003976,0.010264,0.003205,1,True,1
7,KNeighborsDist_BAG_L1,-1.982404,-1.210855,0.003102,0.005246,0.003009,0.003102,0.005246,0.003009,1,True,2


In [13]:
# try a regression problem
time_limit = 60  # for quick demonstration only, you should set this to longest time you are willing to wait (in seconds)
label = 'Fare'
metric = 'root_mean_squared_error'  # RMSE is the default metric for regression problem
presets = 'best_quality' # this allows AutoGluon to automatically construct powerful model ensembles based on stacking/bagging
predictor = TabularPredictor(label=label, eval_metric=metric).fit(train_data, time_limit=time_limit, presets=presets)
predictor.leaderboard(test_data, silent=True)

No path specified. Models will be saved in: "AutogluonModels/ag-20220306_151315/"
Presets specified: ['best_quality']
Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "AutogluonModels/ag-20220306_151315/"
AutoGluon Version:  0.3.2b20220304
Python Version:     3.9.7
Operating System:   Darwin
Train Data Rows:    712
Train Data Columns: 7
Label Column: Fare
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (512.3292, 0.0, 32.58628, 51.96953)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    3195.6 MB
	Train Data (Original)  Memory

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,NeuralNetTorch_BAG_L1,-24.900763,-36.763231,0.179309,0.124621,48.826595,0.179309,0.124621,48.826595,1,True,5
1,WeightedEnsemble_L2,-29.976564,-35.447741,0.258655,0.219891,49.316774,0.001299,0.000224,0.059086,2,True,6
2,KNeighborsUnif_BAG_L1,-38.556477,-45.824363,0.002675,0.00842,0.002391,0.002675,0.00842,0.002391,1,True,1
3,KNeighborsDist_BAG_L1,-38.734514,-45.349967,0.002993,0.007712,0.002277,0.002993,0.007712,0.002277,1,True,2
4,RandomForestMSE_BAG_L1,-39.324223,-37.75783,0.034183,0.043929,0.203133,0.034183,0.043929,0.203133,1,True,3
5,ExtraTreesMSE_BAG_L1,-45.306049,-38.905662,0.040871,0.043405,0.225684,0.040871,0.043405,0.225684,1,True,4
