# About

This simple notebook shows how you can conduct automatic EDA with pandas_profiling and AutoML with autogluon. 

Although these packages can produce a very good performing model with very few lines of code, the results are generated in a "blackbox" fasion. It is still critical to learn EDA and machine learning algorithms/pipelines as we did in this tutorial to better understand the results, try other/better tuning methods, and develop additional custom models that are not covered by these packages. 

Therefore, a good workflow is to use these packages to get a quick overview of the data and models and then drill down to the parts that are of special interest and greater potential for better performance. 

In this example, the best performing model given by AutoGluon is XGBoost with 0.8324 accuracy, which is not part of scikit-learn package. Our manually-tuned best performing decision tree's accuray is 0.8258, which would be ranked #3 after ExtraTrees classfier (accuracy 08268) that we did not try.


In [1]:
# import packages
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
from autogluon.tabular import TabularDataset, TabularPredictor

In [2]:
# read csv data into pandas dataframe
df = pd.read_csv('titanic.csv')

In [3]:
# generate pandas profiling report
profile = ProfileReport(df, title="Titanic Pandas Profiling Report")
#profile = ProfileReport(df, title="Titanic Pandas Profiling Report", minimal=True)  # this option turns off many expensive calculations for large datasets

In [4]:
# show report in notebook
profile.to_widgets()

Summarize dataset: 100%|██████████| 25/25 [00:10<00:00,  2.31it/s, Completed]
  return str(escape(value))
  return str(escape(value))
  return str(escape(value))
  return str(escape(value))
  return str(escape(value))
  return str(escape(value))
  return str(escape(value))
  return str(escape(value))
  return str(escape(value))
  return str(escape(value))
  return str(escape(value))
  return str(escape(value))
Generate report structure: 100%|██████████| 1/1 [00:08<00:00,  8.41s/it]


VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

In [5]:
# this step is optional - I keep it here to be consistent with the tutorial
# dropping unimportant features, such as passenger id, name, ticket number and cabin number
df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

In [6]:
# Split the data into a training set and a test set for autogluon
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)
train_data

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
331,0,1,male,45.5,0,0,28.5000,S
733,0,2,male,23.0,0,0,13.0000,S
382,0,3,male,32.0,0,0,7.9250,S
704,0,3,male,26.0,1,0,7.8542,S
813,0,3,female,6.0,4,2,31.2750,S
...,...,...,...,...,...,...,...,...
106,1,3,female,21.0,0,0,7.6500,S
270,0,1,male,,0,0,31.0000,S
860,0,3,male,41.0,2,0,14.1083,S
435,1,1,female,14.0,1,2,120.0000,S


In [7]:
from autogluon.tabular import TabularDataset, TabularPredictor
predictor = TabularPredictor(label='Survived').fit(train_data)
leaderboard = predictor.leaderboard(test_data)

No path specified. Models will be saved in: "AutogluonModels/ag-20210712_200740/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20210712_200740/"
AutoGluon Version:  0.2.0
Train Data Rows:    712
Train Data Columns: 7
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    1711.77 MB
	Train Data (Original)  Memory Usage: 0.11 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of t

                  model  score_test  score_val  pred_time_test  pred_time_val   fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0        ExtraTreesEntr    0.826816   0.734266        0.092608       0.063413   0.656698                 0.092608                0.063413           0.656698            1       True          9
1      RandomForestGini    0.826816   0.769231        0.103510       0.105403   0.740338                 0.103510                0.105403           0.740338            1       True          5
2      RandomForestEntr    0.826816   0.755245        0.110604       0.089262   0.847276                 0.110604                0.089262           0.847276            1       True          6
3        ExtraTreesGini    0.821229   0.741259        0.092251       0.069837   0.715123                 0.092251                0.069837           0.715123            1       True          8
4            LightGBMXT    0.810056   0.

In [8]:
test_data

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
709,1,3,male,,1,1,15.2458,C
439,0,2,male,31.0,0,0,10.5000,S
840,0,3,male,20.0,0,0,7.9250,S
720,1,2,female,6.0,0,1,33.0000,S
39,1,3,female,14.0,1,0,11.2417,C
...,...,...,...,...,...,...,...,...
433,0,3,male,17.0,0,0,7.1250,S
773,0,3,male,,0,0,7.2250,C
25,1,3,female,38.0,1,5,31.3875,S
84,1,2,female,17.0,0,0,10.5000,S


In [9]:
passenger1 = pd.DataFrame(
    {   
        'Pclass': [3],
        'Sex': ['male'], 
        'Age': [23],
        'SibSp': [0],
        'Parch': [0],
        'Fare': [5.5],
        'Embarked': ['C'],
    }
)

In [10]:
# predict one
predictor.predict(passenger1) # default is using the best model

  res = method(*args, **kwargs)
  res = method(*args, **kwargs)


0    0
Name: Survived, dtype: int64

In [11]:
# predict using other model
predictor.predict(passenger1, model='RandomForestEntr')

  res = method(*args, **kwargs)
  res = method(*args, **kwargs)


0    0
Name: Survived, dtype: int64

In [12]:
# try a multiclass classification using different metric 
time_limit = 60  # for quick demonstration only, you should set this to longest time you are willing to wait (in seconds)
label = 'Embarked'
metric = 'log_loss'  # specify your evaluation metric here, most important classification metric based on probabilities, the lower the better
presets = 'best_quality' # this allows AutoGluon to automatically construct powerful model ensembles based on stacking/bagging
predictor = TabularPredictor(label=label, eval_metric=metric).fit(train_data, time_limit=time_limit, presets=presets)
predictor.leaderboard(test_data, silent=True)

No path specified. Models will be saved in: "AutogluonModels/ag-20210712_200809/"
Presets specified: ['best_quality']
Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "AutogluonModels/ag-20210712_200809/"
AutoGluon Version:  0.2.0
Train Data Rows:    712
Train Data Columns: 7
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == object).
	3 unique label values:  ['S', 'C', 'Q']
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Train Data Class Count: 3
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    1991.96 MB
	Train Data (Original)  Memory Usage: 0.08 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manuall

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBM_BAG_L1,-0.496261,-0.458405,0.173804,0.056279,8.672698,0.173804,0.056279,8.672698,1,True,5
1,WeightedEnsemble_L2,-0.500019,-0.43733,0.264145,0.090508,9.605694,0.003934,0.001473,0.783969,2,True,7
2,LightGBMXT_BAG_L1,-0.578152,-0.551534,0.382732,0.10305,40.647673,0.382732,0.10305,40.647673,1,True,4
3,RandomForestGini_BAG_L1,-0.620727,-0.623831,0.046487,0.027569,0.146362,0.046487,0.027569,0.146362,1,True,6
4,NeuralNetFastAI_BAG_L1,-0.709866,-0.668783,0.707407,0.192682,9.112587,0.707407,0.192682,9.112587,1,True,3
5,KNeighborsUnif_BAG_L1,-1.909044,-1.304484,0.007542,0.008119,0.00987,0.007542,0.008119,0.00987,1,True,1
6,KNeighborsDist_BAG_L1,-1.953941,-1.315919,0.03992,0.005187,0.002665,0.03992,0.005187,0.002665,1,True,2


In [13]:
# try a regression problem
time_limit = 60  # for quick demonstration only, you should set this to longest time you are willing to wait (in seconds)
label = 'Fare'
metric = 'root_mean_squared_error'  # RMSE is the default metric for regression problem
presets = 'best_quality' # this allows AutoGluon to automatically construct powerful model ensembles based on stacking/bagging
predictor = TabularPredictor(label=label, eval_metric=metric).fit(train_data, time_limit=time_limit, presets=presets)
predictor.leaderboard(test_data, silent=True)

No path specified. Models will be saved in: "AutogluonModels/ag-20210712_200912/"
Presets specified: ['best_quality']
Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "AutogluonModels/ag-20210712_200912/"
AutoGluon Version:  0.2.0
Train Data Rows:    712
Train Data Columns: 7
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (512.3292, 0.0, 32.58628, 51.96953)
	If 'regression' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    2247.04 MB
	Train Data (Original)  Memory Usage: 0.11 MB (0.0% of available memory)
	Inferring data type of each feature based on colum

[1000]	train_set's rmse: 24.1369	valid_set's rmse: 38.9681
[2000]	train_set's rmse: 20.4021	valid_set's rmse: 37.4501
[3000]	train_set's rmse: 18.3425	valid_set's rmse: 37.0559
[4000]	train_set's rmse: 16.9427	valid_set's rmse: 36.8768
[5000]	train_set's rmse: 15.8747	valid_set's rmse: 36.7854
[1000]	train_set's rmse: 22.1435	valid_set's rmse: 43.202


	-34.3276	 = Validation root_mean_squared_error score
	27.38s	 = Training runtime
	0.1s	 = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 31.9s of the 31.89s of remaining time.
	Ran out of time, early stopping on iteration 371. Best iteration is:
	[257]	train_set's rmse: 25.2884	valid_set's rmse: 27.4423
	Ran out of time, early stopping on iteration 136. Best iteration is:
	[136]	train_set's rmse: 27.0032	valid_set's rmse: 45.0637
	Ran out of time, early stopping on iteration 97. Best iteration is:
	[95]	train_set's rmse: 28.6812	valid_set's rmse: 41.2727
	Ran out of time, early stopping on iteration 180. Best iteration is:
	[139]	train_set's rmse: 28.9203	valid_set's rmse: 21.1583
	Ran out of time, early stopping on iteration 303. Best iteration is:
	[246]	train_set's rmse: 22.2556	valid_set's rmse: 42.6277
	-33.8552	 = Validation root_mean_squared_error score
	30.16s	 = Training runtime
	0.14s	 = Validation runtime
Fitting model: RandomForestMSE_BAG_L1

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBM_BAG_L1,-35.332268,-33.855177,0.125364,0.1355,30.155718,0.125364,0.1355,30.155718,1,True,4
1,LightGBMXT_BAG_L1,-35.335685,-34.327552,0.311582,0.101881,27.380675,0.311582,0.101881,27.380675,1,True,3
2,WeightedEnsemble_L2,-35.337402,-33.621813,0.558324,0.424035,58.577053,0.002495,0.003437,0.366925,2,True,7
3,RandomForestMSE_BAG_L1,-39.355784,-37.670003,0.094323,0.150841,0.811818,0.094323,0.150841,0.811818,1,True,5
4,KNeighborsUnif_BAG_L1,-39.573354,-42.442024,0.009333,0.020595,0.00782,0.009333,0.020595,0.00782,1,True,1
5,KNeighborsDist_BAG_L1,-40.940658,-43.999949,0.024063,0.027958,0.003279,0.024063,0.027958,0.003279,1,True,2
6,ExtraTreesMSE_BAG_L1,-45.746227,-38.246714,0.118884,0.183216,0.673735,0.118884,0.183216,0.673735,1,True,6
