# automl-gs Hello World

#### by Max Woolf (@mimaxir)

This notebook will give you an example on how automl-gs works with very little effort!

(Note: this notebook assumes you have installed automs-gs, TensorFlow and xgboost on the system)

In [1]:
from automl_gs import automl_grid_search
import pandas as pd

For this Hello World, we'll use the [Titanic dataset](http://web.stanford.edu/class/archive/cs/cs109/cs109.1166/problem12.html), which is small and good to make sure everything works. We'll try to predict `Survival`, i.e. if a person in the given row survived, and select the model that has the best `accuracy`. (a typical model with substantial preprocessing gets about 80% accuracy on this problem)

We'll download the dataset; the CSV must be on the local system for automl-gs.

In [2]:
df = pd.read_csv('http://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv')
df.to_csv('titanic.csv', index=False)
df.head(10)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05
5,0,3,Mr. James Moran,male,27.0,0,0,8.4583
6,0,1,Mr. Timothy J McCarthy,male,54.0,0,0,51.8625
7,0,3,Master. Gosta Leonard Palsson,male,2.0,3,1,21.075
8,1,3,Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson,female,27.0,0,2,11.1333
9,1,2,Mrs. Nicholas (Adele Achem) Nasser,female,14.0,1,0,30.0708


Now we can run automl-gs with just one quick command! By default, automl-gs will use TensorFlow.

automl-gs tries to automatically infer the problem type and the data types of the columns. During training, a progress bar will appear to tell you both how far the experiment is progressing, how far a given trial is progressing, times elapsed for both, and ETAs until completion for both.

In [3]:
automl_grid_search('titanic.csv', 'Survived')

Solving a binary_classification problem, maximizing accuracy using tensorflow.

Modeling with field specifications:
Pclass: categorical
Name: ignore
Sex: categorical
Age: numeric
Siblings/Spouses Aboard: categorical
Parents/Children Aboard: categorical
Fare: numeric


HBox(children=(IntProgress(value=0), HTML(value='')))

HBox(children=(IntProgress(value=0, max=20), HTML(value='')))


Metrics:
trial_id: ec32902c-7915-44ec-abfd-820d2c483e41
epoch: 20
time_completed: 2019-03-25 00:13:56
log_loss: 0.6927925439809592
accuracy: 0.6142322097378277
auc: 0.5
precision: 0.6142322097378277
recall: 0.6142322097378277
f1: 0.6142322097378277


Metrics:
trial_id: a604947c-8d0c-4cf0-bfee-dd47470e8033
epoch: 20
time_completed: 2019-03-25 00:14:09
log_loss: 0.4649398927524518
accuracy: 0.7865168539325843
auc: 0.8289426947667535
precision: 0.7865168539325843
recall: 0.7865168539325843
f1: 0.7865168539325842


Metrics:
trial_id: 38a1d264-ec71-4204-bd58-7c1a3849aaf5
epoch: 20
time_completed: 2019-03-25 00:15:01
log_loss: 0.5828870279550524
accuracy: 0.8052434456928839
auc: 0.8241475254558371
precision: 0.8052434456928839
recall: 0.8052434456928839
f1: 0.8052434456928839


About 80.5% accuracy: not bad.

The model files are saved in a time-stamped folder in the same directory, and `automl_results.csv` has the results from all the training.

You can use another framework like `xgboost`, or change the number of trials/epochs by passing it to `automl_grid_search`.

xgboost runs substantially faster than TensorFlow, but may not be as robust on more complicated datasets.

In [4]:
automl_grid_search('titanic.csv', 'Survived', framework='xgboost', num_epochs=50)

Solving a binary_classification problem, maximizing accuracy using xgboost.

Modeling with field specifications:
Pclass: categorical
Name: ignore
Sex: categorical
Age: numeric
Siblings/Spouses Aboard: categorical
Parents/Children Aboard: categorical
Fare: numeric


HBox(children=(IntProgress(value=0), HTML(value='')))

HBox(children=(IntProgress(value=0, max=50), HTML(value='')))


Metrics:
trial_id: 97c0ebe9-d03b-43f9-befc-08bfaf0b2c14
epoch: 50
time_completed: 2019-03-25 00:25:03
log_loss: 0.6393894917063052
accuracy: 0.6891385767790262
auc: 0.6897643855079327
precision: 0.6891385767790262
recall: 0.6891385767790262
f1: 0.6891385767790262


Metrics:
trial_id: d77ce1bb-51b0-459a-807a-284a93c1fc66
epoch: 50
time_completed: 2019-03-25 00:25:04
log_loss: 0.5660474079378536
accuracy: 0.7940074906367042
auc: 0.8357506511958324
precision: 0.7940074906367042
recall: 0.7940074906367042
f1: 0.7940074906367042


Metrics:
trial_id: d50fd10e-93b0-4f32-a8f8-cf4e4869ec01
epoch: 50
time_completed: 2019-03-25 00:25:09
log_loss: 0.62283251035526
accuracy: 0.797752808988764
auc: 0.8229931328439498
precision: 0.797752808988764
recall: 0.797752808988764
f1: 0.797752808988764


Metrics:
trial_id: a66fefdd-f0b2-4370-b5d1-9db585bb5f71
epoch: 50
time_completed: 2019-03-25 00:25:14
log_loss: 0.4226235784171672
accuracy: 0.8389513108614233
auc: 0.853658536585366
precision: 0.83895131086

Meanwhile, xgboost got 83.9% accuracy on the same dataset in about 1/4th the time, even after massively increasing the number of epochs.

Although automl-gs automates a lot of the option selection, it's encourages to still try multiple options.

# MIT License

Copyright (c) 2019 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.