# automl-gs Hello World

#### by Max Woolf (@mimaxir)

This notebook will give you an example on how automl-gs works with very little effort!

(Note: this notebook assumes you have installed automs-gs, TensorFlow and xgboost on the system)

In [2]:
from automl_gs import automl_grid_search
import pandas as pd

ModuleNotFoundError: No module named 'automl_gs'

For this Hello World, we'll use the [Titanic dataset](http://web.stanford.edu/class/archive/cs/cs109/cs109.1166/problem12.html), which is small and good to make sure everything works. We'll try to predict `Survival`, i.e. if a person in the given row survived, and select the model that has the best `accuracy`. (a typical model with substantial preprocessing gets about 80% accuracy on this problem)

We'll download the dataset; the CSV must be on the local system for automl-gs.

In [2]:
df = pd.read_csv('http://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv')
df.to_csv('titanic.csv', index=False)
df.head(10)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05
5,0,3,Mr. James Moran,male,27.0,0,0,8.4583
6,0,1,Mr. Timothy J McCarthy,male,54.0,0,0,51.8625
7,0,3,Master. Gosta Leonard Palsson,male,2.0,3,1,21.075
8,1,3,Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson,female,27.0,0,2,11.1333
9,1,2,Mrs. Nicholas (Adele Achem) Nasser,female,14.0,1,0,30.0708


Now we can run automl-gs with just one quick command! By default, automl-gs will use TensorFlow.

automl-gs tries to automatically infer the problem type and the data types of the columns. During training, a progress bar will appear to tell you both how far the experiment is progressing, how far a given trial is progressing, times elapsed for both, and ETAs until completion for both.

In [3]:
automl_grid_search('titanic.csv', 'Survived')

Solving a binary_classification problem, maximizing accuracy using tensorflow.

Modeling with field specifications:
Pclass: categorical
Name: ignore
Sex: categorical
Age: numeric
Siblings/Spouses Aboard: categorical
Parents/Children Aboard: categorical
Fare: numeric


HBox(children=(IntProgress(value=0), HTML(value='')))

HBox(children=(IntProgress(value=0, max=20), HTML(value='')))


Metrics:
trial_id: 269b2f5a-5759-48eb-908a-91301498f173
epoch: 20
time_completed: 2019-03-25 14:37:26
log_loss: 0.6891973887043499
accuracy: 0.6142322097378277
auc: 0.5012431920435709
precision: 0.30711610486891383
recall: 0.5
f1: 0.3805104408352668


Metrics:
trial_id: 261256e8-1708-43d8-98de-f73ca3d5ad4f
epoch: 20
time_completed: 2019-03-25 14:37:37
log_loss: 0.6832971124166853
accuracy: 0.6404494382022472
auc: 0.7806950035519773
precision: 0.7603359173126615
recall: 0.5357861709685058
f1: 0.4576844955991876


Metrics:
trial_id: 3a693105-3ff9-42be-802d-6a9147b6638b
epoch: 20
time_completed: 2019-03-25 14:38:14
log_loss: 0.5836427935611889
accuracy: 0.700374531835206
auc: 0.7412384560738811
precision: 0.6830808080808081
recall: 0.6802628463177836
f1: 0.6814982703089585


Metrics:
trial_id: f3f24805-79f0-4feb-bdff-1050d912df71
epoch: 20
time_completed: 2019-03-25 14:39:11
log_loss: 0.578730115506533
accuracy: 0.704119850187266
auc: 0.7311449206725077
precision: 0.6875487900078064
reca

About 79.4% accuracy: not bad.

The model files are saved in a time-stamped folder in the same directory, and `automl_results.csv` has the results from all the training.

You can use another framework like `xgboost`, or change the number of trials/epochs by passing it to `automl_grid_search`.

xgboost runs substantially faster than TensorFlow, but may not be as robust on more complicated datasets.

In [4]:
automl_grid_search('titanic.csv', 'Survived', framework='xgboost', num_epochs=50)

Solving a binary_classification problem, maximizing accuracy using xgboost.

Modeling with field specifications:
Pclass: categorical
Name: ignore
Sex: categorical
Age: numeric
Siblings/Spouses Aboard: categorical
Parents/Children Aboard: categorical
Fare: numeric


HBox(children=(IntProgress(value=0), HTML(value='')))

HBox(children=(IntProgress(value=0, max=50), HTML(value='')))


Metrics:
trial_id: 0bf1bc3a-e215-4685-9c1e-2f924c4dda15
epoch: 50
time_completed: 2019-03-25 14:47:29
log_loss: 0.6861143221569418
accuracy: 0.6891385767790262
auc: 0.6915699739521668
precision: 0.6718104160925875
recall: 0.6476438550793275
f1: 0.6512550161303012


Metrics:
trial_id: 6ad4efef-f02a-461e-a3f6-febb49b7eb39
epoch: 50
time_completed: 2019-03-25 14:47:31
log_loss: 0.5730137539546142
accuracy: 0.7303370786516854
auc: 0.7495560028415817
precision: 0.7193349263241736
recall: 0.6956251479990527
f1: 0.7014906832298138


Metrics:
trial_id: 758c4a9e-ce70-4a1c-95f9-bf936da4b55a
epoch: 50
time_completed: 2019-03-25 14:47:32
log_loss: 0.5655809819475095
accuracy: 0.7940074906367042
auc: 0.8300378877575183
precision: 0.8316425120772947
recall: 0.7438432394032678
f1: 0.7571643543399533


Metrics:
trial_id: 5e83896a-001d-4372-bf4a-5e4c77addea7
epoch: 50
time_completed: 2019-03-25 14:47:34
log_loss: 0.44416837770952267
accuracy: 0.8202247191011236
auc: 0.8358690504380772
precision: 0.828

Meanwhile, xgboost got 83.5% accuracy on the same dataset in about 1/4th the time, even after massively increasing the number of epochs.

Although automl-gs automates a lot of the option selection, it's encourages to still try multiple options.

# MIT License

Copyright (c) 2019 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.