<a href="https://colab.research.google.com/github/nicholasrichers/auto-ml-neuron/blob/master/Artigo_auto_ml_neuron.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Como criar e avaliar um modelo de machine learning em menos de 10 minutos com AutoML

Para os iniciantes em ciência de dados, o termo auto machine learning (**AutoML**) pode soar como uma ameaça a sua entrada na área, uma vez que não parece fazer sentido ingressar numa área que está próxima de ser automatizada. Contudo, gostaria de oferer a perspectiva oposta, e mostrar como o AutoML pode **reduzir a barreira **de entrada na área.


Entender conceitualmente o que um problema que envolve de machine learning se propõe a resolver, sem entrar em detalhes específicos de cada algoritmo, não é tão difícil. Primeiro recebemos um conjunto de dados, que em geral possui alguns problemas decorrentes da maneira como foram coletados, são eles, **valores nulos, variáveis não numéricas**, [vazamento de dados](https://machinelearningmastery.com/data-leakage-machine-learning/) e etc.


Uma vez preparado esses dados, escolhemos um punhado de algoritmos de Machine Learning que se adequam a nossa tarefa (classificação ou regressão, na maioria dos casos), levando em consideração critérios como a **dimensionalidade** do dataset (algoritmos muito complexos como redes neurais tendem a "[decorar](https://pt.wikipedia.org/wiki/Sobreajuste)" todas as respostas de datasets pequenos), tempo de **processamento** e até mesmo se você deseja uma saída em termos de **probabilidade** (alguns algoritmos de classificação como o SVM não fazer ou não são confiáveis para esse fim). 


No fim, você escolhe 5 ou 6 deles, preferencialmente de famílias diferentes (árvores, K-Vizinhos, lineares, ensembles...), e de forma empírica verifica o de melhor performance. Mas pra isso é preciso criar uma estrutura para validar esses resultados. Técnicas como [validação cruzada](https://pt.wikipedia.org/wiki/Valida%C3%A7%C3%A3o_cruzada#:~:text=A%20valida%C3%A7%C3%A3o%20cruzada%20%C3%A9%20uma,da%20modelagem%20%C3%A9%20a%20predi%C3%A7%C3%A3o.), [bias e variância](https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/#:~:text=Bias%20is%20the%20simplifying%20assumptions,the%20bias%20and%20the%20variance.), escolha de [métricas](https://machinelearningmastery.com/metrics-evaluate-machine-learning-algorithms-python/) e até mesmo otimização de hiperparâmetos começam a surgir.

Todos os **conceitos** apresentados até aqui muitas vezes são apresentados de forma **isolada** e até fora de uma ordem que faça sentido. Por mais que não seja difícil entendê-los de maneira isolada, especialmente com as milhares de representações visuais de cada uma delas, as vezes é difícil acompanhar um projeto de **ponta a ponta** e ainda mais se não estiver familiarizado com as bibliotecas da linguagem preferida, você ja estará exausto no final da análise exploraória.


Então, especialmente para aqueles que entendem um pouco dos conceitos citados acima, mas ainda derrapa para implementar todos esses passos e precisa de algo palpável em pouco tempo. O AutoML, especialmente o **H2O** pode te ajudar bastante.

#Instalando as dependencias do h2O

Obs: Fora do google colab esses comandos dever ser executados direto no prompt de comando. 

In [1]:
%%capture
!apt-get install default-jre
!pip install h2o

In [2]:
import h2o

> Devemos inicar o H2O com o comando abaixo, havendo algumas opções de confuguração, ou se preferir apenas:


```
h2o.init()
```


Dessa forma


In [3]:
h2o.init(ip = "localhost",
         port = 54321, #porta padrão
         nthreads = -1, #seleciona o máximo disponível
         max_mem_size="10g", #De acordo com a sua máquina
         name = "neuron-cluster")

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.8" 2020-07-14; OpenJDK Runtime Environment (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1); OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.6/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpt3e8his2
  JVM stdout: /tmp/tmpt3e8his2/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpt3e8his2/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.0.7
H2O_cluster_version_age:,17 days
H2O_cluster_name:,neuron-cluster
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,10 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


 Após inicializar o h2o, temos a opção de explorar os modelos via interface gráfica no link abaixo (se mantida a porta padrão), você pode encontrar mais sobre isso [aqui](https://www.youtube.com/watch?v=J6DjhhZN9-A&list=PLjdDBZW3EmXe_auwS29jLPBZ3_2PpaNiU&index=3).
 
http://localhost:54321/flow/index.html



---



---



# Carregando os dados

> O [Dataset](https://github.com/rodrigosantis1/backorder_prediction) escolhido foi ultilizado em uma **competição** do kaggle, que tinha como objetivo de **prever** se um determinado **produto** (sku), entrará **em falta** (went_on_backorder), em um determinado período. Dadas algumas **informações** como **tempo de reabastecimento** e **previsão de vendas**.

In [4]:
data_path = "https://github.com/h2oai/h2o-tutorials/raw/master/h2o-world-2017/automl/data/product_backorders.csv"
df = h2o.import_file(data_path, destination_frame = "df")

Parse progress: |█████████████████████████████████████████████████████████| 100%


> Podemos ver que o objeto gerado é do tipo **H2OFrame**, para quem está habituado com **pandas** pode tratá-lo de maneira similar, com algumas diferenças na sintaxe dos comandos, mas que pode ser encontrado na [documentação](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging.html#data-manipulation).


In [5]:
type(df)

h2o.frame.H2OFrame

> Analisando o dataset abaixo podemos destacar alguns pontos como:


1.   **Valores Nulos**: A coluna *lead-time* possui 1078 valores nuos como indicado abaixo
2.   **Variáveis categóricas**: As últimas colunas como *deck_risk* e *oe_constraint* foram marcadas como do tipo enum.


Sabemos que esses casos podem trazer algum problemas se lançados direramente em um modelo de machine learning,  porém não faremos nenhum tipo de pré-processamento, para vermos se o H2O será tratar essas situações.



In [6]:
df.describe()

Rows:19053
Cols:23




Unnamed: 0,sku,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank,potential_issue,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,deck_risk,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
type,int,int,int,int,int,int,int,int,int,int,int,int,enum,int,real,real,int,enum,enum,enum,enum,enum,enum
mins,1111620.0,-1440.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,-99.0,-99.0,0.0,,,,,,
mean,2059552.76056264,376.36702881436014,7.706036161335188,48.272345562378625,182.91082769117727,344.739830997743,497.79242114102766,56.11887891670605,168.53445651603428,333.53219965359773,504.2553928515193,48.84070750013117,,2.3114995013908572,-6.519833622001783,-6.05393533826694,0.8917755734005144,,,,,,
maxs,3284775.0,730722.0,52.0,170920.0,479808.0,967776.0,1418208.0,186451.0,550609.0,1136154.0,1759152.0,85584.0,,13824.0,1.0,1.0,1440.0,,,,,,
sigma,663337.6456498688,7002.071628662688,6.778665072124189,1465.9992102068293,4304.865591970626,8406.062155159243,12180.570042918358,1544.2177775482573,4581.340080221506,9294.566153218986,14184.14539565363,968.7738680675265,,110.24106014611986,25.975138766871876,25.184497150032527,23.033345417338797,,,,,,
zeros,0,1858,121,15432,12118,11136,10604,10278,8022,6864,6231,9909,,18601,474,401,18585,,,,,,
missing,0,0,1078,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1113121.0,0.0,8.0,1.0,6.0,6.0,6.0,0.0,4.0,9.0,12.0,0.0,No,1.0,0.9,0.89,0.0,No,No,No,Yes,No,Yes
1,1113268.0,0.0,8.0,0.0,2.0,3.0,4.0,1.0,2.0,3.0,3.0,0.0,No,0.0,0.96,0.97,0.0,No,No,No,Yes,No,Yes
2,1113874.0,20.0,2.0,0.0,45.0,99.0,153.0,16.0,42.0,80.0,111.0,10.0,No,0.0,0.81,0.88,0.0,No,No,No,Yes,No,Yes




---



---



# Testando modelos usando AutoML

> Já estamos prontos para criar o nosso modelo, bastando **definir** alguns **parâmetros** como na célula abaixo. Dessa forma podemos indicar o número de modelos que iremos desenvolver (**max_models**), e determinar um tempo de parada (**max_runtime_secs**). Também diminuí o número de folds (**nfolds**) para 3 diminuir o tempo de processamento e escolhi o **AUC** como métrica do leaderboard. De qualquer forma todos esses parâmetros ja possuem algum valor padrão, não sendo obrigatório inserir manualmente nenhum deles)


Obs: A lista de parâmetros para o **H2OAutoML** [aqui](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html).


In [43]:
from h2o.automl import H2OAutoML
aml = H2OAutoML(max_models = 30, #default = 10
                max_runtime_secs=300, #default = inf
                nfolds=3, #default = 5
                sort_metric = "AUC", #default = logloss
                seed = 1,  
                )

> Para realizarmos o **treinamento**, retiramos a 1a coluna (sku) de identificação da matriz X, e a última  (went_on_backorder) será o nosso y.

In [44]:
%%time
aml.train(x = df.columns[1:-1],
          y = "went_on_backorder", 
          training_frame = df
          )

AutoML progress: |████████████████████████████████████████████████████████| 100%
CPU times: user 25.2 s, sys: 705 ms, total: 25.9 s
Wall time: 5min 7s




---



---



## Comparando modelos

Na mesma documentação, você encontrará que o AutoML só trabalha com algoritmos "top de linha", e a verdade é que você **não precisa se preocupar por enquanto**, como esses modelos funcionam a fundo.

1. Random Forest (**DRF**)
2. Extremely-Randomized Forest (**DRF/XRT**)
3. Generalized Linear Models (**GLM**)
4. XGBoost (**XGBoost**)
5. Gradient Boosting Machines (**GBM**)
6. Deep Neural Nets (**DeepLearning**)
7. **Stacked Ensembles**: One of all the models
8. **Stacked Ensembles**: Only the best models of each kind

> Através do comando *get_leaderboard()* podemos visualizar uma tabela de resultado, e como o esperado os modelos de **StackedEnsemble**, tiveram a melhor performance em relacao ao **AUC**, mas é interessante também reparar na coluna *training_time_ms*.


A termo de comparação o melhor modelo individual *XGBoost_model_1* é quase **4x mais lento** que que o *GBM_grid_model_3*, por uma diferenca de AUC de .948 vs .944, quase **insignificante**, dependendo da sua aplicação podemos preferir um ou outro.


Obs: É importate frisar que os modelos *StackedEnsemble* usam as **predições dos outros modelos com entrada**, então na prática deveriam ter esse tempo acrescido ao seu tempo de processamento.



In [45]:
h2o.automl.get_leaderboard(aml, extra_columns = 'training_time_ms').head(30)

model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse,training_time_ms
StackedEnsemble_AllModels_AutoML_20200808_024528,0.949843,0.180871,0.749203,0.161413,0.226127,0.0511335,2336
StackedEnsemble_BestOfFamily_AutoML_20200808_024528,0.948128,0.181832,0.748647,0.166513,0.22677,0.0514246,1356
XGBoost_grid__1_AutoML_20200808_024528_model_6,0.946223,0.177022,0.734937,0.151898,0.228979,0.0524315,6430
XGBoost_grid__1_AutoML_20200808_024528_model_3,0.94588,0.195842,0.738485,0.161088,0.232142,0.0538901,5089
XGBoost_3_AutoML_20200808_024528,0.944862,0.178901,0.727522,0.171679,0.230257,0.0530184,1597
XGBoost_grid__1_AutoML_20200808_024528_model_4,0.944287,0.18157,0.722594,0.159271,0.231161,0.0534353,2331
XGBoost_2_AutoML_20200808_024528,0.943624,0.182506,0.716941,0.172878,0.232402,0.0540107,2267
XGBoost_1_AutoML_20200808_024528,0.943017,0.183659,0.719312,0.187809,0.23275,0.0541728,2076
XGBoost_grid__1_AutoML_20200808_024528_model_1,0.942965,0.183016,0.716754,0.169328,0.232128,0.0538833,2231
GBM_grid__1_AutoML_20200808_024528_model_2,0.942432,0.180284,0.742392,0.155666,0.228801,0.0523498,4031




Por outro lado os modelos de DeepLearing não tiveram boa performance, além de demandarem muito tempo de processamento.

> Algumas desvantagens dessa abordagem são:



1.  **Valores Nulos**: Alguns desses modelos como os de árvores, são robustos a valor nulo, porém outros como o GLM exigem que algum valor seja exigido, e não fica claro se o H2O usou a mesma abordagem em todos os modelos ou se possui uma técnica específica para cada família de modelos.
2.  **Categóricos**: Não se foi utiizado *One-Hot-Encoding*, *Label-Encoding* ou outra técnica para esse fim.



---



---



## Avaliando o melhor modelo

> Uma vez que temos um modelo em mãos, podemos inclusive analisar mais a fundo suas previsões. Contudo o H2O gera **muitas informações** que **não são relevantes** para o problema em questão, então é importante saber os **conceitos** para interpretar a saída do modelo,

In [20]:
# Get model ids for all models in the AutoML Leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
# Get the "All Models" Stacked Ensemble model
best = h2o.get_model([mid for mid in model_ids if "XGBoost_3_AutoML" in mid][0])

In [21]:
best

Model Details
H2OXGBoostEstimator :  XGBoost
Model Key:  XGBoost_3_AutoML_20200808_012405


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees
0,,17.0




ModelMetricsBinomial: xgboost
** Reported on train data. **

MSE: 0.047044350738613946
RMSE: 0.21689709711891939
LogLoss: 0.1610520833471124
Mean Per-Class Error: 0.10433515122317316
AUC: 0.9577580364034688
AUCPR: 0.7922496888898677
Gini: 0.9155160728069376

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.36366494596004484: 


Unnamed: 0,Unnamed: 1,No,Yes,Error,Rate
0,No,16114.0,673.0,0.0401,(673.0/16787.0)
1,Yes,586.0,1680.0,0.2586,(586.0/2266.0)
2,Total,16700.0,2353.0,0.0661,(1259.0/19053.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.363665,0.72743,190.0
1,max f2,0.134648,0.787073,283.0
2,max f0point5,0.56264,0.756103,128.0
3,max accuracy,0.449559,0.937333,162.0
4,max precision,0.975378,1.0,0.0
5,max recall,0.006401,1.0,396.0
6,max specificity,0.975378,1.0,0.0
7,max absolute_mcc,0.392434,0.690485,181.0
8,max min_per_class_accuracy,0.147417,0.892119,277.0
9,max mean_per_class_accuracy,0.134648,0.895665,283.0



Gains/Lift Table: Avg response rate: 11.89 %, avg score: 12.11 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.010025,0.919607,8.23212,8.23212,0.979058,0.941859,0.979058,0.941859,0.082524,0.082524,723.212016,723.212016,0.082286
1,2,0.020049,0.887975,7.879944,8.056032,0.937173,0.904651,0.958115,0.923255,0.078994,0.161518,687.99439,705.603203,0.160565
2,3,0.030022,0.840555,7.434626,7.849621,0.884211,0.865793,0.933566,0.904168,0.074139,0.235658,643.462628,684.962103,0.233394
3,4,0.040046,0.799348,7.131569,7.669873,0.848168,0.820956,0.912189,0.883338,0.071492,0.307149,613.156934,666.987284,0.303158
4,5,0.050018,0.756747,6.549552,7.446514,0.778947,0.777581,0.885624,0.862253,0.065313,0.372462,554.955173,644.651396,0.365969
5,6,0.100037,0.465398,5.611354,6.528934,0.667366,0.60345,0.776495,0.732851,0.280671,0.653133,461.135412,552.893404,0.627756
6,7,0.150003,0.275405,3.144246,5.401494,0.37395,0.360847,0.642407,0.608937,0.157105,0.810238,214.424596,440.14942,0.749358
7,8,0.200021,0.148499,1.632234,4.458932,0.194124,0.202795,0.530307,0.507375,0.081642,0.89188,63.223351,345.893177,0.78525
8,9,0.300005,0.05266,0.635581,3.184705,0.075591,0.088657,0.378761,0.367826,0.063548,0.955428,-36.44189,218.470451,0.743895
9,10,0.39999,0.027333,0.233929,2.447107,0.027822,0.038397,0.291038,0.28548,0.023389,0.978817,-76.607085,144.710747,0.656961




ModelMetricsBinomial: xgboost
** Reported on cross-validation data. **

MSE: 0.05324434086321282
RMSE: 0.23074735288451917
LogLoss: 0.1808088882074152
Mean Per-Class Error: 0.1210828515382838
AUC: 0.9435671100725138
AUCPR: 0.7267268029227715
Gini: 0.8871342201450276

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.31738859228789806: 


Unnamed: 0,Unnamed: 1,No,Yes,Error,Rate
0,No,15887.0,900.0,0.0536,(900.0/16787.0)
1,Yes,604.0,1662.0,0.2665,(604.0/2266.0)
2,Total,16491.0,2562.0,0.0789,(1504.0/19053.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.317389,0.688484,209.0
1,max f2,0.139234,0.761272,279.0
2,max f0point5,0.562048,0.718993,134.0
3,max accuracy,0.468101,0.928095,162.0
4,max precision,0.982443,1.0,0.0
5,max recall,0.004787,1.0,396.0
6,max specificity,0.982443,1.0,0.0
7,max absolute_mcc,0.353964,0.64521,197.0
8,max min_per_class_accuracy,0.123808,0.87669,287.0
9,max mean_per_class_accuracy,0.111267,0.878917,293.0



Gains/Lift Table: Avg response rate: 11.89 %, avg score: 12.00 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.010025,0.916685,7.439724,7.439724,0.884817,0.940574,0.884817,0.940574,0.074581,0.074581,643.972357,643.972357,0.07327
1,2,0.020049,0.87822,7.703856,7.57179,0.91623,0.897564,0.900524,0.919069,0.077229,0.151809,670.385577,657.178967,0.149546
2,3,0.030022,0.832516,7.036343,7.393931,0.836842,0.854885,0.879371,0.897749,0.070168,0.221977,603.634273,639.393142,0.217867
3,4,0.040046,0.7901,6.471239,7.162956,0.769634,0.811329,0.8519,0.876116,0.064872,0.286849,547.123885,616.295595,0.280118
4,5,0.050018,0.744181,6.682313,7.06713,0.794737,0.766732,0.840504,0.854308,0.066637,0.353486,568.231291,606.712995,0.344432
5,6,0.100037,0.456271,5.29373,6.18043,0.629591,0.598158,0.735047,0.726233,0.264784,0.61827,429.37303,518.043013,0.588187
6,7,0.150003,0.26162,3.03826,5.133773,0.361345,0.353814,0.610567,0.60218,0.151809,0.770079,203.826014,413.377309,0.703778
7,8,0.200021,0.146109,1.879274,4.319935,0.223505,0.197532,0.513776,0.500991,0.093998,0.864078,87.927426,331.993488,0.753695
8,9,0.300005,0.053276,0.710615,3.117039,0.084514,0.088372,0.370714,0.363476,0.07105,0.935128,-28.938502,211.703873,0.720855
9,10,0.39999,0.029697,0.300136,2.412905,0.035696,0.039916,0.28697,0.282596,0.030009,0.965137,-69.986448,141.290533,0.641434




Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid
0,accuracy,0.9212722,0.0024595337,0.918438,0.9228468,0.9225319
1,auc,0.9434563,0.00400136,0.9407884,0.9415234,0.9480572
2,aucpr,0.7275399,0.026171621,0.7407156,0.6973988,0.74450517
3,err,0.07872776,0.0024595337,0.08156196,0.077153206,0.07746811
4,err_count,500.0,15.6205,518.0,490.0,492.0
5,f0point5,0.66455394,0.008447769,0.65481985,0.6699693,0.6688726
6,f1,0.6902409,0.013196191,0.67503136,0.69864696,0.6970443
7,f2,0.7180378,0.018658346,0.6965303,0.7298895,0.7276935
8,lift_top_group,7.4007654,0.6171129,7.875744,6.703249,7.623303
9,logloss,0.18080889,0.005443733,0.18251438,0.18519567,0.1747166



See the whole table with table.as_data_frame()

Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
0,,2020-08-08 01:24:19,2.805 sec,0.0,0.5,0.693147,0.5,0.118931,1.0,0.881069
1,,2020-08-08 01:24:19,2.989 sec,5.0,0.250503,0.2488,0.940203,0.738973,7.841852,0.071852
2,,2020-08-08 01:24:20,3.155 sec,10.0,0.22737,0.186119,0.9481,0.762283,8.189245,0.072744
3,,2020-08-08 01:24:20,3.394 sec,15.0,0.218843,0.16484,0.956319,0.785701,8.23212,0.065292
4,,2020-08-08 01:24:20,3.620 sec,17.0,0.216897,0.161052,0.957758,0.79225,8.23212,0.066079



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,national_inv,4106.75293,1.0,0.373134
1,forecast_3_month,1094.212891,0.266442,0.099419
2,forecast_6_month,904.245605,0.220185,0.082159
3,sales_1_month,824.639465,0.200801,0.074926
4,sales_3_month,788.78595,0.19207,0.071668
5,sales_6_month,677.860107,0.16506,0.06159
6,in_transit_qty,661.551208,0.161089,0.060108
7,sales_9_month,468.179199,0.114002,0.042538
8,forecast_9_month,422.820679,0.102957,0.038417
9,lead_time,270.488251,0.065864,0.024576



See the whole table with table.as_data_frame()






---



---



## Salvando o modelo

Pronto!

Apenas com alguns comandos (e muitos conceitos!!) ja estamos prontos para **serializar** o nosso **melhor modelo** no formato MOJO (padrão do H2O), e colocá-lo em produção se for o caso.

In [38]:
h2o.save_model(best, path = "./model_bin")

'/content/model_bin/XGBoost_3_AutoML_20200808_012405'

Se quiser baixar o modelo

In [41]:
best.download_mojo(path = "./")

'/content/XGBoost_3_AutoML_20200808_012405.zip'



---



---



# Considerações

Vimos que o H2O é uma ferramenta bastante poderosa, e pode nos poupar muitas horas de trabalho, mas ainda é essencial termos os conceitos muito bem assimilados para que se consiga fazer uma interpretação razoável.

Um cientista de dados experiênte também poderia realizar essas tarefas em um tempo relativamente curto com *sklearn* ou equivalente, mantendo ainda uma maior flexibilidade. Contudo o H2O pode apontar o melhor caminho, trabalhando em segundo plano enquanto você executa outras tarefas do seu dia a dia.

Fontes:

1.   https://medium.com/data-hackers/automated-machine-learning-automl-70c1eab669ad
2.   https://www.youtube.com/watch?v=cMlZqpXskWA&list=PLjdDBZW3EmXe_auwS29jLPBZ3_2PpaNiU&index=1
3.   http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html?highlight=get_leaderboard
4.   https://www.kaggle.com/sudalairajkumar/getting-started-with-h2o
5.   https://github.com/h2oai/h2o-tutorials/blob/master/h2o-world-2017/automl/Python/automl_binary_classification_product_backorders.ipynb

