# *[Using AutoML as a start point]*

**Author:** [Marco Bertani-Økland](https://github.com/mbertani)

**Achievement:** Illustrate the use of AutoML as a starting point to explore different algorithms.

## Introduction

This notebook is based on [https://supervised.mljar.com/](https://supervised.mljar.com/).

Run the notebook and check the results produced under the folder `results_diabetes`. 

Requirements:

1. You must run `make venv` to verify that all packages are installed.
2. You must have downloaded the [diabetes dataset](https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset) into the folder `NBD_22_workshop`.

# Reproducibility and code formatting

In [1]:
# To watermark the environment
%load_ext watermark

# For automatic code formatting in jupyter lab.
%load_ext lab_black

# For automatic code formatting in jupyter notebook
%load_ext nb_black

# For better logging
%load_ext rich

# Analysis

In [2]:
# Imports
# -------

# System
import sys

# Logging
import logging

# Rich logging in jupyter
from rich.logging import RichHandler

FORMAT = "%(message)s"
logging.basicConfig(
    level="INFO", format=FORMAT, datefmt="[%X]", handlers=[RichHandler()]
)

log = logging.getLogger("rich")

# Nice logging example:
# log.error("[bold red blink]Server is shutting down![/]", extra={"markup": True})


# Other packages
import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

In [3]:
datapath = (
    "../NBD_22_workshop/diabetes_binary_5050split_health_indicators_BRFSS2015.csv"
)
df = pd.read_csv(datapath)

In [4]:
target_column = "Diabetes_binary"
train_columns = list(df.columns)
train_columns.remove(target_column)

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    df[train_columns], df[target_column], test_size=0.25, random_state=42
)

automl = AutoML(results_path="results_diabetes")
automl.fit(X_train, y_train)

predictions = automl.predict(X_test)

Linear algorithm was disabled.
AutoML directory: results_diabetes
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Baseline', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 2 models
1_Baseline logloss 0.693147 trained in 0.63 seconds


2_DecisionTree logloss 0.560716 trained in 15.12 seconds
* Step default_algorithms will try to check up to 3 models
3_Default_Xgboost logloss 0.503069 trained in 12.48 seconds
4_Default_NeuralNetwork logloss 0.510378 trained in 11.25 seconds
5_Default_RandomForest logloss 0.53522 trained in 11.37 seconds
* Step ensemble will try to check up to 1 model
Ensemble logloss 0.502129 trained in 3.2 seconds
AutoML fit time: 64.01 seconds
AutoML best model: Ensemble


# Watermark

This should be the last section of your notebook, since it watermarks all your environment.

When commiting this notebook, remember to restart the kernel, rerun the notebook and run this cell last, to watermark the environment.

In [6]:
%watermark -gb -iv -m -v

Python implementation: CPython
Python version       : 3.8.13
IPython version      : 8.4.0

Compiler    : GCC 10.3.0
OS          : Linux
Release     : 5.15.0-1017-aws
Machine     : x86_64
Processor   : x86_64
CPU cores   : 4
Architecture: 64bit

Git hash: 35e2f62d87aa72cb48f930e06a61764e58edbc5f

Git branch: add-automl-example

pandas : 1.4.3
logging: 0.5.1.2
sys    : 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) 
[GCC 10.3.0]

