# *[Using AutoML as a start point]*

**Author:** [Marco Bertani-Økland](https://github.com/mbertani)

**Achievement:** Illustrate the use of AutoML as a starting point to explore different algorithms.

## Introduction

This notebook is based on [https://supervised.mljar.com/](https://supervised.mljar.com/).

Run the notebook and check the results produced under the folder `results_diabetes`. 

Requirements:

1. You must run `make venv` to verify that all packages are installed.
2. You must have downloaded the [diabetes dataset](https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset) into the folder `NBD_22_workshop`.

# Reproducibility and code formatting

In [None]:
# To watermark the environment
%load_ext watermark

# For automatic code formatting in jupyter lab.
%load_ext lab_black

# For automatic code formatting in jupyter notebook
%load_ext nb_black

# For better logging
%load_ext rich

# Analysis

In [None]:
# Imports
# -------

# System
import sys

# Logging
import logging

# Rich logging in jupyter
from rich.logging import RichHandler

FORMAT = "%(message)s"
logging.basicConfig(
    level="INFO", format=FORMAT, datefmt="[%X]", handlers=[RichHandler()]
)

log = logging.getLogger("rich")

# Nice logging example:
# log.error("[bold red blink]Server is shutting down![/]", extra={"markup": True})


# Other packages
import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML
from sklearn.metrics import accuracy_score

RANDOM_SEED = 42

In [None]:
datapath = "../data/train/diabetes_binary_train.csv.zip"
df = pd.read_csv(datapath, compression="zip")

In [None]:
target_column = "Diabetes_binary"
train_columns = list(df.columns)
train_columns.remove(target_column)

X_train, X_valid, y_train, y_valid = train_test_split(
    df[train_columns], df[target_column], test_size=0.2, random_state=RANDOM_SEED
)

In [None]:
automl = AutoML(
    results_path="results_diabetes",
    mode="Explain",
    ml_task="binary_classification",
    algorithms=["LightGBM", "Extra Trees", "CatBoost", "Linear", "Neural Network"],
    eval_metric="accuracy",
    total_time_limit=180,
    random_state=RANDOM_SEED,
)
model = automl.fit(X_train, y_train)

In [None]:
predictions = automl.predict(X_valid)
print(
    f"Best model accuracy score on validation set: {accuracy_score(y_valid,predictions):.3f}"
)

# Improving the experimentation

In [None]:
from typing import List


def automl_pipeline(
    frame: pd.DataFrame,
    train_columns: List[str],
    target_column: str,
    results_path: str,
    random_state=RANDOM_SEED,
):
    X_train, X_valid, y_train, y_valid = train_test_split(
        frame[train_columns],
        frame[target_column],
        test_size=0.2,
        random_state=RANDOM_SEED,
    )

    automl = AutoML(
        results_path=results_path,
        mode="Explain",
        ml_task="binary_classification",
        algorithms=["LightGBM", "Extra Trees", "CatBoost", "Linear", "Neural Network"],
        eval_metric="accuracy",
        total_time_limit=180,
        random_state=RANDOM_SEED,
    )

    model = automl.fit(X_train, y_train)
    predictions = automl.predict(X_valid)
    log.info(
        f"Best model accuracy_score on valid set: {accuracy_score(y_valid,predictions):.3f}"
    )
    return model


def experiment_pipeline(
    name: str, target_column: str, sorted_columns: List[str], frame: pd.DataFrame
):
    best_models = []
    for iteration in range(1, len(sorted_columns) + 1):
        experiment = f"{name}-{iteration}"
        log.info(f"Starting: {experiment}")
        train_columns = sorted_columns[0:iteration]
        log.info(f"Training on features: {train_columns}")
        automl_model = automl_pipeline(
            frame=df,
            train_columns=train_columns,
            target_column=target_column,
            results_path=experiment,
        )
        best_models.append(automl_model)

    return best_models

In [None]:
best_models = experiment_pipeline(
    name="experiment",
    target_column="Diabetes_binary",
    sorted_columns=["HighBP", "GenHlth", "HighChol", "BMI", "Age", "Income"],
    frame=df,
)

In [None]:
datapath_test = "../data/train/diabetes_binary_test.csv.zip"
df_test = pd.read_csv(datapath, compression="zip")
target_column = "Diabetes_binary"
train_columns = list(df.columns)
train_columns.remove(target_column)
X_test, y_test = df_test[train_columns], df_test[target_column]

predictions = best_models[-1].predict(X_test)
log.info(
    f"Best model accuracy score on test set: {accuracy_score(y_test,predictions):.3f}"
)

# Watermark

This should be the last section of your notebook, since it watermarks all your environment.

When commiting this notebook, remember to restart the kernel, rerun the notebook and run this cell last, to watermark the environment.

In [None]:
%watermark -gb -iv -m -v