# *[Using AutoML as a start point]*

**Author:** [Marco Bertani-Økland](https://github.com/mbertani)

**Achievement:** Illustrate the use of AutoML as a starting point to explore different algorithms.

## Introduction

This notebook is based on [https://supervised.mljar.com/](https://supervised.mljar.com/).

Run the notebook and check the results produced under the folder `results_diabetes`. 

Requirements:

1. You must run `make venv` to verify that all packages are installed.
2. You must have downloaded the [diabetes dataset](https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset) into the folder `NBD_22_workshop`.

# Reproducibility and code formatting

In [None]:
# To watermark the environment
%load_ext watermark

# For automatic code formatting in jupyter lab.
%load_ext lab_black

# For automatic code formatting in jupyter notebook
%load_ext nb_black

# For better logging
%load_ext rich

# Analysis

In [None]:
# Imports
# -------

# System
import sys

# Logging
import logging

# Rich logging in jupyter
from rich.logging import RichHandler

FORMAT = "%(message)s"
logging.basicConfig(
    level="INFO", format=FORMAT, datefmt="[%X]", handlers=[RichHandler()]
)

log = logging.getLogger("rich")

# Nice logging example:
# log.error("[bold red blink]Server is shutting down![/]", extra={"markup": True})


# Other packages
import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML
from sklearn.metrics import accuracy_score

RANDOM_SEED = 42

In [None]:
# Let's load the training dataset
datapath = "../data/train/diabetes_binary_train.csv.zip"
df = pd.read_csv(datapath, compression="zip")

In [None]:
# Then we create the list of columns we will use for training
target_column = "Diabetes_binary"
train_columns = list(df.columns)
train_columns.remove(target_column)

X_train, X_valid, y_train, y_valid = train_test_split(
    df[train_columns], df[target_column], test_size=0.2, random_state=RANDOM_SEED
)

Using the AutoML package, we configure it for binary classification. The package has several [modes](https://supervised.mljar.com/features/modes/), and we use `Perform` for real life scenarios. We chose some [algorithms](https://supervised.mljar.com/features/algorithms/) to start with, but have a look at the list of available ones, if you want to experiment with others (beware: no all can be used for the binary classification setup).

We have also changed the default metric to `accuracy`, and set the [start_random_models=5](https://supervised.mljar.com/features/automl/#not_so_random) to perform random search over some hyper-parameters. We also turn off the `train_ensemble` option, which will use an ensemble of previous models.

In [None]:
# Using the AutoML package, we configure it for binary classification
automl = AutoML(
    results_path="experiment-full",
    mode="Perform",
    ml_task="binary_classification",
    algorithms=[
        "LightGBM",
        "Extra Trees",
        "CatBoost",
        "Baseline",
        "Decision Tree",
        "Neural Network",
    ],
    eval_metric="accuracy",
    start_random_models=5,
    total_time_limit=1500,
    train_ensemble=False,
    random_state=RANDOM_SEED,
)
model = automl.fit(X_train, y_train)

In [None]:
predictions = automl.predict(X_valid)
print(
    f"Best model accuracy score on validation set: {accuracy_score(y_valid,predictions):.3f}"
)

# Improving the experimentation

In the previous step, we used the full dataset. But what about using only the columns where the predictive power score was highest? We can sort those columns by the power score ranking, run the auto-ml pipeline for the first column and find the best model. Then we add a new column and repeat. When do we get a similar perfomance than when using the full dataset?

In this way, we prune the features by creating a simpler model, and less data dependencies. 

In [None]:
from typing import List


def automl_pipeline(
    frame: pd.DataFrame,
    train_columns: List[str],
    target_column: str,
    results_path: str,
    random_search_iterations: int,
    max_total_time: int,
    random_state: int = RANDOM_SEED,
) -> AutoML:
    """Create a simple pipeline that will create the train and eval splits on the selected columns and run the auto-ml process.

    Args:
        frame (pd.DataFrame): The input dataset to be split.
        train_columns (List[str]): A list of columns to use for training.
        target_column (str): The target column to predict.
        results_path (str): A name for the folder to store the results.
        random_search_iterations (int): The number of random search hyper-params trials to run.
        random_state (int, optional): The random seed to fix the pseudo-random number generators. Defaults to RANDOM_SEED.

    Returns:
        AutoML: an AutoML object containing the best model for each run.
    """
    X_train, X_valid, y_train, y_valid = train_test_split(
        frame[train_columns],
        frame[target_column],
        test_size=0.2,
        random_state=RANDOM_SEED,
    )

    automl = AutoML(
        results_path=results_path,
        mode="Perform",
        ml_task="binary_classification",
        algorithms=[
            "Baseline",
            "Decision Tree",
            "Extra Trees",
            "LightGBM",
            "CatBoost",
            "Neural Network",
        ],
        eval_metric="accuracy",
        start_random_models=random_search_iterations,
        train_ensemble=False,
        total_time_limit=max_total_time,
        random_state=RANDOM_SEED,
    )

    model = automl.fit(X_train, y_train)
    predictions = automl.predict(X_valid)
    log.info(
        f"Best model accuracy_score on valid set: {accuracy_score(y_valid,predictions):.3f}"
    )
    return model


def experiment_pipeline(
    name: str,
    target_column: str,
    sorted_columns: List[str],
    frame: pd.DataFrame,
    random_search_iterations: int = 5,
    max_total_time: int = 1500,
    random_state: int = RANDOM_SEED,
) -> List[AutoML]:
    """A method to run several iterations of an AutoML process.

    Args:
        name (str): The name to use as prefix for the results folder. The folders will be created using format `<name>-<index>`.
        target_column (str): The column with the feature to predict.
        sorted_columns (List[str]): A sorted list of feature columns. The process will start by runing an AutoML process for the first column for training, and add the next one for the next iteration.
        frame (pd.DataFrame): The dataFrame with all the data, not split beforehand.
        random_search_iterations (int): The number of random search hyper-params trials to run.
        max_total_time (int): The max number of seconds the experiment can run for. Default is 1500 seconds (25 mins).
        random_state (int, optional): The random seed to fix the pseudo-random number generators. Defaults to RANDOM_SEED.

    Returns:
        List[AutoML]: A sorted list of AutoML objects.
    """
    best_models = []
    for iteration in range(1, len(sorted_columns) + 1):
        experiment = f"{name}-{iteration}"
        log.info(f"Starting: {experiment}")
        train_columns = sorted_columns[0:iteration]
        log.info(f"Training on features: {train_columns}")
        automl_model = automl_pipeline(
            frame=df,
            train_columns=train_columns,
            target_column=target_column,
            results_path=experiment,
            random_search_iterations=random_search_iterations,
            max_total_time=max_total_time,
            random_state=random_state,
        )
        best_models.append(automl_model)
        log.info(f"Ending: {experiment}\n")

    return best_models

In [None]:
# Let's run the experiments.
# Grab a coffee since this will take some time.
best_models = experiment_pipeline(
    name="experiment",
    target_column="Diabetes_binary",
    sorted_columns=["HighBP", "GenHlth", "HighChol", "BMI", "Age", "Income"],
    random_search_iterations=1,
    max_total_time=150,
    frame=df,
)

In [None]:
datapath_test = "../data/train/diabetes_binary_test.csv.zip"
df_test = pd.read_csv(datapath, compression="zip")
target_column = "Diabetes_binary"
train_columns = list(df.columns)
train_columns.remove(target_column)
X_test, y_test = df_test[train_columns], df_test[target_column]

predictions = best_models[-1].predict(X_test)
log.info(
    f"Best model accuracy score on test set: {accuracy_score(y_test,predictions):.3f}"
)

# Watermark

This should be the last section of your notebook, since it watermarks all your environment.

When commiting this notebook, remember to restart the kernel, rerun the notebook and run this cell last, to watermark the environment.

In [None]:
%watermark -gb -iv -m -v