# *[Using AutoML as a start point]*

**Author:** [Marco Bertani-Økland](https://github.com/mbertani)

**Achievement:** Illustrate the use of AutoML as a starting point to explore different algorithms.

## Introduction

This notebook is based on [https://supervised.mljar.com/](https://supervised.mljar.com/).

Run the notebook and check the results produced under the folder `results_diabetes`. 

Requirements:

1. You must run `make venv` to verify that all packages are installed.
2. You must have downloaded the [diabetes dataset](https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset) into the folder `NBD_22_workshop`.

# Reproducibility and code formatting

In [1]:
# To watermark the environment
%load_ext watermark

# For automatic code formatting in jupyter lab.
%load_ext lab_black

# For automatic code formatting in jupyter notebook
%load_ext nb_black

# For better logging
%load_ext rich

# Analysis

In [2]:
# Imports
# -------

# System
import sys

# Logging
import logging

# Rich logging in jupyter
from rich.logging import RichHandler

FORMAT = "%(message)s"
logging.basicConfig(
    level="INFO", format=FORMAT, datefmt="[%X]", handlers=[RichHandler()]
)

log = logging.getLogger("rich")

# Nice logging example:
# log.error("[bold red blink]Server is shutting down![/]", extra={"markup": True})


# Other packages
import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML
from sklearn.metrics import accuracy_score

RANDOM_SEED = 42

In [3]:
# Let's load the training dataset
datapath = "../data/train/diabetes_binary_train.csv.zip"
df = pd.read_csv(datapath, compression="zip")

In [4]:
# Then we create the list of columns we will use for training
target_column = "Diabetes_binary"
train_columns = list(df.columns)
train_columns.remove(target_column)

X_train, X_valid, y_train, y_valid = train_test_split(
    df[train_columns], df[target_column], test_size=0.2, random_state=RANDOM_SEED
)

Using the AutoML package, we configure it for binary classification. The package has several [modes](https://supervised.mljar.com/features/modes/), and we use `Perform` for real life scenarios. We chose some [algorithms](https://supervised.mljar.com/features/algorithms/) to start with, but have a look at the list of available ones, if you want to experiment with others (beware: no all can be used for the binary classification setup).

We have also changed the default metric to `accuracy`, and set the [start_random_models=5](https://supervised.mljar.com/features/automl/#not_so_random) to perform random search over some hyper-parameters. We also turn off the `train_ensemble` option, which will use an ensemble of previous models.

In [5]:
# Using the AutoML package, we configure it for binary classification
automl = AutoML(
    results_path="experiment-full",
    mode="Perform",
    ml_task="binary_classification",
    algorithms=[
        "LightGBM",
        "Extra Trees",
        "CatBoost",
        "Baseline",
        "Decision Tree",
        "Neural Network",
    ],
    eval_metric="accuracy",
    start_random_models=5,
    total_time_limit=1500,
    train_ensemble=False,
    random_state=RANDOM_SEED,
)
model = automl.fit(X_train, y_train)

AutoML directory: experiment-full
The task is binary_classification with evaluation metric accuracy
AutoML will use algorithms: ['LightGBM', 'Extra Trees', 'CatBoost', 'Baseline', 'Decision Tree', 'Neural Network']
AutoML steps: ['simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2']
* Step simple_algorithms will try to check up to 4 models
1_Baseline accuracy 0.500729 trained in 9.75 seconds (1-sample predict time 0.0223 seconds)
2_DecisionTree accuracy 0.727996 trained in 59.61 seconds (1-sample predict time 0.035 seconds)
3_DecisionTree accuracy 0.700809 trained in 28.98 seconds (1-sample predict time 0.0317 seconds)
4_DecisionTree accuracy 0.700809 trained in 28.76 seconds (1-sample predict time 0.0274 seconds)
* Step default_algorithms will try to check up to 4 models
5_Default_LightGBM accuracy 0.754012 trained in 31.53 seconds (1-sample predict time 0.0342 seconds)
6_Defa

In [6]:
predictions = automl.predict(X_valid)
print(
    f"Best model accuracy score on validation set: {accuracy_score(y_valid,predictions):.3f}"
)

Best model accuracy score on validation set: 0.749


# Improving the experimentation

In the previous step, we used the full dataset. But what about using only the columns where the predictive power score was highest? We can sort those columns by the power score ranking, run the auto-ml pipeline for the first column and find the best model. Then we add a new column and repeat. When do we get a similar perfomance than when using the full dataset?

In this way, we prune the features by creating a simpler model, and less data dependencies. 

In [7]:
from typing import List


def automl_pipeline(
    frame: pd.DataFrame,
    train_columns: List[str],
    target_column: str,
    results_path: str,
    random_search_iterations: int,
    max_total_time: int,
    random_state: int = RANDOM_SEED,
) -> AutoML:
    """Create a simple pipeline that will create the train and eval splits on the selected columns and run the auto-ml process.

    Args:
        frame (pd.DataFrame): The input dataset to be split.
        train_columns (List[str]): A list of columns to use for training.
        target_column (str): The target column to predict.
        results_path (str): A name for the folder to store the results.
        random_search_iterations (int): The number of random search hyper-params trials to run.
        random_state (int, optional): The random seed to fix the pseudo-random number generators. Defaults to RANDOM_SEED.

    Returns:
        AutoML: an AutoML object containing the best model for each run.
    """
    X_train, X_valid, y_train, y_valid = train_test_split(
        frame[train_columns],
        frame[target_column],
        test_size=0.2,
        random_state=RANDOM_SEED,
    )

    automl = AutoML(
        results_path=results_path,
        mode="Perform",
        ml_task="binary_classification",
        algorithms=[
            "Baseline",
            "Decision Tree",
            "Extra Trees",
            "LightGBM",
            "CatBoost",
            "Neural Network",
        ],
        eval_metric="accuracy",
        start_random_models=random_search_iterations,
        train_ensemble=False,
        total_time_limit=max_total_time,
        random_state=RANDOM_SEED,
    )

    model = automl.fit(X_train, y_train)
    predictions = automl.predict(X_valid)
    log.info(
        f"Best model accuracy_score on valid set: {accuracy_score(y_valid,predictions):.3f}"
    )
    return model


def experiment_pipeline(
    name: str,
    target_column: str,
    sorted_columns: List[str],
    frame: pd.DataFrame,
    random_search_iterations: int = 5,
    max_total_time: int = 1500,
    random_state: int = RANDOM_SEED,
) -> List[AutoML]:
    """A method to run several iterations of an AutoML process.

    Args:
        name (str): The name to use as prefix for the results folder. The folders will be created using format `<name>-<index>`.
        target_column (str): The column with the feature to predict.
        sorted_columns (List[str]): A sorted list of feature columns. The process will start by runing an AutoML process for the first column for training, and add the next one for the next iteration.
        frame (pd.DataFrame): The dataFrame with all the data, not split beforehand.
        random_search_iterations (int): The number of random search hyper-params trials to run.
        max_total_time (int): The max number of seconds the experiment can run for. Default is 1500 seconds (25 mins).
        random_state (int, optional): The random seed to fix the pseudo-random number generators. Defaults to RANDOM_SEED.

    Returns:
        List[AutoML]: A sorted list of AutoML objects.
    """
    best_models = []
    for iteration in range(1, len(sorted_columns) + 1):
        experiment = f"{name}-{iteration}"
        log.info(f"Starting: {experiment}")
        train_columns = sorted_columns[0:iteration]
        log.info(f"Training on features: {train_columns}")
        automl_model = automl_pipeline(
            frame=df,
            train_columns=train_columns,
            target_column=target_column,
            results_path=experiment,
            random_search_iterations=random_search_iterations,
            max_total_time=max_total_time,
            random_state=random_state,
        )
        best_models.append(automl_model)
        log.info(f"Ending: {experiment}\n")

    return best_models

In [8]:
# Let's run the experiments.
# Grab a coffee since this will take some time.
best_models = experiment_pipeline(
    name="experiment",
    target_column="Diabetes_binary",
    sorted_columns=["HighBP", "GenHlth", "HighChol", "BMI", "Age", "Income"],
    random_search_iterations=1,
    max_total_time=300,
    frame=df,
)

AutoML directory: experiment-1
The task is binary_classification with evaluation metric accuracy
AutoML will use algorithms: ['Baseline', 'Decision Tree', 'Extra Trees', 'LightGBM', 'CatBoost', 'Neural Network']
AutoML steps: ['simple_algorithms', 'default_algorithms', 'golden_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2']
* Step simple_algorithms will try to check up to 2 models
1_Baseline accuracy 0.500729 trained in 12.81 seconds (1-sample predict time 0.0138 seconds)
2_DecisionTree accuracy 0.690752 trained in 50.97 seconds (1-sample predict time 0.0163 seconds)
* Step default_algorithms will try to check up to 4 models
3_Default_LightGBM accuracy 0.690752 trained in 17.11 seconds (1-sample predict time 0.0144 seconds)
4_Default_CatBoost accuracy 0.690752 trained in 15.81 seconds (1-sample predict time 0.0132 seconds)
5_Default_NeuralNetwork accuracy 0.690752 trained in 46.9 seconds (1-sample predict time 0.0416 seconds)
* Step golde

There was an error during 3_Default_LightGBM_GoldenFeatures training.
Please check experiment-1/errors.md for details.


There was an error during 4_Default_CatBoost_GoldenFeatures training.
Please check experiment-1/errors.md for details.


There was an error during 1_Baseline_GoldenFeatures training.
Please check experiment-1/errors.md for details.
Not enough time to perform features selection. Skip
Time needed for features selection ~ 54.0 seconds
Please increase total_time_limit to at least (603 seconds) to have features selection
Skip insert_random_feature because no parameters were generated.
Skip features_selection because no parameters were generated.
* Step hill_climbing_1 will try to check up to 6 models
6_DecisionTree accuracy 0.690752 trained in 34.04 seconds (1-sample predict time 0.0164 seconds)
7_LightGBM accuracy 0.690752 trained in 21.83 seconds (1-sample predict time 0.0141 seconds)
8_CatBoost accuracy 0.690752 trained in 13.71 seconds (1-sample predict time 0.0138 seconds)
9_CatBoost accuracy 0.690752 trained in 13.97 seconds (1-sample predict time 0.0123 seconds)
10_NeuralNetwork accuracy 0.690752 trained in 31.6 seconds (1-sample predict time 0.0264 seconds)
* Step hill_climbing_2 will try to check up 

AutoML directory: experiment-2
The task is binary_classification with evaluation metric accuracy
AutoML will use algorithms: ['Baseline', 'Decision Tree', 'Extra Trees', 'LightGBM', 'CatBoost', 'Neural Network']
AutoML steps: ['simple_algorithms', 'default_algorithms', 'golden_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2']
* Step simple_algorithms will try to check up to 2 models
1_Baseline accuracy 0.500729 trained in 12.99 seconds (1-sample predict time 0.0146 seconds)
2_DecisionTree accuracy 0.712325 trained in 43.86 seconds (1-sample predict time 0.0164 seconds)
* Step default_algorithms will try to check up to 4 models
3_Default_LightGBM accuracy 0.713319 trained in 16.26 seconds (1-sample predict time 0.0163 seconds)
4_Default_CatBoost accuracy 0.712325 trained in 14.49 seconds (1-sample predict time 0.0126 seconds)
5_Default_NeuralNetwork accuracy 0.712325 trained in 40.65 seconds (1-sample predict time 0.0256 seconds)
6_Default_E

AutoML directory: experiment-3
The task is binary_classification with evaluation metric accuracy
AutoML will use algorithms: ['Baseline', 'Decision Tree', 'Extra Trees', 'LightGBM', 'CatBoost', 'Neural Network']
AutoML steps: ['simple_algorithms', 'default_algorithms', 'golden_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2']
* Step simple_algorithms will try to check up to 2 models
1_Baseline accuracy 0.500729 trained in 13.2 seconds (1-sample predict time 0.0154 seconds)
2_DecisionTree accuracy 0.72152 trained in 49.77 seconds (1-sample predict time 0.0154 seconds)
* Step default_algorithms will try to check up to 4 models
3_Default_LightGBM accuracy 0.722293 trained in 17.6 seconds (1-sample predict time 0.0156 seconds)
4_Default_CatBoost accuracy 0.722426 trained in 15.72 seconds (1-sample predict time 0.0138 seconds)
5_Default_NeuralNetwork accuracy 0.720813 trained in 42.83 seconds (1-sample predict time 0.0269 seconds)
* Step golden_

AutoML directory: experiment-4
The task is binary_classification with evaluation metric accuracy
AutoML will use algorithms: ['Baseline', 'Decision Tree', 'Extra Trees', 'LightGBM', 'CatBoost', 'Neural Network']
AutoML steps: ['simple_algorithms', 'default_algorithms', 'golden_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2']
* Step simple_algorithms will try to check up to 2 models
1_Baseline accuracy 0.500729 trained in 15.3 seconds (1-sample predict time 0.014 seconds)
2_DecisionTree accuracy 0.724725 trained in 52.06 seconds (1-sample predict time 0.0182 seconds)
* Step default_algorithms will try to check up to 4 models
3_Default_LightGBM accuracy 0.736484 trained in 22.9 seconds (1-sample predict time 0.0164 seconds)
4_Default_CatBoost accuracy 0.736484 trained in 17.13 seconds (1-sample predict time 0.0147 seconds)
5_Default_NeuralNetwork accuracy 0.734075 trained in 60.9 seconds (1-sample predict time 0.0545 seconds)
Skip golden_fea

AutoML directory: experiment-5
The task is binary_classification with evaluation metric accuracy
AutoML will use algorithms: ['Baseline', 'Decision Tree', 'Extra Trees', 'LightGBM', 'CatBoost', 'Neural Network']
AutoML steps: ['simple_algorithms', 'default_algorithms', 'golden_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2']
* Step simple_algorithms will try to check up to 2 models
1_Baseline accuracy 0.500729 trained in 17.21 seconds (1-sample predict time 0.0198 seconds)
2_DecisionTree accuracy 0.727996 trained in 54.93 seconds (1-sample predict time 0.0175 seconds)
* Step default_algorithms will try to check up to 4 models
3_Default_LightGBM accuracy 0.745878 trained in 25.05 seconds (1-sample predict time 0.0172 seconds)
4_Default_CatBoost accuracy 0.747513 trained in 21.11 seconds (1-sample predict time 0.0173 seconds)
5_Default_NeuralNetwork accuracy 0.744109 trained in 69.97 seconds (1-sample predict time 0.0294 seconds)
Skip golden

AutoML directory: experiment-6
The task is binary_classification with evaluation metric accuracy
AutoML will use algorithms: ['Baseline', 'Decision Tree', 'Extra Trees', 'LightGBM', 'CatBoost', 'Neural Network']
AutoML steps: ['simple_algorithms', 'default_algorithms', 'golden_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2']
* Step simple_algorithms will try to check up to 2 models
1_Baseline accuracy 0.500729 trained in 17.57 seconds (1-sample predict time 0.0173 seconds)
2_DecisionTree accuracy 0.727996 trained in 55.44 seconds (1-sample predict time 0.0195 seconds)
* Step default_algorithms will try to check up to 4 models
3_Default_LightGBM accuracy 0.748 trained in 25.45 seconds (1-sample predict time 0.0181 seconds)
4_Default_CatBoost accuracy 0.748619 trained in 21.52 seconds (1-sample predict time 0.0201 seconds)
5_Default_NeuralNetwork accuracy 0.747005 trained in 78.07 seconds (1-sample predict time 0.0368 seconds)
Skip golden_fe

In [9]:
datapath_test = "../data/train/diabetes_binary_test.csv.zip"
df_test = pd.read_csv(datapath, compression="zip")
target_column = "Diabetes_binary"
train_columns = list(df.columns)
train_columns.remove(target_column)
X_test, y_test = df_test[train_columns], df_test[target_column]

predictions = best_models[-1].predict(X_test)
log.info(
    f"Best model accuracy score on test set: {accuracy_score(y_test,predictions):.3f}"
)

# Watermark

This should be the last section of your notebook, since it watermarks all your environment.

When commiting this notebook, remember to restart the kernel, rerun the notebook and run this cell last, to watermark the environment.

In [10]:
%watermark -gb -iv -m -v

Python implementation: CPython
Python version       : 3.8.13
IPython version      : 8.5.0

Compiler    : Clang 12.0.1 
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 12
Architecture: 64bit

Git hash: e97b29b94dd8f03e671866246316a00340217f3b

Git branch: main

sys    : 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:06:49) 
[Clang 12.0.1 ]
pandas : 1.4.3
logging: 0.5.1.2

