***
# <font color=red>Building an Image Classifier using AutoMLx</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> Oracle AutoMLx Team </font></p>

***

Image Classification Demo Notebook.

Copyright © 2023, Oracle and/or its affiliates.

Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/

## Overview of this Notebook

In this notebook we will build a image classifier using the Oracle AutoMLx tool for the public PneumoniaMNIST dataset which is part of MedMNIST datasets. The dataset is a multi-label classification dataset, and more details about the dataset can be found at https://medmnist.com/.
We explore the various options provided by the Oracle AutoMLx tool, allowing the user to exercise control over the AutoML training process. We then evaluate the different models trained by AutoML.

---
## Prerequisites

  - Experience level: Novice (Python and Machine Learning)
  - Professional experience: Some industry experience
---

## Business Use

Data analytics and modeling problems using Machine Learning (ML) are becoming popular and often rely on data science expertise to build accurate ML models. Such modeling tasks primarily involve the following steps:
- Pick an appropriate model for the given dataset and prediction task at hand.
- Tune the chosen model’s hyperparameters for the given dataset.

All of these steps are significantly time consuming and heavily rely on data scientist expertise. Unfortunately, to make this problem harder, the best model, and hyperparameter choice widely varies with the dataset and the prediction task. Hence, there is no one-size-fits-all solution to achieve reasonably good model performance. Using a simple Python API, AutoMLx can quickly jump-start the datascience process with an accurately-tuned model for a given prediction task.

## Table of Contents

- <a href='#setup'>Setup</a>
- <a href='#load-data'>Load the PneumoniaMNIST dataset</a>
- <a href='#AutoML'>AutoML</a>
  - <a href='#Engine'>Setting the execution engine</a>
  - <a href='#provider'>Create an Instance of Oracle AutoMLx</a>
  - <a href='#default'>Train a Model using AutoML</a>
  - <a href='#analyze'>Analyze the AutoML optimization process </a>
      - <a href='#algorithm-selection'>Algorithm Selection</a>
      - <a href='#adaptive-sampling'>Adaptive Sampling</a>
      - <a href='#model-tuning'>Model Tuning</a>
      - <a href='#confusion-matrix'>Confusion Matrix</a>
  - <a href='#advanced'> Advanced AutoML Configuration </a>
- <a href='#ref'>References</a>

<a id='setup'></a>
## Setup

Basic setup for the Notebook.

In [None]:

%matplotlib inline
%load_ext autoreload
%autoreload 2

Load the required modules.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.metrics import balanced_accuracy_score, roc_auc_score
from datasets import load_dataset
from sklearn.model_selection import train_test_split

# Settings for plots
plt.rcParams['figure.figsize'] = [4, 3]
plt.rcParams['font.size'] = 15

import automlx

<a id='load-data'></a>
## Load the PneumoniaMNIST dataset
We start by reading in the dataset from Hugging Face.

In [None]:
dataset = load_dataset("albertvillanova/medmnist-v2", "pneumoniamnist")

Lets look at a few of the values in the data

In [None]:
dataset["train"][:5]

Plot one of the images as an example

In [None]:
print("Pneumonia" if dataset["train"][0]['label'] == 1 else 'Normal')
dataset["train"][0]['image']

We visualize the distribution of the target variable in the training data.

In [None]:
y_df = pd.DataFrame(dataset["train"]["label"])
y_df.columns = ['label']

fig = px.histogram(y_df["label"].apply(lambda x: "Normal" if x == 0 else "Pneumonia"), x="label", barmode="group")
fig.show()

We now separate the predictions (`y`) from the training data (`X`) for both the training (70%) and test (30%) datasets. The training set will be used to create a Machine Learning model using AutoML, and the test set will be used to evaluate the model's performance on unseen data.

In [None]:
X = pd.DataFrame(dataset["train"]["image"], columns=['images'])
y = pd.DataFrame(dataset["train"]["label"])
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

# reducing the number of samples in training set to speed up the demo
X_train = X_train[:1000]
y_train = y_train[:1000]

<a id='AutoML'></a>
## AutoML

<a id='provider'></a>
### Create an instance of Oracle AutoMLx

The Oracle AutoMLx solution provides a pipeline that automatically finds a tuned model given a prediction task and a training dataset.

AutoML consists of three main steps for image classification:
- **Algorithm Selection** : Identify the right classification algorithm for a given dataset.
- **Adaptive Sampling** : Select a subset of the data samples for the model to be trained on.
- **Model Tuning** : Find the right model parameters (including model size) that maximize score for the given dataset.

All these pieces are readily combined into a simple AutoML pipeline which automates the entire Machine Learning process with minimal user input/interaction.

<a id='default'></a>
### Train a model using AutoML

The AutoMLx API is quite simple to work with. We create an instance of the pipeline. Next, the training data is passed to the `fit()` function which executes the previously mentioned steps.
Note that we decreased number of tuning trials in Model Tuning, to speed up the demo notebook.

In [None]:
est1 = automlx.Pipeline(task='classification', max_tuning_trials=10, score_metric="balanced_accuracy")
est1.fit(X_train, y_train)

A model is then generated (`est1`) and can be used for prediction tasks. We use the `balanced accuracy` scoring metric to evaluate the performance of this model on unseen data (`X_test`).

In [None]:
y_pred = est1.predict(X_test)
score_default = balanced_accuracy_score(y_test, y_pred)

print(f'Score on test data : {score_default:.4f}')

<a id='analyze'></a>
### Analyze the AutoML optimization process

During the AutoML process, a summary of the optimization process is logged. It consists of:
- Information about the training data .
- Information about the AutoML, such as:
    - Selected algorithm that was the best choice for this data;
    - Selected hyperparameters for the selected algorithm.

AutoMLx provides a `print_summary` API to output all the different trials performed.

In [None]:
est1.print_summary()

We also provide the capability to visualize the results of each stage of the AutoML.

<a id='algorithm-selection'></a>
#### Algorithm Selection

The plot below shows the scores predicted by Algorithm Selection for each algorithm. The horizontal line shows the average score across all algorithms. Algorithms below the line are colored turquoise, whereas those with a score higher than the mean are colored teal.

In [None]:
# Each trial is a row in a dataframe that contains
# Algorithm, Number of Samples, Number of Features, Hyperparameters, Score, Runtime, Memory Usage, Step as features
trials = est1.completed_trials_summary_[est1.completed_trials_summary_["Step"].str.contains('Model Selection')]
name_of_score_column = f"Score ({est1._inferred_score_metric[0].name})"
trials.replace([np.inf, -np.inf], np.nan, inplace=True)
trials.dropna(subset=[name_of_score_column],inplace=True)
colors = []
scores = trials[name_of_score_column].tolist()
models = trials['Algorithm'].tolist()
y_margin = 0.10 * (max(scores) - min(scores))
s = pd.Series(scores, index=models).sort_values(ascending=False)
s = s.dropna()
for f in s.keys():
    if f.strip()  ==  est1.selected_model_.strip():
        colors.append('orange')
    elif s[f] >= s.mean():
        colors.append('teal')
    else:
        colors.append('turquoise')

fig, ax = plt.subplots(1)
ax.set_title("Algorithm Selection Trials")
ax.set_ylim(min(scores) - y_margin, max(scores) + y_margin)
ax.set_ylabel(est1._inferred_score_metric[0].name)
s.plot.bar(ax=ax, color=colors, edgecolor='black')
ax.axhline(y=s.mean(), color='black', linewidth=0.5)
plt.show()

<a id='model-tuning'></a>
#### Model Tuning

Model Tuning is the last stage of AutoML, and focuses on improving the chosen algorithm's score on the reduced dataset (after Adaptive Sampling). AutoML uses a novel algorithm to search across many hyperparameters dimensions, and converge automatically when optimal hyperparameters are identified. Each trial represents a particular hyperparameters configuration for the selected model.

In [None]:
# Each trial is a row in a dataframe that contains
# Algorithm, Number of Samples, Number of Features, Hyperparameters, Score, Runtime, Memory Usage, Step as features
trials = est1.completed_trials_summary_[est1.completed_trials_summary_["Step"].str.contains('Model Tuning')]
trials.replace([np.inf, -np.inf], np.nan, inplace=True)
trials.dropna(subset=[name_of_score_column], inplace=True)
trials = trials.sort_values(by=['Finished'],ascending=True)
scores = trials[name_of_score_column].tolist()
score = []
score.append(scores[0])
for i in range(1,len(scores)):
    if scores[i]>= score[i-1]:
        score.append(scores[i])
    else:
        score.append(score[i-1])
y_margin = 0.10 * (max(score) - min(score))

fig, ax = plt.subplots(1)
ax.set_title("Model Tuning Trials")
ax.set_xlabel("Iteration $n$")
ax.set_ylabel(est1._inferred_score_metric[0].name)
ax.grid(color='g', linestyle='-', linewidth=0.1)
ax.set_ylim(min(score) - y_margin, max(score) + y_margin)
ax.plot(range(1, len(trials) + 1), score, 'k:', marker="s", color='teal', markersize=3)
plt.show()

<a id='advanced'></a>
### Advanced AutoML Configuration

For customizing the model tuning step, the range of the hyperparameters of each of the models can be specified and passed to the pipeline.

In [None]:
shared_hyperparameters = {
    "epochs": {'range': [1,5],
               'type': 'discrete'
              },
    "batch_size": {'range': [16, 32],
                   'type': 'discrete'
                  }
}
search_space = {
    "ResNet" : {
        "size": {'range': ["18","101"],
                 'type': 'categorical'
                },
        **shared_hyperparameters

    },
         "EfficientNet" : {
        "size": {'range': ["b2","b6"],
                 'type': 'categorical'
                },
        **shared_hyperparameters
    }
}

You can also configure the pipeline with suitable parameters according to your needs.

In [None]:
custom_pipeline = automlx.Pipeline(
    task='classification',
    model_list=[                 # Specify the models you want the AutoML to consider
        'ResNet',
        'EfficientNet',
    ],
    n_algos_tuned=2,             # Choose how many models to tune
    adaptive_sampling=False,     # Disable or enable Adaptive Sampling step. Default to `True`
    search_space=search_space,   # You can specify the hyper-parameters and ranges AutoML searches
    max_tuning_trials=2,         # The maximum number of tuning trials. Can be integer or Dict (max number for each model)
    score_metric='f1_macro',     # Any scikit-learn metric or a custom function
)

A few of the advanced settings can be passed directly to the pipeline's fit method, instead of its constructor.

In [None]:
custom_pipeline.fit(
    X_train,
    y_train,
    time_budget= 20,    # Specify time budget in seconds
    cv='auto'           # Automatically pick a good cross-validation (cv) strategy for the user's dataset.
                        # Ignored if X_valid and y_valid are provided.
                        # Can also be:
                        #   - An integer (for example, to use 5-fold cross validation)
                        #   - A list of data indices to use as splits (for advanced, such as time-based splitting)
)
y_pred = custom_pipeline.predict(X_test)
score_default = balanced_accuracy_score(y_test, y_pred)

print(f'Score on test data : {score_default:.4f}')

<a id='ref'></a>
## References
* Oracle AutoML http://www.vldb.org/pvldb/vol13/p3166-yakovlev.pdf
* scikit-learn https://scikit-learn.org/stable/
* Hugging Face https://huggingface.co/
* MedMNIST Dataset https://medmnist.com/