In [None]:
# Upgrade Oracle ADS to pick up latest features and maintain compatibility with Oracle Cloud Infrastructure.

!pip install -U oracle-ads

Oracle Data Science service sample notebook.

Copyright (c) 2020, 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

***

# <font color="red">Introduction to ADSTuner</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

***

# Overview:

A hyperparameter is a parameter that is used to control a learning process. This is in contrast to other parameters that are learned in the training process. The process of hyperparameter optimization is to search for hyperparameter values by building many models and assessing their quality. This notebook provides an overview of the `ADSTuner` hyperparameter optimization engine. `ADSTuner` can optimize any estimator object that follows the [scikit-learn API](https://scikit-learn.org/stable/modules/classes.html).

Compatible conda pack: [General Machine Learning](https://docs.oracle.com/en-us/iaas/data-science/using/conda-gml-fam.htm) for CPU on Python 3.8 (version 1.0)

## Contents:

- <a href='#intro'>Introduction</a>
    - <a href='#ntrials'>Synchronous Tuning with Exit Criterion Based on Number of Trials</a>
    - <a href='#resume'>Asynchronously Tuning with Exit Criterion Based on Time Budget</a>
    - <a href='#inspect'>Inspecting the Tuning Trials</a>
- <a href='#custom'>Defining a Custom Search Space and Score</a>  
    - <a href='#search-space'>Changing the Search Space Strategy</a>
- <a href='#pipeline'>Optimizing a scikit-learn `Pipeline()`</a> 
- <a href="#reference">References</a>

---


Datasets are provided as a convenience.  Datasets are considered third-party content and are not considered materials 
under your agreement with Oracle. 

You can access the `iris` dataset license [here](https://github.com/scikit-learn/scikit-learn/blob/master/COPYING).  

---


In [None]:
import category_encoders as ce
import logging
import numpy as np
import os
import pandas as pd
import sklearn
import time

from ads.hpo.stopping_criterion import *
from ads.hpo.distributions import *
from ads.hpo.search_cv import ADSTuner, State

from sklearn import preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.metrics import make_scorer, f1_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif

<a id='intro'></a>
# Introduction

Hyperparameter optimization requires a model, dataset, and an `ADSTuner` object to perform the search.

`ADSTuner()` Performs a hyperparameter search using [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). You can specify the number of folds you want to use with the `cv` parameter.

The `ADSTuner()` needs a search space to tune the hyperparameters in so you use the `strategy` parameter. This parameter can be set in two ways. You can specify detailed search criteria or you can use the built-in defaults. For the supported model classes, `ADSTuner` provides `perfunctory`and `detailed` search spaces that are optimized for the class of model that is being used. The `perfunctory` option is optimized for a small search space so that the most important hyperparameters are tuned. Generally, this option is used early in your search as it reduces the computational cost and allows you to assess the quality of the model class that you are using. The `detailed` search space instructs `ADSTuner` to cover a broad search space by tuning more hyperparameters. Typically, you would use it when you have determined what class of model is best suited for the dataset and type of problem you are working on. If you have experience with the dataset and have a good idea of what the best hyperparameter values are, you can explicitly specify the search space. You pass a dictionary that defines the search space into the `strategy`.

The parameter `storage` takes a database URL. For example, `sqlite:////home/datascience/example.db`. When `storage` is set to the default value `None`, a new sqlite database file is created internally in the `tmp` folder with a unique name. The name format is `sqlite:////tmp/hpo_*.db`. `study_name` is the name of this study for this `ADSTuner` object. One `ADSTuner` object only has one `study_name`. However, one database file can be shared among different `ADSTuner` objects. `load_if_exists` controls whether to load an existing study from an existing database file. If `False`, it raises a `DuplicatedStudyError` when the `study_name` exists.

The `loglevel` parameter controls the amount of logging information displayed in the notebook.

This notebook uses the scikit-learn `SGDClassifer()` model and the iris dataset. This model object is a regularized linear model with [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) (SGD) used to optimize the model parameters.

The next cell creates the `SGDClassifer()` model, initialize san `ADSTuner` object, and loads the iris data.

In [None]:
tuner = ADSTuner(SGDClassifier(), cv=3, loglevel=logging.WARNING)
X, y = load_iris(return_X_y=True)

Each model class has a set of hyperparameters that yoi need to optimized. The `strategy` attribute returns what strategy is being used. This can be `perfunctory`, `detailed`, or a dictionary that defines the strategy. The method `search_space()` always returns a dictionary of hyperparameters that are to be searched. Any hyperparameter that is required by the model, but is not listed, uses the default value that is defined by the model class. To see what search space is being used for your model class when `strategy` is `perfunctory` or `detailed` use the `search_space()` method to see the details.

The `adstuner_search_space_update.ipynb` notebook has detailed examples about how to work with and update the search space.

The next cell displaces the search strategy and the search space.

In [None]:
print(f'Search Space for strategy "{tuner.strategy}" is: \n {tuner.search_space()}')

The `tune()` method starts a tuning process. It has a synchronous and asynchronous mode for tuning. The mode is set with the `synchronous` parameter. When it is set to `False`, the tuning process runs asynchronously so it runs in the background and allows you to continue your work in the notebook. When `synchronous` is set to `True`, the notebook is blocked until `tune()` finishes running. The `adntuner_sync_and_async.ipynb` notebook illustrates this feature in a more detailed way.

The `ADSTuner` object needs to know when to stop tuning. The `exit_criterion` parameter accepts a list of criteria that cause the tuning to finish. If any of the criteria are met, then the tuning process stops. Valid exit criteria are:

* `NTrials(n)`: Run for `n` number of trials.
* `TimeBudget(t)`: Run for `t` seconds.
* `ScoreValue(s)`: Run until the score value exceeds `s`.

The default behavior is to run for 50 trials (`NTrials(50)`).

The stopping criteria are listed in the [`ads.hpo.stopping_criterion`](https://docs.oracle.com/en-us/iaas/tools/ads-sdk/latest/ads.hpo.html#module-ads.hpo.stopping_criterion) module.

<a id='ntrials'></a>
## Synchronous Tuning with Exit Criterion Based on Number of Trials

This section demonstrates how to perform a synchronous tuning process with the exit criteria based on the number of trials. In the next cell, the `synchronous` parameter is set to `True` and the `exit_criterion` is set to `[NTrials(5)]`. 

In [None]:
tuner.tune(X, y, exit_criterion=[NTrials(5)], synchronous=True)

You can access a summary of the trials by looking at the various attributes of the `tuner` object. The `scoring_name` attribute is a string that defines the name of the scoring metric. The `best_score` attribute gives the best score of all the completed trials. The `best_params` parameter defines the values of the hyperparameters that have to lead to the best score. Hyperparameters that are not in the search criteria are not reported.

In [None]:
print(
    f"So far the best {tuner.scoring_name} score is {tuner.best_score} and the best hyperparameters are {tuner.best_params}"
)

You can also look at the detailed table of all the trials attempted: 

In [None]:
tuner.trials.tail()

<a id='resume'></a>
## Asynchronously Tuning with Exit Criterion Based on Time Budget

`ADSTuner()` tuner can be run in an asynchronous mode by setting `synchronous=False` in the `tune()` method. This allows you to run other Python commands while the tuning process is executing in the background. This section demonstrates how to run an asynchronous search for the optimal hyperparameters. It uses a stopping criteria of five seconds. This is controlled by the parameter `exit_criterion=[TimeBudget(5)]`.

The next cell starts an asynchronous tuning process. A loop is created that prints the best search results that have been detected so far by using the `best_score` attribute. It also displays the remaining time in the time budget by using the `time_remaining` attribute. The attribute `status` is used to exit the loop.

In [None]:
# This cell will return right away since it's running asynchronous.
tuner.tune(exit_criterion=[TimeBudget(5)])
while tuner.status == State.RUNNING:
    print(
        f"So far the best score is {tuner.best_score} and the time left is {tuner.time_remaining}"
    )
    time.sleep(1)

The attribute `best_index` givse you the index in the `trials` data frame where the best model is located.

In [None]:
tuner.trials.loc[tuner.best_index, :]

The attribute `n_trials` reports the number of successfully complete trials that were conducted.

In [None]:
print(f"The total of trials was: {tuner.n_trials}.")

<a id='inspect'></a>
## Inspecting the Tuning Trials

You can inspect the tuning trials performance using several built in plots.

**Note**: If the tuning process is still running in the background, the plot runs in real time to update the new changes until the tuning process completes.

In [None]:
# tuner.tune(exit_criterion=[NTrials(5)], loglevel=logging.WARNING) # uncomment this line to see the real-time plot.
tuner.plot_best_scores()

In [None]:
tuner.plot_intermediate_scores()

In [None]:
tuner.plot_contour_scores(params=["penalty", "alpha"])

In [None]:
tuner.plot_parallel_coordinate_scores(params=["penalty", "alpha"])

In [None]:
tuner.plot_edf_scores()

In [None]:
tuner.plot_param_importance()

<a id='custom'></a>
# Defining a Custom Search Space and Score

Instead of using a `perfunctory` or `detailed` strategy, define a custom search space strategy. 

The next cell, creates a `LogisticRegression()` model instance then defines a custom search space strategy for the three `LogisticRegression()` hyperparameters, `C`, `solver`, and `max_iter` parameters. 

You can define a custom `scoring` parameter, see <a id='pipeline'>Optimizing a scikit-learn `Pipeline()`</a> though this example uses the standard weighted average $F_1$, `f1_score`.

In [None]:
tuner = ADSTuner(
    LogisticRegression(),
    strategy={
        "C": LogUniformDistribution(low=1e-05, high=1),
        "solver": CategoricalDistribution(["saga"]),
        "max_iter": IntUniformDistribution(500, 2000, 50),
    },
    scoring=make_scorer(f1_score, average="weighted"),
    cv=3,
)
tuner.tune(
    X, y, exit_criterion=[NTrials(5)], synchronous=True, loglevel=logging.WARNING
)

<a id='search-space'></a>
## Changing the Search Space Strategy

You can change the search space in the following three ways:

- add new hyperparameters
- remove existing hyperparameters
- modify the range of existing non-categorical hyperparameters

**Note**: You can't change the distribution of an existing hyperparameter or make any changes to a hyperparameter that is based on a categorical distribution. You need to initiate a new `ADSTuner` object for those cases. For more detailed information, review the `adstuner_search_space_update.ipynb` notebook.

The next cell switches to a `detailed` strategy. All previous values set for `C`, `solver`, and `max_iter` are kept, and `ADSTuner` infers distributions for the remaining hyperparameters. You can force an overwrite by setting `overwrite=True`. 

In [None]:
tuner.search_space(strategy="detailed")

Alternatively, you can edit a subset of the search space by changing the range.

In [None]:
tuner.search_space(strategy={"C": LogUniformDistribution(low=1e-05, high=1)})

Here's an example of using `overwrite=True` to reset to the default values for `detailed`: 

In [None]:
tuner.search_space(strategy="detailed", overwrite=True)

In [None]:
tuner.tune(
    X, y, exit_criterion=[NTrials(5)], synchronous=True, loglevel=logging.WARNING
)

<a id='pipeline'></a>
# Optimizing a scikit-learn `Pipeline()` 

The following example demonstrates how the `ADSTuner` hyperparameter optimization engine can optimize the **sklearn** `Pipeline()` objects. 

You create a scikit-learn `Pipeline()` model object and use `ADSTuner` to optimize its performance on the iris dataset from sklearn.

The dataset is then split into X and y, which refers to the training features and the target feature respectively. Again, applying a `train_test_split()` call splits the data into training and validation datasets.

In [None]:
X, y = load_iris(return_X_y=True)
X = pd.DataFrame(
    data=X, columns=["sepal_length", "sepal_width", "petal_length", "petal_width"]
)
y = pd.DataFrame(data=y)

numeric_features = X.select_dtypes(
    include=["int64", "float64", "int32", "float32"]
).columns
categorical_features = y.select_dtypes(include=["object", "category", "bool"]).columns

y = preprocessing.LabelEncoder().fit_transform(y)

num_features = len(numeric_features) + len(categorical_features)

numeric_transformer = Pipeline(
    steps=[
        ("num_imputer", SimpleImputer(strategy="median")),
        ("num_scaler", StandardScaler()),
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("cat_imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("cat_encoder", ce.woe.WOEEncoder()),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

pipe = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("feature_selection", SelectKBest(f_classif, k=int(0.9 * num_features))),
        ("classifier", LogisticRegression()),
    ]
)

You can define a custom score function. In this example, it is directly measuring how close the predicted y-values are to the true y-values by taking the weighted average of the number of direct matches between the y-values.

In [None]:
def custom_score(y_true, y_pred, sample_weight=None):
    score = y_true == y_pred
    return np.average(score, weights=sample_weight)


score = make_scorer(custom_score)

Again, you instantiate the `ADSTuner()` object and use it to tune the iri` dataset:

In [None]:
ads_search = ADSTuner(pipe, scoring=score, strategy="detailed", cv=2, random_state=42)

ads_search.tune(
    X=X, y=y, exit_criterion=[NTrials(20)], synchronous=True, loglevel=logging.WARNING
)

The `ads_search` tuner can provide useful information about the tuning process, like the best parameter that was optimized, the best score achieved, the number of trials, and so on.

In [None]:
ads_search.sklearn_steps

In [None]:
ads_search.best_params

In [None]:
ads_search.best_score

In [None]:
ads_search.best_index

In [None]:
ads_search.trials.head()

In [None]:
ads_search.n_trials

<a id="reference"></a>
# References

- [`ads.hpo.stopping_criterion`](https://docs.oracle.com/en-us/iaas/tools/ads-sdk/latest/ads.hpo.html#module-ads.hpo.stopping_criterion)
- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Cross-Validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [Stochastic Gradient Descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)