***
# <font color=red>Building and Evaluating a Machine Learning Model using AutoMLx</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> Oracle AutoMLx Team </font></p>

***

AutoMLx Demo Notebook.

Copyright © 2024, Oracle and/or its affiliates.

Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/

## Overview of this Notebook

In this notebook, we will build and evaluate a machine learning model using Oracle AutoMLx. The first dataset is a binary classification dataset.
We explore the various options provided by Oracle AutoMLx, allowing the user to specify different options in the training procedure. We then evaluate the model trained by AutoMLx.

---
## Prerequisites

  - Experience level: Novice (Python and Machine Learning)
  - Professional experience: Some industry experience
---

## Business Use

Data analytics and modeling problems using Machine Learning (ML) are becoming popular and often rely on data science expertise to build accurate ML models. Such modeling tasks primarily involve the following steps:
- Preprocessing the dataset (for example, cleaning, imputing, engineering features and normalization).
- Picking an appropriate model for the given dataset and prediction task at hand.
- Tuning the chosen model’s hyperparameters for the given dataset.

All of these steps are significantly time consuming and heavily rely on data scientist expertise. Unfortunately, to make this problem harder, the best feature subset, model, and hyperparameter choice widely varies with the dataset and the prediction task. Hence, there is no one-size-fits-all solution to achieve reasonably good model performance. Using a simple Python API, AutoML can quickly jump-start the datascience process with an accurately-tuned model and appropriate features for a given prediction task.

## Table of Contents

- <a href='#train'> Train a Model using AutoMLx</a>
- <a href='#quality'> Inspect the Model's Quality</a>
- <a href='#prediction'> Make Predictions using the Model</a>
- <a href='#evaluate'> Evaluate the Quality of a Model on a Given Dataset</a>
- <a href='#save'> Save a model</a>
- <a href='#load'> Load a model</a>
- <a href='#regression'> Train a Regression Model using AutoMLx</a>
- <a href='#ref'>References</a>

<a id='train'></a>
## Train a Model using AutoMLx

Here we show how easy it is to use the AutoMLx train_model API to quickly and automatically train a model for a
machine learning problem. We pass the data, with the name of the target to predict and task to the `train_model`
function. This function will return the best, fully-trained model that AutoML could find for the given dataset.

You can find the synthetic datasets used in this notebook at https://docs.oracle.com/en-us/iaas/tools/automlx/latest/data/

The data argument can be a string, in which case it should be the path to a CSV file that contains your dataset.
Alternatively, you can directly pass a pandas DataFrame.

The task can be either `classification` or `regression`, or we can import and use Task.CLASSIFICATION and TASK.REGRESSION.

In [1]:
from automlx import train_model

model = train_model(
    data = "classification_train.csv",  # path to dataset CSV file or a pandas DataFrame
    target_to_predict = "income_group",  # name of the target column in the dataset
    task = 'classification',  # type of problem you are interested in solving,
)

That's it! The model is fully trained and ready to be used to make predictions or to be deployed.

<a id='quality'></a>
## Inspect the Model's Quality

But how well can you expect your model to work? There are many ways to measure the quality of a machine learning
model. AutoML automatically calculates two different types of scores for your model. The first is the model's
training score, this tells you how well the model learned to predict the target on the data that was used for
training the model. Generally, higher scores are better; however, sometimes a model may see patterns in your data
that appeared by random chance. When this happens, your model typically won't perform well when deployed, because
those same patterns aren't likely to appear in future data that the model encounters. For this reason, AutoML
automatically reserves 20% of the training data as a stress test for your model. This data is not used to train the
model; instead, it is used to estimate the future quality of your model on new data. Both scores can be accessed
using `model.quality`.

In [2]:
model.quality

Unnamed: 0_level_0,neg_log_loss
Evaluated on 2024-09-11,Unnamed: 1_level_1
Measured quality on training data,-0.109292
Estimate of future quality,-0.131209


The more similar the two scores are the better. If there is a large gap between them, it may mean that the model learned to rely on spurious correlations. However, if the model quality is still good on the stress test data, then the gap may not be a cause for concern.

<a id='prediction'></a>
## Make Predictions using the Model

We can now use this model to make predictions! The following method will return a new dataset that is identical to
the provided dataset with an additional column that contains the model's predictions.

In [3]:
data_with_prediction = model.predict("classification_train.csv")
data_with_prediction.head(2)

Unnamed: 0,age,education,sex,income_group,prediction for income_group
0,42,diploma,female,<=50k,<=50k
1,57,bachelors,male,<=50k,<=50k


We can also save the dataset with the predictions by passing desired path to a new CSV file the `output` parameter.

In [4]:
data_with_prediction = model.predict("classification_train.csv", output='data_with_prediction.csv')
data_with_prediction.head(2)

Unnamed: 0,age,education,sex,income_group,prediction for income_group
0,42,diploma,female,<=50k,<=50k
1,57,bachelors,male,<=50k,<=50k


<a id='evaluate'></a>
## Evaluate the Quality of a Model on a New Dataset

Inspecting `model.quality` (see above) is always a good idea to ensure that the model performed well when it was
trained. However, it is never a bad idea to continue evaluating the model over time on new data as you collect it.
This can be achieved using the `evaluate_model_quality` function. We just need to pass the model and the desired
dataset to this function.

In [5]:
from automlx import evaluate_model_quality

score = evaluate_model_quality(model, "classification_test.csv")

score

Unnamed: 0,neg_log_loss
classification_test.csv,-0.159003


You can always compare the new results to `model.quality` to see if the model is still performing as well as it
was expected to when it was trained. If not, it may be time to call `train_model` again with your new data so that
the model can learn any new trends that have appeared in your dataset.

<a id='save'></a>
## Save a model

Once we are satisfied with the results, we can save the model, using the `save` method, by passing a desired file
path.

In [6]:
model.save('model.amlx')

<a id='load'></a>
## Load a model

We can also load a saved model using the `load_model` function by providing the path to the model.

In [7]:
from automlx import load_model

loaded_model = load_model('model.amlx')

loaded_model.quality

Unnamed: 0_level_0,neg_log_loss
Evaluated on 2024-09-11,Unnamed: 1_level_1
Measured quality on training data,-0.109292
Estimate of future quality,-0.131209


<a id='load'></a>
## Train a Regression Model using AutoMLx

Here we show how to train a model for a regression task. We also cover some optional arguments that can be used to
further control how AutoML works.
  - metric: The name of the desired scoring metric. By default, this is `auto`, which means that an appropriate
  metric is chosen based on the dataset and task.
  - time_budget: The maximum time budget in seconds. By default, this is `None`, which means that AutoML should run
  until it is done (that is, until it cannot find any better models for your data). Note that AutoML will sometimes
  run for longer than your requested time budget. This is to ensure that we can always return a fully-trained model
  that is ready to be deployed.
  - test_data: Advanced users can pass in a custom dataset for stress testing the model. This will be used to estimate
  the quality of the final model on future data. If not provided, the test scores are estimated automatically by
  reserving 20% of the training data for evaluation of the final model.

In [8]:
from automlx import Task

model_regression = train_model(
    data = "regression_train.csv",
    target_to_predict = "income",
    task = Task.REGRESSION,
    metric = 'auto',
    time_budget=30,
    test_data = "regression_test.csv"
)

In [9]:
model_regression.quality

Unnamed: 0_level_0,neg_mean_squared_error
Evaluated on 2024-09-11,Unnamed: 1_level_1
Measured quality on training data,-37810310.0
Estimate of future quality,-38290700.0


<a id='ref'></a>
## References
* Oracle AutoML http://www.vldb.org/pvldb/vol13/p3166-yakovlev.pdf