***
# <font> ML Lifecycle with Oracle AutoML and ADS GenericModel</font>
<p style="margin-left:10%; margin-right:10%;"> <font color=teal> 

***

## Overview:

In this notebook, we will walk you through ML Lifecycle phases like ,<br>
Reading a Dataset<br>
Understanding Data through Exploratory Data Analysis<br>
Build and Train ML Model with and without AutoML<br>
Evaluate ML Model<br>
Prepare and Publish it to the Model Catalog<br>
Deploy ML Model<br>
Use deployed ML Model to get predictions for new data<br>
Cleanup deployed resources.
<p> We will also show you how to achieve all the above using Oracle Accelerated Data Science (ADS) library in the simplest possible manner.    
    




Please select the conda envionment "automlx_p38_cpu_v5" before proceeding further. 

## Dataset:
<p>We will be using "Heart Attack Analysis & Prediction Dataset". The link to the dataset is <br> https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset/code 
<p> We have downloaded and kept a copy of the heart.csv in the same folder as this notebook.


In [1]:
import io
import warnings
import logging
import os
import automl
from automl import init
from os import path 
from os.path import expanduser
from os.path import join


from collections import defaultdict

import ads

from ads.evaluations.evaluator import ADSEvaluator

from ads.common.model_metadata import UseCaseType

from ads.catalog.model import ModelCatalog
from ads.common.model import ADSModel
from ads.common.model_metadata import UseCaseType
from ads.dataset.factory import DatasetFactory

from os import path
from shutil import rmtree



A resource principal is a feature of IAM that enables resources to be authorized principal actors that can perform actions on service resources. Each resource has its own identity, and it authenticates using the certificates that are added to it. These certificates are automatically created, assigned to resources, and rotated, avoiding the need for you to upload and manage your own credentials.

In [2]:
import ads 
ads.set_auth(auth='resource_principal') 

In [3]:
# ADS version used in this notebook: 
print(ads.hello())



  O  o-o   o-o
 / \ |  \ |
o---o|   O o-o
|   ||  /     |
o   oo-o  o--o

ads v2.8.9
oci v2.112.0
ocifs v1.2.1


None


## Open and Visualize the Heart Attack Analysis Dataset using `ADS`
<p>The first step is to load in the dataset. To do this the `DatasetFactory` singleton object will be used. It is part of the `ADS` library. It is a powerful class to work with datasets from different sources.

In [4]:
heart_ds = DatasetFactory.open(
        'heart.csv',target="output").set_positive_class(1)

loop1:   0%|          | 0/4 [00:00<?, ?it/s]

In [5]:
heart_ds.head(10)

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,True
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,True
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,True
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,True
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,True
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,True
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,True
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,True
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,True
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2,True


In [6]:
heart_ds.summary()

Unnamed: 0,Feature,Datatype
0,output,categorical/category
1,age,ordinal/int64
2,sex,categorical/category
3,cp,ordinal/int64
4,trtbps,ordinal/int64
5,chol,ordinal/int64
6,fbs,categorical/category
7,restecg,ordinal/int64
8,thalachh,ordinal/int64
9,exng,categorical/category


In [7]:
heart_ds.isna().sum()

age         0
sex         0
cp          0
trtbps      0
chol        0
fbs         0
restecg     0
thalachh    0
exng        0
oldpeak     0
slp         0
caa         0
thall       0
output      0
dtype: int64

<a id='viz'></a>
### Visualize the Dataset Object

The `show_in_notebook` method can be applied to the dataset itself. When this is done the following is produced:

  - Summary, this shows a brief description of the dataset, shape, and a breakdown by feature type
  - Feature summary, a visualization created on a dataset sample to give an idea of distribution for each feature.
  - Correlations, a map which shows how every feature (numeric and categorical) are correlated
  - Data preview, the first five rows of the data



In [8]:
heart_ds.show_in_notebook()

Accordion(children=(HTML(value='<h3>Type: BinaryClassificationDataset</h3><h3>303 Rows, 14 Columns</h3><h4>Col…

In [9]:
heart_ds.show_corr(correlation_methods='all')

## Transform Data using `ADS`

<a id='trans'></a>
### Get and Apply Transformation Recommendations

`ADS` can help with feature engineering by transforming datasets. For example, it can fix class imbalance by up or downsampling. This is just one example of the many transforms that `ADS` can apply. You can have `ADS` perform an analysis of the data and automatically perform the transformations that it thinks would improve the model. This is done with the `auto_transform()` method. The `suggest_recommendations()` method allows you to explore the suggested transforms using the notebook's UI and select the transformations that you would like it to make.

All ADS datasets are immutable; any transforms that are applied result in a new dataset.

In [10]:
heart_ds_transformed = heart_ds.auto_transform()

loop1:   0%|          | 0/7 [00:00<?, ?it/s]

Let's split the dataset train/test. If you call `train_test_split()` the split will be 90/10, train/test. Change the parameter `test_size` to change the size of the test dataset.  

In [11]:
train , test = heart_ds_transformed.train_test_split()

## Train the Model using AutoML 
<a id='Engine'></a>
### Setting the engine and deprecation warnings



The AutoML pipeline offers the function `init`, which allows to initialize the parallelization engine. 

In [12]:
init(engine='local', check_deprecation_warnings=False)

INFO:automl:Running on 19dd7b37c975 4 logical cores
INFO:automl.xengine:Using Single Node XEngine with n_jobs: 2
INFO:automl.xengine:Max timeout per task is set to 1500
INFO:automl.xengine:local xengine initialization: <multiprocessing.pool.Pool state=RUN pool_size=2>


In [13]:
est1 = automl.Pipeline(task='classification')
est1.fit(train.X, train.y)

INFO:automl.interface:Using AutoML default metric of neg_log_loss
INFO:automl:Time budget of 0 using fcfs allocation strategy
INFO:automl.pipeline:cv=5; ds_valid=None; task=classification
INFO:automl.pipeline:cv: 5; ds instances: 272; ds_valid instances: None
INFO:automl.pipeline:#############################################################################
INFO:automl.pipeline:############################ AutoML Pipeline ################################
INFO:automl.pipeline:#############################################################################
INFO:automl.pipeline:
INFO:automl.pipeline:Config: {'xengine': 'local', 'xengine_opts': {'dask_scheduler': None, 'model_n_jobs': 2, 'n_jobs': 2, 'spawn_type': 'forkserver', 'exec_ctx': <automl.xengine.local.DefaultXEngine object at 0x7f4a876217f0>}, 'max_n_jobs': -1, 'max_model_n_jobs': -1, 'task': 'classification', 'scoring': ['neg_log_loss'], 'random': RandomState(MT19937) at 0x7F4A7C1B6240, 'loglevel': 30, 'data_dir': '/home/datascience

In [14]:
est1.print_summary()

0,1
Training Dataset size,"(272, 13)"
Validation Dataset size,
CV,5
Optimization Metric,neg_log_loss
Selected Features,"Index(['age', 'sex', 'cp', 'trtbps', 'chol', 'thalachh', 'exng', 'oldpeak',  'slp', 'caa', 'thall'],  dtype='object')"
Selected Algorithm,RandomForestClassifier
Time taken,47.209
Selected Hyperparameters,"{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': 'balanced_subsample', 'criterion': 'gini', 'max_depth': None, 'max_features': 0.777777778, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 0.000625, 'min_samples_split': 0.00125, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 253, 'n_jobs': 4, 'oob_score': False, 'random_state': 7, 'verbose': 0, 'warm_start': False}"
AutoML version,23.2.2
Python version,"3.8.16 (default, Jun 12 2023, 18:09:05) \n[GCC 11.2.0]"


Algorithm,#Samples,#Features,Mean Validation Score,Hyperparameters,CPU Time,Memory Usage (GB)
RandomForestClassifier_AdaBoostClassifier_FS,272,11,-0.3582,"{'n_estimators': 100, 'class_weight': 'balanced', 'max_features': 0.777777778, 'min_samples_leaf': 0.000625, 'min_samples_split': 0.00125}",0.8958,0.0
RandomForestClassifier_AdaBoostClassifier_FS,272,12,-0.3593,"{'n_estimators': 100, 'class_weight': 'balanced', 'max_features': 0.777777778, 'min_samples_leaf': 0.000625, 'min_samples_split': 0.00125}",0.8385,0.0
RandomForestClassifier_HT,272,11,-0.3632,"{'class_weight': 'balanced_subsample', 'max_features': 0.777777778, 'min_samples_leaf': 0.000625, 'min_samples_split': 0.00125, 'n_estimators': 253}",2.2035,"(0.0, None)"
RandomForestClassifier_HT,272,11,-0.3634,"{'class_weight': 'balanced_subsample', 'max_features': 0.777777778, 'min_samples_leaf': 0.000625, 'min_samples_split': 0.00125, 'n_estimators': 252}",2.2764,"(0.0, None)"
RandomForestClassifier_HT,272,11,-0.3640,"{'class_weight': 'balanced', 'max_features': 0.777786868909091, 'min_samples_leaf': 0.000625, 'min_samples_split': 0.00125, 'n_estimators': 253}",1.9377,"(0.0, None)"
...,...,...,...,...,...,...
RandomForestClassifier_HT,272,11,-1.5657,"{'class_weight': None, 'max_features': 0.09090909090909091, 'min_samples_leaf': 0.003676470588235294, 'min_samples_split': 0.00125, 'n_estimators': 5}",0.0590,"(0.0, None)"
RandomForestClassifier_HT,272,11,-1.5657,"{'class_weight': None, 'max_features': 0.09090909090909091, 'min_samples_leaf': 0.003676470588235294, 'min_samples_split': 0.0012599264705882352, 'n_estimators': 5}",0.0544,"(0.0, None)"
RandomForestClassifier_HT,272,11,-1.9297,"{'class_weight': None, 'max_features': 0.5491142137735302, 'min_samples_leaf': 0.003676470588235294, 'min_samples_split': 0.007352941176470588, 'n_estimators': 5}",0.0674,"(0.0, None)"
RandomForestClassifier_HT,272,11,-1.9297,"{'class_weight': None, 'max_features': 0.5491233046826212, 'min_samples_leaf': 0.003676470588235294, 'min_samples_split': 0.007352941176470588, 'n_estimators': 5}",0.0668,"(0.0, None)"


In [15]:
from sklearn.metrics import f1_score
y_pred = est1.predict(test.X)
score_default = f1_score(test.y, y_pred, average='macro')
print(f'Score on test data : {score_default}')


INFO:automl.preprocessing:transform: After feature engineering and transformations. Updated shape : (31, 13)
INFO:automl.dataset:Train memory consumption 0.00 MB -> 0.00 MB after downcasting (0.00 secs)
INFO:automl.dataset:Train dtypes before: float64    11 -> after downcasting float32    11
Score on test data : 0.7703703703703704


## Create Model Artifacts for saving the Model using `ADS`
<p>The AutoML process creates an `ADSModel` object. The `AutoMLModel()` constructor takes an `ADSModel` along with the path that you want to use to store the model artifacts. An `AutoMLModel` object is returned, and it is used to manage the deployment.
<p>To deploy the model into production, you need to prepare the model artifact, verify that the artifact works, save the model to the model catalog, and then deploy it. AutoMLModel class provides methods to do these tasks.
    

In [16]:
import tempfile
from ads.model import GenericModel
from ads.common.model_metadata import UseCaseType

artifact_dir = tempfile.mkdtemp()
print(f"Model artifact director: {artifact_dir}")

automl_model = GenericModel(estimator=est1, artifact_dir=artifact_dir)

automl_model.prepare(inference_conda_env="automlx_p38_cpu_v1",
                     training_conda_env="automlx_p38_cpu_v1",
                     use_case_type=UseCaseType.BINARY_CLASSIFICATION,
                     X_sample=test.X,
                     force_overwrite=True)

Model artifact director: /tmp/tmpkx5orq33
                                                                                                                                                                                            ?, ?it/s]

algorithm: null
artifact_dir:
  /tmp/tmpkx5orq33:
  - - .model-ignore
    - score.py
    - runtime.yaml
    - input_schema.json
    - model.pkl
framework: null
model_deployment_id: null
model_id: null

In [17]:
os.listdir(artifact_dir)

['.model-ignore', 'score.py', 'runtime.yaml', 'input_schema.json', 'model.pkl']

In [18]:
automl_model.verify(test.X.iloc[:10],auto_serialize_data=True)

Start loading model.pkl from model directory /tmp/tmpkx5orq33 ...
Model is successfully loaded.
INFO:automl.preprocessing:transform: After feature engineering and transformations. Updated shape : (10, 13)
INFO:automl.dataset:Train memory consumption 0.00 MB -> 0.00 MB after downcasting (0.00 secs)
INFO:automl.dataset:Train dtypes before: float64    11 -> after downcasting float32    11


{'prediction': [False,
  True,
  False,
  False,
  True,
  True,
  True,
  False,
  False,
  True]}

## Save / Publish Model to the Model Catlog `ADS`

 Use an unique display name to identify the saved model. 

In [None]:
model_id = automl_model.save(display_name='AutoML Model-0111-2', ignore_introspection=True)

## Deploy Model using `ADS`

In [None]:
deploy = automl_model.deploy(display_name='AutoML Model Deployment-0111-3')

The model takes some time for getting created. You may check the status of model creation in the console.

## Get predictions for a set of new data `ADS`

In [None]:
print(f"Endpoint: {automl_model.model_deployment.url}")

In [None]:
automl_model.predict(test.X.iloc[:10], auto_serialize_data=True,local=True)