### Regression and One-click deployment example (simplified., with new ADS functions)

In this Notebook I walk through all the steps needed to develop a **regression model** and save the model to the OCI Data Science **Model Catalog**.

After the model have been saved, I deploy the model as a **REST** service.

The Notebook is based on the **California Housing Dataset**, downloaded from SKlearn.

In this NB ADSTuner has been removed, training is done with best parameters from complete NB.

This NB uses the new ADS functionality to simplify MOdel saving to Catalog. It requires ADS >= 2.5.9

In [12]:
import pandas as pd
import numpy as np
from pandas.api.types import is_numeric_dtype

# the dataset used for the example
from sklearn.datasets import fetch_california_housing

from sklearn.model_selection import train_test_split

# the GBM used
import xgboost as xgb

# to save to Model catalog
import pickle
import os
from ads import set_auth
import ads

# for new ADS model deployment features we need this
#
from ads.model.framework.xgboost_model import XGBoostModel
from ads.common.model_metadata import UseCaseType
from ads.catalog.model import ModelCatalog

In [15]:
# check we have the minimum ADS version (april 2022)
assert ads.__version__ >= "2.5.9"

### some utility functions

In [5]:
# functions
def get_general_info(data_df):
    print(f"There are: {len(data_df.columns)} columns in the dataset")
    print()
    print(
        "The list of column names, in alphabetical order:",
        sorted(list(data_df.columns)),
    )
    print()
    print(f"There are {data_df.shape[0]} records in the dataset")
    print()
    
    return

# well you have to decide a threshold in term of a fraction
# to decide if the col is categorical
FRAC = 0.1

def analyze_df(data_df):
    # it is ok to use isna, isnull is an alias of isna
    missing_val = data_df.isna().sum()

    # cardinality

    THR = data_df.shape[0] * FRAC

    list_card = []
    list_cat = []
    list_dtypes = []
    list_num_zeros = []

    for col in data_df.columns:
        # count the # of distinct values
        n_distinct = data_df[col].nunique()
        list_card.append(n_distinct)
        
        # is categorical is decide on this rule
        if n_distinct < THR:
            # categorical
            list_cat.append("Yes")
        else:
            list_cat.append("No")

        list_dtypes.append(data_df[col].dtype)

    # build the results DF
    result_df = pd.DataFrame(
        {
            "col_name": list(data_df.columns),
            "missing_vals": missing_val,
            "cardinality": list_card,
            "is_categorical": list_cat,
            "data_type": list_dtypes,
        },
        index=None,
    )

    # if you don't want cols as index
    result_df.reset_index(drop=True, inplace=True)

    return result_df

### Load the dataset

In [3]:
# load the dataset
housing = fetch_california_housing(as_frame=True)

orig_df = housing.frame

In [4]:
orig_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


### some EDA

In [6]:
# In this example I'll use all the columns (ex MedHouseVal) as features, except Lat, Long, to simplify

TARGET = "MedHouseVal"
all_cols = list(orig_df.columns)
cols_to_drop = ['Latitude', 'Longitude']

cat_cols = ['HouseAge']

# take care, I have sorted
FEATURES = sorted(list(set(all_cols) - set([TARGET])- set(cols_to_drop)))

# for LightGBM
cat_columns_idxs = [i for i, col in enumerate(FEATURES) if col in cat_cols]

FEATURES

['AveBedrms', 'AveOccup', 'AveRooms', 'HouseAge', 'MedInc', 'Population']

In [7]:
# the only important thing is that we have 1 categorical column: HouseAge

# we will code categorical as integer starting from zero
# in this case it is easy, since the minimum is 1... so we need only to subtract 1

In [8]:
# make a copy before any changes
used_df = orig_df.copy()

used_df['HouseAge'] = used_df['HouseAge'] - 1.

used_df['HouseAge'] = used_df['HouseAge'].astype(int)
used_df['HouseAge'] = used_df['HouseAge'].astype("category")

In [9]:
# let's make a simple train/test split
X = used_df[FEATURES].values
y = used_df[TARGET].values

TEST_SIZE = 0.2

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=1)

### Train with best params

In [10]:
%%time
### train with best params

params = {
    "n_estimators": 600,
    "learning_rate": 0.008,
    "max_depth": 7,
}

model = xgb.XGBRegressor(**params)

model.fit(X_train, y_train)

CPU times: user 1h 20min 12s, sys: 11.2 s, total: 1h 20min 23s
Wall time: 2min 50s


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.008, max_delta_step=0,
             max_depth=7, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=600, n_jobs=32,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

### Prepare for Model Catalog

In [19]:
PATH_ARTEFACT = f"./model-files"

if not os.path.exists(PATH_ARTEFACT):
    os.mkdir(PATH_ARTEFACT)

In [52]:
set_auth(auth='resource_principal')

xgb_model = XGBoostModel(estimator=model, artifact_dir= PATH_ARTEFACT)

In [53]:
# 1. prepare
xgb_model.prepare(
    inference_conda_env="generalml_p37_cpu_v1",
    training_conda_env="generalml_p37_cpu_v1",
    use_case_type=UseCaseType.REGRESSION,
    X_sample=X_test,
    y_sample=y_test,
    force_overwrite=True
)

In [54]:
# be aware that the XGBModel is saved as a JSON file, in model-files

In [55]:
# 2. verify
print(xgb_model.verify(X_test[:10]))

# compare with expected values
print()
print(f"Expected: {y_test[:10]}")

Start loading model.json from model directory /home/datascience/data-science-bp/model-files ...
Model is successfully loaded.
{'prediction': [3.4311716556549072, 0.8633096814155579, 2.053156852722168, 1.4130446910858154, 3.0937297344207764, 2.868069648742676, 2.3109734058380127, 1.25899076461792, 1.4848921298980713, 1.584919810295105]}

Expected: [3.55  0.707 2.294 1.125 2.254 2.63  2.268 1.662 1.18  1.563]


In [56]:
# we can check the list of steps
xgb_model.summary_status()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Actions Needed
Step,Status,Details,Unnamed: 3_level_1
initiate,Done,Initiated the model,
prepare(),Done,Generated runtime.yaml,
prepare(),Done,Generated score.py,
prepare(),Done,Serialized model,
prepare(),Done,"Populated metadata(Custom, Taxonomy and Provenance)",
verify(),Done,Local tested .predict from score.py,
save(),Available,Conducted Introspect Test,
save(),Available,Uploaded artifact to model catalog,
deploy(),Not Available,Deployed the model,
predict(),Not Available,Called deployment predict endpoint,


In [57]:
# 3. introspect to do some checks
xgb_model.introspect()

['model.json', 'output_schema.json', 'input_schema.json', 'runtime.yaml', 'test_json_output.json', 'score.py']


Unnamed: 0,Test key,Test name,Result,Message
0,runtime_env_path,Check that field MODEL_DEPLOYMENT.INFERENCE_ENV_PATH is set,Passed,
1,runtime_env_python,Check that field MODEL_DEPLOYMENT.INFERENCE_PYTHON_VERSION is set to a value of 3.6 or higher,Passed,
2,runtime_path_exist,Check that the file path in MODEL_DEPLOYMENT.INFERENCE_ENV_PATH is correct.,Passed,
3,runtime_version,Check that field MODEL_ARTIFACT_VERSION is set to 3.0,Passed,
4,runtime_yaml,"Check that the file ""runtime.yaml"" exists and is in the top level directory of the artifact directory",Passed,
5,score_load_model,Check that load_model() is defined,Passed,
6,score_predict,Check that predict() is defined,Passed,
7,score_predict_arg,Check that all other arguments in predict() are optional and have default values,Passed,
8,score_predict_data,"Check that the only required argument for predict() is named ""data""",Passed,
9,score_py,"Check that the file ""score.py"" exists and is in the top level directory of the artifact directory",Passed,


In [58]:
# seems everything is OK

In [59]:
# 4. after all needed changes to score.py you can save to model catalog
model_id = xgb_model.save(display_name = "cal_housing_new2", description = "new way of model deployment")

Start loading model.json from model directory /home/datascience/data-science-bp/model-files ...
Model is successfully loaded.
['model.json', 'output_schema.json', 'input_schema.json', 'runtime.yaml', 'test_json_output.json', 'score.py']


loop1:   0%|          | 0/5 [00:00<?, ?it/s]

artifact:/tmp/saved_model_d1041770-283b-494a-af92-d9bd5facc598.zip


In [60]:
# just to see, check again status
xgb_model.summary_status()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Actions Needed
Step,Status,Details,Unnamed: 3_level_1
initiate,Done,Initiated the model,
prepare(),Done,Generated runtime.yaml,
prepare(),Done,Generated score.py,
prepare(),Done,Serialized model,
prepare(),Done,"Populated metadata(Custom, Taxonomy and Provenance)",
verify(),Done,Local tested .predict from score.py,
save(),Done,Conducted Introspect Test,
save(),Done,Uploaded artifact to model catalog,
deploy(),Available,Deployed the model,
predict(),Not Available,Called deployment predict endpoint,


### deploy to Model Catalog

In [61]:
set_auth(auth='resource_principal')

# needs to specify shape, log ids... easier from the UI
xgb_model_deployment = xgb_model.deploy(display_name = "cal_xgb_deploy1")

loop1:   0%|          | 0/6 [00:00<?, ?it/s]

In [65]:
xgb_model.predict(X_test[:10])

ERROR:ads:ADS Exception
Traceback (most recent call last):
  File "/home/datascience/conda/generalml_p37_cpu_v1/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3457, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-65-ab7e74306387>", line 1, in <module>
    xgb_model.predict(X_test[:10])
  File "/home/datascience/conda/generalml_p37_cpu_v1/lib/python3.7/site-packages/ads/model/generic_model.py", line 1108, in predict
    raise NotActiveDeploymentError(current_state)
ads.model.generic_model.NotActiveDeploymentError: To perform a prediction the deployed model needs to be in an active state. The current state is: UPDATING.


NotActiveDeploymentError: To perform a prediction the deployed model needs to be in an active state. The current state is: UPDATING.

### Customize score.py

### Model introspection

### Some tests on the code before saving to Model Catalog

### Save to Model Catalog

In [67]:
xgb_model.summary_status()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Actions Needed
Step,Status,Details,Unnamed: 3_level_1
initiate,Done,Initiated the model,
prepare(),Done,Generated runtime.yaml,
prepare(),Done,Generated score.py,
prepare(),Done,Serialized model,
prepare(),Done,"Populated metadata(Custom, Taxonomy and Provenance)",
verify(),Done,Local tested .predict from score.py,
save(),Done,Conducted Introspect Test,
save(),Done,Uploaded artifact to model catalog,
deploy(),UPDATING,Deployed the model,
predict(),Not Available,Called deployment predict endpoint,
