Copyright (c) 2021 Oracle, Inc. All rights reserved. Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.

***
# <font color=red>Model Catalog August 2021 Release - Saving an XGBoost Model with the OCI Python SDK </font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal>Oracle Cloud Infrastructure Data Science Team</font></p>

***
#  Overview
This simple example notebook demonstrates creating and uploading a XGBoost binary logisitic-based model, with metadata and schema, to the model catalog v2.0.

---

## Prerequisites:
 - Experience with specific topic: Novice
 - Professional experience: None
 
---

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

---

<font color=gray>Datasets are provided as a convenience. Datasets are considered third party content and are not considered materials under your agreement with Oracle applicable to the services. The [`telco churn` dataset](oracle_data/UPL.txt) is distributed under the UPL license.
</font>


Before you get started, make sure you have the `OCI Python SDK` **version 2.43.2** installed in your conda environment. Install a new version by simply running `pip install oci==2.43.2`. 

### Model training and preparation 

First, you import the dataset using Pandas:

In [None]:
import pandas as pd

df_data = pd.read_csv('https://objectstorage.us-ashburn-1.oraclecloud.com/n/bigdatadatasciencelarge/b/hosted-ds-datasets/o/telco_churn%2FTelco-Customer-Churn.csv')
df_data.head(10)

Next, encode the categorical variables:

In [None]:
#import Label Encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dummy_columns = [] #array for multiple value columns
for column in df_data.columns:
    if df_data[column].dtype == object and column != 'customerID':
        if df_data[column].nunique() == 2:
            #apply Label Encoder for binary ones
            df_data[column] = le.fit_transform(df_data[column]) 
        else:
            dummy_columns.append(column)
#apply get dummies for selected columns
df_data = pd.get_dummies(data = df_data,columns = dummy_columns)

Now, use the XGBoost classifier model with binary logistic objective to train the dataset:

In [None]:
# install xgboost and scikit-learn if they are not available already in your conda environment 
from sklearn.model_selection import train_test_split
import xgboost as xgb
#create feature set and labels
X = df_data.drop(['Churn','customerID'],axis=1)
y = df_data.Churn
#train and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=56)
#building the model & printing the score
xgb_model = xgb.XGBClassifier(max_depth=5, learning_rate=0.08, objective= 'binary:logistic',n_jobs=-1,enable_categorical=True).fit(X_train, y_train)
print('Accuracy of XGB classifier on training set: {:.2f}'.format(xgb_model.score(X_train, y_train)))
print('Accuracy of XGB classifier on test set: {:.2f}'.format(xgb_model.score(X_test[X_train.columns], y_test)))

Now we save the generated model into the temporary folder. You can specify any path you may need. The model is saved as pickle file.

## Boilerplate Code

You can download model artifact boilerplate code at: 
[https://github.com/oracle/oci-data-science-ai-samples/blob/master/model_catalog_examples/artifact_boilerplate/artifact_boilerplate.zip](https://github.com/oracle/oci-data-science-ai-samples/blob/master/model_catalog_examples/artifact_boilerplate/artifact_boilerplate.zip)

In [None]:
#!wget https://github.com/oracle/oci-data-science-ai-samples/blob/master/model_catalog_examples/artifact_boilerplate/artifact_boilerplate.zip
#!unzip -d artifact_boilerplate/ ./artifact_boilerplate.zip

In [None]:
import os
import pickle

#Replace with your own path: 
path_to_artifact = "./ms-test-artifact-0802"
if not os.path.exists(path_to_artifact):
    os.mkdir(path_to_artifact)
pickle.dump(xgb_model, open(path_to_artifact+"/churn_prediction_model.pkl", "wb"))

Open the boiler plate artifact code that are downloaded as part of Step 1 from LA documentation.<br>
Edit score.py for xgboost prediction output. You can also edit runtime.yaml file with your configuration.<br> 
Zip the model file along with score.py and runtime.yaml file  <br> 

In [None]:
%%writefile "{path_to_artifact}/score.py"
import json
import os
import pickle
import pandas as pd


model_name = 'churn_prediction_model.pkl'

"""
   Inference script. This script is used for prediction by scoring server when schema is known.
"""


def load_model(model_file_name=model_name):
    """
    Loads model from the serialized format

    Returns
    -------
    model:  a model instance on which predict API can be invoked
    """
    model_dir = os.path.dirname(os.path.realpath(__file__))
    contents = os.listdir(model_dir)
    if model_file_name in contents:
        with open(os.path.join(os.path.dirname(os.path.realpath(__file__)), model_file_name), "rb") as file:
            return pickle.load(file)
    else:
        raise Exception('{0} is not found in model directory {1}'.format(model_file_name, model_dir))


def predict(data, model=load_model()):
    """
    Returns prediction given the model and data to predict

    Parameters
    ----------
    model: Model instance returned by load_model API
    data: Data format as expected by the predict API of the core estimator. For eg. in case of sckit models it could be numpy array/List of list/Panda DataFrame

    Returns
    -------
    predictions: Output from scoring server
        Format: {'prediction':output from model.predict method}

    """
    
    from pandas import read_json, DataFrame
    from io import StringIO
    X = read_json(StringIO(data)) if isinstance(data, str) else DataFrame.from_dict(data)
    pred = model.predict(X).tolist()
    return {'prediction': pred}


In [None]:
%%writefile "{path_to_artifact}/runtime.yaml"
MODEL_ARTIFACT_VERSION: '3.0'
MODEL_DEPLOYMENT:
  INFERENCE_CONDA_ENV:
    INFERENCE_ENV_PATH: oci://service_conda_packs@ociodscdev/service_pack/cpu/General
      Machine Learning for CPUs/1.0/mlcpuv1
    INFERENCE_ENV_SLUG: mlcpuv1
    INFERENCE_ENV_TYPE: data_science
    INFERENCE_PYTHON_VERSION: '3.6.11'

Let's test the artifact before saving it to the model catalog

In [None]:
data = X_test[:5].to_json()

In [None]:
import sys
from json import dumps
 
# The local path to your model artifact directory is added to the Python path.
# replace <your-model-artifact-path>
sys.path.insert(0, f"{path_to_artifact}/")
 
# importing load_model() and predict() that are defined in score.py
from score import load_model, predict
 
# Loading the model to memory
_ = load_model()
 
# Take a sample of your training or validation dataset and store it as data.
# Making predictions on a JSON string object (dumps(data)). Here we assume
# that predict() is taking data in JSON format
predictions_test = predict(data, _)
# Compare the predictions captured in predictions_test with what you expect for data:
predictions_test

## Run introspection tests

Before running the introspection tests, install the required libraries: 

In [None]:
#!python3 -m pip install --user -r boilerplate-template/artifact_introspection_test/requirements.txt

Run the tests. Two files with the test results will be generated in the current working directory: `test_html_output.html` and `test_json_output.json`. 

In [None]:
!python3 boilerplate-template/artifact_introspection_test/model_artifact_validate.py --artifact {path_to_artifact}

We then create the zip archive of the model artifact: 

In [None]:
# Upload the modifto_jsonre.py and runtime.yaml files from Step 2,3,4 into the directory where the model is present
import zipfile
    
def zipdir(target_zip_path, ziph, source_artifact_directory):
    ''' Creates a zip archive of a model artifact directory. 
    
    Parameters: 
    
    - target_zip_path: the path where you want to store the zip archive of your artifact 
    - ziph: a zipfile.ZipFile object 
    - source_artifact_directory: the path to the artifact directory. 
    
    '''
    for root, dirs, files in os.walk(source_artifact_directory):
        for file in files:
            ziph.write(os.path.join(root, file), 
                       os.path.relpath(os.path.join(root,file), 
                                       os.path.join(target_zip_path,'.')))
      
zipf = zipfile.ZipFile(f'{path_to_artifact}.zip', 'w', zipfile.ZIP_DEFLATED)
zipdir('.', zipf, f"{path_to_artifact}")
zipf.close()

The input data schema: 

In [None]:
%%writefile "./Churn_prediction_Input_schema.json"

{
  "schema": [
    {
      "description": "A unique identifier for a customer",
      "domain": {
        "constraints": [],
        "stats": [],
        "values": "freetext"
      },
      "name": "customerID",
      "required": true,
      "type": "string"
    },

  {
      "description": "Gender",
      "domain": {
        "constraints": [],
        "stats": [],
        "values": "Male,Female"
      },
      "name": "gender",
      "required": false,
      "type": "category"
    },
{
      "description": "Senior Citizen",
      "domain": {
        "constraints": [],
        "stats": [],
        "values": "0,1"
      },
      "name": "seniorcitizen",
      "required": false,
      "type": "boolean"
    },
{
      "description": "Partner",
      "domain": {
        "constraints": [],
        "stats": [],
        "values": "Yes,No"
      },
      "name": "partner",
      "required": false,
      "type": "category"
    },
{
      "description": "Dependents",
      "domain": {
        "constraints": [],
        "stats": [],
        "values": "Yes,No"
      },
      "name": "dependents",
      "required": false,
      "type": "category"
    },
{
      "description": "InternetService",
      "domain": {
        "constraints": [],
        "stats": [],
        "values": "DSL, Fiber Optic, No"
      },
      "name": "internetservice",
      "required": false,
      "type": "category"
    },
{
      "description": "Phone Service",
      "domain": {
        "constraints": [],
        "stats": [],
        "values": "Yes, No"
      },
      "name": "PhoneService",
      "required": false,
      "type": "category"
    },
{
      "description": "totalcharges",
      "domain": {
        "constraints": [],
        "stats": [],
        "values": "number"
      },
      "name": "totalcharges",
      "required": false,
      "type": "float"
    }
  ]
}

The output data schema (model predictions): 

In [None]:
%%writefile "./Churn_prediction_Output_schema.json"

{
  "predictionschema": [
    {
      "description": "Churn prediction",
      "domain": {
        "constraints": [],
        "stats": [],
        "values": "Yes,No"
      },
      "name": "churn",
      "required": true,
      "type": "category"
    }

   ]
}

### Creating Model and Model Metadata
Step 1: Create provenance details with repository url, git_branch,git_commit,script_dir, training_id. Training ID will be the Job Run OCID or Notebook session OCID.<br>
Step 2: Create defined metadata with key values of Usecasetype, framework, frameworkversion, algorithm, and hyperparameters.<br>
Step 3: Create custom metadata with key values and allowed category type.<br>
Step 4: Create input, output schema based on json files. This step is only allowed at the time of model creation.<br>
Step 5: Upload artifact test introspection results in json format to the metadata key named ArtifactTestResults. <br>
Step 6: Create the model entry in the catalog. <br>
Step 7: Upload the model artifact. <br>

In [None]:
# Create a default config using DEFAULT profile in default location
# Refer to
# https://docs.cloud.oracle.com/en-us/iaas/Content/API/Concepts/sdkconfig.htm#SDK_and_CLI_Configuration_File
# for more info

# Initialize service client with default config file
import json
from json import load
import oci
from oci.data_science.models import CreateModelDetails, Metadata, CreateModelProvenanceDetails, UpdateModelDetails, UpdateModelProvenanceDetails
config = oci.config.from_file()
data_science_client = oci.data_science.DataScienceClient(config=config)

# Step 1: 
provenance_details = CreateModelProvenanceDetails(repository_url="www.oracle.com",
                                                  git_branch="AI Samples",
                                                  git_commit="master",
                                                  script_dir="script",
                                                  # OCID of the ML job Run or Notebook session on which this model was
                                                  # trained
                                                  training_id=os.environ['NB_SESSION_OCID']
                                                  )
# Step 2: 
defined_metadata_list = [
    Metadata(key="UseCaseType", value="binary_classification"),
    Metadata(key="Framework", value="xgboost"),
    Metadata(key="FrameworkVersion", value="0.2.0"),
    Metadata(key="Algorithm",value="Classifier"),
    Metadata(key="hyperparameters",value="{\"max_depth\":\"5\",\"learning_rate\":\"0.08\",\"objective\":\"Binary Logistic\"}")
]

# Step 3: Adding your own custom metadata:
custom_metadata_list = [
    Metadata(key="Accuracy Limit", value="70-90%", category="Performance",
             description="Performance accuracy accepted"),
    Metadata(key="Sourcing", value="https://objectstorage.us-ashburn-1.oraclecloud.com/n/bigdatadatasciencelarge/b/hosted-ds-datasets/o/telco_churn%2FTelco-Customer-Churn.csv", category="other",
             description="Source for  training data")
]

# Step 4: 
# Declare input/output schema for our model - this is optional
# It must be a valid json or yaml string
# Schema like model artifact is immutable hence it is allowed only at the model creation time and cannot be updated
input_schema = load(open('Churn_prediction_Input_schema.json','rb'))
input_schema_str= json.dumps(input_schema)
output_schema = load(open('Churn_prediction_Output_schema.json','rb'))
output_schema_str= json.dumps(output_schema)

# Step 5: 
# Provide the introspection test results
test_results = load(open('test_json_output.json','rb'))
test_results_str = json.dumps(test_results)
defined_metadata_list.extend([Metadata(key="ArtifactTestResults", value=test_results_str)])

# Step 6: 
# creating a model details object:
model_details = CreateModelDetails(
    compartment_id=os.environ["NB_SESSION_COMPARTMENT_OCID"],
    project_id=os.environ["PROJECT_OCID"],
    display_name='Churn prediction using xgboost algo',
    description='Churn prediction of Telco customers',
    custom_metadata_list=custom_metadata_list,
    defined_metadata_list=defined_metadata_list,
    input_schema=input_schema_str,
    output_schema=output_schema_str)
# creating the model object:
model = data_science_client.create_model(model_details)
# adding the provenance:
data_science_client.create_model_provenance(model.data.id, provenance_details)


### Uploading Model artifact
The final step is to upload the model artifact zip

In [None]:
# Step 7: 
# Upload the model artifact
with open(f'{path_to_artifact}.zip','rb') as artifact_file:
    artifact_bytes = artifact_file.read()
    data_science_client.create_model_artifact(model.data.id, artifact_bytes, content_disposition='attachment; filename="ms-test-artifact.zip"')