# Lab 3 and Lab 4: Training and Deployment

---

## Introduction

This notebook demonstrate how to train a model and deploy it. We will go through the steps to train your model using the available data and then we'll validate the predictions using a subset of the data. Once we're done validating the data, we'll deploy the model and generate predictions.

## Index

---

1. [Prerequisites](<#Prerequisites>)
1. [Data Handling](#Data-Handling)
1. [Lab 3: Train a model using XGBoost](<#Lab-3:-Train-a-Model-using-XGBoost>)
    1. [Training: Create and fit the estimator](#Training:-Create-and-fit-the-estimator)
    1. [Deposit trained model in SageMaker Model Registry](<#Deposit-trained-model-in-SageMaker-Model-Registry>)
1. [Deploy and serve the model](<#Lab-4:-Deploy-and-serve-the-model>)
    1. [Evaluate trained model and update status in the model registry](<#Evaluate-trained-model-and-update-status-in-the-model-registry>)
    1. [Model deployment](#Model-deployment)
    1. [Create/update endpoint](#Create/Update-endpoint)
    1. [Predictor interface](#Predictor-interface)
1. [(Optional) Clean-up](<#Optional-clean-up>)


## Prerequisites

---

Install and update required libraries

Import Python libraries we are going to use in the notebook

In [1]:
# Cell 1
import json
import time
import boto3
import sagemaker
import pandas as pd

from sagemaker.xgboost.estimator import XGBoost
from scripts.inference_specification import InferenceSpecification

Set general parameters, as region and initiate boto3 and SageMaker SDK variables. You can adjust the code to use a region of your choice.

In [2]:
# Cell 2
# Set Region
region = sagemaker.Session().boto_region_name
print("Using AWS Region: {}".format(region))

boto3.setup_default_session(region_name=region)
boto_session = boto3.Session(region_name=region)
s3_client = boto3.client("s3", region_name=region)

# Get SageMaker client, role and session
sagemaker_boto_client = boto_session.client("sagemaker")
#sagemaker_role = sagemaker.get_execution_role()
sagemaker_role  = "arn:aws:iam::975050200450:role/fraud-detection-workshop-SageMakerExecutionRole-4rmzgJyf2wbK"
sagemaker_session = sagemaker.session.Session(
    boto_session=boto_session, sagemaker_client=sagemaker_boto_client
)

Using AWS Region: us-east-1


Set parameterized variables:
* bucket - S3 Bucket name. You can adjust the code to use a bucket of your choice.
* Prefix - String which will be used to identify different resources.
* Training parameters:
    * `estimator_output_path` - S3 location for saving the training result (model artifacts and output files).
    * `train_instance_count` - Number of Amazon EC2 instances to use for training.
    * `train_instance_type` - Type of EC2 instance to use for training

In [3]:
# Cell 3
# Bucket 
bucket = sagemaker_session.default_bucket()
print(bucket)
prefix = "fraud-detect-demo"

# Training parameters
estimator_output_path = f"s3://{bucket}/{prefix}/training_jobs"
train_instance_count = 1
train_instance_type = "ml.m5.xlarge"

sagemaker-us-east-1-975050200450


## Data Handling

---

In this lab, you will train your machine learning model with the training dataset. There are few ways to obtain the dataset:
1. Use the dataset you uploaded to Amazon S3 bucket in the previous Lab (Lab 2 - Store Features in Feature Store). 
2. Upload the following datasets from `data` folder to Amazon S3: `train.csv`, `test.csv`

The following code uploads the datasets from `data` folder to Amazon S3.

The code can be adjusted to use datasets created in previous labs or elsewhere.

In [4]:
# Cell 4
# Set data URI for training input
train_data_uri = f"s3://{bucket}/{prefix}/data/train/train.csv"
test_data_uri = f"s3://{bucket}/{prefix}/data/test/test.csv"

# Upload data files
s3_client.upload_file(Filename="data/train.csv", Bucket=bucket, Key=f"{prefix}/data/train/train.csv")
s3_client.upload_file(Filename="data/test.csv", Bucket=bucket, Key=f"{prefix}/data/test/test.csv")

## Lab 3: Train a Model using XGBoost

---

To train a model in SageMaker, you create a training job. The training job includes the following information:
1. The URL of the Amazon Simple Storage Service (Amazon S3) bucket where you've stored the training data.
2. The compute resources that you want SageMaker to use for model training. Compute resources are ML compute instances that are managed by SageMaker.
3. The URL of the S3 bucket where you want to store the output of the job.
4. The Amazon Elastic Container Registry path where the training code is stored.

Lets walkthrough how the training process works:
1. Invoke model.fit() to start the training process  
2. Process downloads your training algorithm from ECR
3. Training data from S3 is used to train the model and create the model file
4. Model file is written to S3 which will be used as input for inferencing in the Deploy process


![training-job.png](images/notebooks/training-job.png)

Once the training and test datasets have been persisted in S3, you can start training a model by defining which SageMaker Estimator you'd like to use. For this guide, you will use the [XGBoost Open Source Framework](https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/xgboost.html) to train your model. This estimator is accessed via the SageMaker SDK, but mirrors the open source version of the [XGBoost Python package](https://xgboost.readthedocs.io/en/latest/python/index.html). Any functionality provided by the XGBoost Python package can be implemented in your training script.
This estimator includes reference to the relevant training algorithm from ECR.


XGBoost is an extremely popular, open-source package for gradient boosted trees. It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.

### Set HyperParameters
These are the parameters which will be sent to our training script in order to train the model. Although they are all defined as "hyperparameters" here, they can encompass XGBoost's [Learning Task Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters), [Tree Booster Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster), or any other parameters you'd like to configure for XGBoost.

For this example, we will use the following hyperparameters for the XGBoost algorithm:

* `max_depth` - Controls how deep each tree within the algorithm can be built. Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting. There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.
* `num_round` - Controls the number of boosting rounds. This is essentially the subsequent models that are trained using the residuals of previous iterations. More rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
* `eta` - Controls how aggressive each round of boosting is. Larger values lead to more conservative boosting.
* `objective` - Specifies the learning task and the corresponding learning objective. Use `binary:logistic` for binary classification task.

In [5]:
# Cell 5
hyperparameters = {
    "max_depth": "3",
    "eta": "0.2",
    "objective": "binary:logistic",
    "num_round": "100",
}

### Training: Create and fit the estimator
If you want to explore the breadth of functionality offered by the SageMaker XGBoost Framework you can read about all the configuration parameters by referencing the inheriting classes. The XGBoost class inherits from the Framework class and Framework inherits from the EstimatorBase class:
* [XGBoost Estimator documentation](https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/xgboost.html#sagemaker.xgboost.estimator.XGBoost)
* [Framework documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Framework)
* [EstimatorBase documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase)

For this example, we will use the following parameters for the XGBoost estimator:
* `entry_point` - Path to the Python source file which should be executed as the entry point to training.
* `hyperparameters` - Hyperparameters that will be used for training. The hyperparameters are made accessible as a dict[str, str] to the training code on SageMaker.
* `output_path` - S3 location for saving the training result (model artifacts and output files).
* `framework_version` - XGBoost version you want to use for executing your model training code.
* `instance_type` - Type of EC2 instance to use for training.

The following code will launch the training job and store the trained model into S3, **the training process should take ~4 minutes**.

In [6]:
# Cell 6
# Creating the SageMaker Estimator object
xgb_estimator = XGBoost(
    entry_point="scripts/xgboost_starter_script.py",
    output_path=estimator_output_path,
    code_location=estimator_output_path,
    hyperparameters=hyperparameters,
    role=sagemaker_role,
    instance_count=train_instance_count,
    instance_type=train_instance_type,
    framework_version="1.0-1",
)

# Training the model
xgb_estimator.fit(inputs={"train": train_data_uri})

# Get Training job name
training_job_name = xgb_estimator.latest_training_job.job_name
print("Training job name: {}".format(training_job_name))

2024-09-10 01:15:15 Starting - Starting the training job....
2024-09-10 01:15:39 Starting - Preparing the instances for trainingProfilerReport-1725930915: InProgress
....
2024-09-10 01:16:23 Downloading - Downloading the training image....
2024-09-10 01:17:04 Training - Training image download completed. Training in progress...[2024-09-10 01:17:15.418 ip-10-0-199-134.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
INFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training
INFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)
INFO:sagemaker_xgboost_container.training:Invoking user training script.
INFO:sagemaker-containers:Module xgboost_starter_script does not provide a setup.py. 
Generating setup.py
INFO:sagemaker-containers:Generating setup.cfg
INFO:sagemaker-containers:Generating MANIFEST.in
INFO:sagemaker-containers:Installing module with the following command:
/miniconda3/bin/python3 -m pip install . 
Processing /opt/m

Notice that the output includes the value of `Billable seconds`, which is the amount of time you will be actually charge for.

### Deposit trained model in SageMaker Model Registry

Once a useful model has been trained and its artifacts properly associated, the next step is to save the model in a registry for future reference and possible deployment. In this section, we will see how you can package your artifacts into a ModelPackage.


#### Create Model Package Group
A Model Package Group holds multiple versions or iterations of a model. Though it is not required to create them for every model in the registry, they help organize various models which all have the same purpose and provide automatic versioning. A Model Package is a reusable model artifact abstraction that packages all ingredients necessary for inference. 

Setup the Model Package Group name according to the `prefix` variable.

In [9]:
# Cell 7
mpg_name = prefix
print(f"Model Package Group name: {mpg_name}")

Model Package Group name: fraud-detect-demo


Create a Model Package dictionary that will contain all the information on the model, the following cells will add information to it.

In [10]:
# Cell 8
mpg_input_dict = {
    "ModelPackageGroupName": mpg_name,
    "ModelPackageGroupDescription": "Insurance claim fraud detection",
}

If we already have a Model Package Group with this name, we can use it. If we don't have one, we need to create a new one.

In [11]:
# Cell 9
matching_mpg = sagemaker_boto_client.list_model_package_groups(NameContains=mpg_name)['ModelPackageGroupSummaryList']

# Check if Model Package Group already exist
if matching_mpg:
    print(f'Using existing Model Package Group: {mpg_name}')
else:
    mpg_response = sagemaker_boto_client.create_model_package_group(**mpg_input_dict)
    print(f'Create Model Package Group {mpg_name}: SUCCESSFUL')
    %store mpg_name

Using existing Model Package Group: fraud-detect-demo


Get a training job description so we can extract relevant information from it.

In [12]:
# Cell 10
training_job_info = sagemaker_boto_client.describe_training_job(
    TrainingJobName=training_job_name
)

The model package consists of an inference specification that defines the inference image to use along with an optional model weights location. `InferenceSpecification` module is implemented in `model_package_src` folder.

In [13]:
# Cell 11
mp_inference_spec = InferenceSpecification().get_inference_specification_dict(
    ecr_image=training_job_info["AlgorithmSpecification"]["TrainingImage"],
    supports_gpu=False,
    supported_content_types=["text/csv"],
    supported_mime_types=["text/csv"],
)

mp_inference_spec["InferenceSpecification"]["Containers"][0]["ModelDataUrl"] = training_job_info["ModelArtifacts"]["S3ModelArtifacts"]

Adding information to the Model Package dictionary, including inference specification, and creating the model package.

In [14]:
# Cell 12
mp_input_dict = {
    "ModelPackageGroupName": mpg_name,
    "ModelPackageDescription": "XGBoost classifier to detect insurance fraud.",
    "ModelApprovalStatus": "PendingManualApproval",
}

mp_input_dict.update(mp_inference_spec)

# Create the Model Package
mp_response = sagemaker_boto_client.create_model_package(**mp_input_dict)

Loop and wait until the model package will be added to the registry.

In [15]:
# Cell 13
mp_info = sagemaker_boto_client.describe_model_package(
    ModelPackageName=mp_response["ModelPackageArn"]
)
mp_status = mp_info["ModelPackageStatus"]

while mp_status not in ["Completed", "Failed"]:
    time.sleep(60)
    mp_info = sagemaker_boto_client.describe_model_package(
        ModelPackageName = mp_response["ModelPackageArn"]
    )
    mp_status = mp_info["ModelPackageStatus"]
    print(f"model package status: {mp_status}")
print(f"model package status: {mp_status}")

model package status: Completed


At this point you will be able to view model package in Model registry. Navigate to the model registry by clicking on the `Home` icon in the left sidebar and then to the `Model` menu.  

### Congratulations! 

You have finished Lab 3. Please return to workshop studio to understand what's next.


## Lab 4: Deploy and serve the model

---

Once training is completed, we can deploy the trained model as a real time endpoint using Amazon SageMaker hosting services. This will allow us to make predictions (or inference) from the model. The steps below will demonstrate how to: 
 - Evaluate trained model.
 - Based on the evaluation results, update the model approval status in the model registry.
 - Deploy the model as real time endpoint using SageMaker hosting services.
 - Using Dataset and Claims customer dataset prepared as part of Lab 1, run the inference against sample policy id.
 
 This lab will take approximately 10 mins to run.
 

### Evaluate trained model and update status in the model registry

When we registered the model in the model registry, the default status is "Pending Approval" status. In the real-life MLOps lifecycle, a model package gets approved after evaluation by data scientists, subject matter experts and auditors. For the purpose of this lab, we will evaluate the model with test dataset that was created during training process. 

#### Model Evaluation

The following script evaluates the trained model performance metric against a certain threshold value. We will use AUC metric as the model performance metric criteria and set the minimum threshold value as 0.7.

Create SageMaker Processing Job providing:
- The evaluation script
- The model artifact located in S3 (from previous Training job)
- The test data located in S3

The processing job will approximately take 3 mins to run.


In [20]:
# Cell 15
from sagemaker.processing import (
    ProcessingInput,
    ProcessingOutput,
    ScriptProcessor,
)

script_processor = ScriptProcessor(command=['python3'],
                    base_job_name=prefix,
                    image_uri=training_job_info["AlgorithmSpecification"]["TrainingImage"],
                    role=sagemaker_role,
                    instance_count=1,
                    instance_type='ml.t3.medium')
                    #instance_type='ml.m5.xlarge')

# model.tar.gz S3 location 
model_s3_uri = training_job_info['ModelArtifacts']['S3ModelArtifacts']
report_output_dir = f"s3://{bucket}/{prefix}/evaluation_jobs"

script_processor.run(
    code="./scripts/evaluate.py",
    inputs=[
        ProcessingInput(source=model_s3_uri, destination="/opt/ml/processing/model"),
        ProcessingInput(source=test_data_uri, destination="/opt/ml/processing/test"),
    ],
    outputs=[ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation", destination=report_output_dir)],
)

evaluation_job_description = script_processor.jobs[-1].describe()


Job Name:  fraud-detect-demo-2024-09-07-20-28-39-337
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-975050200450/fraud-detect-demo/training_jobs/sagemaker-xgboost-2024-09-07-20-14-18-480/output/model.tar.gz', 'LocalPath': '/opt/ml/processing/model', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'input-2', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-975050200450/fraud-detect-demo/data/test/test.csv', 'LocalPath': '/opt/ml/processing/test', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-975050200450/fraud-detect-demo-2024-09-07-20-28-39-337/input/code/evaluate.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': '

In [21]:
# Cell 16
# Download evaluation score from the Processing job output
eval_report = "./outputs/evaluation.json"
eval_report_dir = evaluation_job_description['ProcessingOutputConfig']['Outputs'][0]['S3Output']['S3Uri'].replace(f"s3://{bucket}/", '')
s3_client.download_file(bucket, f"{eval_report_dir}/evaluation.json", eval_report)

with open(eval_report) as f:
    model_score_report = json.load(f)

print(model_score_report)

{'binary_classification_metrics': {'accuracy': {'value': 0.973, 'standard_deviation': 'NaN'}, 'auc': {'value': 0.8324768756423432, 'standard_deviation': 'NaN'}}}


#### Evaluate model AUC score. 
Based on the results, update model approval status for that version. Please note, we are using API's to Approve/Reject model approval status. You can also update the approval status with the SageMaker Studio UI.

In [22]:
# Cell 17
eval_auc_threshold = 0.7
model_approval_stat = "PendingManualApproval"
model_auc_eval_score = model_score_report['binary_classification_metrics']['auc']['value']

if model_auc_eval_score >= eval_auc_threshold:
    model_approval_stat = "Approved"
    print("Model approved!")
else:
    model_approval_stat = "Rejected"
    print("Model rejected!")
    print("You should re-evaliuate data or traning script to improve the model AUC")

modelPackageList=sagemaker_boto_client.list_model_packages(ModelPackageGroupName=mpg_name,SortBy="CreationTime", SortOrder="Descending")["ModelPackageSummaryList"]

#Get the model packge ARN from  Model Package. 
model_package_arn = modelPackageList[0]["ModelPackageArn"]
print(model_package_arn)

model_package_update_input_dict = {
    "ModelPackageArn" : model_package_arn,
    "ModelApprovalStatus" : model_approval_stat
}
#update the model package registry with appropriate status
model_package_update_response = sagemaker_boto_client.update_model_package(**model_package_update_input_dict)

Model approved!
arn:aws:sagemaker:us-east-1:975050200450:model-package/fraud-detect-demo/2


### Model deployment
In the previous step, the model status was adjudicated based on AUC of the test dataset. If the AUC was above the defined threshold , then the approval status for that particular version is set to "Approved". At this stage, the model is ready for deployment. Please note, we do not have explicit step to stop the model deployment if the AUC was below threshold. In the real-life MLOPS lifecycle, the model approval status is set to "Rejected" if the model performance is not per the business KPI's and you will either tweak the data processing or training script to improve the model performance. 

Once the model is approved, it is ready for deployment. We will deploy the model for real time inference using [SageMaker Hosting service](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html). In order to deploy the endpoint, we will 
- Define interface endpoint requirements.
- Create Model
- Create Endpoint config
- Create/Update endpoint using newly created EndpointConfig.
- Test the endpoint with sample data.

In [23]:
# Cell 18
# define end point hosting configuration
endpoint_name = f"{prefix}-endpoint"
endpoint_instance_count = 1
endpoint_instance_type = "ml.m4.xlarge"

#### Create Model 
The code below creates a model object from the model version by calling the create_model method. Pass the Amazon Resource Name (ARN) of the model version as part of the Containers for the model object. You can also deploy the model directly using Model package group, but the aim of deploy example is to demonstrate how you can create model from the model version and create/update existing endpoint.


In [24]:
# Cell 19
from sagemaker import ModelPackage
from time import gmtime, strftime

model = ModelPackage(role=sagemaker_role,
         model_package_arn=model_package_arn,
         sagemaker_session=sagemaker_session)
print(model)

model_name = prefix+"-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name : {}".format(model_name))
container_list = [{'ModelPackageName': model_package_arn}]


create_model_response = sagemaker_boto_client.create_model( ModelName=model_name,
                                         ExecutionRoleArn=sagemaker_role,
                                            Containers = container_list
                                       )


<sagemaker.model.ModelPackage object at 0x7f84e69402e0>
Model name : fraud-detect-demo-2024-09-07-20-36-29


#### Endpoint config
To host your model, you create an endpoint configuration with the CreateEndpointConfig API, and then create an endpoint with the CreateEndpoint API. SageMaker then deploys all of the containers that you defined for the model in the hosting environment.


In [25]:
# Cell 20
endpoint_config_name = f"{prefix}-endpoint-config"+ strftime("%Y-%m-%d-%H-%M-%S", gmtime())

create_ep_config_response = sagemaker_boto_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": endpoint_instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": endpoint_instance_count,
            "ModelName": model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

#### Create/Update endpoint 
We create or update existing endpoint with the endpoint configuration that was created in the earlier stage.
The endpoint creation will approximately take 5 mins complete.

In [26]:
# Cell 21
existing_endpoints = sagemaker_boto_client.list_endpoints(
    NameContains=endpoint_name, MaxResults=30)["Endpoints"]
if not existing_endpoints:
    create_endpoint_response = sagemaker_boto_client.create_endpoint(
        EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
    )

endpoint_info = sagemaker_boto_client.describe_endpoint(EndpointName=endpoint_name)
endpoint_status = endpoint_info["EndpointStatus"]

while endpoint_status == "Creating":
    endpoint_info = sagemaker_boto_client.describe_endpoint(EndpointName=endpoint_name)
    endpoint_status = endpoint_info["EndpointStatus"]
    print("Endpoint status:", endpoint_status)
    if endpoint_status == "Creating":
        time.sleep(60)

Endpoint status: Creating
Endpoint status: Creating
Endpoint status: Creating
Endpoint status: Creating
Endpoint status: InService


### Predictor interface
Make real-time predictions against SageMaker endpoints with Python objects. We will use data from `dataset.csv` to test predictios.

In [27]:
# Cell 22
predictor = sagemaker.predictor.Predictor(
    endpoint_name=endpoint_name, sagemaker_session=sagemaker_session
)

In [28]:
policydataset = pd.read_csv("./data/dataset.csv")
sample_policy_id = int(policydataset.sample(1)["policy_id"])

  sample_policy_id = int(policydataset.sample(1)["policy_id"])


#### Calling Predictor API
 Making inference request using Claims Customer data that is present in the CSV format. In production environment, application that calls the endpint will send the  will receive the customer data.


In [29]:
# Cell 23
# Read claims-customer.csv file in dataframe and list columns
dataset = pd.read_csv("./data/claims_customer.csv")
col_order = ["fraud"] + list(dataset.drop(["fraud", "Unnamed: 0", "policy_id"], axis=1).columns)
col_order

['fraud',
 'authorities_contacted_police',
 'injury_claim',
 'num_vehicles_involved',
 'incident_type_collision',
 'policy_liability',
 'customer_gender_male',
 'num_injuries',
 'incident_dow',
 'collision_type_front',
 'driver_relationship_self',
 'authorities_contacted_fire',
 'incident_day',
 'total_claim_amount',
 'incident_month',
 'incident_severity',
 'policy_state_or',
 'customer_gender_female',
 'collision_type_rear',
 'policy_state_id',
 'incident_type_theft',
 'policy_deductable',
 'policy_state_az',
 'policy_state_ca',
 'num_witnesses',
 'driver_relationship_spouse',
 'driver_relationship_na',
 'policy_state_nv',
 'vehicle_claim',
 'policy_annual_premium',
 'driver_relationship_other',
 'incident_hour',
 'collision_type_na',
 'driver_relationship_child',
 'num_claims_past_year',
 'months_as_customer',
 'police_report_available',
 'authorities_contacted_ambulance',
 'customer_education',
 'customer_age',
 'incident_type_breakin',
 'auto_year',
 'authorities_contacted_none',


#### Pull customer data and format the datapoint.

When a customer submits an insurance claim online for instant approval, the insurance company will need to pull customer-specific data. You can do it either using the customer data we have stored in a CSV files or an online feature store to add to the claim data. The pulled data will serve as input for a model prediction.

Then, the datapoint must match the exact input format as the model was trained–with all features in the correct order. In this example, the col_order variable was saved when you created the train and test datasets earlier in the guide.

We will get sample policy id from the dataset and fetch relevant records from the claims-customer dataset that was created as part of Lab1. You can also fetch the record from feature store and then call the endpoint.

In [30]:
# Cell 24
sample_policy_id = int(policydataset.sample(1)["policy_id"])
customer_claim_df = dataset[dataset["policy_id"] == sample_policy_id].sample(1)
blended_df = customer_claim_df.loc[:, col_order].drop("fraud", axis=1).T.reset_index()
blended_df.columns = ["FeatureName", "ValueAsString"]

data_input = ",".join([str(x) for x in blended_df["ValueAsString"]])
data_input
print(policydataset.sample(1))

      Unnamed: 0  policy_id  authorities_contacted_police  injury_claim  \
4677        4677       4141                             0           0.0   

      num_vehicles_involved  incident_type_collision  policy_liability  fraud  \
4677                      2                        1                 0      0   

      customer_gender_male  num_injuries  ...  police_report_available  \
4677                     1             0  ...                        0   

      authorities_contacted_ambulance  customer_education  customer_age  \
4677                                0                   4            24   

      incident_type_breakin  auto_year  authorities_contacted_none  \
4677                      0       2017                           1   

      num_insurers_past_5_years  collision_type_side  policy_state_wa  
4677                          1                    0                0  

[1 rows x 48 columns]


  sample_policy_id = int(policydataset.sample(1)["policy_id"])


#### Run prediction on sample claim dataset

In [31]:
# Cell 25
results = predictor.predict(data_input, initial_args={"ContentType": "text/csv"})
prediction = json.loads(results)
print(f"The probability claim from policy {int(sample_policy_id)} is fraudulent:", prediction)

The probability claim from policy 4808 is fraudulent: 0.028527986258268356


## (Optional) Clean-up

---

If you are done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [32]:
# Cell 26
predictor.delete_endpoint(delete_endpoint_config=True)

**Congratulations!** 

You have successfully completed Lab 3 & 4.