# Welcome to the Sagemaker Immersion Day for ISVs



## Some pre-requisites
## Downloading the content of the GitHub repository needed for the labs
Please run the following cell to download the artifacts needed to your notebook.

In [3]:
!git clone https://github.com/aws-samples/amazon-sagemaker-immersion-day.git

Cloning into 'amazon-sagemaker-immersion-day'...
remote: Enumerating objects: 574, done.[K
remote: Counting objects: 100% (32/32), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 574 (delta 17), reused 5 (delta 2), pack-reused 542[K
Receiving objects: 100% (574/574), 33.53 MiB | 40.07 MiB/s, done.
Resolving deltas: 100% (242/242), done.
Checking out files: 100% (96/96), done.


## SageMaker Pipelines Lab
### Overview

**Amazon SageMaker Pipelines**, a new capability of Amazon SageMaker that makes it easy for data scientists and engineers to build, automate, and scale end to end machine learning pipelines. SageMaker Pipelines is a native workflow orchestration tool for building ML pipelines that take advantage of direct Amazon SageMaker integration. Three components improve the operational resilience and reproducibility of your ML workflows: pipelines, model registry, and projects. These workflow automation components enable you to easily scale your ability to build, train, test, and deploy hundreds of models in production, iterate faster, reduce errors due to manual orchestration, and build repeatable mechanisms.

SageMaker projects introduce MLOps templates that automatically provision the underlying resources needed to enable CI/CD capabilities for your ML development lifecycle. You can use a number of built-in templates or create your own custom template (https://docs.aws.amazon.com/sagemaker/latest/dgsagemaker-projects-templates-custom.html). You can use SageMaker Pipelines independently to create automated workflows; however, when used in combination with SageMaker projects, the additional CI/CD capabilities are provided automatically. The following screenshot shows how the three components of SageMaker Pipelines can work together in an example SageMaker project.

![Overview](img/image1.png)



This lab focuses on using an MLOps template to bootstrap your ML project and establish a CI/CD pattern from sample code. We show how to use the built-in build, train, and deploy project template as a base for a customer churn classification example. This base template enables CI/CD for training ML models, registering model artifacts to the model registry, and automating model deployment with manual approval and automated testing.

## MLOps Template for building, training and deploying models

We start by taking a detailed look at what AWS services are launched when this build, train, and deploy MLOps template is launched. Later, we discuss how to modify the skeleton for a custom use case.

In SageMaker Studio, you can now choose the **Projects** menu on the **Components and registries** menu.

![Overview](img/image12.png)

Once you choose Projects, click on Create project as below:

![Overview](img/image13.png)

On the projects page, you can launch a preconfigured SageMaker MLOps template. For this lab, we choose MLOps template for model building, training, and deployment and click on Select project template

![Overview](img/image14.png)

In the next page provide Project Name and short Description and select Create Project.

![Overview](img/image15.png)

The project will take a while to be created.

![Overview](img/image15bis.png)

Launching this template starts a model building pipeline by default, and while there is no cost for using SageMaker Pipelines itself, you will be charged for the services launched. Cost varies by Region. A single run of the model build pipeline in us-east-1 is estimated to cost less than $0.50. Models approved for deployment incur the cost of the SageMaker endpoints (test and production) for the Region using an ml.m5.large instance.

After the project is created from the MLOps template, the following architecture is deployed.

![Overview](img/image16.png)



Included in the architecture are the following AWS services and resources:

* The MLOps templates that are made available through SageMaker projects are provided via an AWS Service Catalog portfolio that automatically gets imported when a user enables projects on the Studio domain.

* Two repositories are added to AWS CodeCommit:

    * The first repository provides scaffolding code to create a multi-step model building pipeline including the following steps: data processing, model training, model evaluation, and conditional model registration based on accuracy. As you can see in the pipeline.py file, this pipeline trains a linear regression model using the XGBoost algorithm on the well-known UCI Abalone dataset. This repository also includes a build specification file, used by AWS CodePipeline and AWS CodeBuild to run the pipeline automatically.

    * The second repository contains code and configuration files for model deployment, as well as test scripts required to pass the quality gate. This repo also uses CodePipeline and CodeBuild, which run an AWS CloudFormation template to create model endpoints for staging and production.

* Two CodePipeline pipelines:

    * The ModelBuild pipeline automatically triggers and runs the pipeline from end to end whenever a new commit is made to the ModelBuild CodeCommit repository.

    * The ModelDeploy pipeline automatically triggers whenever a new model version is added to the model registry and the status is marked as Approved. Models that are registered with Pending or Rejected statuses aren’t deployed.

* An Amazon Simple Storage Service(Amazon S3) bucket is created for output model artifacts generated from the pipeline.

* SageMaker Pipelines uses the following resources:

    * This workflow contains the directed acyclic graph (DAG) that trains and evaluates our model. Each step in the pipeline keeps track of the lineage and intermediate steps can be cached for quickly re-running the pipeline. Outside of templates, you can also create pipelines using the SDK.

    * Within SageMaker Pipelines, the SageMaker model registry tracks the model versions and respective artifacts, including the lineage and metadata for how they were created. Different model versions are grouped together under a model group, and new models registered to the registry are automatically versioned. The model registry also provides an approval workflow for model versions and supports deployment of models in different accounts. You can also use the model registry through the boto3 package.

* Two SageMaker endpoints:
    * After a model is approved in the registry, the artifact is automatically deployed to a staging endpoint followed by a manual approval step.
    * If approved, it’s deployed to a production endpoint in the same AWS account.
    
    

All SageMaker resources, such as training jobs, pipelines, models, and endpoints, as well as AWS resources listed in this lab, are automatically tagged with the project name and a unique project ID tag.


# Modifying the Seed Code for Custom Use Case

After your project has been created, the architecture described earlier is deployed and the visualization of the pipeline is available on the Pipelines drop-down menu within SageMaker Studio.

To modify the sample code from this launched template, we first need to clone the CodeCommit repositories to our local SageMaker Studio instance. From the list of projects, choose the one that was just created. On the Repositories tab, you can select the hyperlinks to locally clone the CodeCommit repos.

![repos](img/image18.png)

Once both repositories have been cloned you should see the following:
![repos](img/jma2.png)

## ModelBuild Repo:

The ModelBuild repository contains the code for preprocessing, training, and evaluating the model. The sample code trains and evaluates a model on [the UCI Abalone dataset](https://archive.ics.uci.edu/ml/datasets/abalone). We can modify these files to solve our own customer churn use case. See the following code:

![code](img/image19.png)

We now need a dataset accessible to the project.

Run the following code to download a data text file and save it as a .csv in your bucket:


 

In [22]:
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/synthetic/churn.txt ./
import boto3
import os
import sagemaker
prefix = 'sagemaker/DEMO-xgboost-churn'
region = boto3.Session().region_name
default_bucket = sagemaker.session.Session().default_bucket()
RawData = boto3.Session().resource('s3')\
.Bucket(default_bucket).Object(os.path.join(prefix, 'data/RawData.csv'))\
.upload_file('./churn.txt')
s3source=os.path.join("s3://",default_bucket, prefix, 'data/RawData.csv')
print(s3source)

download: s3://sagemaker-sample-files/datasets/tabular/synthetic/churn.txt to ./churn.txt
s3://sagemaker-us-east-1-280388799341/sagemaker/DEMO-xgboost-churn/data/RawData.csv


Navigate to the pipelines directory inside the modelbuild directory and rename the abalone directory to customer_churn (or run the following cell).


In [14]:
!cd ../customer-churn*/*modelbuild/pipelines/; mv abalone customer_churn;


mv: cannot stat 'abalone': No such file or directory
ImmersionDay-Sagemaker-ISV.ipynb  churn.txt			 img
amazon-sagemaker-immersion-day	  customer-churn-p-ibrflghsm8sb


Now open the codebuild-buildspec.yml file in the modelbuild directory and modify the run pipeline path from run-pipeline --module-name pipelines.abalone.pipeline to this:

 *run-pipeline --module-name pipelines.customer_churn.pipeline \*

or execute the following cell:


In [21]:
!wget https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-immersion-day/master/ML%20Pipelines%20scripts/codebuild-buildspec.yml
!cd ../customer-churn*/*modelbuild/; mv ../../sagemaker-isv-immersionday/codebuild-buildspec.yml .; 

--2022-08-12 23:14:29--  https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-immersion-day/master/ML%20Pipelines%20scripts/codebuild-buildspec.yml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 993 [text/plain]
Saving to: ‘codebuild-buildspec.yml’


2022-08-12 23:14:29 (31.0 MB/s) - ‘codebuild-buildspec.yml’ saved [993/993]



Now you need to replace all 3 files inside the Pipeline directory as shown below;

![files to replace](img/image32.png)


Replace the preprocess.py code under the customer_churn folder with the customer churn preprocessing script found in the sample repository.
Replace the pipeline.py code under the customer_churn folder with the customer churn pipeline script found in the sample repository. Be sure to replace the “InputDataUrl” (line 121 of pipeline.py) default parameter with the Amazon S3 URL obtained in Step 2:


    input_data = ParameterString(
        name="InputDataUrl",
        default_value=f"s3://YOUR-BUCKET/sagemaker/DEMO-xgboost-churn/data/RawData.csv",
    )


The conditional step to evaluate the classification model should already be as the following:

    # Conditional step for evaluating model quality and branching execution</p>
    cond_lte = ConditionGreaterThanOrEqualTo(
        left=JsonGet(step=step_eval, property_file=evaluation_report, json_path="binary_classification_metrics.accuracy.value"), right=0.8
    )

One last thing to note is the default ModelApprovalStatus is set to PendingManualApproval. If our model has greater than 80% accuracy, it’s added to the model registry, but not deployed until manual approval is complete.

Replace the evaluate.py code with the customer churn evaluation script found in the sample repository. One piece of the code we’d like to point out is that, because we’re evaluating a classification model, we need to update the metrics we’re evaluating and associating with trained models:

    report_dict = {
        "binary_classification_metrics": {
            "accuracy": {
                "value": acc,
                "standard_deviation" : "NaN"
            },
            "auc" : {
                "value" : auc,
                "standard_deviation": "NaN"
            },
        },
    }
    evaluation_output_path = '/opt/ml/processing/evaluation/evaluation.json'
    with open(evaluation_output_path, 'w') as f:
        f.write(json.dumps(report_dict))

The JSON structure of these metrics are required to match the format of sagemaker.model_metrics for complete integration with the model registry. 

The following cell will execute that for you:


In [70]:
!wget https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-immersion-day/master/ML%20Pipelines%20scripts/preprocess.py
!wget https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-immersion-day/master/ML%20Pipelines%20scripts/evaluate.py
!wget https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-immersion-day/master/ML%20Pipelines%20scripts/pipeline.py
!cat pipeline.py | sed "s|s3://sm-pipelines-demo-data-123456789/churn.txt|$s3source|" > pipeline-2.py
!rm pipeline.py
!mv pipeline-2.py pipeline.py
!cd ../customer-churn-*/*modelbuild/pipelines/customer_churn;mv ~/sagemaker-isv-immersionday/*.py .

--2022-08-19 04:45:40--  https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-immersion-day/master/ML%20Pipelines%20scripts/preprocess.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2726 (2.7K) [text/plain]
Saving to: ‘preprocess.py’


2022-08-19 04:45:40 (36.8 MB/s) - ‘preprocess.py’ saved [2726/2726]

--2022-08-19 04:45:40--  https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-immersion-day/master/ML%20Pipelines%20scripts/evaluate.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2694 (2.6K) [text/plain

# ModelDeploy repo:

The ModelDeploy repository contains the AWS CloudFormation buildspec for the deployment pipeline. We don’t make any modifications to this code because it’s sufficient for our customer churn use case. It’s worth noting that model tests can be added to this repo to gate model deployment. See the following code:

    ├── build.py
    ├── buildspec.yml
    ├── endpoint-config-template.yml
    ├── prod-config.json
    ├── README.md
    ├── staging-config.json
    └── test
    ├── buildspec.yml
    └── test.py

# Triggering a pipeline run

Committing these changes to the CodeCommit repository (easily done on the Studio source control tab) triggers a new pipeline run, because an Amazon EventBridge event monitors for commits. After a few moments, we can monitor the run by choosing the pipeline inside the SageMaker project.

1. To commit the changes, navigate to the Git Section on the left panel and follow the steps in the screenshot below;
    * Stage all changes
    * Commit the changes by providing a Summary and your Name and an email address
    * Push the changes.

**Make sure you stage the Untracked changes as well.**
![files to replace](img/image24.png)


2. Navigate back to the project and select the Pipelines section.
![files to replace](img/pipelines.png)


Under execution the following screenshot shows our pipeline details.
![files to replace](img/image25.png)

3. If you double click on the executing pipelines, the steps of the pipeline will appear. You will be able to monitor the step that is currently running.
![files to replace](img/image26.png)

![files to replace](img/image27.png)


4. When the pipeline is complete, you can go back to the project screen and choose the Model groups tab. You can then inspect the metadata attached to the model artifacts.
![files to replace](img/image28.png)

5. If everything looks good, you can click on the Update Status tab and manually approve the model.
![files to replace](img/image29.png)

![files to replace](img/image30.png)

![files to replace](img/image30-2.png)


You can then go to **Endpoints** in the SageMaker menu.
![files to replace](img/image30-3.png)

You will see a staging endpoint being created.
![files to replace](img/image30-4.png)

After a while the endpoint will be listed with the **InService** status.
![files to replace](img/image30-5.png)

To deploy the endpoint into production, you need to put your "DevOps Team" hat and go to CodePipeline.
![files to replace](img/image30-6.png)

Click on the modeldeploy pipeline which is currently in progress.
![files to replace](img/image30-7.png)

At the end of the DeployStaging phase, you need to manually approve the deployment.
![files to replace](img/image30-8.png)

Once it is done you will see the production endpoint being deployed in the SageMaker Endpoints.
![files to replace](img/image30-9.png)

After a while the endpoint will also be InService.
![files to replace](img/image31.png)




# Multi tenancy

Lets create a second training dataset for our second customer:

In [72]:
prefix = 'sagemaker/DEMO-xgboost-churn-cust2'
RawData = boto3.Session().resource('s3')\
.Bucket(default_bucket).Object(os.path.join(prefix, 'data/RawData.csv'))\
.upload_file('./churn.txt')
s3sourcecust2=os.path.join("s3://",default_bucket, prefix, 'data/RawData.csv')
print(s3sourcecust2)

download: s3://sagemaker-sample-files/datasets/tabular/synthetic/churn.txt to ./churn.txt
s3://sagemaker-us-east-1-280388799341/sagemaker/DEMO-xgboost-churn-cust2/data/RawData.csv


## Execute the pipeline with a different training dataset


In [74]:
client=boto3.client('sagemaker')
response = client.start_pipeline_execution(
    PipelineName='arn:aws:sagemaker:us-east-1:280388799341:pipeline/customer-churn-p-ibrflghsm8sb',
    PipelineParameters=[
        {
            'Name': 'InputDataUrl',
            'Value': s3sourcecust2
        },
    ],
    PipelineExecutionDescription='customer2'
)


# Conclusion

In this lab we have walked through how a data scientist can modify a preconfigured MLOps template for their own modeling use case. Among the many benefits is that the changes to the source code can be tracked, associated metadata can be tied to trained models for deployment approval, and repeated pipeline steps can be cached for reuse. To learn more about SageMaker Pipelines, check out the website and the documentation.