# Using Neptune together with Amazon SageMaker training jobs 

<div class="alert alert-info">You can run this part of the notebook either locally or from a SageMaker notebook. It would need additional dependencies such as AWS CLI tools and Docker.</div>

This tutorial uses some code (with adaptations) from the [official AWS tutorial](https://github.com/aws/amazon-sagemaker-examples/tree/main/advanced_functionality/scikit_bring_your_own). We'll show how to add Neptune logging to a custom training job in SageMaker. For this, we are going to create a Docker container with pre-installed Neptune, and adapt Amazon's code by adding Netune logging to it.

## Docker container

Our container is a simplified version of the container in AWS's tutorial. We are using `python` base image and added `neptune-client` and `neptune-sklearn` as additional dependencies.

In [None]:
!cat Dockerfile

### Training script

The training script can be found in `decision_trees/train`. As compared to the script from AWS's tutorial, we added a few lines of our code:

```diff
[...]

+ import neptune.new as neptune
+ import neptune.new.integrations.sklearn as npt_utils

[...]

def train():
+    run = neptune.init_run(
+         tags=["sagemaker"],
+         source_files=["train"],
+    )
     [...]
     
     # Now use scikit-learn's decision tree classifier to train the model.
     clf = tree.DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes)
     clf = clf.fit(train_X, train_y)

+    run["cls_summary"] = npt_utils.create_classifier_summary(
+       clf, train_X, train_X, train_y, train_y
+    )
     [...]
```

### Build the Docker image and push it to ECR

Next, we need to build the container. The following Bash script assumes that you have [AWS CLI](https://aws.amazon.com/cli/) installed on your machine. It again uses the code from the AWS tutorial. It is lengthy, because it automatically creates an ECR repository for us and pushes the image to it.

In [None]:
!cat build_and_push.sh

In [None]:
!bash ./build_and_push.sh

## Start the training job

<div class="alert alert-info">This part of the notebook is to be run from an Amazon SageMaker notebook.</div>

### Obtaining the Neptune token from AWS Secrets

If you store a Neptune API token and project name in AWS Secrets, you can read them using the following code. If you do that, you can use your token and project name in the place of `NEPTUNE_API_TOKEN` and `NEPTUNE_PROJECT` in the next cell. Alternatively, you can add the code below to the `decision_trees/train` script before building the Docker image and read the secrets from the `secrets` dictionary instead of `os.envir`.

If you want to read the secrets form AWS Secrets, make sure that your SageMaker Notebook has a role that allows for the access to the secrets, in particular the `secretsmanager:GetSecretValue` permission for the appropriate secret.

In [None]:
import boto3
from botocore.exceptions import ClientError
import json

secret_name = "AmazonSageMaker-tw-neptune-v2"
region_name = "us-east-1"

# Create a Secrets Manager client
session = boto3.session.Session()
client = session.client(
    service_name='secretsmanager',
    region_name=region_name
)

get_secret_value_response = client.get_secret_value(
    SecretId=secret_name
)

# Decrypts secret using the associated KMS key.
secrets = json.loads(get_secret_value_response["SecretString"])

In the example below, we are going to use the anonymous token `neptune.ANONYMOUS_API_TOKEN`, so we first need to install the Neptune client library to obtain the token.

In [None]:
%pip install -q -U neptune-client

## Training data

We are going to use the Iris dataset. Below, we are downloading it from the official SageMaker repository of sample datasets.

In [None]:
import boto3
import numpy as np
import pandas as pd
import os

os.makedirs("./data", exist_ok=True)

s3_client = boto3.client("s3")
s3_client.download_file(
    f"sagemaker-sample-files", "datasets/tabular/iris/iris.data", "./data/iris.csv"
)

df_iris = pd.read_csv("./data/iris.csv", header=None)
df_iris[4] = df_iris[4].map({"Iris-setosa": 0, "Iris-versicolor": 1, "Iris-virginica": 2})
iris = df_iris[[4, 0, 1, 2, 3]].to_numpy()
np.savetxt("./data/iris.csv", iris, delimiter=",", fmt="%1.1f, %1.3f, %1.3f, %1.3f, %1.3f")

## Training the machine learning model

The code below was taken from the official AWS tutorial. The only difference is that we are passing the `NEPTUNE_API_TOKEN` and `NEPTUNE_PROJECT` as environment variables to the [estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html).

In [None]:
import os
from sagemaker import get_execution_role
import sagemaker as sage
import neptune.new as neptune

# S3 prefix
s3_prefix = "neptune-sagemaker-demo-data"

role = get_execution_role()

sess = sage.Session()

WORK_DIRECTORY = "data"

data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=s3_prefix)

account = sess.boto_session.client("sts").get_caller_identity()["Account"]
region = sess.boto_session.region_name
image = "{}.dkr.ecr.{}.amazonaws.com/neptune-sagemaker-demo:latest".format(account, region)

tree = sage.estimator.Estimator(
    image_uri=image,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path="s3://{}/output".format(sess.default_bucket()),
    sagemaker_session=sess,
    environment={
        "NEPTUNE_API_TOKEN": neptune.ANONYMOUS_API_TOKEN,
        "NEPTUNE_PROJECT": "common/showroom"
    }
)

tree.fit(data_location)