# Monitoring Job Status

## DevOps Account

* acc: mlops-devops
* IAM: mlops-devops-admin

## Create Lambda function
* Name: monitor_sagemaker_job_status
* python 3.8
* service role add policy to access secret manager

In [None]:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "secretsmanager:GetSecretValue",
            "Resource": "arn:aws:secretsmanager:us-west-2:*:secret:beta/sagemaker*"
        }
    ]
}

### Lamda python code

In [None]:
import json, boto3, base64

def getSecret(secret_name):
    ssm = boto3.client('secretsmanager')
    resp = ssm.get_secret_value(SecretId=secret_name)['SecretString']
    return json.loads(resp)

def lambda_handler(event, context):
    #env
    secret_name = event['Input']['secret_name']
    job_name = event['Input']['job_name']
    
    # beta sagemaker
    secret_result = getSecret(secret_name)
    beta_access_key = secret_result['beta-sagemaker-access']
    beta_secrete_key = secret_result['beta-sagemaker-secret']
    
    ss_beta = boto3.Session(aws_access_key_id=beta_access_key, aws_secret_access_key=beta_secrete_key)
    sm_beta = ss_beta.client('sagemaker')
    status = sm_beta.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
    
    # TrainingJobStatus !='Completed' and TrainingJobStatus!='Failed'
    return {
        'job_name': job_name,
        'status': status
    }

## lamda testing event

In [None]:
{
  "Input": {
    "secret_name": "beta/sagemaker",
    "job_name": "scikit-bring-your-own-2020-02-04-13-59-13"
  }
}

# Create Step Function

* name: sm-job-status

In [None]:
{
  "Comment": "Check SageMaker Training Job Status",
  "StartAt": "sm_job",
  "States": {
    "sm_job": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:us-west-2:*:function:monitor_sagemaker_job_status:$LATEST",
        "Payload": {
          "Input.$": "$"
        }
      },
      "ResultPath": "$.taskresult",
      "Next": "ChoiceState"
    },
    "ChoiceState": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.taskresult.Payload.status",
          "StringEquals": "Completed",
          "Next": "Done"
        },
        {
          "Variable": "$.taskresult.Payload.status",
          "StringEquals": "Failed",
          "Next": "Done"
        }
      ],
      "Default": "WaitSeconds"
    },
    "WaitSeconds": {
      "Type": "Wait",
      "Seconds": 10,
       "Next": "sm_job"
    },
    "Done": {
      "Type": "Pass",
      "End": true
    }
  }
}

![](./img/38.png)

# Start Step functions

* trigger the beta training job and get the job_name
* input the payload of step function

In [None]:
{
  "secret_name": "beta/sagemaker",
  "job_name": "scikit-bring-your-own-2020-02-04-13-59-13"
}

You will see the state goes into waiting and then finish

![](./img/39.png)