# Implementation with Step Functions

Ahora que tenemos nuestro proyecto de ciencia de datos construido, queremos implementarlo de una manera sólida y repetible. Para esto, implementaremos el ETL usando AWS Glue, y luego entrenaremos y transformaremos por lotes la entrada usando la integración de SageMaker con Amazon Step Functions.

Este cuaderno lo guiará a través de este proceso paso a paso.

Pero primero debe crear o configurar su propio cubo. El SDK de SageMaker es una buena forma de empezar.

## 1. Subiendo archivos al bucket S3 external

In [2]:
import sagemaker
import boto3, os

In [3]:
account_id = boto3.client("sts").get_caller_identity()["Account"]



In [4]:
ses = sagemaker.Session()
#your_bucket = ses.default_bucket()
bucket = "fashionstore-datalake-external-" + str(account_id)

In [6]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join('data','billing', 'billing_sm.csv')).upload_file('../data/billing_sm.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join('data','reseller', 'reseller_sm.csv')).upload_file('../data/reseller_sm.csv')

## 2. Creando un Crawler en Glue (creará 2 tablas en el esquema external)

Para usar esta información csv en el contexto de un ETL de Glue, primero tenemos que crear un rastreador de Glue que apunte a la ubicación de cada archivo. El rastreador intentará averiguar los tipos de datos de cada columna. La forma más segura de realizar este proceso es crear un rastreador para cada tabla que apunte a una ubicación diferente.

Accederemos a Glue a la opción Crawler.
Clic el enlace <a href='https://us-east-1.console.aws.amazon.com/glue/home?region=us-east-1#/v2/data-catalog/crawlers'> Link </a>       

Clic en Create crawler.

Asignamos el siguiente nombre : data_external y Next.

<img src='img/c1.png' style='width:500px' />

Clic en la opción Add a data source.

<img src='img/c2.png' style='width:500px' />

Seleccionamos la opción <br>
Data Source : S3 <br>
S3 Path ingresamos : **s3://fashionstore-datalake-external-TUACCOUNTID/data/**

Clic en Add an S3 data source.

<img src='img/c3.png' style='width:300px' />

Clic en Next.

<img src='img/c4.png' style='width:500px' />

Seleccionamos el rol : RoleGlue-fashionstore y clic en Next.

<img src='img/c5.png' style='width:500px' />

Seleccionamos el esquema donde se van a crear las tablas : fashion_external.

<img src='img/c6.png' style='width:500px' />

Clic en Create crawler.

<img src='img/c7.png' style='width:500px' />

Seleccionamos el crawler creado y clic en la opción Run.

<img src='img/c8.png' style='width:500px' />

<img src='img/c9.png' style='width:500px' />

Después de un minuto, se deben haber creado 2 tablas en el esquema fashion_external.

<img src='img/c10.png' style='width:500px' />

## 4. Create Glue Job

First of all, you need to create a role to run the Glue Job. For simplicity we are going to build a role that can be assumed by the Glue Service with administrator access. 

In the <a href='https://console.aws.amazon.com/athena/home?region=us-east-1#'> IAM Console </a>

* Under use case select Glue
* Under Policies Select Administrator Access
* Name your role GlueAdmin and accept.

<img src='img/gluerole1.png' style='width:500px'>
<img src='img/gluerole2.png' style='width:500px'>
<img src='img/gluerole3.png' style='width:500px'>



Now move to the <a href='https://us-east-1.console.aws.amazon.com/gluestudio/home?region=us-east-1#/jobs'> Glue Job Console </a> and author a new job.

Seleccionar : Spark script editor y clic en Create.

<img src='img/c11.png' style='width:700px' />

Asignamos el nombre al job : ETL_PIPELINE
Y pegamos el contenido el código : ETL_PIPELINE.py

<img src='img/c12.png' style='width:700px' />

Clic en Job details y seleccionamos el rol : RoleGlue-fashionstore, y clic en Save.

<img src='img/c13.png' style='width:700px' />    

## 5. Create the Step Function

First you need to create a role that can be assumed by AWS Step Functions and have enough permissions to create and use for inference SageMaker models and run Glue Jobs. 
First, we are going to create a role that can be assumed by the service Step Functions and then we are going to modify it to add Administrator Access. You can name this role StepFunctionsAdmin


<img src='img/iamstep.png' />

Tip: In this particular case it can not be done in the same step.





Next go to the <a href='https://console.aws.amazon.com/states/home?region=us-east-1#/statemachines'> Step Functions </a> console and create a new State Machine.

* Author with code snippets
* Standard


In the json place you can use the following script:

In [20]:
from sagemaker import get_execution_role

your_role = get_execution_role()

In [21]:
definition = open('step_function.json', 'r').read().replace('your_bucket',your_bucket).replace('your_role',your_role)
print(definition)

{
  "Comment": "Full ML Pipeline",
  "StartAt": "Start Glue Job",
  "States": {
    "Start Glue Job": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "etlandpipeline"
      },
      "Next": "Train model (XGBoost)"
    },
    "Train model (XGBoost)": {
      "Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync",
      "Parameters": {
        "AlgorithmSpecification": {
          "TrainingImage": "811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest",
          "TrainingInputMode": "File"
        },
        "OutputDataConfig": {
          "S3OutputPath": "s3://sagemaker-us-east-1-646862220717/models"
        },
        "StoppingCondition": {
          "MaxRuntimeInSeconds": 86400
        },
        "ResourceConfig": {
          "InstanceCount": 1,
          "InstanceType": "ml.m4.xlarge",
          "VolumeSizeInGB": 30
        },
        "RoleArn": "arn:aws:iam::646862220717:role/TeamRole",
    

Use the role that you previously created and then you can create and run your state machine. 


As you process starts running and moves thorugh each step you will be able to see the process running in each servicés console. 

Check <a href='https://console.aws.amazon.com/glue/home?region=us-east-1#etl:tab=jobs'> Glue </a> for job logs and
<a href='https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/jobs'> SageMaker </a> to see the training job, the model that you created and the batch transform process. 

After you step function finishes the execution, you should see the graph turning to green:

<img src='img/step.png' style='width:500px' />

You can inspect your predictions in the predictinos folder on you bucket checking <a href='https://s3.console.aws.amazon.com/s3/home?region=us-east-1'>S3</a>.