# SM00: SageMaker and Production Pipelines

There are a lot of considerations in moving from a local model used to train and predict on batch data to a production model. This series of posts explores how to create an MLOps compliant production pipeline using AWS's SageMaker Studio.

SageMaker Studio is a suite of tools that helps manage the infrastructure and collaboration for a machine learning project in the AWS ecosystem. Some of the biggest advantages of SageMaker Studio include:

- Ability to spin up hardware resources as needed
- Automatically spin down hardware resources once the task is complete
- Ability to create a pipeline to automate the machine learning process from preprocessing data through deploying the model

## Prerequisites

For brevity, I'll assume that SageMaker Studio and an IAM role with the appropriate permissions have been set up. In a corporate/enterprise environment, these will generally be set up by an administrator or someone on the architecture team.

- For directions on setting up the SageMaker environment see [Onboard to Amazon SageMaker Domain Using Quick setup](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-quick-start.html)
- For directions on setting up an AWS account and IAM role see [Set Up Amazon SageMaker Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-set-up.html)

The notebooks in this series *may* run on a stand alone SageMaker Jupyter Notebook instance or *possibly* in a local environment where the AWS credentials are specified. However, this series is designed to take advantage of the managed infrastructure and other benefits of using SageMaker Studio, so that will be the prefered environment for all posts in the series. I won't be testing, trying, or troubleshooting the code to work on stand alone SageMaker Jupyter Notebook instances or local environments.

## Series Guide

1. [Read from and Write to S3]()
1. [Clean Data]()
1. [ETL Pipe Foundations]()
1. [ETL (extract, transform, load) Script]()
1. [ETL Pipeline]()
1. [EDA (Exploratory Data Analysis)]()
1. [Develop Preprocessing Code]()
1. [Preprocessing Pipeline]()
1. [Train Pre-built Model]()
1. [Train Custom Model]()
1. [Inference]()
1. [Multistep Pipeline]()
1. [Custom Transformers]()
1. [Custom Transformers at Inference]()
1. [Hyperparameter Optimization]()
1. [Evaluate Model]()
1. [Register and Deploy]()
1. [Debugger]()
1. [Interpretability and Bias]()

## Best Practices for Flexibility/Automation

Based on experience, we want to keep as much related code in the same place as possible. In the past, our code has spanned different applications, EC2s, repos, and just about everything else you can think of. This made it extremely difficult to track down what code needed to be updated when we needed to make a change.

Additionally, we want to consolidate where changes need to be made in the code. In the past, we had hard coded values into several steps of the code. In the current code, the goal is to put hard coded values (in this case, the column names) all in the same script. Should we need to make changes to the included columns, we only have to change the `preprocessing.py` script. *Note*, in our production workflow, the data capture pulls all data in the specified tables regardless of whether that column is expected in the `preprocessing.py` script or not.

To update the workflow:

- If a new table is available
    - Add the table to the features DAG
    - Add the columns to the `preprocessing.py` script
- If a column was added or removed:
    - Add or remove the column(s) in the `preprocessing.py` script

## Python Scripts

All Python scripts to be run on EC2s in a pipeline are in the `write_scripts.ipynb`. This means that any changes needed to the Python scripts can be done all at once in the same `ipynb` notebook. Simply run the notebook to update the `.py` scripts.

To write these scripts to their own file use the built-in Jupyter magic command `%%writefile filename.py`. Everything in the cell with that magic command will be written to a separate file. Different file types can be written by changing the file extension. Example `filename.py` vs `filename.txt`.