# SM08: Preprocessing Script

The code to preprocess the [Insurance Company Benchmark (COIL 2000) dataset](https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+%28COIL+2000%29) was developed in posts [SM07](). This notebook will turn that code into the script for the pipeline.

## Update EC2 instance

Writing the `preprocess.py` script is very similar to writing the `etl.py` script. The major difference is that I want to use a library that isn't installed by default and want to ensure package versions for several libraries. To do this I need to consider the pre-built EC2 instance configurations vs other options.

AWS provides several different pre-built EC2 instance configurations. Unfortunately, there's always one package that needs to be updated to a specific version or isn't included by default. AWS generally recommends the following two solutions:

- Use a `requirements.txt` (only available for estimator instances, not processor instances)
- Create a custom EC2 image, load it to ECR (elastic container registry), and reference it in your pipeline

When first starting out, we don't want to have to figure out how to convert a transformer to an estimator. We just want to be able to run the python script and save the outputs to a designated S3 location. So, the `requirements.txt` is out. For information on how to do it, see the [Using thrid-party libraries](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#using-third-party-libraries) section in the documentation.

The directions to create a custom EC2 image generally involve going into another system, such as the AWS CLI. Our goal is to keep as much together in SageMaker as humanly possible. This rules out creating our own image. For information on how to create an image and load it to ECR, see the [Building your own algorithm container](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.html#Building-your-own-algorithm-container) section of the documentation or [Pushing a Docker image](https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html) in the ECR User Guide. *Fair warning*, it isn't recommended to create a docker container from within a docker container (which is what SageMaker Studio is). To create the container, you'll need to use the AWS CLI or a SageMaker instance (not Studio).

Stackoverflow to the rescue. [This answer](https://stackoverflow.com/a/63925135) gave us the information we needed to simply install or update the specific packages we needed. The code is included directly in the python script and is easy to use. We update the code to be able to load or upgrade a package as necessary.

If we get to the point that we frequently need a specific configuration, we'll want to further explore creating our own image to upload to ECR.

The code to install or upgrade a package on the EC2 is:

In [None]:
def install(package):
    subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package])
def upgrade(package):
    subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package, '--upgrade'])
    
upgrade('pandas==1.3.5')
upgrade('numpy')
install('category_encoders')

## Create directories



In [None]:
    try:
        os.makedirs(os.path.join(output_path, "train"))
        os.makedirs(os.path.join(output_path, "validate"))
        os.makedirs(os.path.join(output_path, "test"))
        os.makedirs(os.path.join(output_path, 'encoder'))
    except:
        pass

## Split data



In [None]:
    train_data, validation_data, test_data = np.split(
        processed_df.sample(frac=1, random_state=1729),
        [int(0.7 * len(processed_df)), int(0.9 * len(processed_df))],)

## Save encoder



In [None]:
    joblib.dump(encoder, os.path.join(output_path, 'encoder', encoder_name))

## Write Script

Put it all together.

In [2]:
%%writefile preprocess.py

import subprocess
import sys

def install(package):
    subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package])
def upgrade(package):
    subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package, '--upgrade'])
    
upgrade('pandas==1.3.5')
upgrade('numpy')
install('category_encoders')

import pandas as pd
import numpy as np
import category_encoders as ce
import joblib
import os


if __name__ == '__main__':
    input_path = '/opt/ml/processing/input'
    output_path = '/opt/ml/processing/output'
 
    try:
        os.makedirs(os.path.join(output_path, "train"))
        os.makedirs(os.path.join(output_path, "validate"))
        os.makedirs(os.path.join(output_path, "test"))
        os.makedirs(os.path.join(output_path, 'encoder'))
    except:
        pass
    
    cat_cols = ['zip_agg Customer Subtype', 'zip_agg Customer main type']

    df = pd.read_csv(os.path.join(input_path, 'full_data.csv'))
    print('Preprocessing data')
    encoder = ce.OneHotEncoder(cols=cat_cols, use_cat_names=True, handle_missing='return_nan')
    processed_df = encoder.fit_transform(df)

    train_data, validation_data, test_data = np.split(
        processed_df.sample(frac=1, random_state=1729),
        [int(0.7 * len(processed_df)), int(0.9 * len(processed_df))],)
    
    print('Saving dataframe')
    train_data.to_csv(os.path.join(output_path, 'train', 'train_feats.csv'))
    validation_data.to_csv(os.path.join(output_path, 'validate', 'validate_feats.csv'))
    test_data.to_csv(os.path.join(output_path, 'test', 'test_feats.csv'))
                              
    print('Saving preprocessor joblib')
    encoder_name = 'preprocessor.joblib'
    joblib.dump(encoder, os.path.join(output_path, 'encoder', encoder_name))

Overwriting preprocess.py


## Write Pipeline

Write pipeline to run the `.py` script.

Foundations of pipelines is in [SM03](). Only new thing here is the multiple outputs. 

In [3]:
import sagemaker
import sagemaker.session

from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
)

from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.workflow.functions import Join
from sagemaker.workflow.execution_variables import ExecutionVariables

from sagemaker.workflow.pipeline import Pipeline

session = sagemaker.session.Session()
region = session.boto_region_name
role = sagemaker.get_execution_role()

bucket = session.default_bucket()
prefix = '1_ins_dataset'
pipeline_name = "InsExample"  # SageMaker Pipeline name
model_package_group_name = "Insurance Co Example"  # Model name in model registry
framework_version = "0.23-1"

input_uri = f's3://{bucket}/{prefix}/clean/full_data.csv'

tags = [
    {"Key": "PLATFORM", "Value": "FO-ML"},
    {"Key": "BUSINESS_REGION", "Value": "GLOBAL"},
    {"Key": "BUSINESS_UNIT", "Value": "MOBILITY"},
    {"Key": "CLIENT", "Value": "MULTI_TENANT"}
   ]

# tags = [
#     {"Key": "DATASET", "Value": "InsCOIL"},
#     {"Key": "SOURCE", "Value": "UCI"}
#    ]

processing_instance_count = ParameterInteger(name="ProcessingInstanceCount", default_value=1)

processing_instance_type = ParameterString(
    name="ProcessingInstanceType", default_value="ml.t3.medium")
    
input_data = ParameterString(
    name="InputData",
    default_value=input_uri
)

sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    role=role,
    instance_type=processing_instance_type,
    instance_count=processing_instance_count,
    base_job_name="ins-example-job"
)

step_preprocess = ProcessingStep(
    name="preprocess_data",
    processor=sklearn_processor,
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input")
    ],
    outputs=[
        ProcessingOutput(
            output_name="train",
            source="/opt/ml/processing/output/train",
            destination=Join(
                on="/",
                values=[
                    "s3://{}".format(bucket),
                    prefix,
                    'final',
                    "train"
                ],
            ),
        ),
        ProcessingOutput(
            output_name="validate",
            source="/opt/ml/processing/output/validate",
            destination=Join(
                on="/",
                values=[
                    "s3://{}".format(bucket),
                    prefix,
                    'final',
                    "validate"
                ],
            ),
        ),
        ProcessingOutput(
            output_name="test",
            source="/opt/ml/processing/output/test",
            destination=Join(
                on="/",
                values=[
                    "s3://{}".format(bucket),
                    prefix,
                    'final',
                    "test"
                ],
            ),
        ),
        ProcessingOutput(
            output_name="encoder",
            source="/opt/ml/processing/output/encoder",
            destination=Join(
                on="/",
                values=[
                    "s3://{}".format(bucket),
                    prefix,
                    'final',
                    'encoder'
                ],
            ),
        ),
    ],
    code="preprocess.py"
)

pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        processing_instance_type,
        processing_instance_count,
        input_data,
    ],
    steps=[step_preprocess])

pipeline.upsert(role_arn=role, tags=tags)

pipeline.start(execution_display_name="InsPreprocess4")

The input argument instance_type of function (sagemaker.image_uris.retrieve) is a pipeline variable (<class 'sagemaker.workflow.parameters.ParameterString'>), which is not allowed. The default_value of this Parameter object will be used to override it. Please make sure the default_value is valid.


_PipelineExecution(arn='arn:aws:sagemaker:us-east-1:707031497630:pipeline/insexample/execution/5f2oz1vy9ub6', sagemaker_session=<sagemaker.session.Session object at 0x7fc80031b0d0>)