## Create a file for batch prediction

Normally this is the data that would be used as your input data, but we have to create it before from the Iris dataset. Let's using the already split training data set to run Batch prediction on.

In [None]:
import sagemaker
import pandas as pd

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
data_location = 's3://{}/{}'.format(bucket,"iris/data/iris_test.csv")
df = pd.read_csv(data_location,header=None)
df.head()

## Remove the class label from the batch data

Since we are running prediction, we don't have the label (first column) beforehand. Our training data is labeled, so we will remove that label and try to infer it from our model.

In [None]:
batchdf = df.drop(columns=0)
batchdf.to_csv("iris_batch.csv",header=False,index=False)
input_batch = sagemaker_session.upload_data(path='iris_batch.csv', key_prefix='iris/data')
batchdf.head()

In [None]:
# this is the S3 path to the file that will be used for batch prediction.
input_batch

## Run the batch transformation

We will trigger a one off Batch transformation process in SageMaker to transform data that is in our S3 bucket into predicted results. The predicted results will be stored in the same Bucket with key iris/batch_output. Note that this will run an instance with the XGBoost container to infer the flower type for each batch input, and stop it at the end - so it's a great way to save on costs when the use-case does not require an online endpoint.

Typically Batch transformation are run on a Lambda with the SageMaker or Boto3 SDKs installed, and triggered on a **schedule** or based on an **event** (e.g. an object was uploaded to S3 containing new data).


In [None]:
# Normally you would not run batch manually, but trigger the batch prediction on an event 
# Some possible triggers would be: scheduled, a new file in S3

from sagemaker.transformer import Transformer

output_path='s3://{}/iris/batch_output'.format(bucket)

#model_name = COPY_THE_MODEL_NAME_FROM_PREVIOUS_NOTEBOOK_HERE
model_name = "xgboost-2020-02-24-08-58-21-841"

xgb_batch = Transformer(
    base_transform_job_name='iris-batch',
    model_name=model_name,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=output_path)

xgb_batch.transform(input_batch,content_type='text/csv',split_type='Line')

In [None]:
xgb_batch.wait()

In [None]:
# And now let's view our predictions
import boto3
import json
s3_client = boto3.client('s3')
input_files = s3_client.list_objects(Bucket=bucket,
                               Prefix='iris/batch_output/',
                               Delimiter='/')['Contents']

output_data = pd.concat([ pd.read_csv('s3://{}/{}'.format(bucket, file['Key']), header=None) for file in input_files ])

output_data.head()