## Batch Transform

Now we are going to use "today's" features to create predictions, that the business unit is going to use as an input for promotions. 

For this, we are going to deploy the model created on the best training job from the hyperparameter tunning job and use the resulting endpoint for inference. 

In [1]:
import sagemaker
import boto3
from sagemaker.estimator import Estimator
from sagemaker.tuner import HyperparameterTuner
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import os 
import time
from sagemaker.predictor import csv_serializer,RealTimePredictor

# take the best training job from notebook #PROD2
# best_training_job = 'hpo-invoice-pred-191009-1624-002-2086aff7'
role = sagemaker.get_execution_role()
prefix = 'predictions'

In [2]:
%store -r bucket

In [3]:
df = pd.read_csv('to_predict.csv',header=None)

In [4]:
df.shape

(1197, 24)

In [5]:
id_reseller = pd.read_csv('id_reseller_to_predict.csv',header=None)[0]

In [6]:
id_reseller.shape

(1197,)

Make sure you stored the best_job variable in <a href='./PROD2.ModelTrain.ipynb'>notebook 2 </a>

In [7]:
%store -r best_job

In [8]:
model = Estimator.attach(best_job)

2019-12-10 19:47:59 Starting - Preparing the instances for training
2019-12-10 19:47:59 Downloading - Downloading input data
2019-12-10 19:47:59 Training - Training image download completed. Training in progress.
2019-12-10 19:47:59 Uploading - Uploading generated training model
2019-12-10 19:47:59 Completed - Training job completed[34mArguments: train[0m
[34m[2019-12-10:19:47:27:INFO] Running standalone xgboost training.[0m
[34m[2019-12-10:19:47:27:INFO] Setting up HPO optimized metric to be : mae[0m
[34m[2019-12-10:19:47:27:INFO] File size need to be processed in the node: 20.56mb. Available memory size in the node: 8524.61mb[0m
[34m[2019-12-10:19:47:27:INFO] Determined delimiter of CSV input is ','[0m
[34m[19:47:27] S3DistributionType set as FullyReplicated[0m
[34m[19:47:27] 126181x24 matrix with 3024374 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2019-12-10:19:47:27:INFO] Determined delimiter of CSV input is ','[0m
[34

Training seconds: 86
Billable seconds: 86


In [None]:
model_predictor = model.deploy(initial_instance_count=1,
                            instance_type='ml.t2.medium')

-------------------------------------------------------------------------------

In [None]:
# In case you interrupt the notebook, you can create the predictor using the endpoint name.
#model_predictor = RealTimePredictor('########')

In [None]:
model_predictor.content_type = 'text/csv'
model_predictor.serializer = csv_serializer
model_predictor.deserializer = None

In [None]:
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, model_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(df.values)

In [None]:
predictions.shape

In [None]:
df_predictions  = pd.DataFrame({'id_reseller':id_reseller,'prediction':predictions})

In [None]:
df_predictions.head()

Finally we upload predictions to S3

In [None]:
df_predictions.to_csv('predictions.csv',index=False)

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'predictions.csv')).upload_file('predictions.csv')