# IMBA next purchase prediction

## Using XGBoost in SageMaker

---

In this example of using Amazon's SageMaker service we will construct a random tree model to productionaze the predictive model.

## Step 1: Preprocess the data

In [1]:
# read the data into dataframe
import pandas as pd

bucket='imba'

# we load a smaller version of the full dataset, you can increase the instance size to load the full dataset instead
data_key = 'output_small/data_small.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

data = pd.read_csv(data_location)

In [2]:
# load order product train dataframe
order_product_train_key = 'data/order_products/order_products__train.csv.gz'
order_product_train_location = 's3://{}/{}'.format(bucket, order_product_train_key)

order_product_train = pd.read_csv(order_product_train_location)

In [3]:
# have a look at the data
order_product_train.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


In [4]:
# load orders dataframe
orders_key = 'data/orders/orders.csv'
orders_location = 's3://{}/{}'.format(bucket, orders_key)

orders = pd.read_csv(orders_location)
# only select train and test orders
#orders = orders[orders.eval_set != 'prior'][['user_id', 'order_id','eval_set']]

In [5]:
# attach user_id to order_product_train
order_product_train = order_product_train.merge(orders[['user_id', 'order_id']])

In [6]:
# attach eval_set to data
#data = data.merge(orders[orders.eval_set != 'prior'][['user_id', 'order_id','eval_set']])
data = data.merge(orders[orders.eval_set != 'prior'][['user_id','eval_set']])

In [7]:
# attach target variable: reordered
data = data.merge(order_product_train[['user_id', 'product_id', 'reordered']], how = 'left')

In [8]:
# a few more feature engineering, refer to the R code
data['prod_reorder_probability'] = data.prod_second_orders / data.prod_first_orders
data['prod_reorder_times'] = 1 + data.prod_reorders / data.prod_first_orders
data['prod_reorder_ratio'] = data.prod_reorders / data.prod_orders
data.drop(['prod_reorders', 'prod_first_orders', 'prod_second_orders'], axis=1, inplace=True)
data['user_average_basket'] = data.user_total_products / data.user_orders
data['up_order_rate'] = data.up_orders / data.user_orders
data['up_orders_since_last_order'] = data.user_orders - data.up_last_order
data['up_order_rate_since_first_order'] = data.up_orders / (data.user_orders - data.up_first_order + 1)

In [9]:
data.head()

Unnamed: 0,product_id,up_orders,user_mean_days_since_prior,user_period,user_distinct_products,user_reorder_ratio,user_total_products,up_average_cart_position,up_first_order,user_orders,...,user_id,eval_set,reordered,prod_reorder_probability,prod_reorder_times,prod_reorder_ratio,user_average_basket,up_order_rate,up_orders_since_last_order,up_order_rate_since_first_order
0,19508,6,5.969574,2943,200,0.602434,497,7.833333,11,61,...,144185,test,,0.348857,1.812378,0.448239,8.147541,0.098361,32,0.117647
1,42307,1,5.969574,2943,200,0.602434,497,3.0,55,61,...,144185,test,,0.546166,3.4241,0.707952,8.147541,0.016393,6,0.142857
2,35883,1,5.969574,2943,200,0.602434,497,8.0,52,61,...,144185,test,,0.357664,2.010219,0.502542,8.147541,0.016393,9,0.1
3,13539,1,5.969574,2943,200,0.602434,497,12.0,50,61,...,144185,test,,0.176471,1.270588,0.212963,8.147541,0.016393,11,0.083333
4,27966,3,5.969574,2943,200,0.602434,497,7.0,53,61,...,144185,test,,0.611129,4.330669,0.769089,8.147541,0.04918,0,0.333333


In [10]:
# split into training and test set, test set does not have target variable
train = data[data.eval_set == 'train'].copy()
test = data[data.eval_set == 'test'].copy()

In [11]:
# id field won't be used in model, thus make a backup of them and remove from dataframe
#test_id = test[['product_id','user_id', 'order_id', 'eval_set']]
test_id = test[['product_id','user_id', 'eval_set']]
#test.drop(['product_id','user_id', 'order_id', 'eval_set', 'reordered'], axis=1, inplace=True)
test.drop(['product_id','user_id', 'eval_set', 'reordered'], axis=1, inplace=True)

In [12]:
# convert target variable to 1/0 for training dataframe
train['reordered'] = train['reordered'].fillna(0)
train['reordered'] = train.reordered.astype(int)

In [13]:
# drop id columns as they won't be used in model
#train.drop(['eval_set', 'user_id', 'product_id', 'order_id'], axis=1, inplace=True)
train.drop(['eval_set', 'user_id', 'product_id'], axis=1, inplace=True)

In [14]:
# this is the target variable dataframe
train_y = train[['reordered']]
# this is the dataframe without target variable
train_X = train.drop(['reordered'], axis = 1)

## Step 2: Classification

Now that we have created the feature representation of our training (and testing) data, it is time to start setting up and using the XGBoost classifier provided by SageMaker.

### Writing the dataset

The XGBoost classifier that we will be using requires the dataset to be written to a file and stored using Amazon S3. To do this, we will start by splitting the training dataset into two parts, the data we will train the model with and a validation set. Then, we will write those datasets to a file and upload the files to S3.

First we split the data into training and validation set. Training data is for training the model, validation data is for evaluating the model performance.

In [15]:
import pandas as pd

val_X = train_X[:20000]
train_X = train_X[20000:]

val_y = train_y[:20000]
train_y = train_y[20000:]

#test_y = pd.DataFrame(test_y)
#test_X = pd.DataFrame(test_X)

For more information about this and other algorithms, the SageMaker developer documentation can be found on __[Amazon's website.](https://docs.aws.amazon.com/sagemaker/latest/dg/)__

In [16]:
# First we make sure that the local directory in which we'd like to store the training and validation csv files exists.
import os
data_dir = 'data/xgboost'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [17]:
# First, save the test data to test.csv in the data_dir directory without label.
pd.DataFrame(test).to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)

# Then we save the training and validation set into local disk as csv files
pd.concat([val_y, val_X], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)
pd.concat([train_y, train_X], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

In [18]:
# To save a bit of memory we can set text_X, train_X, val_X, train_y and val_y to None.

train_X = val_X = train_y = val_y = None

### Uploading Training / Validation files to S3

Amazon's S3 service allows us to store files that can be access by both the built-in training models such as the XGBoost model we will be using as well as custom models such as the one we will see a little later.

In [19]:
import sagemaker

session = sagemaker.Session() # Store the current SageMaker session

# S3 prefix (which folder will we use)
prefix = 'imba-xgboost'

test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)

### Creating a tuned XGBoost model

Now that the data has been uploaded it is time to create the XGBoost model. The first step is to create an estimator object which will be used as the *base* of your hyperparameter tuning job.

In [20]:
from sagemaker import get_execution_role

# Our current execution role is require when creating the model as the training
# and inference code will need to access the model artifacts.
role = get_execution_role()

In [21]:
# We need to retrieve the location of the container which is provided by Amazon for using XGBoost.
# As a matter of convenience, the training and inference code both use the same container.
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(session.boto_region_name, 'xgboost', '0.90-1')

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [22]:
# We need to retrieve the location of the container which is provided by Amazon for using XGBoost.
# As a matter of convenience, the training and inference code both use the same container.
import sagemaker
container = sagemaker.image_uris.retrieve('xgboost', session.boto_region_name, 'latest')
#container = get_image_uri(session.boto_region_name, 'xgboost', '0.90-1')

In [23]:
#       Create a SageMaker estimator using the container location determined in the previous cell.
#       It is recommended that you use a single training instance of type ml.m4.xlarge. It is also
#       recommended that you use 's3://{}/{}/output'.format(session.default_bucket(), prefix) as the
#       output path.

xgb = sagemaker.estimator.Estimator(container, # The location of the container we wish to use
                                    role,                                    # What is our current IAM Role
                                    instance_count=1,                  # How many compute instances
                                    instance_type='ml.m4.xlarge',      # What kind of compute instances
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)

#       Set the XGBoost hyperparameters in the xgb object. Don't forget that in this case we have a binary
#       label so we should be using the 'binary:logistic' objective.

# Solution:
xgb.set_hyperparameters(max_depth=5,
                        eta=0.1,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        early_stopping_rounds=10,
                        num_round=500)

### Create the hyperparameter tuner

Now that the base estimator has been set up we need to construct a hyperparameter tuner object which we will use to request SageMaker construct a hyperparameter tuning job.

**Note:** If you don't want the hyperparameter tuning job to take too long, make sure to not set the total number of models (jobs) too high.

In [24]:
# First, make sure to import the relevant objects used to construct the tuner
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner


# create the tuner object:

xgb_hyperparameter_tuner = HyperparameterTuner(estimator = xgb, # The estimator object to use as the basis for the training jobs.
                                               objective_metric_name = 'validation:rmse', # The metric used to compare trained models.
                                               objective_type = 'Minimize', # Whether we wish to minimize or maximize the metric.
                                               max_jobs = 4, # The total number of models to train
                                               max_parallel_jobs = 2, # The number of models to train in parallel
                                               hyperparameter_ranges = {
                                                    'max_depth': IntegerParameter(3, 12),
                                                    'eta'      : ContinuousParameter(0.05, 0.5),
                                                    'min_child_weight': IntegerParameter(2, 8),
                                                    'subsample': ContinuousParameter(0.5, 0.9),
                                                    'gamma': ContinuousParameter(0, 10),
                                               })

### Fit the hyperparameter tuner

Now that the hyperparameter tuner object has been constructed, it is time to fit the various models and find the best performing model.

In [25]:
s3_input_train = sagemaker.TrainingInput(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.TrainingInput(s3_data=val_location, content_type='csv')
xgb_hyperparameter_tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

.....................................................................................................................................!


### Testing the model

Now that we've run our hyperparameter tuning job, it's time to see how well the best performing model actually performs. To do this we will use SageMaker's Batch Transform functionality. Batch Transform is a convenient way to perform inference on a large dataset in a way that is not realtime. That is, we don't necessarily need to use our model's results immediately and instead we can peform inference on a large number of samples. An example of this in industry might be peforming an end of month report. This method of inference can also be useful to us as it means to can perform inference on our entire test set. 

Remember that in order to create a transformer object to perform the batch transform job, we need a trained estimator object. We can do that using the `attach()` method, creating an estimator object which is attached to the best trained job.

In [26]:
# attach the model:

xgb_attached = sagemaker.estimator.Estimator.attach(xgb_hyperparameter_tuner.best_training_job())


2021-12-05 05:00:40 Starting - Preparing the instances for training
2021-12-05 05:00:40 Downloading - Downloading input data
2021-12-05 05:00:40 Training - Training image download completed. Training in progress.
2021-12-05 05:00:40 Uploading - Uploading generated training model
2021-12-05 05:00:40 Completed - Training job completed


Now that we have an estimator object attached to the correct training job, we can proceed as we normally would and create a transformer object.

In [27]:
# Create a transformer object from the attached estimator. Using an instance count of 1 and an instance type of ml.m4.xlarge
#       should be more than enough.:
xgb_transformer = xgb_attached.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')

Next we actually perform the transform job. When doing so we need to make sure to specify the type of data we are sending so that it is serialized correctly in the background. In our case we are providing our model with csv data so we specify `text/csv`. Also, if the test data that we have provided is too large to process all at once then we need to specify how the data file should be split up. Since each line is a single entry in our data set we tell SageMaker that it can split the input on each line.

In [28]:
# Start the transform job. Make sure to specify the content type and the split type of the test data.
xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')

.............................[34mArguments: serve[0m
[34m[2021-12-05 05:11:37 +0000] [1] [INFO] Starting gunicorn 19.9.0[0m
[34m[2021-12-05 05:11:37 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2021-12-05 05:11:37 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2021-12-05 05:11:37 +0000] [21] [INFO] Booting worker with pid: 21[0m
[34m[2021-12-05 05:11:38 +0000] [22] [INFO] Booting worker with pid: 22[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021-12-05:05:11:38:INFO] Model loaded successfully for worker : 21[0m
[34m[2021-12-05 05:11:38 +0000] [23] [INFO] Booting worker with pid: 23[0m
[34m[2021-12-05 05:11:38 +0000] [24] [INFO] Booting worker with pid: 24[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021-12-05:05:11:38:INFO] Model loaded successfully for worker : 22[0m
  monkey.patch_all(subprocess=True)[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021-12-05:05:11:38:INFO] Model loaded successfully for worker : 23[0m
[34m[2021-12-0

Now the transform job has executed and the result, the estimated sentiment of each review, has been saved on S3. Since we would rather work on this file locally we can perform a bit of notebook magic to copy the file to the `data_dir`.

In [29]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

download: s3://sagemaker-ap-southeast-2-480972076311/xgboost-2021-12-05-05-06-49-605/test.csv.out to data/xgboost/test.csv.out


In [34]:
!pwd

/home/ec2-user/SageMaker


In [35]:
prediction=pd.read_csv('data/xgboost/test.csv.out', header=None, names=["prob"])
test_id = test_id.reset_index().drop(['index','eval_set'],axis = 1)
test = test.reset_index().drop(['index'],axis = 1)
pd.concat([test_id,test],axis=1).to_csv('data/xgboost/test_final.csv', index = False)
!aws s3 cp 'data/xgboost/test_final.csv' 's3://imba/model_output/test_final.csv'