## Using XGBoost for Customer Response Prediction

This notebook is the one that use XGBoost to classify if the customer will reponse to Email sent out or not.

In [1]:
!pip install sagemaker==1.72.0

Collecting sagemaker==1.72.0
  Using cached sagemaker-1.72.0-py2.py3-none-any.whl
Collecting smdebug-rulesconfig==0.1.4
  Using cached smdebug_rulesconfig-0.1.4-py2.py3-none-any.whl (10 kB)
Installing collected packages: smdebug-rulesconfig, sagemaker
  Attempting uninstall: smdebug-rulesconfig
    Found existing installation: smdebug-rulesconfig 1.0.1
    Uninstalling smdebug-rulesconfig-1.0.1:
      Successfully uninstalled smdebug-rulesconfig-1.0.1
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.45.0
    Uninstalling sagemaker-2.45.0:
      Successfully uninstalled sagemaker-2.45.0
Successfully installed sagemaker-1.72.0 smdebug-rulesconfig-0.1.4
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
%matplotlib inline

import os

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import load_boston
import sklearn.model_selection

In [3]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.predictor import csv_serializer

session = sagemaker.Session()

role = get_execution_role()

## 1. Data Preprocessing

In [4]:
train_data = pd.read_csv('data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')
test_data = pd.read_csv('data/Udacity_MAILOUT_052018_TEST.csv', sep=';')

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
train_data.RESPONSE.value_counts()

0    42430
1      532
Name: RESPONSE, dtype: int64

### Clean Data

In [6]:
object_cols = ['CAMEO_DEU_2015', 'CAMEO_DEUG_2015', 'CAMEO_INTL_2015', 'D19_LETZTER_KAUF_BRANCHE', 'EINGEFUEGT_AM', 'OST_WEST_KZ']
for each_col in object_cols:
    if each_col in train_data.columns:
        print(each_col, train_data[each_col].nunique(), " values")

CAMEO_DEU_2015 45  values
CAMEO_DEUG_2015 19  values
CAMEO_INTL_2015 43  values
D19_LETZTER_KAUF_BRANCHE 35  values
EINGEFUEGT_AM 1599  values
OST_WEST_KZ 2  values


In [7]:
train_data = train_data.drop(columns=object_cols + ['LNR'])
test_data = test_data.drop(columns=object_cols + ['LNR'])

### Replacing Missing Value with NaN

In [8]:
missing_val = pd.read_csv("data/missing_value.tsv", sep="\t")
missing_val['unknown_value'] = missing_val['unknown_value'].apply(lambda x: [int(i) for i in x.split(",")])
missing_val.head()

Unnamed: 0,column_name,unknown_value
0,AGER_TYP,[-1]
1,ALTERSKATEGORIE_GROB,"[-1, 0]"
2,ANREDE_KZ,"[-1, 0]"
3,BALLRAUM,[-1]
4,BIP_FLAG,[-1]


In [9]:
# Replace some values with np.NaN
for idx, row in missing_val.iterrows():
    col_name = row.column_name
    unk_val = row.unknown_value
    
    if col_name not in train_data.columns :
        continue
    
    train_data[col_name] = train_data[col_name].apply(lambda x: np.NaN if x in unk_val else x)
    
    test_data[col_name] = test_data[col_name].apply(lambda x: np.NaN if x in unk_val else x)

## 2. Split the Training Data

In [10]:
X_pd = train_data.drop(columns=['RESPONSE'])
y_pd = train_data['RESPONSE']

In [11]:
X_train, X_val, y_train, y_val = sklearn.model_selection.train_test_split(X_pd, y_pd, test_size=0.33)

In [12]:
X_test = test_data

## 3. Upload Data Files to S3

#### Save Data to Local Directory

In [13]:
data_dir = './data/for_upload'

if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [14]:
X_test.to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)

pd.concat([y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)
pd.concat([y_val, X_val], axis=1).to_csv(os.path.join(data_dir, 'val.csv'), header=False, index=False)

X_val.to_csv(os.path.join(data_dir, 'val_for_aoc.csv'), header=False, index=False)

#### Upload Files to S3

In [15]:
prefix = 'customer_response_prediction'

test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'val.csv'), key_prefix=prefix)

val_for_aoc_location = session.upload_data(os.path.join(data_dir, 'val_for_aoc.csv'), key_prefix=prefix)

## 4. Train XGBoost Model

In [16]:
container = get_image_uri(session.boto_region_name, 'xgboost', '1.0-1')

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


In [17]:
xgb = sagemaker.estimator.Estimator(container,
                                    role,
                                    train_instance_count=1,
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)

Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.


In [18]:
xgb.set_hyperparameters(max_depth=3,
                        eta=0.05,
                        gamma=5.5,
                        min_child_weight=3,
                        subsample=0.8,
                        objective='binary:logistic',
                        early_stopping_rounds=20,
                        num_round=200,
                        scale_pos_weight=80)

In [None]:
s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
s3_input_val = sagemaker.s3_input(s3_data=val_location, content_type='csv')

In [95]:
xgb.fit({'train': s3_input_train, 'validation': s3_input_val})

2021-06-29 14:53:12 Starting - Starting the training job...
2021-06-29 14:53:14 Starting - Launching requested ML instances......
2021-06-29 14:54:26 Starting - Preparing the instances for training......
2021-06-29 14:55:42 Downloading - Downloading input data...
2021-06-29 14:56:05 Training - Downloading the training image.....[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34m[14:56:51] 28784x359 matrix with 10333456 entries loaded from /opt/ml/inpu

## 5. Create XGBoost Transformer

In [96]:
xgb_transformer = xgb.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


## 6. Model Inference using Validation Data For AUC ROC Calculation

In [108]:
xgb_transformer.transform(val_for_aoc_location, content_type='text/csv', split_type='Line')

In [110]:
xgb_transformer.wait()

[34m[2021-06-29:15:47:52:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2021-06-29:15:47:52:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2021-06-29:15:47:52:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2021-06-29:15:47:52:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2021-06-29:15:47:52:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;
[0m
[34mworker_rlimit_nofile 4096;
[0m
[34mevents {
  worker_connections 2048;[0m
[34m}
[0m
[35m[2021-06-29:15:47:52:INFO] nginx config: [0m
[35mworker_processes auto;[0m
[35mdaemon off;[0m
[35mpid /tmp/nginx.pid;[0m
[35merror_log  /dev/stderr;
[0m
[35mworker_rlimit_nofile 4096;
[0m
[35mevents {
  worker_connections 2048;[0m
[35m}
[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;

  upstream gunicorn {
  

In [113]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

download: s3://sagemaker-us-east-1-632144194871/sagemaker-xgboost-2021-06-29-15-42-27-790/val_for_aoc.csv.out to data/for_upload/val_for_aoc.csv.out


In [114]:
y_val_test = pd.read_csv(os.path.join(data_dir, 'val_for_aoc.csv.out'), header=None)

#### Calculate ROC AUC Score

In [116]:
from sklearn.metrics import roc_auc_score

In [120]:
"Current ROC Score: ", roc_auc_score(y_val, y_val_test)

('Current ROC Score: ', 0.6456231976992358)

## 7. Model Hyperparameter Tuning

In [19]:
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

In [21]:
xgb_hyperparameter_tuner = HyperparameterTuner(estimator = xgb,
                                               objective_metric_name = 'validation:auc', 
                                               objective_type = 'Maximize', 
                                               max_jobs = 20,
                                               max_parallel_jobs = 3, 
                                               hyperparameter_ranges = {
                                                    'max_depth': IntegerParameter(3, 12),
                                                    'eta'      : ContinuousParameter(0.05, 0.5),
                                                    'min_child_weight': IntegerParameter(2, 8),
                                                    'subsample': ContinuousParameter(0.5, 0.9),
                                                    'gamma': ContinuousParameter(0, 10),
                                               },
                                               strategy='Bayesian')

In [22]:
s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')

xgb_hyperparameter_tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


In [23]:
xgb_hyperparameter_tuner.wait()

...........................................................................................................................................................................................................................................................................................................................................................................................!


## 8. Use Best Tuned Model for Batch Prediction

In [24]:
best_xgb = sagemaker.estimator.Estimator.attach(xgb_hyperparameter_tuner.best_training_job())

Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.


2021-06-30 01:33:27 Starting - Preparing the instances for training
2021-06-30 01:33:27 Downloading - Downloading input data
2021-06-30 01:33:27 Training - Training image download completed. Training in progress.
2021-06-30 01:33:27 Uploading - Uploading generated training model
2021-06-30 01:33:27 Completed - Training job completed[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter _tuning_objective_metric value validation:auc to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of

In [25]:
xgb_transformer = best_xgb.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


In [26]:
xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')

In [27]:
xgb_transformer.wait()

................................[34m[2021-06-30:01:50:49:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2021-06-30:01:50:49:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2021-06-30:01:50:49:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;
[0m
[34mworker_rlimit_nofile 4096;
[0m
[34mevents {
  worker_connections 2048;[0m
[34m}
[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;

  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }

  server {
    listen 8080 deferred;
    client_max_body_size 0;

    keepalive_timeout 3;

    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    }

 

In [28]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

download: s3://sagemaker-us-east-1-632144194871/sagemaker-xgboost-210630-0106-017-24089-2021-06-30-01-45-40-851/test.csv.out to data/for_upload/test.csv.out


In [29]:
y_pred = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)

In [30]:
y_pred.head()

Unnamed: 0,0
0,0.687111
1,0.682062
2,0.211141
3,0.265858
4,0.249983


In [31]:
y_data_LNR = pd.read_csv('data/Udacity_MAILOUT_052018_TEST.csv', sep=';')[['LNR']]

  interactivity=interactivity, compiler=compiler, result=result)


In [32]:
pd.concat([y_data_LNR, y_pred.rename(columns={0: 'RESPONSE'})], axis=1).to_csv('data/prediction/20210629_Y_PRED.csv', index=False)