## Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

In [7]:
# Importing libraries
import os
import boto3
import sagemaker

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

from utils import read_s3_files, preprocess, predict_clusters

%matplotlib inline

In [2]:
# sagemaker session, role
session = sagemaker.Session()
s3_client = boto3.client('s3')

In [3]:
train, test = read_s3_files(['mailout_train.csv', 'mailout_test.csv'], s3_client)
print("Train shape:", train.shape)
print("Test shape:", test.shape)

  if (await self.run_code(code, result,  async_=asy)):


Train shape: (42962, 367)
Test shape: (42833, 366)


In [4]:
train_y = train['RESPONSE'].copy()
train_x = preprocess(train.drop('RESPONSE', axis=1))
test = preprocess(test)

----------------PART-1(Identifying missing data and data-types)----------------
Downcasting dataframe...
Memory usage before downcasting:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42962 entries, 0 to 42961
Columns: 366 entries, LNR to ALTERSKATEGORIE_GROB
dtypes: float64(267), int64(93), object(6)
memory usage: 120.0+ MB
None
Memory usage after downcasting:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42962 entries, 0 to 42961
Columns: 366 entries, LNR to ALTERSKATEGORIE_GROB
dtypes: category(3), datetime64[ns](1), float32(267), int16(1), int32(1), int8(91), object(2)
memory usage: 48.8+ MB
None
Identifying and replacing values representing unknown or missing data...
Replaced X with nan in following column(s):['CAMEO_DEUG_2015']
Replaced XX with nan in following column(s):['CAMEO_INTL_2015']
Replaced -1 with nan in following column(s):['AGER_TYP', 'ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'BALLRAUM', 'CAMEO_DEUG_2015', 'EWDICHTE', 'FINANZTYP', 'FINANZ_ANLEGER', 'FINANZ_HAUSBAUE

In [5]:
train_x.shape

(42962, 388)

In [8]:
train_cluster = predict_clusters(train_x)
test_cluster = predict_clusters(test)

Performing Dimensionality Reduction using PCA...
Shape before PCA: (42962, 387)
Shape before PCA: (42962, 180)
Finding clusters in data...
Performing Dimensionality Reduction using PCA...
Shape before PCA: (42833, 387)
Shape before PCA: (42833, 180)
Finding clusters in data...


In [9]:
train_cluster

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,172,173,174,175,176,177,178,179,CLUSTER,LNR
0,-7.182226,0.187454,4.609773,5.837287,-0.496500,1.013342,-4.267402,1.315556,0.390262,-0.196290,...,0.402967,0.237453,-0.810422,-0.075919,-0.054217,-0.340420,-0.680172,1.132971,3,1763.0
1,7.725657,1.041986,-3.314461,-2.836252,-2.156061,0.516340,-3.764678,1.419750,2.031265,2.047351,...,1.221896,1.241672,0.141727,-0.292178,-0.541942,-0.778153,-0.182689,-0.370895,4,1771.0
2,1.106763,-0.302178,-2.878360,4.657115,5.467867,2.277680,3.233390,2.110274,-0.053255,-1.088111,...,0.734011,-0.626888,1.371622,0.586329,-0.803731,0.198041,0.020245,-1.409442,1,1776.0
3,-0.249254,-8.177169,11.141335,2.645266,0.534873,-2.097282,2.449938,-3.461524,1.025434,-1.240244,...,-0.268486,-0.335606,0.966945,1.043623,1.746834,0.308505,2.001338,0.206794,2,1460.0
4,-1.833353,-2.427218,6.122892,3.509244,5.905100,1.822975,0.423698,3.453430,-1.055653,1.250149,...,-1.039643,0.516221,0.360774,-1.056338,-0.681541,-0.742959,-0.065323,0.262950,2,1783.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42957,8.215550,-0.376386,6.827886,1.202471,-2.094871,1.628754,1.061350,1.421027,-0.473854,-2.721176,...,-1.809185,-0.941040,1.078960,-0.124614,0.677304,0.073045,0.715979,0.059666,4,66338.0
42958,6.438153,2.738240,-6.053588,-0.373863,4.829053,2.066686,0.044657,2.150053,0.252102,-2.837602,...,0.266430,-1.280734,1.468274,0.673154,0.003213,0.480690,-0.445532,-1.098566,4,67629.0
42959,7.486320,-3.193941,-2.265814,0.232257,-0.030057,2.947499,3.148342,0.314993,2.465614,1.588148,...,1.004895,0.129926,0.445571,-1.466482,2.360812,0.972449,0.499622,0.298071,2,68273.0
42960,3.751761,8.577937,1.755504,1.563673,-5.459938,-3.677296,1.949818,-3.803554,-0.779987,3.560894,...,-1.335688,0.268301,-0.012517,-0.467964,-0.430864,0.695971,-0.127211,0.262410,4,68581.0


In [12]:
train_cluster['RESPONSE'] = train_y


In [14]:
test_cluster.shape

(42833, 182)

In [None]:
data_dir = 'Data/arvato'
train_cluster.to_pickle(os.path.join(data_dir, 'train_mailout.pkl'))
test_.to_pickle(os.path.join(data_dir, 'test_mailout.pkl'))