## Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

In [90]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

import utils as u
from time import time

# sklearn
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier

from sklearn.model_selection import GridSearchCV

## Preprocessing train data

In [2]:
mailout_train = pd.read_csv('Udacity_MAILOUT_052018_TRAIN.csv', sep=';')
mailout_train.drop(columns='Unnamed: 0', inplace=True)


  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
# Also the attribute file will come in handy for handling missing or unknown values
dictionary = pd.read_excel('DIAS Attributes - Values 2017_revised.xlsx', sheet_name='Tabelle1')
dictionary.drop(columns='Unnamed: 0', inplace=True)

In [4]:
column_dist = mailout_train.isnull().sum()
column_dist.sort_values(ascending=False, inplace=True)
outlier_columns = column_dist.index[:8]
print(column_dist[:30]/mailout_train.shape[0])

ALTER_KIND4       0.999046
ALTER_KIND3       0.995950
ALTER_KIND2       0.982403
ALTER_KIND1       0.953727
KK_KUNDENTYP      0.589265
EXTSEL992         0.371212
HH_DELTA_FLAG     0.225269
W_KEIT_KIND_HH    0.225269
KBA05_KW2         0.201294
MOBI_REGIO        0.201294
KBA05_KW3         0.201294
KBA05_MAXAH       0.201294
KBA05_MAXBJ       0.201294
KBA05_MAXHERST    0.201294
KBA05_MOTRAD      0.201294
KBA05_MAXVORB     0.201294
KBA05_MOD1        0.201294
KBA05_MOD2        0.201294
KBA05_MOD3        0.201294
KBA05_MOD4        0.201294
KBA05_MOD8        0.201294
KBA05_MOTOR       0.201294
KBA05_KW1         0.201294
KBA05_MAXSEG      0.201294
KBA05_ZUL2        0.201294
KBA05_SEG1        0.201294
KBA05_SEG7        0.201294
KBA05_ZUL1        0.201294
KBA05_VORB2       0.201294
KBA05_VORB1       0.201294
dtype: float64


In [5]:
cat_col = u.obtain_categorical_columns(mailout_train)
cat_col

{'multi': ['LP_FAMILIE_FEIN',
  'LP_FAMILIE_GROB',
  'LP_STATUS_FEIN',
  'LP_STATUS_GROB',
  'NATIONALITAET_KZ',
  'SHOPPER_TYP',
  'TITEL_KZ',
  'VERS_TYP',
  'CJT_GESAMTTYP',
  'CAMEO_DEUG_2015',
  'FINANZTYP',
  'GEBAEUDETYP',
  'GFK_URLAUBERTYP',
  'ZABEOTYP'],
 'binary': ['OST_WEST_KZ', 'ANREDE_KZ', 'GREEN_AVANTGARDE']}

In [6]:
unknown_dict = u.create_unknown_dictionary(dictionary)

Shape of the attribute file: (2258, 4)
Missing values in the array: 11
After filling NaNs with False: 
False    2025
True      233
Name: Meaning, dtype: int64


In [7]:
#Cleaning 
mailout_train, mailout_high_nas = u.clean_mailout(mailout_train, unknown_dict, outlier_columns, cat_col)

Splitting records with NAs..
 Total records: 42962
Records split by 89 missing values.
 Shape of resulting dataset: (34991,)
 Shape of high NAs dataset: (7971,)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [71]:
# Creating X_train and Y_train
X_train = mailout_train.drop(columns=['RESPONSE', 'LNR'])
y_train = mailout_train.RESPONSE

X_train.head()

Unnamed: 0,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,ANZ_HH_TITEL,ANZ_KINDER,ANZ_PERSONEN,ANZ_STATISTISCHE_HAUSHALTE,ANZ_TITEL,...,ZABEOTYP_2,ZABEOTYP_3,ZABEOTYP_4,ZABEOTYP_5,ZABEOTYP_6,PRAEGENDE_JUGENDJAHRE_MOVEMENT,PRAEGENDE_JUGENDJAHRE_DECADE,CAMEO_INTL_2015_WEALTH,CAMEO_INTL_2015_LIFESTAGE,EINGEFUEGT_AM_YEAR
0,2.0,1.0,8.0,8.0,15.0,0.0,0.0,1.0,13.0,0.0,...,0,1,0,0,0,0.0,40.0,3.0,4.0,1992.0
1,1.0,4.0,13.0,13.0,1.0,0.0,0.0,2.0,1.0,0.0,...,0,0,0,0,0,0.0,70.0,3.0,2.0,1997.0
2,1.0,1.0,9.0,7.0,0.0,,0.0,0.0,1.0,0.0,...,0,1,0,0,0,1.0,40.0,1.0,4.0,1995.0
3,2.0,1.0,6.0,6.0,4.0,0.0,0.0,2.0,4.0,0.0,...,0,1,0,0,0,1.0,40.0,1.0,4.0,1992.0
4,2.0,1.0,9.0,9.0,53.0,0.0,0.0,1.0,44.0,0.0,...,0,1,0,0,0,0.0,50.0,4.0,1.0,1992.0


In [72]:
X_train.describe(include=['object'])

Unnamed: 0,D19_LETZTER_KAUF_BRANCHE
count,34412
unique,35
top,D19_UNBEKANNT
freq,9986


In [73]:
# There is nothing related to the 'D19_LETZTER_KAUF_BRANCHE' column in the dictionary file, so we drop it (Also some of 
# its values arleady exist as columns in the dataset)
X_train.drop(columns=['D19_LETZTER_KAUF_BRANCHE'], inplace=True)

In [None]:
# I will use AdaBoost and RandomForest and SGD classifier to train

### Random Forest Classifier

In [116]:
transformer = Pipeline([
    ('imp', Imputer(missing_values=np.nan, strategy='median')),
    ('scaler', StandardScaler())
])

In [111]:
# Let's create a RF Pipeline first
pipeline_rf = Pipeline([
        ('transf', transformer)
        ('clf', RandomForestClassifier())
    ])

In [75]:
# Parameters to consider
pipeline_rf.get_params()

{'memory': None,
 'steps': [('imp',
   Imputer(axis=0, copy=True, missing_values=nan, strategy='median', verbose=0)),
  ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('clf',
   RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=None, verbose=0,
               warm_start=False))],
 'imp': Imputer(axis=0, copy=True, missing_values=nan, strategy='median', verbose=0),
 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'clf': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_i

In [80]:
parameters_rf = {
    'clf__n_estimators': [10,30,50,100],
    'clf__min_samples_split': [3,4,7,10],
}

# Using GridSearch to find the best set of parameters, also considering ROC to evaluate
# due to the imbalanced data
cv_rf = GridSearchCV(estimator=pipeline_rf,param_grid=parameters_rf, scoring='roc_auc',
                     n_jobs=-1, verbose=2)

In [81]:
# Training Random Forest algorithm
t0 = time()
cv_rf.fit(X_train,y_train)
print("done in %0.3fs" % (time() - t0))

Fitting 3 folds for each of 16 candidates, totalling 48 fits
[CV] clf__min_samples_split=3, clf__n_estimators=10 ..................
[CV] clf__min_samples_split=3, clf__n_estimators=10 ..................
[CV] clf__min_samples_split=3, clf__n_estimators=10 ..................
[CV] clf__min_samples_split=3, clf__n_estimators=30 ..................
[CV] ... clf__min_samples_split=3, clf__n_estimators=10, total=   2.8s
[CV] clf__min_samples_split=3, clf__n_estimators=30 ..................
[CV] ... clf__min_samples_split=3, clf__n_estimators=10, total=   3.1s
[CV] clf__min_samples_split=3, clf__n_estimators=30 ..................
[CV] ... clf__min_samples_split=3, clf__n_estimators=10, total=   3.1s
[CV] clf__min_samples_split=3, clf__n_estimators=50 ..................
[CV] ... clf__min_samples_split=3, clf__n_estimators=30, total=   4.8s
[CV] clf__min_samples_split=3, clf__n_estimators=50 ..................
[CV] ... clf__min_samples_split=3, clf__n_estimators=30, total=   4.8s
[CV] ... clf__mi

[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   52.7s


[CV] .. clf__min_samples_split=10, clf__n_estimators=10, total=   2.6s
[CV] clf__min_samples_split=10, clf__n_estimators=10 .................
[CV] .. clf__min_samples_split=7, clf__n_estimators=100, total=   9.0s
[CV] clf__min_samples_split=10, clf__n_estimators=10 .................
[CV] .. clf__min_samples_split=10, clf__n_estimators=10, total=   2.6s
[CV] clf__min_samples_split=10, clf__n_estimators=30 .................
[CV] .. clf__min_samples_split=10, clf__n_estimators=10, total=   2.7s
[CV] clf__min_samples_split=10, clf__n_estimators=30 .................
[CV] .. clf__min_samples_split=7, clf__n_estimators=100, total=   9.1s
[CV] clf__min_samples_split=10, clf__n_estimators=30 .................
[CV] .. clf__min_samples_split=7, clf__n_estimators=100, total=   9.4s
[CV] clf__min_samples_split=10, clf__n_estimators=50 .................
[CV] .. clf__min_samples_split=10, clf__n_estimators=30, total=   4.6s
[CV] clf__min_samples_split=10, clf__n_estimators=50 .................
[CV] .

[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:  1.3min finished


done in 88.765s


In [82]:
print("\nBest Score: ", cv_rf.best_score_)
print("\nBest Parameters: ", cv_rf.best_params_)
print("\nBest Estimator: ", cv_rf.best_estimator_)


Best Score:  0.6371316457523607

Best Parameters:  {'clf__min_samples_split': 10, 'clf__n_estimators': 100}

Best Estimator:  Pipeline(memory=None,
     steps=[('imp', Imputer(axis=0, copy=True, missing_values=nan, strategy='median', verbose=0)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])


2

### AdaBoost Classifier

In [83]:
# Building pipeline for AdaBoost
pipeline_ada= Pipeline([
        ('transf', transformer),
        ('clf', AdaBoostClassifier(base_estimator=DecisionTreeClassifier()))
    ])


In [84]:
pipeline_ada.get_params

<bound method Pipeline.get_params of Pipeline(memory=None,
     steps=[('imp', Imputer(axis=0, copy=True, missing_values=nan, strategy='median', verbose=0)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max...ne,
            splitter='best'),
          learning_rate=1.0, n_estimators=50, random_state=None))])>

In [88]:
parameters_ada = {
    'clf__base_estimator__max_depth': [1,3,5,7],
    'clf__n_estimators': [50,70,100],
    'clf__random_state': [42],
    'clf__learning_rate': [0.01, 0.001]
}
cv_ada = GridSearchCV(estimator=pipeline_ada,param_grid=parameters_ada, scoring='roc_auc', 
                      n_jobs=-1, verbose=2)

In [89]:
# Training AdaBoost
t0 = time()
cv_ada.fit(X_train,y_train)
print("done in %0.3fs" % (time() - t0))

Fitting 3 folds for each of 24 candidates, totalling 72 fits
[CV] clf__base_estimator__max_depth=1, clf__learning_rate=0.01, clf__n_estimators=50, clf__random_state=42 
[CV] clf__base_estimator__max_depth=1, clf__learning_rate=0.01, clf__n_estimators=50, clf__random_state=42 
[CV] clf__base_estimator__max_depth=1, clf__learning_rate=0.01, clf__n_estimators=50, clf__random_state=42 
[CV] clf__base_estimator__max_depth=1, clf__learning_rate=0.01, clf__n_estimators=70, clf__random_state=42 
[CV]  clf__base_estimator__max_depth=1, clf__learning_rate=0.01, clf__n_estimators=50, clf__random_state=42, total=  12.3s
[CV]  clf__base_estimator__max_depth=1, clf__learning_rate=0.01, clf__n_estimators=50, clf__random_state=42, total=  12.5s
[CV] clf__base_estimator__max_depth=1, clf__learning_rate=0.01, clf__n_estimators=70, clf__random_state=42 
[CV] clf__base_estimator__max_depth=1, clf__learning_rate=0.01, clf__n_estimators=70, clf__random_state=42 
[CV]  clf__base_estimator__max_depth=1, clf__

[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  3.3min


[CV]  clf__base_estimator__max_depth=3, clf__learning_rate=0.001, clf__n_estimators=100, clf__random_state=42, total=  43.2s
[CV] clf__base_estimator__max_depth=5, clf__learning_rate=0.01, clf__n_estimators=50, clf__random_state=42 
[CV]  clf__base_estimator__max_depth=3, clf__learning_rate=0.001, clf__n_estimators=100, clf__random_state=42, total=  43.4s
[CV] clf__base_estimator__max_depth=5, clf__learning_rate=0.01, clf__n_estimators=50, clf__random_state=42 
[CV]  clf__base_estimator__max_depth=5, clf__learning_rate=0.01, clf__n_estimators=50, clf__random_state=42, total=  37.3s
[CV] clf__base_estimator__max_depth=5, clf__learning_rate=0.01, clf__n_estimators=70, clf__random_state=42 
[CV]  clf__base_estimator__max_depth=3, clf__learning_rate=0.001, clf__n_estimators=100, clf__random_state=42, total=  43.8s
[CV] clf__base_estimator__max_depth=5, clf__learning_rate=0.01, clf__n_estimators=70, clf__random_state=42 
[CV]  clf__base_estimator__max_depth=5, clf__learning_rate=0.01, clf__

[CV]  clf__base_estimator__max_depth=7, clf__learning_rate=0.001, clf__n_estimators=100, clf__random_state=42, total= 1.7min
[CV]  clf__base_estimator__max_depth=7, clf__learning_rate=0.001, clf__n_estimators=100, clf__random_state=42, total= 1.7min
[CV]  clf__base_estimator__max_depth=7, clf__learning_rate=0.001, clf__n_estimators=100, clf__random_state=42, total= 1.6min


[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed: 13.7min finished


done in 918.798s


In [91]:
print("\nBest Score: ", cv_ada.best_score_)
print("\nBest Parameters: ", cv_ada.best_params_)
print("\nBest Estimator: ", cv_ada.best_estimator_)


Best Score:  0.7728198328124221

Best Parameters:  {'clf__base_estimator__max_depth': 7, 'clf__learning_rate': 0.01, 'clf__n_estimators': 70, 'clf__random_state': 42}

Best Estimator:  Pipeline(memory=None,
     steps=[('imp', Imputer(axis=0, copy=True, missing_values=nan, strategy='median', verbose=0)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max...one,
            splitter='best'),
          learning_rate=0.01, n_estimators=70, random_state=42))])


### SGM Classifier

In [92]:
# Building pipeline for SGD
pipeline_sgd= Pipeline([
    ('transf', transformer),
    ('clf', SGDClassifier())
    ])

In [93]:
pipeline_sgd.get_params

<bound method Pipeline.get_params of Pipeline(memory=None,
     steps=[('imp', Imputer(axis=0, copy=True, missing_values=nan, strategy='median', verbose=0)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False))])>

In [103]:
parameters_sgd = {
    'clf__loss': ['hinge','perceptron', 'squared_hinge'],
    'clf__penalty': ['l1'],
    'clf__max_iter': [1500],
    'clf__tol': [None]
}
cv_sgd = GridSearchCV(estimator=pipeline_sgd,param_grid=parameters_sgd, scoring='roc_auc', 
                      n_jobs=-1, verbose=2)

In [104]:
# Training SGD
t0 = time()
cv_sgd.fit(X_train,y_train)
print("done in %0.3fs" % (time() - t0))

Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] clf__loss=hinge, clf__max_iter=1500, clf__penalty=l1, clf__tol=None 
[CV] clf__loss=hinge, clf__max_iter=1500, clf__penalty=l1, clf__tol=None 
[CV] clf__loss=hinge, clf__max_iter=1500, clf__penalty=l1, clf__tol=None 
[CV] clf__loss=hinge, clf__max_iter=1500, clf__penalty=l1, clf__tol=0.001 
[CV]  clf__loss=hinge, clf__max_iter=1500, clf__penalty=l1, clf__tol=0.001, total=  23.1s
[CV] clf__loss=hinge, clf__max_iter=1500, clf__penalty=l1, clf__tol=0.001 
[CV]  clf__loss=hinge, clf__max_iter=1500, clf__penalty=l1, clf__tol=0.001, total=  21.0s
[CV] clf__loss=hinge, clf__max_iter=1500, clf__penalty=l1, clf__tol=0.001 
[CV]  clf__loss=hinge, clf__max_iter=1500, clf__penalty=l1, clf__tol=0.001, total=  23.0s
[CV] clf__loss=perceptron, clf__max_iter=1500, clf__penalty=l1, clf__tol=None 
[CV]  clf__loss=hinge, clf__max_iter=1500, clf__penalty=l1, clf__tol=None, total= 1.3min
[CV] clf__loss=perceptron, clf__max_iter=1500, clf__pen



[CV]  clf__loss=squared_hinge, clf__max_iter=1500, clf__penalty=l1, clf__tol=0.001, total= 1.4min
[CV]  clf__loss=squared_hinge, clf__max_iter=1500, clf__penalty=l1, clf__tol=None, total= 1.4min
[CV]  clf__loss=squared_hinge, clf__max_iter=1500, clf__penalty=l1, clf__tol=0.001, total=  22.5s
[CV]  clf__loss=squared_hinge, clf__max_iter=1500, clf__penalty=l1, clf__tol=0.001, total=  36.9s


[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:  4.5min finished


done in 366.025s


In [105]:
print("\nBest Score: ", cv_sgd.best_score_)
print("\nBest Parameters: ", cv_sgd.best_params_)
print("\nBest Estimator: ", cv_sgd.best_estimator_)


Best Score:  0.705273802129917

Best Parameters:  {'clf__loss': 'hinge', 'clf__max_iter': 1500, 'clf__penalty': 'l1', 'clf__tol': None}

Best Estimator:  Pipeline(memory=None,
     steps=[('imp', Imputer(axis=0, copy=True, missing_values=nan, strategy='median', verbose=0)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=1500, n_iter=None,
       n_jobs=1, penalty='l1', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False))])


In [110]:
# Saving all the three models
u.save_model(cv_rf.best_estimator_, 'rf')
u.save_model(cv_ada.best_estimator_, 'ada')
u.save_model(cv_sgd.best_estimator_, 'sgd')