### Libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

# Custom cleaning functions
from utils import cleaning_functions

## Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

### Reading and cleaning training set

In [3]:
mailout_train = pd.read_csv('data/Udacity_MAILOUT_052018_TRAIN.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Applying the cleaning steps from workbook 1 (now in cleaning_functions.py)

In [4]:
training_clean = cleaning_functions.clean_data(mailout_train)

Initial amount of missing values: 2217201

Reading the description of attributes table....

Missing values after including missing codes 2354411
Additional missing values: 137210

Starting the cleaning of attributes and feature engineering...


We don't need `LNR` for training, just for the training set.

In [5]:
training_clean.drop(['LNR'], axis = 1, inplace = True)

In [6]:
training_clean.head()

Unnamed: 0,ARBEIT,BALLRAUM,CJT_KATALOGNUTZER,CJT_TYP_1,CJT_TYP_2,CJT_TYP_3,CJT_TYP_4,CJT_TYP_5,CJT_TYP_6,D19_BANKEN_ANZ_12,...,REGIOTYP_7.0,CAMEO_DEUG_2015_1.0,CAMEO_DEUG_2015_2.0,CAMEO_DEUG_2015_3.0,CAMEO_DEUG_2015_4.0,CAMEO_DEUG_2015_5.0,CAMEO_DEUG_2015_6.0,CAMEO_DEUG_2015_7.0,CAMEO_DEUG_2015_8.0,CAMEO_DEUG_2015_9.0
0,3.0,5.0,5.0,2.0,2.0,5.0,5.0,5.0,5.0,0,...,0,0,0,0,0,1,0,0,0,0
1,2.0,5.0,2.0,2.0,2.0,4.0,3.0,5.0,4.0,1,...,0,0,0,0,0,1,0,0,0,0
2,4.0,1.0,5.0,1.0,1.0,5.0,5.0,5.0,5.0,0,...,0,0,1,0,0,0,0,0,0,0
3,4.0,2.0,5.0,2.0,2.0,5.0,5.0,5.0,4.0,0,...,0,0,1,0,0,0,0,0,0,0
4,3.0,4.0,5.0,1.0,2.0,5.0,5.0,5.0,5.0,0,...,1,0,0,0,0,0,0,1,0,0


### Scaling the training set.


In [7]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

train_scaled = scaler.fit_transform(training_clean.drop(['RESPONSE'], axis = 1))

train_scaled = pd.DataFrame(train_scaled, columns = list(training_clean.columns)[:-1])

train_scaled.head()

Unnamed: 0,ARBEIT,BALLRAUM,CJT_KATALOGNUTZER,CJT_TYP_1,CJT_TYP_2,CJT_TYP_3,CJT_TYP_4,CJT_TYP_5,CJT_TYP_6,D19_BANKEN_ANZ_12,...,REGIOTYP_6.0,REGIOTYP_7.0,CAMEO_DEUG_2015_1.0,CAMEO_DEUG_2015_2.0,CAMEO_DEUG_2015_3.0,CAMEO_DEUG_2015_4.0,CAMEO_DEUG_2015_5.0,CAMEO_DEUG_2015_6.0,CAMEO_DEUG_2015_7.0,CAMEO_DEUG_2015_8.0
0,-0.220282,0.203551,0.756779,-0.39614,-0.211671,0.563449,0.580345,0.560862,0.580973,-0.224018,...,-0.286551,-0.242063,-0.350607,-0.338773,-0.368279,3.996789,-0.66823,-0.277121,-0.36333,-0.24645
1,-1.216569,0.203551,-1.329511,-0.39614,-0.211671,-0.521138,-1.309792,0.560862,-0.443101,1.978334,...,-0.286551,-0.242063,-0.350607,-0.338773,-0.368279,3.996789,-0.66823,-0.277121,-0.36333,-0.24645
2,0.776006,-1.704266,0.756779,-1.138807,-0.929186,0.563449,0.580345,0.560862,0.580973,-0.224018,...,-0.286551,-0.242063,2.852196,-0.338773,-0.368279,-0.250201,-0.66823,-0.277121,-0.36333,-0.24645
3,0.776006,-1.227312,0.756779,-0.39614,-0.211671,0.563449,0.580345,0.560862,-0.443101,-0.224018,...,-0.286551,-0.242063,2.852196,-0.338773,-0.368279,-0.250201,-0.66823,-0.277121,-0.36333,-0.24645
4,-0.220282,-0.273403,0.756779,-1.138807,-0.211671,0.563449,0.580345,0.560862,0.580973,-0.224018,...,3.489775,-0.242063,-0.350607,-0.338773,-0.368279,-0.250201,-0.66823,3.608537,-0.36333,-0.24645


## Training learning models

We will try the following models:

- SGDClassifier
- RandomForestClassifier
- XGBoost

We will perform hyperparameter tuning to improve the AUCROCC metric. 

### Setting $X$ and $y$

Since we will cross-validate our training batches we will not split our data in training and test set.

In [8]:
#from sklearn.model_selection import train_test_split

X = train_scaled.values
y = training_clean['RESPONSE'].values

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

### 1) SGDClassifier

This is a linear model from which we did not expect to have great performance. However, since it is very quick to train we can check that all the workflow is integrated correctly.

In [9]:
%%time

from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier

# Define classifier
sgd_clf = SGDClassifier(class_weight = 'balanced', early_stopping = True, loss = 'modified_huber')

# Setting hyperparameter grid
sgd_hyperparams = {'penalty': ['l2', 'l1', 'elasticnet'], 
                   'alpha': [5e-5, 0.0001, 0.0002]}

# Define grid search
sgd_cv = GridSearchCV(sgd_clf, sgd_hyperparams, scoring = 'roc_auc', n_jobs = -1, verbose = 1, cv = 3)

# Fit training
sgd_cv.fit(X, y)

print('\nBest parameters:', sgd_cv.best_params_)
print('Best score:', sgd_cv.best_score_)

### Selecting best estimator
sgd_best = sgd_cv.best_estimator_

#y_pred_prob = sgd_best.predict_proba(X)[:,1]
#print('\nrouc_auc_score: ', roc_auc_score(y, y_pred_prob))

# Confusion matrix
confusion_matrix(y, sgd_best.predict(X))

Fitting 3 folds for each of 9 candidates, totalling 27 fits

Best parameters: {'alpha': 0.0001, 'penalty': 'l1'}
Best score: 0.630051361200492
Wall time: 10.8 s


array([[28325, 14105],
       [  174,   358]], dtype=int64)

### 2) RandomForest Classifier

In [10]:
from sklearn.ensemble import RandomForestClassifier


rfc_clf = RandomForestClassifier() # class_weight = 'balanced')

rf_hyperparameters = {'n_estimators': [150, 175, 185, 200], #, 250, 300],  
                      'max_depth': [ 3, 4, 5, 6, 7], 
                      'max_features': ['auto', 0.25, 0.33, 0.4],
                      'min_samples_leaf': [3, 5, 10, 12, 15]}

rfc_cv = GridSearchCV(rfc_clf, rf_hyperparameters, scoring = 'roc_auc', n_jobs = 2, verbose = 3, cv = 3)

preds = rfc_cv.fit(X, y)


print('\nBest parameters:', rfc_cv.best_params_)
print('Best score:', rfc_cv.best_score_)

rfc_final = rfc_cv.best_estimator_

y_pred= rfc_final.predict_proba(X)[:,1]

print('roc_auc_score using metric:', roc_auc_score(y, y_pred))

Fitting 3 folds for each of 400 candidates, totalling 1200 fits

Best parameters: {'max_depth': 5, 'max_features': 0.4, 'min_samples_leaf': 15, 'n_estimators': 185}
Best score: 0.769675021887556
roc_auc_score using metric: 0.8608629161874756


In [11]:
with open('final_rfc.pkl', 'wb') as f:
    pickle.dump(rfc_final, f)

### 3) XGBoost

In [12]:
import xgboost as xgb #!pip install xgboost
from sklearn.model_selection import RandomizedSearchCV
import scipy.stats as stats
from sklearn.utils.fixes import loguniform


xgb_clf = xgb.XGBClassifier(n_jobs = -1, objective = 'binary:logistic', eval_metric = 'auc')
                            #scale_pos_weight = 99) 

xgb_distributions = {'n_estimators':[4, 5, 6, 7, 10, 50, 100, 150, 200],  # [5, 10, 50, 100], 
                       'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
                       'gamma': loguniform(1e-4, 1e0),    
                       'learning_rate': loguniform(1e-4, 1e0), 
                       'colsample_bytree':  stats.uniform(0.1, 1.0),}
                       
xgb_cv = RandomizedSearchCV(xgb_clf, 
                        param_distributions = xgb_distributions, 
                        n_iter = 400,
                        scoring = 'roc_auc', 
                        n_jobs = 2, 
                        verbose = 3, 
                        cv = 3)

xgb_cv.fit(X, y)

print('\nBest parameters:', xgb_cv.best_params_)
print('Best score:', xgb_cv.best_score_)

xgb_final = xgb_cv.best_estimator_

y_pred= xgb_final.predict_proba(X)[:,1]
print('roc_auc_score using metric:', roc_auc_score(y, y_pred))

Fitting 3 folds for each of 400 candidates, totalling 1200 fits


 0.68477053 0.75402554        nan 0.7603648  0.75177686        nan
 0.76276898 0.7269786  0.7551787  0.75662907 0.75513155 0.75972411
        nan 0.75670131        nan 0.75742495 0.75670131 0.74740485
 0.75860617 0.76581496 0.76323498 0.7552886  0.51039547 0.70258389
 0.75696963 0.75671169 0.76036675 0.7631416  0.73668987 0.75112451
 0.65160583 0.7551458         nan 0.75106454 0.75640395 0.75594788
 0.75688809 0.75948902 0.75661868 0.7539416  0.75517888 0.75652988
 0.75264799 0.7590495  0.74157967 0.76237033 0.5        0.75715617
 0.63744576 0.75653947 0.75646487        nan 0.72830829 0.75800368
 0.76014276 0.75585088        nan 0.66793973 0.76921271 0.7516955
 0.75585099 0.75666549 0.71461951 0.75519488        nan 0.75617426
 0.75630865 0.75514029 0.74579016 0.7592211  0.75587824 0.75818779
 0.75402554 0.75604772 0.75919626 0.75708844 0.75684661 0.63461138
 0.73994935 0.75622544 0.75611806 0.75670131 0.76022138        nan
 0.759419   0.76013561 0.76698492 0.75598749 0.75624988 0.76149


Best parameters: {'colsample_bytree': 0.43240625240551533, 'gamma': 0.576392071568772, 'learning_rate': 0.4088291979761932, 'max_depth': 4, 'n_estimators': 7}
Best score: 0.7692127082056782
roc_auc_score using metric: 0.7860359787637843


In [13]:
with open('final_xgb.pkl', 'wb') as f:
    pickle.dump(xgb_final, f)

### 4) AdaBoost

In [14]:
from sklearn.ensemble import AdaBoostClassifier

adab_clf = AdaBoostClassifier()

adab_hyperparams = {'n_estimators': [5, 10, 50, 100, 200], 
                    'learning_rate': [0.01, 0.05, 0.1, 0.2, 1]}


adab_cv = GridSearchCV(adab_clf, adab_hyperparams, scoring = 'roc_auc', n_jobs = 2, verbose = 3, cv = 3)

adab_cv.fit(X, y)

print('\nBest parameters:', adab_cv.best_params_)
print('Best score:', adab_cv.best_score_)

adab_final = adab_cv.best_estimator_

y_pred = adab_final.predict_proba(X)[:,1]
print('roc_auc_score using metric:', roc_auc_score(y, y_pred))

Fitting 3 folds for each of 25 candidates, totalling 75 fits

Best parameters: {'learning_rate': 0.05, 'n_estimators': 100}
Best score: 0.7631178622195605
roc_auc_score using metric: 0.7765788720564079


In [16]:
with open('final_adab.pkl', 'wb') as f:
    pickle.dump(adab_final, f)

## Feature selection with Boruta

In [17]:
from boruta import BorutaPy # !pip install boruta

clf = RandomForestClassifier(n_jobs=-1, max_depth = 6, class_weight = 'balanced')

trans = BorutaPy(clf, n_estimators = 'auto', random_state = 42, verbose=1, max_iter = 200)

X_filtered = trans.fit_transform(X, y)

Iteration: 1 / 200
Iteration: 2 / 200
Iteration: 3 / 200
Iteration: 4 / 200
Iteration: 5 / 200
Iteration: 6 / 200
Iteration: 7 / 200
Iteration: 8 / 200
Iteration: 9 / 200
Iteration: 10 / 200
Iteration: 11 / 200
Iteration: 12 / 200
Iteration: 13 / 200
Iteration: 14 / 200
Iteration: 15 / 200
Iteration: 16 / 200
Iteration: 17 / 200
Iteration: 18 / 200
Iteration: 19 / 200
Iteration: 20 / 200
Iteration: 21 / 200
Iteration: 22 / 200
Iteration: 23 / 200
Iteration: 24 / 200
Iteration: 25 / 200
Iteration: 26 / 200
Iteration: 27 / 200
Iteration: 28 / 200
Iteration: 29 / 200
Iteration: 30 / 200
Iteration: 31 / 200
Iteration: 32 / 200
Iteration: 33 / 200
Iteration: 34 / 200


BorutaPy finished running.

Iteration: 	35 / 200
Confirmed: 	6
Tentative: 	0
Rejected: 	206


### Relevant features: 

In [60]:
rel = pd.DataFrame(X).loc[:, trans.support_]

list(training_clean.iloc[:, rel.columns].columns)

['D19_KONSUMTYP_MAX',
 'D19_SOZIALES',
 'HH_EINKOMMEN_SCORE',
 'RT_SCHNAEPPCHEN',
 'CJT_GESAMTTYP_6.0',
 'D19_KONSUMTYP_6.0']

### Using X_filtered

We retrained the last three algorithms with the newly filtered features

### RandomForest 

In [33]:
rfc_bor_clf = RandomForestClassifier() # class_weight = 'balanced')

rf_hyperparameters = {'n_estimators': [150, 175, 185, 200], #, 250, 300],  
                      'max_depth': [ 3, 4, 5, 6, 7], 
                      'max_features': ['auto', 0.25, 0.33, 0.4],
                      'min_samples_leaf': [3, 5, 10, 12, 15]}

rfc_bor_cv = GridSearchCV(rfc_bor_clf, rf_hyperparameters, scoring = 'roc_auc', n_jobs = 2, verbose = 3, cv = 3)

preds = rfc_bor_cv.fit(X_filtered, y)


print('\nBest parameters:', rfc_bor_cv.best_params_)
print('Best score:', rfc_bor_cv.best_score_)

rfc_bor_final = rfc_bor_cv.best_estimator_

y_pred= rfc_bor_final.predict_proba(X_filtered)[:,1]

print('roc_auc_score using metric:', roc_auc_score(y, y_pred))

Fitting 3 folds for each of 400 candidates, totalling 1200 fits

Best parameters: {'max_depth': 5, 'max_features': 'auto', 'min_samples_leaf': 5, 'n_estimators': 185}
Best score: 0.7688345183377158
roc_auc_score using metric: 0.8044315360638221


In [34]:
with open('final_bor_rfc.pkl', 'wb') as f:
    pickle.dump(rfc_bor_final, f)

### XGB

In [35]:
xgb_bor_clf = xgb.XGBClassifier(n_jobs = -1, objective = 'binary:logistic', eval_metric = 'auc')
                            #scale_pos_weight = 99) 

xgb_distributions = {'n_estimators':[4, 5, 6, 7, 10, 50, 100, 150, 200],  # [5, 10, 50, 100], 
                       'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
                       'gamma': loguniform(1e-4, 1e0),    
                       'learning_rate': loguniform(1e-4, 1e0), 
                       'colsample_bytree':  stats.uniform(0.1, 1.0),}
                       
xgb_bor_cv = RandomizedSearchCV(xgb_bor_clf, 
                        param_distributions = xgb_distributions, 
                        n_iter = 400,
                        scoring = 'roc_auc', 
                        n_jobs = -1, 
                        verbose = 3, 
                        cv = 3)

xgb_bor_cv.fit(X_filtered, y)

print('\nBest parameters:', xgb_bor_cv.best_params_)
print('Best score:', xgb_bor_cv.best_score_)

xgb_bor_final = xgb_bor_cv.best_estimator_

y_pred= xgb_bor_final.predict_proba(X_filtered)[:,1]
print('roc_auc_score using metric:', roc_auc_score(y, y_pred))

Fitting 3 folds for each of 400 candidates, totalling 1200 fits


 0.75607206 0.64880276 0.75514574 0.77117514 0.75618357 0.7628666
        nan 0.75601737 0.74214072 0.73668987 0.75096245 0.62266604
 0.76006484 0.75524983 0.75938156 0.5               nan 0.6803659
 0.7571545  0.71667154        nan 0.75868993 0.75633868 0.75618357
 0.75727101        nan 0.5        0.73683037 0.75193755 0.62266604
 0.75256872 0.74769316 0.75621472 0.57877374 0.75618357 0.73235911
 0.75518423        nan 0.75636292 0.75193755 0.67996379 0.7561602
 0.75054379 0.75618357 0.75688385 0.75256872 0.73668987 0.5
 0.73572305 0.63456384 0.76108386 0.75926932 0.76102178 0.63456384
 0.75193755 0.75950353 0.62266604 0.69620788 0.76300112 0.57877374
 0.76518125 0.75328641 0.7530197  0.7675232  0.73849424        nan
 0.62266604 0.75618357 0.62266604 0.74100382 0.76506212 0.63456384
        nan 0.7571545  0.69473158 0.62266604 0.76598114 0.74514752
 0.73336809 0.7571545  0.75636292 0.7571545  0.75618357 0.63456384
 0.76493801 0.76623769 0.68747045 0.72707834 0.75618357 0.55906738
 0.75


Best parameters: {'colsample_bytree': 0.4241194460159503, 'gamma': 0.3655779982521269, 'learning_rate': 0.12275552548889282, 'max_depth': 2, 'n_estimators': 200}
Best score: 0.7723333062851351
roc_auc_score using metric: 0.7991307000118726


In [36]:
with open('final_bor_xgb.pkl', 'wb') as f:
    pickle.dump(xgb_bor_final, f)

### AdaBoost

In [37]:
adab_bor_clf = AdaBoostClassifier()

adab_hyperparams = {'n_estimators': [5, 10, 50, 100, 200], 
                    'learning_rate': [0.01, 0.05, 0.1, 0.2, 1]}


adab_bor_cv = GridSearchCV(adab_bor_clf, adab_hyperparams, scoring = 'roc_auc', n_jobs = 2, verbose = 3, cv = 3)

adab_bor_cv.fit(X_filtered, y)

print('\nBest parameters:', adab_bor_cv.best_params_)
print('Best score:', adab_bor_cv.best_score_)

adab_bor_final = adab_bor_cv.best_estimator_

y_pred = adab_bor_final.predict_proba(X_filtered)[:,1]
print('roc_auc_score using metric:', roc_auc_score(y, y_pred))

Fitting 3 folds for each of 25 candidates, totalling 75 fits

Best parameters: {'learning_rate': 0.2, 'n_estimators': 200}
Best score: 0.7642875996769988
roc_auc_score using metric: 0.7866541131877538


In [38]:
with open('final_bor_adab.pkl', 'wb') as f:
    pickle.dump(adab_bor_final, f)

## Part 3: Kaggle Competition

Now that you've created a model to predict which individuals are most likely to respond to a mailout campaign, it's time to test that model in competition through Kaggle. If you click on the link [here](http://www.kaggle.com/t/21e6d45d4c574c7fa2d868f0e8c83140), you'll be taken to the competition page where, if you have a Kaggle account, you can enter. If you're one of the top performers, you may have the chance to be contacted by a hiring manager from Arvato or Bertelsmann for an interview!

Your entry to the competition should be a CSV file with two columns. The first column should be a copy of "LNR", which acts as an ID number for each individual in the "TEST" partition. The second column, "RESPONSE", should be some measure of how likely each individual became a customer – this might not be a straightforward probability. As you should have found in Part 2, there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the competition will be using AUC to evaluate performance. The exact values of the "RESPONSE" column do not matter as much: only that the higher values try to capture as many of the actual customers as possible, early in the ROC curve sweep.

In [41]:
mailout_test = pd.read_csv('data/Udacity_MAILOUT_052018_TEST.csv')

test_clean = cleaning_functions.clean_data(mailout_test)

test_clean.set_index(['LNR'], inplace = True)

# Scaler
test_scaled = scaler.transform(test_clean)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Initial amount of missing values: 2186771

Reading the description of attributes table....

Missing values after including missing codes 2322980
Additional missing values: 136209

Starting the cleaning of attributes and feature engineering...


### RandomForest submission

In [42]:
preds_rfc = rfc_final.predict_proba(test_scaled)[:,1]

preds_rfc_df = pd.DataFrame(preds_rfc, index = test_clean.index)

preds_rfc_df.reset_index(inplace = True)

preds_rfc_df.columns = ['LNR', 'RESPONSE']

print(preds_rfc_df.shape)

preds_rfc_df.to_csv('predictions/pred_rfc.csv', index = False)

preds_rfc_df.head()

(42833, 2)


Unnamed: 0,LNR,RESPONSE
0,1754,0.028906
1,1770,0.039914
2,1465,0.00252
3,1470,0.003173
4,1478,0.008785


![img](img/Submit_2.png)

### XGB submission

In [43]:
preds_xgb = xgb_final.predict_proba(test_scaled)[:,1]

preds_xgb_df = pd.DataFrame(preds_xgb, index = test_clean.index)

preds_xgb_df.reset_index(inplace = True)

preds_xgb_df.columns = ['LNR', 'RESPONSE']

print(preds_xgb_df.shape)

preds_xgb_df.to_csv('predictions/pred_xgb.csv', index = False)

preds_xgb_df.head()

(42833, 2)


Unnamed: 0,LNR,RESPONSE
0,1754,0.051515
1,1770,0.051883
2,1465,0.028016
3,1470,0.027879
4,1478,0.027879


![img](img/Submit_1.png)

### AdaBoost submission

In [44]:
preds_adab = adab_final.predict_proba(test_scaled)[:,1]

preds_adab_df = pd.DataFrame(preds_adab, index = test_clean.index)

preds_adab_df.reset_index(inplace = True)

preds_adab_df.columns = ['LNR', 'RESPONSE']

print(preds_adab_df.shape)

preds_adab_df.to_csv('predictions/pred_adab.csv', index = False)

preds_adab_df.head()

(42833, 2)


Unnamed: 0,LNR,RESPONSE
0,1754,0.32313
1,1770,0.329434
2,1465,0.244584
3,1470,0.244584
4,1478,0.25959


![img](img/submit_3.png)

## Models trained with features selected

In [46]:
test_reduced = trans.transform(test_scaled)

### Boruta + RandomForest [Winner]

In [47]:
rfc_bor_final

preds_bor_rfc = rfc_bor_final.predict_proba(test_reduced)[:,1]

preds_bor_rfc_df = pd.DataFrame(preds_bor_rfc, index = test_clean.index)

preds_bor_rfc_df.reset_index(inplace = True)

preds_bor_rfc_df.columns = ['LNR', 'RESPONSE']

print(preds_bor_rfc_df.shape)

preds_bor_rfc_df.to_csv('predictions/pred_bor_rfc.csv', index = False)

preds_bor_rfc_df.head()

(42833, 2)


Unnamed: 0,LNR,RESPONSE
0,1754,0.030328
1,1770,0.030687
2,1465,0.002439
3,1470,0.002938
4,1478,0.007083


![img](img/submit_4.png)

### Boruta + XGB

In [48]:
preds_xgb_bor = xgb_bor_final.predict_proba(test_reduced)[:,1]

preds_xgb_bor_df = pd.DataFrame(preds_xgb_bor, index = test_clean.index)

preds_xgb_bor_df.reset_index(inplace = True)

preds_xgb_bor_df.columns = ['LNR', 'RESPONSE']

print(preds_xgb_bor_df.shape)

preds_xgb_bor_df.to_csv('predictions/pred_bor_xgb.csv', index = False)

preds_xgb_bor_df.head()

(42833, 2)


Unnamed: 0,LNR,RESPONSE
0,1754,0.030596
1,1770,0.033635
2,1465,0.002101
3,1470,0.00261
4,1478,0.003887


![img](img/submit_5.png)

### Boruta + AdaBoost

In [49]:
preds_adab_bor = adab_bor_final.predict_proba(test_reduced)[:,1]

preds_adab_bor_df = pd.DataFrame(preds_adab_bor, index = test_clean.index)

preds_adab_bor_df.reset_index(inplace = True)

preds_adab_bor_df.columns = ['LNR', 'RESPONSE']

print(preds_adab_bor_df.shape)

preds_adab_bor_df.to_csv('predictions/pred_bor_adab.csv', index = False)

preds_adab_bor_df.head()

(42833, 2)


Unnamed: 0,LNR,RESPONSE
0,1754,0.478823
1,1770,0.478971
2,1465,0.462971
3,1470,0.462712
4,1478,0.465857


![img](img/submit_6.png)

## Final results:

![img](img/Table.png)