## Part 3: Kaggle Competition<a name="part3"></a>

Now that you've created a model to predict which individuals are most likely to respond to a mailout campaign, it's time to test that model in competition through Kaggle. If you click on the link [here](http://www.kaggle.com/t/21e6d45d4c574c7fa2d868f0e8c83140), you'll be taken to the competition page where, if you have a Kaggle account, you can enter. If you're one of the top performers, you may have the chance to be contacted by a hiring manager from Arvato or Bertelsmann for an interview!

Your entry to the competition should be a CSV file with two columns. The first column should be a copy of "LNR", which acts as an ID number for each individual in the "TEST" partition. The second column, "RESPONSE", should be some measure of how likely each individual became a customer – this might not be a straightforward probability. As you should have found in Part 2, there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the competition will be using AUC to evaluate performance. The exact values of the "RESPONSE" column do not matter as much: only that the higher values try to capture as many of the actual customers as possible, early in the ROC curve sweep.

In [None]:
# Load in the data:
#mailout_test = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TEST.csv', sep=';')

# Loading test set:
mailout_test = joblib.load('test')

mailout_test.head()

In [None]:
# Saving LNR object:
lnr = list(mailout_test.LNR.values)

# Applying data transformations on train set:
mailout_test = supervised_data_transformation(mailout_test, test_set = True)

mailout_test.head()

### 3.1 Attempt 1: Training on Unbalanced Data<a name="p1"></a>

In [None]:
# Predicting on teste data:
y_gbc_pred = gbc_clf.predict_proba(mailout_test)

In [None]:
# Creating prediction dataframe:
gbc_pred_df = pd.DataFrame(columns = ['LNR', 'RESPONSE'])

# Assigning id:
gbc_pred_df['LNR'] = lnr

# Assigning predictions:
gbc_pred_df['RESPONSE'] = y_gbc_pred[:, 1]

gbc_pred_df.head()

In [None]:
# Saving person predictions csv:
gbc_pred_df.to_csv('gbc_pred.csv', header = True, index = False)

# KAGGLE SCORE: 0.79488

Considering the `Kaggle` rank, this first approach could be considered a regular model, being positioned among the **top 150**.

### 3.2 Attempt 2: Training on Balanced Data<a name="p2"></a>

In [None]:
# Predicting on teste data:
y_gbc_smote_pred = gbc_smote_clf.predict_proba(mailout_test)

In [None]:
# Creating prediction dataframe:
gbc_smote_pred_df = pd.DataFrame(columns = ['LNR', 'RESPONSE'])

# Assigning id:
gbc_smote_pred_df['LNR'] = lnr

# Assigning predictions:
gbc_smote_pred_df['RESPONSE'] = y_gbc_smote_pred[:, 1]

gbc_smote_pred_df.head()

In [None]:
# Saving person predictions csv:
gbc_smote_pred_df.to_csv('gbc_smote_pred.csv', header = True, index = False)

# KAGGLE SCORE: 0.70165

### 3.3 Attempt 3: Information Level and PCA Transformation<a name="p3"></a>

In [None]:
# Predicting on teste data:
y_gbc_pca_pred = gbc_pca_clf.predict_proba(mailout_test[selected_columns])

In [None]:
# Creating prediction dataframe:
gbc_pca_pred_df = pd.DataFrame(columns = ['LNR', 'RESPONSE'])

# Assigning id:
gbc_pca_pred_df['LNR'] = lnr

# Assigning predictions:
gbc_pca_pred_df['RESPONSE'] = y_gbc_pca_pred[:, 1]

gbc_pca_pred_df.head()

In [None]:
# Saving person predictions csv:
gbc_pca_pred_df.to_csv('gbc_pca_pred.csv', header = True, index = False)

# KAGGLE SCORE: 0.79168

### 3.4 Attempt 4: PCA Transformation<a name="p4"></a>

In [None]:
# Predicting on teste data:
y_gbc_pca2_pred = gbc_pca2_clf.predict_proba(mailout_test)

In [None]:
# Creating prediction dataframe:
gbc_pca2_pred_df = pd.DataFrame(columns = ['LNR', 'RESPONSE'])

# Assigning id:
gbc_pca2_pred_df['LNR'] = lnr

# Assigning predictions:
gbc_pca2_pred_df['RESPONSE'] = y_gbc_pca2_pred[:, 1]

gbc_pca2_pred_df.head()

In [None]:
# Saving person predictions csv:
gbc_pca2_pred_df.to_csv('gbc_pca2_pred.csv', header = True, index = False)

# KAGGLE SCORE: 0.71402

### 3.5 Attempt 5: XGBoost Classifier and Baysian Optimization<a name="p5"></a>

In [None]:
# Predicting on teste data:
y_xgbc_bayes_pred = xgbc_bayes_clf.predict_proba(mailout_test)

In [None]:
# Creating prediction dataframe:
xgbc_bayes_pred_df = pd.DataFrame(columns = ['LNR', 'RESPONSE'])

# Assigning id:
xgbc_bayes_pred_df['LNR'] = lnr

# Assigning predictions:
xgbc_bayes_pred_df['RESPONSE'] = y_xgbc_bayes_pred[:, 1]

xgbc_bayes_pred_df.head()

In [None]:
# Saving person predictions csv:
xgbc_bayes_pred_df.to_csv('xgbc_bayes_pred.csv', header = True, index = False)

# KAGGLE SCORE: 0.80492

Although this model `roc_auc` score is comparable to the first model, when predicting on the test data, it represented a great advance.

This score positions the model among the **top 40** in the Kaggle rank among 349 data scientists.

![best_score.png](attachment:best_score.png)

### 3.6 LightGBM and Bayesian Optimization<a name="p6"></a>

In [None]:
# Predicting on teste data:
y_lgbm_bayes_pred = lgbm_bayes_clf.predict_proba(mailout_test)

In [None]:
# Creating prediction dataframe:
lgbm_bayes_pred_df = pd.DataFrame(columns = ['LNR', 'RESPONSE'])

# Assigning id:
lgbm_bayes_pred_df['LNR'] = lnr

# Assigning predictions:
lgbm_bayes_pred_df['RESPONSE'] = y_lgbm_bayes_pred[:, 1]

lgbm_bayes_pred_df.head()

In [None]:
# Saving person predictions csv:
lgbm_bayes_pred_df.to_csv('lgbm_bayes_pred.csv', header = True, index = False)

# KAGGLE SCORE: 0.79743

