# Ensembling

On this notebook, all the predictions from the models presented before, are combined to one dataframe. Then a simple voting is executed, where the majority vote will be the prediction for each testset item.

In [59]:
import os
import pandas as pd

# First try

Combine all the model predictions into a single dataframe.

In [60]:
folder = "csvs/"
predictions = pd.DataFrame()
column_count = 0

for index, filename in enumerate(os.listdir(folder)):
  if filename.endswith('.csv'):
    df = pd.read_csv(folder + filename)

    if "PassengerId" not in predictions:
      predictions = df
    else:
      predictions["col_" + str(index)] = df['Survived']

    column_count = index + 1


predictions.rename(columns={"Survived": "col_0"}, inplace=True)
predictions.head()


Unnamed: 0,PassengerId,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7
0,892,0,0,0,0,0,0,0,0
1,893,1,1,0,1,0,0,0,1
2,894,0,0,0,0,0,0,0,0
3,895,0,0,0,0,1,1,1,0
4,896,1,1,1,1,1,0,0,1


In [61]:
# This function gets the "majority vote" of all the model predictions for each row. 
def get_vote(row):
  if row.drop('PassengerId').sum() > column_count / 2:
    return 1
  else:
    return 0

In [62]:
# Get the ensembled prediction for each testset item.
predictions['vote'] = predictions.apply(get_vote, axis=1)

In [63]:
predictions.head()

Unnamed: 0,PassengerId,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,vote
0,892,0,0,0,0,0,0,0,0,0
1,893,1,1,0,1,0,0,0,1,0
2,894,0,0,0,0,0,0,0,0,0
3,895,0,0,0,0,1,1,1,0,0
4,896,1,1,1,1,1,0,0,1,1


Separate the final ensembled predictions into a dataframe.

In [64]:
predictions.rename(columns={"vote": "Survived"}, inplace=True)
ensembled_predictions = predictions[['PassengerId', 'Survived']]
ensembled_predictions.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1


In [65]:
ensembled_predictions.to_csv('ensemble_preds_02.csv', index=None)

When taking into account the predictions of each model, we get a an accuracy of 0.784, which is the best accuracy so far.

# Second try

Let's try if we can still improve our accuracy by taking the majority vote of only the three best models: Support Vector Machine, Kernel SVM and Logistic Regression.

In [66]:
svm = pd.read_csv('csvs/titanic_svm_01.csv')
kernel_svm = pd.read_csv('csvs/titanic_kernel_svm_01.csv')
lgreg = pd.read_csv('csvs/titanic_logistic_regression_01.csv')

In [67]:
predictions2 = svm
predictions2.rename(columns={"Survived": "col_0"}, inplace=True)
predictions2['col_1'] = kernel_svm['Survived']
predictions2['col_2'] = lgreg['Survived']

In [69]:
predictions2.head(20)

Unnamed: 0,PassengerId,col_0,col_1,col_2
0,892,0,0,0
1,893,1,1,1
2,894,0,0,0
3,895,0,0,0
4,896,1,1,1
5,897,0,0,0
6,898,1,1,1
7,899,0,0,0
8,900,1,1,1
9,901,0,0,0


In [71]:
def get_vote2(row):
  if row.drop('PassengerId').sum() > 2:
    return 1
  else:
    return 0

In [72]:
predictions2['Survived'] = predictions2.apply(get_vote2, axis=1)

In [73]:
predictions2.head()

Unnamed: 0,PassengerId,col_0,col_1,col_2,Survived
0,892,0,0,0,0
1,893,1,1,1,1
2,894,0,0,0,0
3,895,0,0,0,0
4,896,1,1,1,1


In [74]:
ensemble2 = predictions2[['PassengerId', 'Survived']]
ensemble2.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [75]:
ensemble2.to_csv('best_ensemble.csv', index=None)

This method of ensembling results an accuracy of only 0.777.