# 6. Final Predictions <a class="anchor"  id="chapter6"></a>

The Gradient Boosting Classifier model with tuned hyperparameters appears to have the best mean cross-validation score, although the tuned SVC model is behind by less than a percentage point. I tested all four models - Gradient Boosting with default parameters, Gradient Boosting with tuned hyperparameters, SVC with default parameters, and SVC with tuned hyperparameters - on the public leaderboard and found that SVC with default parameters actually resulted in the best score for the final prediction test set, at 81.33%. 

There are a few reasons why the SVC model with default parameters may have turned out better than the others. The first and most obvious is that the models are overfitting the training data. This has been apparent throughout the trial and error on this dataset, as I have consistently found cross-validation scores and test set accuracy to be ~5% higher than the public leaderboard score. It could also be the case that the prediction test set is fundamentally different from the training set in terms of its features, leading to a mismatch between the training accuracy and final prediction accuracy. Finally, machine learning models are inherently random, and the performance of a model can vary from one test set to another. It could be the case that SVC does have a worse CV score than Gradient Boosting, but it just happened to perform better on the one test set that is scored on the public leaderboard. 

No matter what the reason, I was able to score 81.33%, which is top 2% on the leaderboard as of the time of my submission. Not too shabby!

Happy Kaggling :) 

In [36]:
# Preparing dataset for final predictions
X_train = train_original[[
       'SibSp', 'Parch',  'In_WCG',
       'WCG_not_survived', 'WCG_survived', 
       'Party_size', 'Is_alone', 'Fare_per_passenger', 'Title_Miss.',
       'Title_Mr.', 'Title_Mrs.', 'Title_Uncommon', 'Sex_male', 'Pclass_2',
       'Pclass_3', 'Embarked_Q', 'Embarked_S', 'Age_cat']]
y_train = train_original[['Survived']].values.ravel()
X_test = test_original[['SibSp', 'Parch',  'In_WCG',
       'WCG_not_survived', 'WCG_survived', 
       'Party_size', 'Is_alone', 'Fare_per_passenger', 'Title_Miss.',
       'Title_Mr.', 'Title_Mrs.', 'Title_Uncommon', 'Sex_male', 'Pclass_2',
       'Pclass_3', 'Embarked_Q', 'Embarked_S', 'Age_cat']]

# Scaling the Fare_per_passenger column to create scaled training and test sets
scaler = StandardScaler()

X_train_num = X_train.loc[:, ['Fare_per_passenger']]
X_test_num = X_test.loc[:, ['Fare_per_passenger']]

X_train_cat = X_train.loc[:, ['SibSp', 'Parch', 'In_WCG',
       'WCG_not_survived', 'WCG_survived', 
       'Party_size', 'Is_alone',  'Title_Miss.',
       'Title_Mr.', 'Title_Mrs.', 'Title_Uncommon', 'Sex_male', 'Pclass_2',
       'Pclass_3', 'Embarked_Q', 'Embarked_S', 'Age_cat']]
X_test_cat = X_test.loc[:, ['SibSp', 'Parch', 'In_WCG',
       'WCG_not_survived', 'WCG_survived', 
       'Party_size', 'Is_alone', 'Title_Miss.',
       'Title_Mr.', 'Title_Mrs.', 'Title_Uncommon', 'Sex_male', 'Pclass_2',
       'Pclass_3', 'Embarked_Q', 'Embarked_S', 'Age_cat']]

X_train_num_scaled = scaler.fit_transform(X_train_num)
X_test_num_scaled = scaler.transform(X_test_num)

# Concatenate scaled numerical and categorical columns
X_train_scaled = np.concatenate([X_train_num_scaled, X_train_cat], axis=1)
X_test_scaled = np.concatenate([X_test_num_scaled, X_test_cat], axis=1)

In [37]:
best_model = SVC(random_state=41)
best_model.fit(X_train_scaled, y_train)
final_prediction = best_model.predict(X_test_scaled)

submission = pd.DataFrame({
        "PassengerId": test_original["PassengerId"],
        "Survived": final_prediction
    })
display(submission['Survived'].value_counts())
display(submission.dtypes)

0    274
1    144
Name: Survived, dtype: int64

PassengerId    int64
Survived       int64
dtype: object

In [38]:
submission['Survived'] = submission['Survived'].astype(int)
submission.to_csv('submission.csv', index=False)