# 🛳️ Final Submission

This notebook wraps up the Titanic survival prediction project.  
After evaluating all candidate models, we select the best-performing one, generate the final `.csv` file for submission,<br> and summarize key takeaways from the project.

---

## 🎯 Purpose

To finalize the project by:
- Selecting the most suitable model based on evaluation metrics  
- Generating the submission file in the required format  
- Reflecting briefly on the overall process

## 📦 Dataset

Same processed dataset used in earlier notebooks:  
[Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic)  
via public repository: [Data Science Dojo GitHub](https://github.com/datasciencedojo/datasets)


📦 1. Load the Dataset & Predictions

We load the same processed dataset and reuse the train/test split  
to ensure consistency with the previous evaluation.

In [100]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load processed dataset
df = pd.read_csv("feature_engineered_titanic.csv")

# Target and features
y = df['Survived']
X_full = df.drop(columns=['Survived', 'Name', 'Ticket', 'PassengerId'])
X_safe = X_full.drop(columns=['Cabin', 'Title'])

# Shared train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_safe, y, test_size=0.2, random_state=42
)

🤖 2. Train the Final Model

Based on the evaluation results,  
we select `GradientBoostingClassifier` as the final model due to its consistently strong performance across all metrics, including F1 Score and AUC.

We'll now retrain it on the training set and generate predictions.

In [101]:
from sklearn.ensemble import GradientBoostingClassifier

# Initialize and train the final model
final_model = GradientBoostingClassifier(random_state=42)
final_model.fit(X_train, y_train)

# Predict on test set
final_preds = final_model.predict(X_test)

📄 3. Generate final submission file (for Kaggle)

To prepare for submission, we retrain the final model using the full training set  
and generate predictions on the processed test set.

We apply the same feature engineering pipeline to the test set using a custom preprocessing function.

Before prediction, we ensure the test data:
- Has all the features used in training (aligned by column order)
- Contains only numeric values (with categorical features properly encoded)

The output follows Kaggle's required format:
- `PassengerId`: from the original `test.csv`
- `Survived`: predicted survival (0 or 1)

The final file will be saved as `submission.csv`.

In [102]:
def preprocess_test_for_submission(df):
    # Fill missing values
    df["Age"] = df.groupby(["Pclass", "Sex"])["Age"].transform(lambda x: x.fillna(x.median()))
    df["Fare"] = df["Fare"].fillna(df["Fare"].median())
    df["Embarked"] = df["Embarked"].fillna("S")

    # Title extraction & simplification
    df["Title"] = df["Name"].str.extract(r" ([A-Za-z]+)\.", expand=False)
    df["Title"] = df["Title"].replace(['Lady', 'Countess', 'Capt', 'Col', 'Don',
                                       'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    df["Title"] = df["Title"].replace(['Mlle', 'Ms'], 'Miss')
    df["Title"] = df["Title"].replace('Mme', 'Mrs')

    # Family features
    df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
    df["IsAlone"] = (df["FamilySize"] == 1).astype(int)

    # Additional engineered features (same as training)
    df["LowFare"] = (df["Fare"] <= 7.91).astype(int)
    df["AgeBin"] = pd.cut(df["Age"], bins=[0, 16, 32, 48, 64, 80], labels=False)
    df["ModerateFamily"] = df["FamilySize"].apply(lambda x: 1 if 2 <= x <= 4 else 0)
    df["IsCherbourg"] = (df["Embarked"] == "C").astype(int)
    df["FemaleFirstSecondClass"] = ((df["Sex"] == "female") & (df["Pclass"] != 3)).astype(int)
    df["IsChildOrElderly"] = ((df["Age"] <= 10) | (df["Age"] >= 60)).astype(int)
    df["LowFare_3rdClass"] = ((df["Pclass"] == 3) & (df["Fare"] < 7.91)).astype(int)

    # One-hot encoding: Pclass
    df = pd.get_dummies(df, columns=["Pclass"], prefix="Pclass")
    # One-hot encoding: Embarked
    df = pd.get_dummies(df, columns=["Embarked"], prefix="Embarked")
    # One-hot encoding: Cabin first letter
    df["Cabin"] = df["Cabin"].fillna("U")
    df["Cabin"] = df["Cabin"].str[0]
    df = pd.get_dummies(df, columns=["Cabin"], prefix="Cabin")

    # Drop unnecessary columns
    df.drop(columns=["Name", "Ticket", "PassengerId"], inplace=True)

    return df

In [103]:
df_test = pd.read_csv("test.csv")
X_kaggle = preprocess_test_for_submission(df_test.copy())
X_kaggle.to_csv("feature_engineered_test.csv", index=False)

In [104]:
# Load full training data
df_train = pd.read_csv("feature_engineered_titanic.csv")
X_all = df_train.drop(columns=['Survived', 'Name', 'Ticket', 'PassengerId', 'Cabin', 'Title'])
y_all = df_train['Survived']

# Load Kaggle test data
df_test = pd.read_csv("test.csv")
X_kaggle = pd.read_csv("feature_engineered_test.csv") 

# Fix object-type columns manually
X_kaggle["Sex"] = X_kaggle["Sex"].map({"male": 0, "female": 1}).astype(int)
title_map = {'Mr': 0, 'Miss': 1, 'Mrs': 2, 'Master': 3, 'Rare': 4}
X_kaggle["Title"] = X_kaggle["Title"].map(title_map).astype(int)

# Align test features to match training set
X_kaggle = X_kaggle.reindex(columns=X_all.columns, fill_value=0)

# Retrain final model on the full training set
final_model = GradientBoostingClassifier(random_state=42)
final_model.fit(X_all, y_all)

# Generate predictions on Kaggle test data
kaggle_preds = final_model.predict(X_kaggle)

# Create submission file
submission = pd.DataFrame({
    "PassengerId": df_test["PassengerId"],
    "Survived": kaggle_preds
})

submission.to_csv("submission.csv", index=False)

The submission was uploaded to Kaggle and received a public score of **0.79186**.  
This score reflects the performance of the final GradientBoostingClassifier model trained on the full dataset.

### 🧠 Summary
In this notebook, I finalized the Titanic survival prediction project by:

- Selecting `GradientBoostingClassifier` as the final model based on F1 score and ROC AUC  
- Retraining the model on the training set  
- Generating predictions and saving the result as `submission.csv` in Kaggle format

This marks the completion of the core workflow.  
Future improvements could include hyperparameter tuning, model stacking, or feature expansion.

## 📦 Overall Summary

This notebook concludes a 7-part beginner-friendly Titanic survival prediction project.

Throughout the series, I explored:
- Data cleaning and visualization (01–02)
- Feature engineering (03–04)
- Model building and evaluation (05–06)
- Final model selection and submission preparation (07)

By completing this project, I practiced core data science skills such as:
- Exploratory data analysis (EDA)
- Feature preprocessing and transformation
- Model selection and evaluation using multiple metrics
- Real-world prediction workflow aligned with Kaggle submission format

This was my first fully documented project on GitHub,  
and it helped me build both skills and confidence in the data science workflow.
