# 🤖 Modeling

This notebook focuses on building and evaluating predictive models  
for passenger survival using the engineered Titanic dataset.

---

## 🎯 Purpose

To establish a baseline model and compare multiple classification algorithms  
using engineered features created in the previous notebook.

## 📦 Dataset

Same dataset used in previous notebooks:  
[Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic)  
via public repository: [Data Science Dojo GitHub](https://github.com/datasciencedojo/datasets)

📦 1. Load the Dataset

We begin by loading the feature-engineered dataset saved from the previous notebook.  
This dataset contains all processed and engineered features, ready for modeling.

Some models cannot handle string-type features.  
We prepare two versions of the feature set:

- `X_full`: includes all engineered features  
- `X_safe`: excludes object-type columns like `Cabin` and `Title`

In [94]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the processed dataset
df = pd.read_csv("feature_engineered_titanic.csv")

# Target variable
y = df['Survived']

# Feature sets
X_full = df.drop(columns=['Survived', 'Name', 'Ticket', 'PassengerId'])  # for tree-based models
X_safe = X_full.drop(columns=['Cabin', 'Title'])                         # for models that require numeric inputs

📈 2. Baseline Model: Logistic Regression

Logistic regression is a simple and interpretable classification algorithm,  
making it a great starting point for binary classification tasks like survival prediction.

However, it requires all input features to be numeric.  
We use the `X_safe` feature set prepared earlier, which excludes non-numeric columns.

In [95]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Use the safe feature set
X_train, X_test, y_train, y_test = train_test_split(
    X_safe, y, test_size=0.2, random_state=42
)

logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

print(f"Baseline Accuracy: {accuracy_score(y_test, y_pred):.4f}")

Baseline Accuracy: 0.8045


📚 3. Compare Multiple Models

In this section, we train and evaluate several classification algorithms  
to compare their predictive performance on the Titanic dataset.

In the previous section, we established a **baseline** using Logistic Regression,  
which is simple and interpretable but requires stricter preprocessing.

Now, we move on to additional models — many of which can handle categorical variables more flexibly —  
to see if they can outperform our baseline.

We will evaluate the following models:

- Decision Tree
- Random Forest
- Gradient Boosting
- K-Nearest Neighbors
- XGBoost

3-1. Decision Tree Classifier

Decision trees split the dataset based on feature values  
to make predictions through a series of learned rules.

They are intuitive and can handle both numerical and categorical variables (if encoded).

In [96]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Train a decision tree using the same train/test split
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

# Predict and evaluate
y_pred_tree = tree.predict(X_test)
print(f"Decision Tree Accuracy: {accuracy_score(y_test, y_pred_tree):.4f}")

Decision Tree Accuracy: 0.7709


3-2. Random Forest Classifier

Random Forest is an ensemble method that combines multiple decision trees  
to improve accuracy and reduce overfitting.

It works well with both numerical and encoded categorical features.

In [97]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Train a random forest model using the same train/test split
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred_rf = rf.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")


Random Forest Accuracy: 0.7877


3-3. Gradient Boosting Classifier

Gradient Boosting builds trees sequentially,  
with each tree correcting the errors of the previous one.

It often achieves strong performance on structured datasets.

In [98]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Train a gradient boosting model using the same train/test split
gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train, y_train)

# Predict and evaluate
y_pred_gb = gb.predict(X_test)
print(f"Gradient Boosting Accuracy: {accuracy_score(y_test, y_pred_gb):.4f}")

Gradient Boosting Accuracy: 0.8212


3-4. K-Nearest Neighbors

KNN classifies data based on the majority label among its nearest neighbors.  
It is simple but sensitive to feature scaling and works best with numeric inputs.

We will use the `X_safe` feature set to ensure all features are numeric.

In [99]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Train a KNN model using numeric-safe features
X_train_knn, X_test_knn, _, _ = train_test_split(X_safe, y, test_size=0.2, random_state=42)

knn = KNeighborsClassifier()
knn.fit(X_train_knn, y_train)

# Predict and evaluate
y_pred_knn = knn.predict(X_test_knn)
print(f"KNN Accuracy: {accuracy_score(y_test, y_pred_knn):.4f}")


KNN Accuracy: 0.7318


3-5. XGBoost Classifier

XGBoost is a powerful and efficient gradient boosting implementation  
that often delivers high performance on structured data.

It requires all input features to be numeric.

In [100]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Train XGBoost using numeric-safe features
X_train_xgb, X_test_xgb, _, _ = train_test_split(X_safe, y, test_size=0.2, random_state=42)

xgb = XGBClassifier(eval_metric='logloss', random_state=42)
xgb.fit(X_train_xgb, y_train)

# Predict and evaluate
y_pred_xgb = xgb.predict(X_test_xgb)
print(f"XGBoost Accuracy: {accuracy_score(y_test, y_pred_xgb):.4f}")

XGBoost Accuracy: 0.7989


📊 4. Compare Model Performance

We summarize the accuracy scores of all models tested above.  
This allows us to compare their relative performance and identify which performed best on the validation set.

In [101]:
import pandas as pd

# Collect all model scores
model_scores = {
    "Logistic Regression": accuracy_score(y_test, y_pred),
    "Decision Tree": accuracy_score(y_test, y_pred_tree),
    "Random Forest": accuracy_score(y_test, y_pred_rf),
    "Gradient Boosting": accuracy_score(y_test, y_pred_gb),
    "K-Nearest Neighbors": accuracy_score(y_test, y_pred_knn),
    "XGBoost": accuracy_score(y_test, y_pred_xgb),
}

# Create dataframe and sort by accuracy
results_df = pd.DataFrame.from_dict(model_scores, orient='index', columns=['Accuracy'])
results_df = results_df.sort_values(by='Accuracy', ascending=False)
results_df

Unnamed: 0,Accuracy
Gradient Boosting,0.821229
Logistic Regression,0.804469
XGBoost,0.798883
Random Forest,0.787709
Decision Tree,0.77095
K-Nearest Neighbors,0.731844


## 🧠 Summary

In this notebook, I built and compared several machine learning models  
to predict survival on the Titanic dataset.

Here's what I worked on:

- Loaded the processed dataset with all engineered features
- Prepared two versions of the input data:  
  `X_full` for tree-based models and `X_safe` for models that require numeric input
- Started with a simple logistic regression as a baseline
- Then tested five other models:
  - Decision Tree
  - Random Forest
  - Gradient Boosting
  - K-Nearest Neighbors
  - XGBoost
- Measured and compared their accuracy on the test set

Among them, **Gradient Boosting** performed the best with an accuracy of 0.82,  
slightly ahead of Logistic Regression (0.80) and XGBoost (0.80).

This gives a good sense of how different models behave on this dataset,  
and sets the stage for deeper evaluation or final model selection in the next step.