# Titanic - Machine Learning from Disaster

## 3. Model Building

In this notebook, we will:
1. Load the processed data.
2. Split the data into training and validation sets.
3. Build several machine learning models to predict survival.
4. Evaluate the models based on performance metrics like accuracy.

### 3.1. Load the Processed Data

In [38]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the processed dataset
train_data = pd.read_csv('train_processed.csv')

# Display the first few rows of the processed train data
train_data.head()


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,HasCabin,FamilySize,FamilySurvival,FarePerPerson,Age_Pclass,IsHighFare,Sex_male,Embarked_Q,Embarked_S,AgeGroup_Teen,AgeGroup_Young Adult,AgeGroup_Middle Age,AgeGroup_Senior
0,1,0,3,22.0,1,0,7.25,0,2,1,3.625,66.0,0,True,False,True,False,True,False,False
1,2,1,1,38.0,1,0,71.2833,1,2,1,35.64165,38.0,1,False,False,False,False,False,True,False
2,3,1,3,26.0,0,0,7.925,0,1,0,7.925,78.0,0,False,False,True,False,True,False,False
3,4,1,1,35.0,1,0,53.1,1,2,1,26.55,35.0,1,False,False,True,False,True,False,False
4,5,0,3,35.0,0,0,8.05,0,1,0,8.05,105.0,0,True,False,True,False,True,False,False


*Comment:* The processed dataset includes essential features such as `Pclass`, `Age`, `Fare`, and engineered features like `FamilySize`, `IsAlone`, and various title-related features.


### 3.2. Splitting the Data into Training and Validation Sets

In [42]:
# Define the features (X) and target variable (y)
X = train_data.drop('Survived', axis=1)  # Features
y = train_data['Survived']               # Target

# Split the data into training (80%) and validation (20%) sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Confirm the shapes of the split data
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((712, 19), (179, 19), (712,), (179,))

### 3.3. Building Machine Learning Models

#### 3.3.1. Logistic Regression

In [46]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialise the scaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and validation sets
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Now, train the Logistic Regression model on the scaled data
logreg = LogisticRegression(max_iter=1000)

# Train the model
logreg.fit(X_train_scaled, y_train)

# Predict on the validation set
y_pred_logreg = logreg.predict(X_val_scaled)

# Calculate the accuracy
logreg_accuracy = accuracy_score(y_val, y_pred_logreg)
print(f"Logistic Regression Accuracy: {logreg_accuracy:.4f}")

# Evaluate the model
def evaluate_model(model, X_val, y_val, model_name):
    """Evaluates the model and prints the classification report and confusion matrix."""
    y_pred = model.predict(X_val)
    accuracy = accuracy_score(y_val, y_pred)
    print(f"{model_name} Accuracy: {accuracy:.4f}")
    print(f"{model_name} Classification Report:")
    print(classification_report(y_val, y_pred))
    print(f"{model_name} Confusion Matrix:")
    print(confusion_matrix(y_val, y_pred))

# Evaluate the model
evaluate_model(logreg, X_val_scaled, y_val, "Logistic Regression")


Logistic Regression Accuracy: 0.8101
Logistic Regression Accuracy: 0.8101
Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.86      0.84       105
           1       0.79      0.74      0.76        74

    accuracy                           0.81       179
   macro avg       0.81      0.80      0.80       179
weighted avg       0.81      0.81      0.81       179

Logistic Regression Confusion Matrix:
[[90 15]
 [19 55]]


**Random Forest Classifier**

In [49]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialise Random Forest model
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf.fit(X_train, y_train)

# Predict on the validation set
y_pred_rf = rf.predict(X_val)

# Calculate the accuracy
rf_accuracy = accuracy_score(y_val, y_pred_rf)
print(f"Random Forest Accuracy: {rf_accuracy:.4f}")

# Evaluate Random Forest
evaluate_model(rf, X_val, y_val, "Random Forest")

Random Forest Accuracy: 0.8324
Random Forest Accuracy: 0.8324
Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.88      0.86       105
           1       0.81      0.77      0.79        74

    accuracy                           0.83       179
   macro avg       0.83      0.82      0.83       179
weighted avg       0.83      0.83      0.83       179

Random Forest Confusion Matrix:
[[92 13]
 [17 57]]


**Gradient Boosting Classifier**

In [52]:
from sklearn.ensemble import GradientBoostingClassifier

# Initialise Gradient Boosting model
gb = GradientBoostingClassifier(random_state=42)

# Train the model
gb.fit(X_train, y_train)

# Predict on the validation set
y_pred_gb = gb.predict(X_val)

# Calculate the accuracy
gb_accuracy = accuracy_score(y_val, y_pred_gb)
print(f"Gradient Boosting Accuracy: {gb_accuracy:.4f}")

# Evaluate Gradient Boosting
evaluate_model(gb, X_val, y_val, "Gradient Boosting")

Gradient Boosting Accuracy: 0.8324
Gradient Boosting Accuracy: 0.8324
Gradient Boosting Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.90      0.86       105
           1       0.84      0.73      0.78        74

    accuracy                           0.83       179
   macro avg       0.83      0.82      0.82       179
weighted avg       0.83      0.83      0.83       179

Gradient Boosting Confusion Matrix:
[[95 10]
 [20 54]]


### 3.4. Evaluate the Models

#### Summary of Key Findings:
- **Gradient Boosting** is the best performing model overall, with the highest accuracy and a strong balance between precision and recall.
- **Logistic Regression** is effective but misses more survivors than Gradient Boosting.
- **Random Forest** has similar precision to Logistic Regression but lower accuracy, suggesting that it may be struggling to predict survivors accurately.

Gradient Boosting is selected as the best model for submission, but further improvements can focus on refining predictions for survivors and exploring advanced techniques.