# üßæ Titanic Survival Prediction Assignment
## Objective:
Predict survival of passengers on the Titanic using machine learning models (LightGBM and XGBoost). Perform preprocessing, train models, evaluate on validation set, and predict on the test set.


## 1. Import Required Libraries

In [1]:

#  1. Import Required Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

# ML Libraries
import lightgbm as lgb
import xgboost as xgb

# For warnings
import warnings
warnings.filterwarnings('ignore')


## 2. Load Training and Test Datasets

In [6]:

#  2. Load Training and Test Datasets
file_path_1=r"D:\Data sciences\Assignments\Assignment files\Assignment files Extracs\XGBM & LGBM\Titanic_train.csv"

file_path_2=r"D:\Data sciences\Assignments\Assignment files\Assignment files Extracs\XGBM & LGBM\Titanic_test.csv"
train = pd.read_csv(file_path_1)
test = pd.read_csv(file_path_2)

print("Train Shape:", train.shape)
print("Test Shape:", test.shape)


Train Shape: (891, 12)
Test Shape: (418, 11)


### Inference:
‚Ä¢	Training set has 891 rows, 12 columns.

‚Ä¢	Test set has 418 rows, 11 columns.


## 3. Data Preprocessing (Training Data)

In [7]:
#  3. Data Preprocessing (Training Data)
# Fill missing Age and Embarked values safely
train["Age"].fillna(train["Age"].median(), inplace=True)
if "Embarked" in train.columns:
    train["Embarked"].fillna(train["Embarked"].mode()[0], inplace=True)
    embarked_dummies = pd.get_dummies(train["Embarked"], prefix="Embarked", drop_first=True)
    train = pd.concat([train, embarked_dummies], axis=1)
    train.drop("Embarked", axis=1, inplace=True)  # drop original after encoding

# Encode 'Sex' column
train["Sex_male"] = train["Sex"].map({"male": 1, "female": 0})

# Features and target
features = ["Pclass", "Age", "SibSp", "Parch", "Fare", "Sex_male", "Embarked_Q", "Embarked_S"]
X = train[features]
y = train["Survived"]

# Train-Validation Split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


### Inference:
‚Ä¢	Missing values handled safely.

‚Ä¢	Categorical variables encoded.

‚Ä¢	Features ready for model training.


## 4. Train Machine Learning Models

In [8]:

# 4. Train Models on Training Data
# LightGBM
lgb_model = lgb.LGBMClassifier(random_state=42)
lgb_model.fit(X_train, y_train)
y_lgb_val = lgb_model.predict(X_val)

# XGBoost
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42)
xgb_model.fit(X_train, y_train)
y_xgb_val = xgb_model.predict(X_val)


[LightGBM] [Info] Number of positive: 273, number of negative: 439
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000719 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 206
[LightGBM] [Info] Number of data points in the train set: 712, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383427 -> initscore=-0.475028
[LightGBM] [Info] Start training from score -0.475028


### Inference:
‚Ä¢	Both LightGBM and XGBoost models are trained on the training data.


## 5. Evaluate Models on Validation Set

In [9]:

#  5. Evaluate Models on Validation Set
def evaluate_model(y_true, y_pred, model_name):
    print(f"\n {model_name} Results on Validation Set")
    print("Accuracy:", accuracy_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred))
    print("Recall:", recall_score(y_true, y_pred))
    print("F1 Score:", f1_score(y_true, y_pred))
    print("\nClassification Report:\n", classification_report(y_true, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))

evaluate_model(y_val, y_lgb_val, "LightGBM")
evaluate_model(y_val, y_xgb_val, "XGBoost")



 LightGBM Results on Validation Set
Accuracy: 0.7988826815642458
Precision: 0.7538461538461538
Recall: 0.7101449275362319
F1 Score: 0.7313432835820896

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.85      0.84       110
           1       0.75      0.71      0.73        69

    accuracy                           0.80       179
   macro avg       0.79      0.78      0.79       179
weighted avg       0.80      0.80      0.80       179

Confusion Matrix:
 [[94 16]
 [20 49]]

 XGBoost Results on Validation Set
Accuracy: 0.8044692737430168
Precision: 0.7575757575757576
Recall: 0.7246376811594203
F1 Score: 0.7407407407407407

Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.85      0.84       110
           1       0.76      0.72      0.74        69

    accuracy                           0.80       179
   macro avg       0.79      0.79      0.79       179
weighted a

### Inference:
‚Ä¢	Both models perform similarly.

‚Ä¢	XGBoost has slightly higher accuracy and F1 score.


## 6. Preprocess Test Set

In [10]:

#  6. Preprocess Test Set safely
test["Age"].fillna(train["Age"].median(), inplace=True)
test["Fare"].fillna(train["Fare"].median(), inplace=True)
test["Sex_male"] = test["Sex"].map({"male": 1, "female": 0})

# Handle Embarked safely
if "Embarked" in test.columns:
    test["Embarked"].fillna(train[["Embarked_Q", "Embarked_S"]].mode().iloc[0], inplace=True)
    embarked_dummies = pd.get_dummies(test["Embarked"], prefix="Embarked", drop_first=True)
    test = pd.concat([test, embarked_dummies], axis=1)

# Ensure all dummy columns exist
for col in ["Embarked_Q", "Embarked_S"]:
    if col not in test.columns:
        test[col] = 0

# Select features for test
X_test = test[features]

### Inference:
‚Ä¢	Test set prepared exactly like training set.


## 7. Predict on Test Set

In [14]:

#  7. Predict on Test Set
test["Survived_LGB"] = lgb_model.predict(X_test)
test["Survived_XGB"] = xgb_model.predict(X_test)


## 8. Save Predictions to CSV

In [None]:

#  8. Save Predictions to CSV
submission_lgb = test[["PassengerId", "Survived_LGB"]].rename(columns={"Survived_LGB": "Survived"})
submission_xgb = test[["PassengerId", "Survived_XGB"]].rename(columns={"Survived_XGB": "Survived"})

submission_lgb.to_csv("submission_lgb.csv", index=False)
submission_xgb.to_csv("submission_xgb.csv", index=False)

print("\n Predictions saved! Files:")
print(" - submission_lgb.csv")
print(" - submission_xgb.csv")


### Inference:
‚Ä¢	Predictions for both models are ready.

## üìù Note:
The generated CSV files ‚Äî submission_lgb.csv and submission_xgb.csv ‚Äî are automatically saved in the same folder where this Jupyter Notebook file is running. This ensures that the output files are stored locally without needing to specify a separate path.
    
