<a href="https://www.kaggle.com/code/ishwor2048/beginner-friendly-titanic-eda-and-machine-learning?scriptVersionId=300531594" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<h1><b>Titanic Dataset Hands-on Data Overview, EDA and Classification Machine Learning</b></h1>

Hello Everyone, <br>
In this notebook, I have included most possible data overview, exploratory data analysis, data visualization, data pre-processing, machine learning life-cycle, and saving model locally. I hope you are going to learn a ton from this notebook. This notebook will be constantly updated, and this is just beginning, and I want learners to take the best advantage out of this notebook and analysis. so please do not forget to check the most updated one next time as I am updating this notebook frequently with additional analysis, visualization, comments and models based on the viewers comments!<br><br>
**If you want to  learn something specific, please feel free to COMMENT, I will provide in-depth deep dive into it. Also, if you liked my work on this, please hit UPVOTE, so that it will reach more learners.** <br><br>
For better practice, I will be using training data, and split to validation set to provide full practice of data preprocessing, but in the titanic specific data, we do not need to perform train-test split which I have used in order to do the submission to Kaggle.
<br><br>
Please also do not forget to check my YouTube channel as I am constructing hands-on tutorial from this notebook, and this will be updated there:
https://www.youtube.com/@DataSpeaks4u

<h1><b>Basic Imports

Let's begin our project with importing necessary libraries and modules. Working with python is really powerful for Data Science projects with very powerful libraries, modules and frameworks that will help you with not having to write code from scratch to accomplish specific tasks, including data visualization, exploratory data analysis, machine learning and deployments. Please go through the imports below, and let me know in comments if you got any questions.

In [None]:
# =============================
# Core data and computation libs
# =============================
import numpy as np  # NumPy: fundamental package for fast numerical computing (arrays, random numbers, linear algebra)
import pandas as pd  # pandas: tabular data structures (DataFrame/Series) for data loading, cleaning, joins, and EDA

# =============================
# Visualization libraries
# =============================
import matplotlib.pyplot as plt  # Matplotlib (stateful pyplot API): low-level plotting, figure/axes control
import seaborn as sns  # Seaborn: higher-level statistical plots with nicer defaults; built on top of Matplotlib

# ===============================================================
# scikit-learn utilities: data split, preprocessing, and pipelines
# ===============================================================
# Train/validation/test splitting (train_test_split),
# StratifiedKFold for balanced class distributions across folds during CV,
# and GridSearchCV for exhaustive hyperparameter search with cross-validation.
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV

# ColumnTransformer lets you apply different preprocessing to different column subsets
# (e.g., numeric vs categorical pipelines).
from sklearn.compose import ColumnTransformer

# Pipeline chains preprocessing steps and estimator into a single object
# to ensure no data leakage and consistent application during fit/predict.
from sklearn.pipeline import Pipeline

# Common preprocessing transformers:
# - OneHotEncoder: convert categorical features to one-hot/dummy variables.
# - StandardScaler: standardize numeric features (mean=0, std=1), critical for distance-based or linear-margin models.
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# SimpleImputer: handle missing values (e.g., strategy='mean' for numeric, 'most_frequent' for categorical).
from sklearn.impute import SimpleImputer

# =======================
# scikit-learn classifiers
# =======================
# Decision Tree: non-linear, interpretable, prone to overfitting if not regularized (max_depth, min_samples_split, etc.).
from sklearn.tree import DecisionTreeClassifier

# Random Forest: ensemble of trees; reduces variance; good baseline for tabular data; handles mixed feature types well.
from sklearn.ensemble import RandomForestClassifier

# K-Nearest Neighbors: instance-based learner; sensitive to scale (hence StandardScaler); choose k via CV.
from sklearn.neighbors import KNeighborsClassifier

# Support Vector Classifier: effective in high-dimensional spaces; sensitive to feature scaling; kernels (linear/RBF/poly).
from sklearn.svm import SVC

# ==============================
# Metrics and diagnostic utilities
# ==============================
from sklearn.metrics import (
    accuracy_score,              # Overall fraction of correct predictions (may be misleading with class imbalance).
    precision_score,             # Of predicted positives, how many are correct (TP / (TP + FP)).
    recall_score,                # Of actual positives, how many were found (TP / (TP + FN)).
    f1_score,                    # Harmonic mean of precision and recall; balances both for imbalanced classes.
    roc_auc_score,               # Area under ROC curve; threshold-independent measure (binary & probability-based).
    classification_report,       # Nicely formatted precision/recall/F1/support per class.
    confusion_matrix,            # 2x2 (binary) or CxC (multi-class) matrix of predicted vs actual counts.
    RocCurveDisplay,             # Helper to plot ROC curve (TPR vs FPR across thresholds).
    PrecisionRecallDisplay       # Helper to plot Precision-Recall curve (especially informative on imbalanced data).
)

# =======================
# Persistence / I/O helper
# =======================
import joblib  # For saving/loading trained models, pipelines, and preprocessors efficiently (pickle-compatible).

# =======================
# Reproducibility controls
# =======================
RANDOM_STATE = 42  # Fixed seed value used wherever estimators/splitters accept random_state for reproducible results.
np.random.seed(RANDOM_STATE)  # Also seed NumPy's RNG when generating synthetic data or random operations outside sklearn.

# =======================
# Sanity message
# =======================
# Basic confirmation that the imports executed without error.
# Useful in notebooks or scripts to ensure environment dependencies are satisfied before proceeding.
print("Imports loaded!ðŸ¤–")

<h3><b>Load the data (Kaggle Built-in Titanic Dataset)

Before loading here, it is important that we load the data into the "INPUT" section to the right panel

In [None]:
train_path = "/kaggle/input/competitions/titanic/train.csv"
test_path = "/kaggle/input/competitions/titanic/test.csv"

Loading the training dataset from the path defined above, and naming it to "df"

In [None]:
# define the df
df = pd.read_csv(train_path)

In [None]:
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns in the dataset.") # check number of rows and columns in the dataset

In [None]:
# getting the top 5 rows of the data
df.head()

In [None]:
# Getting top 3 rows of the dataset
df.head(3)

In [None]:
# Getting last 5 rows of the dataset
df.tail()

In [None]:
# Getting last 7 rows of the training dataset
df.tail(7)

In [None]:
# Getting sample 5 rows of the dataset. This function will pick the random 5 rows. Notice the index positions
df.sample(5)


In [None]:
# Checking information about the data in overall
df.info()

In [None]:
# Getting quick statistical summary of the numerical columns
df.describe().T

In [None]:
# Getting statistical summary of categorical columns
df.describe(include="object").T

In [None]:
# Checking list of columns
df.columns.tolist()

<h3><b>Quick EDA for Sanity Check

In [None]:
# Checking balance of the labels
round(df["Survived"].value_counts(normalize=True) * 100, 2)

In [None]:
# Checking missing values
df.isnull().sum()

In [None]:
# Checking top missing values
df.isna().mean().sort_values(ascending=False).head(10)

In [None]:
# Quick correlation check for numerical columns
num_cols = df.select_dtypes(include=[np.number]).columns

corr = round(df[num_cols].corr(numeric_only=True), 4)
display(corr)

<h3><b>Data Visualization for additional data exploration

In [None]:
# Visualizing the survival vs non-survival
ax = sns.countplot(x='Survived', data=df, hue='Survived')
plt.title("Count of Survival (Target)")

for container in ax.containers:
    ax.bar_label(container)
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Feature Correlation Heatmap")
plt.show()

In [None]:
# categorical pivot table: Survival based on gender
round(df[["Sex", "Survived"]].groupby(['Sex']).mean(), 3)

In [None]:
# categorical pivot table: Survival rate by Passenger class
round(df[["Pclass", "Survived"]].groupby(['Pclass']).mean(), 3)

In [None]:
# Age distribution by survival (KDE PLOT)
sns.kdeplot(data=df, x='Age', hue='Survived', fill=True, palette='viridis')
plt.title("Age density Distribution by Survival")

In [None]:
# fair vs age vs survival scatter plot
sns.scatterplot(data=df, x='Age', y='Fare', hue='Survived', alpha=0.7)
plt.title("Relationship between Age, Fare and Survival")

In [None]:
# survival rate by family size
df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
sns.barplot(data=df, x = 'FamilySize', y = 'Survived')
plt.title("Survival Rate by Family Size")

In [None]:
# Passenger class and sex interaction (catplot)
sns.catplot(x='Pclass', y='Survived', hue='Sex', data=df, kind='point')
plt.title("Survival Probability: Sex vs Pclass")

In [None]:
# Embarked port vs Survival vs. Pclass
sns.pointplot(data=df, x='Embarked', y='Survived', hue='Pclass')
plt.title("Surval by Port of Embarkation and Class")

In [None]:
# Fair distribution by class
sns.boxplot(data=df, x = 'Pclass', y = 'Fare')
plt.ylim(0, 300) # Zoom in to ignore extreme outliers for better view
plt.title("Fair Distribution across classes")

In [None]:
# missing value matrix
# Visualize the "emptiness" of the data 
# to see if Age or Cabin missigness is random or clustered
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')
plt.title("Missing data Gap")

This combines a boxplot and KDE. Itâ€™s great for seeing if the age distribution of survivors in 1st class differs from 3rd class.

In [None]:
# Violin plot of Age by Passenger class and Survived
sns.violinplot(data=df, x="Pclass", y='Age', hue="Survived", split=True)
plt.title("Age/Class Distribution by Surivival")

You can extract titles (Mr, Mrs, Miss, Master, Dr) from the Name column. Visualizing survival by "Title" often reveals more than "Sex" alone (e.g., "Master" usually refers to young boys).

In [None]:
# Title extraction Analysis
# We can extract the title (Mr, Mrs, Miss, Master, Dr) from the Name column.
# Visualizing survival by "Title" often reveals more than "Sex" alone (e.g. Master usually refers to young boys)
df["Title"] = df["Name"].str.extract(r" ([A-Za-z]+)\.", expand=False)
sns.countplot(data=df, y='Title', hue="Survived")
plt.title("Survival Count by Title")

A bird's-eye view of all numerical relationships at once.

In [None]:
# Pairplot of numerical features
sns.pairplot(df[num_cols].dropna(), hue="Survived", diag_kind='kde')

Standard analysis looks at individuals. However, families on the Titanic often lived or died together. You can identify groups by looking for people with the same Surname and Ticket number.

In [None]:
# Extract Surname
df['Surname'] = df['Name'].apply(lambda x: x.split(',')[0])

# Create a 'FamilyGroup' identifier
df['FamilyGroup'] = df['Surname'] + "_" + df['Ticket'].str[:-1]

# Find survival rates of these groups
group_survival = df.groupby('FamilyGroup')['Survived'].mean()
sns.histplot(group_survival)
plt.title("Survival Consistency within Family Groups")

The Cabin column is 77% null, so most people drop it. However, the Letter in the cabin (A, B, C, D, E, F, G, T) represents the Deck. Decks closer to the water line had lower survival rates.

In [None]:
# Extract Deck from Cabin
df['Deck'] = df['Cabin'].str.slice(0,1)
df['Deck'] = df['Deck'].fillna('Unknown')

# Plot Deck vs Survival, ordered by vertical height of the ship
deck_order = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'T', 'Unknown']
sns.barplot(data=df, x='Deck', y='Survived', order=deck_order)
plt.title("Survival Rate by Ship Deck (Vertical Location)")

Is there a "survival ceiling" for wealth? Instead of a scatter plot, use a cumulative distribution or a "Binned" bar plot to see if paying $100 vs $500 actually changed your odds.

In [None]:
df["Fare_Bin"] = pd.qcut(df["Fare"], 5)

sns.barplot(data=df, x='Fare_Bin', y="Survived")
plt.xticks(rotation=45)
plt.title("Survival Probability by Fare Quintiles")

Some passengers traveled on the same ticket but weren't "Family" (e.g., nannies, friends, or cousins). The frequency of a ticket number tells you the total group size, which is often more accurate than SibSp + Parch.

In [None]:
df["Ticket_Freq"] = df.groupby("Ticket")["Ticket"].transform('count')
sns.pointplot(data=df, x='Ticket_Freq', y='Survived')
plt.title("Survival Rate by Group Size (based on Ticket Frequency)")

Many people just fill missing Age with the median. To see if the missingness itself is a signal, check if people with "Missing Age" survived at different rates than those with "Known Age."

In [None]:
df["Age_Unknown"] = df["Age"].isnull().astype(int)
sns.barplot(data=df, x="Age_Unknown", y="Survived", hue="Sex")
plt.title("Survival Rate: Known Age vs. Missing Age")

To dive deeper into the relationship between categorical variables, we need to go beyond simple correlations (which only work well for numbers). We want to see if knowing one category (like Sex) gives us significant information about another (like Survival or Pclass).

**Chi-Square Test for Independence** is the gold standard for categorical association. It tells you if the relationship between two variables (e.g., Embarked and Survived) is statistically significant or just due to chance.

In [None]:
# Import necessary libraries and modules
from scipy.stats import chi2_contingency

def check_categorical_association(df, col1, col2):
    contingency_table = pd.crosstab(df[col1], df[col2])
    chi2, p, dof, ex = chi2_contingency(contingency_table)
    print(f"Relationship between {col1} and {col2}:")
    print(f"   - Chi-square Statistics: {chi2:.4f}")
    print(f"   - P-value: {p:.4e}")
    return p

# checking if port of Embarkatioon is related to survival
check_categorical_association(df, "Embarked", "Survived")

While Chi-Square tells you if there is a relationship, Cramerâ€™s V tells you how strong it is (on a scale of 0 to 1). This is essentially the "correlation coefficient" for categories.

In [None]:
def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x, y)
    chi2 = chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

# Example: How strong is the link between Pclass and Survival?
v_score = cramers_v(df['Pclass'], df['Survived'])
print(f"Cramer's V for Pclass & Survived: {v_score:.4f}")

**Mutual Information (MI) Scores** is a non-linear approach used often in feature selection. It measures how much information the presence/absence of a feature contributes to making the correct prediction on the target.

In [None]:
from sklearn.feature_selection import mutual_info_classif

# We need to temporarily encode strings to numbers for MI
temp_df = df[['Pclass', 'Sex', 'Embarked', 'Survived']].copy()
temp_df['Sex'] = temp_df['Sex'].map({'male': 0, 'female': 1})
temp_df['Embarked'] = temp_df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).fillna(-1)

X = temp_df.drop('Survived', axis=1)
y = temp_df['Survived']

mi_scores = mutual_info_classif(X, y, discrete_features=True)
mi_results = pd.Series(mi_scores, name="MI Scores", index=X.columns)
print(mi_results.sort_values(ascending=False))

Sometimes a raw table is hard to read. Normalizing by index (row) shows you the likelihood of survival per category.

In [None]:
# What percentage of each Class survived?
pd.crosstab(df['Pclass'], df['Survived'], normalize='index')

# Does Sex affect the Survival Rate of different Embarked ports?
pd.crosstab(index=[df['Sex'], df['Embarked']], columns=df['Survived'], normalize='index')

<h3><b>Hands-on Feature Engineering

Creating copy of the dataset before proceeding

In [None]:
df_fe = df.copy()

In [None]:
# Quickly see the data
df_fe.head(2)

In [None]:
# Group rare titles into "Rare" for stability
rare_titles = df_fe["Title"].value_counts()
rare_titles = rare_titles[rare_titles < 10].index
df_fe["Title"] = df_fe["Title"].replace(rare_titles, "Rare")

In [None]:
df_fe["IsAlone"] = (df_fe["FamilySize"] == 1).astype(int)

In [None]:
# Deck from Cabin (first letter), missing -> "Unknown"
df_fe["Deck"] = df_fe["Cabin"].astype(str).str[0]
df_fe["Deck"] = df_fe["Deck"].replace("n", "Unknown")  # 'nan' -> 'n' after str conversion
df_fe["Deck"] = df_fe["Deck"].replace("N", "Unknown")

In [None]:
# Ticket group size (people sharing ticket sometimes correlate)
ticket_counts = df_fe["Ticket"].value_counts()
df_fe["TicketGroupSize"] = df_fe["Ticket"].map(ticket_counts)

In [None]:
# Fare per person (avoid divide-by-zero; FamilySize >= 1 always here)
df_fe["FarePerPerson"] = df_fe["Fare"] / df_fe["FamilySize"]

In [None]:
# Drop columns we won't use directly (IDs, high-cardinality, leakage-ish)
# Keep 'Name' dropped after extracting Title
drop_cols = ["PassengerId", "Name", "Cabin", "Ticket"]
df_fe = df_fe.drop(columns=drop_cols)

Let's check what we created

In [None]:
df_fe.head(3)

<h3><b>Define Training Features (X) and Target Variable (y)

In [None]:
target_col = "Survived"

# if we drop the target column from all the features, it becomes training features
X = df_fe.drop(columns=[target_col])
# Now, as we have already defined target_col from the dataset, let's set that up
y = df_fe[target_col].astype(int)

In [None]:
# Identify column data types for pre-processing
numerical_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(exclude=[np.number]).columns.tolist()

In [None]:
print("Numerical Features:", numerical_features)

In [None]:
print("Categorical Features:", categorical_features)

<h3><b>Train / Validation Split (Stratified for Classification

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y, 
                                                 random_state=RANDOM_STATE)

In [None]:
print("Train:", X_train.shape, "Val:", X_val.shape)
print("Train target rate:", y_train.mean(), "Val target rate:", y_val.mean())

<h3><b>Building Data Preprocessing Pipeline</b></h3>
<li> Numeric: Impute missing values + scaling the feature values
<li> Categorical: Impute missing values + one hot encoding

In [None]:
# Numeric data transformer
numeric_transformer = Pipeline(steps = [
    ("imputer", SimpleImputer(strategy="median")), 
    ("scaler", StandardScaler())
])

# Categorical data transformer
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")), 
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# Now, applying categorical and numerical data transformer to the processing step
preprocessor = ColumnTransformer(
    transformers = [
        ("num", numeric_transformer, numerical_features), 
        ("cat", categorical_transformer, categorical_features)
    ]
)

print("Data Preprocessing is Performed!âœ…")

<h3><b>Defining the function to evaluate any categorical machine learning model</b></h3>
This step will help us to define the function that will evaluate any classification model's performance so that we can quickly training more models and evaluate them without any issues.

In [None]:
def evaluate_classifier(model, X_tr, y_tr, X_te, y_te, model_name = "model"):
    """
    This function fits the pipeline model and prints classification metrics. 
    Also, returns a dictionary of metrics for comparison.
    """

    # Fitting training data to the model
    model.fit(X_tr, y_tr)

    # Perform predictions on the test data
    y_pred = model.predict(X_te)

    # Models like SVC output probability being True or some models can output probabilities
    y_proba = None
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_te)[:, 1]

    # ------------------------------------
    # Model evaluation metrices
    # ------------------------------------
    acc = accuracy_score(y_te, y_pred) # calculating accuracy from actual test labels against predicted test labels
    prec = precision_score(y_te, y_pred, zero_division=0)
    rec = recall_score(y_te, y_pred, zero_division=0)
    f1 = f1_score(y_te, y_pred, zero_division=0)

    # ROC AUC score calculation if probabilities exist
    auc = roc_auc_score(y_te, y_proba) if y_proba is not None else np.nan

    # printing out all the results as part of model evaluation techniques
    print(f"\n========== {model_name} ===========")
    print(f"Accuracy: {acc:.4f}")
    print(f"Precision: {prec:.4f}")
    print(f"Recall: {rec:.4f}")
    print(f"F1-score : {f1:.4f}")
    print(f"ROC AUC: {auc:.4f}" if not np.isnan(auc) else "ROC-AUC: (No Probabilities to calculate ROC AUC)")

    # generating confusion matrix to see the True Positive, True Negative, False Positive and False Negative
    print("\nConfusioin Matrix: ")
    print(confusion_matrix(y_te, y_pred))

    # Printing classification report for entire training evaluation report
    print("\nClassification Report: ")
    print(classification_report(y_te, y_pred, zero_division=0))

    metrics = {"Model": model_name, "Accuracy": acc, "Precision": prec, "Recall": rec, "F1-Score": f1, "ROC-AUC": auc}
    return metrics
    

<h2><b>Training and Evaluating Few Baseline Models</b></h2>
It is now finally time train a few baseline models and evaluate and see how they are doing. Later steps, we will tune those models

In [None]:
# Beginning with very basic but powerful decision tree model
dt_clf = Pipeline(steps=[
    ("preprocess", preprocessor), 
    ("model", DecisionTreeClassifier(random_state=RANDOM_STATE))
])

In [None]:
# let's jump now to the RandomForest
rf_clf = Pipeline(steps=[
    ("preprocess", preprocessor), 
    ("model", RandomForestClassifier(
        random_state=RANDOM_STATE, 
        n_estimators=300
    ))
])

In [None]:
# K-Nearest Neighbors (KNN) -> This model needs scaling which is already done above in numeric pipeline
knn_clf = Pipeline(steps=[
    ("preprocess", preprocessor), 
    ("model", KNeighborsClassifier())
])

In [None]:
# Support Vector Classifier (Support ROC-AUC, enabling probability=True)
svc_clf = Pipeline(steps=[
    ("preprocess", preprocessor), 
    ("model", SVC(probability=True, random_state = RANDOM_STATE))
])

In [None]:
# Initializing empty list to store the results of model performances
results = []

In [None]:
# Appending all the results from different models to empty list "results" that we initialized above
results.append(evaluate_classifier(dt_clf, X_train, y_train, X_val, y_val, model_name="DecisionTree (Baseline)"))
results.append(evaluate_classifier(rf_clf, X_train, y_train, X_val, y_val, model_name="RandomForest (Baseline)"))
results.append(evaluate_classifier(knn_clf, X_train, y_train, X_val, y_val, model_name="KNN (Baseline)"))
results.append(evaluate_classifier(svc_clf, X_train, y_train, X_val, y_val, model_name="SVC (Baseline"))

In [None]:
results_df = pd.DataFrame(results).sort_values(by="F1-Score", ascending=False)

In [None]:
results_df

<h2><b>Hyperparameter Tuning</b></h2>
Looks like models did learn faily well, however, I think we can do even better from model's performance standpoint. 
So we will perform hyperparameter tuning individually on each of the the models and see how they are performing. 

In [None]:
# Please check back for the next step on hands-on hyperparameter tuning. In a couple of days, I will make updates further with detailed
# hyperparameter tuning on each of those models. 

# Until then, please follow and subscribe: Data speaks YouTube channel:

https://www.youtube.com/@DataSpeaks4u

<h1><b>For submission only

Only for submission in Kaggle, not much helpful for learning!

In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.preprocessing import LabelEncoder

In [None]:
# 1. LOAD DATA
# 1. Load the data
train = pd.read_csv('/kaggle/input/competitions/titanic/train.csv')
test = pd.read_csv('/kaggle/input/competitions/titanic/test.csv')

passenger_ids = test['PassengerId']

In [None]:
# 2. ADVANCED FEATURE ENGINEERING
def engineer_features(df):
    # Extract Title and group rare ones
    df['Title'] = df.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
    df['Title'] = df['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    df['Title'] = df['Title'].replace(['Mlle', 'Ms'], 'Miss')
    df['Title'] = df['Title'].replace('Mme', 'Mrs')
    
    # Impute Age based on Title (more accurate than global median)
    df['Age'] = df.groupby('Title')['Age'].transform(lambda x: x.fillna(x.median()))
    
    # Cabin Deck extraction (First letter of Cabin)
    df['Deck'] = df['Cabin'].apply(lambda x: x[0] if pd.notnull(x) else 'U') # 'U' for Unknown
    
    # Family Size & Groups
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
    
    # Fare Imputation & Binning
    df['Fare'] = df['Fare'].fillna(df['Fare'].median())
    df['FareBin'] = pd.qcut(df['Fare'], 4, labels=[0, 1, 2, 3]).astype(int)
    
    # Cleanup
    df.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1, inplace=True)
    return df

In [None]:
train = engineer_features(train)
test = engineer_features(test)

In [None]:
# 3. ENCODING
# Use Label Encoding for categorical features
categorical = ['Sex', 'Embarked', 'Title', 'Deck']
for col in categorical:
    le = LabelEncoder()
    # Fit on combined data to ensure all labels are captured
    combined = pd.concat([train[col], test[col]], axis=0).astype(str)
    le.fit(combined)
    train[col] = le.transform(train[col].astype(str))
    test[col] = le.transform(test[col].astype(str))

In [None]:
X = train.drop('Survived', axis=1)
y = train['Survived']

print("Code executed so far!")

In [None]:
# 4. THE ENSEMBLE MODEL
# Define three strong, diverse learners
clf1 = RandomForestClassifier(n_estimators=500, max_depth=6, random_state=42)

In [None]:
clf2 = xgb.XGBClassifier(n_estimators=500, max_depth=4, learning_rate=0.01, random_state=42)

In [None]:
clf3 = lgb.LGBMClassifier(n_estimators=500, max_depth=4, learning_rate=0.01, random_state=42)

In [None]:
# Combine them using a Soft Voting Classifier
# "Soft" uses predicted probabilities rather than just "Hard" majority vote
voting_clf = VotingClassifier(
    estimators=[('rf', clf1), ('xgb', clf2), ('lgbm', clf3)],
    voting='soft'
)

In [None]:
# 5. TRAIN AND SUBMIT
voting_clf.fit(X, y)

In [None]:
predictions = voting_clf.predict(test)

In [None]:
output = pd.DataFrame({'PassengerId': passenger_ids, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Advanced submission file saved!")