### ✅ Feature Engineering

Takeaways from EDA:

- Target (Survived) class balance: Roughly 38% survived, 62% did not → moderate class imbalance, but not extreme.
- Age: 714 non-null → 177 missing (~20%), Question: is missingness random or related to other variables? __Create dummy variable for age missing.__
- Sex: females survive much more often than males
- Children have relatively high survival rates
- Survival doesn’t change linearly with age
- Top fare band (≈ £69–512) has survival ≈ 68%, noticeably higher than the rest.
- Passengers embarking at C (Cherbourg) tend to have higher survival (more 1st-class, high-fare passengers).
- Passengers travelling completely alone or with very large families tend to have lower survival. -> __Create three categories: 'Alone' (FamilySize=1), 'SmallFamily' (FamilySize=2-4), 'LargeFamily' (FamilySize>=5)__
- Age missingness is not random: lowest in 2nd class; moderate in 1st class; highest in 3rd class, where around a quarter to a third of passengers are missing Age
- That suggests a link between missing Age and socio-economic status (and therefore survival), __making naive mean imputation potentially biased__
- Female survival is high in all classes, but especially high in 1st and 2nd -> __interaction Sex×Pclass important__
- Male survival is low overall, worst for 3rd class males -> __interaction Sex×Pclass important__
- Female children and female adults have high survival. -> __interaction Age×Sex important__
- Adult males, especially in certain age bands (young/prime adults), have conspicuously low survival. -> __interaction Age×Sex important__
- Within the same Pclass, higher Fare bands often correspond to better survival (e.g. more desirable cabins/locations). -> __interaction Fare×Pclass important__
- __Numeric variables should be normalized__ before modeling.

### Titanic – Feature Engineering

This notebook takes the **cleaned Titanic data** and performs **model-agnostic feature engineering**.

Inputs:
- `../data/interim/train_clean.csv`
- `../data/interim/test_clean.csv`
- `../data/interim/clean_metadata.json` (for consistent categorical schemas)

Outputs:
- `../data/processed/train_features.csv`
- `../data/processed/test_features.csv`
- `../data/processed/processed_metadata.json` (for consistent categorical schemas)

The feature engineering is **model-agnostic**: all engineered features are interpretable and can be used by any model (linear, tree-based, neural, etc.).

In [31]:
import os
import json

import numpy as np
import pandas as pd

In [32]:
# Display settings
pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)

# Directories
DATA_DIR_INTERIM = "../data/interim"
DATA_DIR_PROCESSED = "../data/processed"
os.makedirs(DATA_DIR_PROCESSED, exist_ok=True)

# --------------------------
# INPUT FILES
# --------------------------
TRAIN_CLEAN_PATH = os.path.join(DATA_DIR_INTERIM, "train_clean.csv")
TEST_CLEAN_PATH  = os.path.join(DATA_DIR_INTERIM, "test_clean.csv")

CLEAN_META_PATH  = os.path.join(DATA_DIR_INTERIM, "clean_metadata.json")

# --------------------------
# OUTPUT FILES
# --------------------------
TRAIN_FEAT_PATH = os.path.join(DATA_DIR_PROCESSED, "train_features.csv")
TEST_FEAT_PATH  = os.path.join(DATA_DIR_PROCESSED, "test_features.csv")

PROCESSED_META_PATH = os.path.join(DATA_DIR_PROCESSED, "processed_metadata.json")

In [33]:
train = pd.read_csv(TRAIN_CLEAN_PATH)
test  = pd.read_csv(TEST_CLEAN_PATH)

# Load INPUT metadata (clean metadata!)
with open(CLEAN_META_PATH, "r") as f:
    clean_meta = json.load(f)

print("Train shape:", train.shape)
print("Test shape: ", test.shape)

train.head()

Train shape: (891, 17)
Test shape:  (418, 16)


Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title,AgeMissing,Ticket_clean,Ticket_strong,TicketPrefix,FareMissing,CabinKnown,CabinDeck,Survived
0,1,3,male,22.0,1,0,7.25,S,Mr,0,A 5 21171,Other,A,0,0,Unknown,0
1,2,1,female,38.0,1,0,71.2833,C,Mrs,0,PC 17599,Other,PC,0,1,C,1
2,3,3,female,26.0,0,0,7.925,S,Miss,0,STON O2 3101282,Other,STON,0,0,Unknown,1
3,4,1,female,35.0,1,0,53.1,S,Mrs,0,113803,Other,NONE,0,1,C,1
4,5,3,male,35.0,0,0,8.05,S,Mr,0,373450,Other,NONE,0,0,Unknown,0


In [34]:
pd.set_option("display.max_colwidth", 200)

### Categorical Schemas (from clean_metadata.json)

We enforce consistent categorical dtypes for all cleaned categorical variables, including:

- TicketPrefix
- Title
- Embarked
- CabinDeck
- Ticket_clean
- Ticket_strong

Ensuring consistent categorical schemas across train and test guarantees that both datasets share the same category sets and prevents downstream errors caused by category mismatches.

In [35]:
def apply_metadata_categories(df: pd.DataFrame, meta: dict) -> pd.DataFrame:
    """
    Apply fixed categorical categories from clean_metadata.json
    so train and test have consistent category schemas.
    """
    df = df.copy()

    # Columns where we actually want to enforce categories
    categorical_cols = [
        "TicketPrefix",
        "Title",
        "Embarked",
        "CabinDeck",
        "Ticket_strong",
    ]

    for col in categorical_cols:
        if col in df.columns and col in meta:
            df[col] = pd.Categorical(
                df[col],
                categories=meta[col],
                ordered=False,
            )

    return df


train = apply_metadata_categories(train, clean_meta)
test  = apply_metadata_categories(test, clean_meta)

train.dtypes.head(20)

PassengerId         int64
Pclass              int64
Sex                object
Age               float64
SibSp               int64
Parch               int64
Fare              float64
Embarked         category
Title            category
AgeMissing          int64
Ticket_clean       object
Ticket_strong    category
TicketPrefix     category
FareMissing         int64
CabinKnown          int64
CabinDeck        category
Survived            int64
dtype: object

### Feature Engineering Function

We now define a **model-agnostic** feature engineering function that:

1. Keeps core cleaned features
2. Adds domain-informed features:
   - FamilySize, IsAlone
   - LogFare, FarePerPerson
   - TicketGroupSize, TicketGroupFarePerPerson
3. Adds binned versions:
   - AgeBand, FareBand
4. Adds missingness indicators (already present) as proper features
5. Adds a few high-value interaction features:
   - Sex × Pclass
   - Sex × AgeBand
   - Embarked × Pclass
   - Title × AgeBand
   - Pclass × FareBand

We learn bin edges (`AgeBand`, `FareBand`) on the training set and reuse them for the test set.

In [36]:
def compute_bins(series: pd.Series, n_bins: int = 5):
    """
    Compute numeric bins for a continuous variable in a stable way.
    """
    valid = series.dropna()
    if valid.empty:
        return None

    quantiles = np.linspace(0, 1, n_bins + 1)
    edges = np.unique(valid.quantile(quantiles).values)

    if edges.size < 3:
        edges = np.linspace(valid.min(), valid.max(), n_bins + 1)

    edges = np.unique(edges)
    return edges if edges.size >= 3 else None


def engineer_features(
    df: pd.DataFrame,
    meta: dict,
    age_bins=None,
    fare_bins=None,
    is_train: bool = True,
    n_age_bins: int = 5,
    n_fare_bins: int = 5,
):
    """
    Model-agnostic feature engineering.
    """
    df = df.copy()

    # Ensure metadata categories (safe to reapply)
    df = apply_metadata_categories(df, meta)

    # -------------------
    # 1. Family features
    # -------------------
    df["FamilySize"] = df.get("SibSp", 0) + df.get("Parch", 0) + 1
    df["IsAlone"] = (df["FamilySize"] == 1).astype(int)

    # -------------------
    # 2. Fare transforms
    # -------------------
    if "Fare" in df.columns:
        fare = df["Fare"].fillna(df["Fare"].median()).clip(lower=0)
        df["LogFare"] = np.log1p(fare)
        df["FarePerPerson"] = fare / df["FamilySize"].replace(0, 1)
    else:
        df["LogFare"] = np.nan
        df["FarePerPerson"] = np.nan

    # -------------------
    # 3. Ticket group features
    # -------------------
    ticket_col = None
    for cand in ["Ticket_strong", "Ticket_clean", "Ticket"]:
        if cand in df.columns:
            ticket_col = cand
            break

    if ticket_col:
        ticket_counts = (
        df.groupby(ticket_col, observed=True)[ticket_col]
        .transform("size")
    )
        df["TicketGroupSize"] = ticket_counts
        df["TicketGroupFarePerPerson"] = df["Fare"] / ticket_counts.replace(0, 1)
    else:
        df["TicketGroupSize"] = 1
        df["TicketGroupFarePerPerson"] = np.nan

    # -------------------
    # 4. AgeBand
    # -------------------
    if "Age" in df.columns:
        if is_train:
            age_bins = compute_bins(df["Age"], n_bins=n_age_bins)
        df["AgeBand"] = (
            pd.cut(df["Age"], bins=age_bins, include_lowest=True)
            if age_bins is not None else
            pd.Series([None] * len(df), dtype="object")
        )

    # -------------------
    # 5. FareBand
    # -------------------
    if "Fare" in df.columns:
        if is_train:
            fare_bins = compute_bins(df["Fare"], n_bins=n_fare_bins)
        df["FareBand"] = (
            pd.cut(df["Fare"], bins=fare_bins, include_lowest=True)
            if fare_bins is not None else
            pd.Series([None] * len(df), dtype="object")
        )

    # -------------------
    # 6. Interaction features
    # -------------------
    def combine_as_str(col1, col2, new_col):
        df[new_col] = df[col1].astype(str) + "_" + df[col2].astype(str)

    combine_as_str("Sex", "Pclass", "Sex_Pclass")
    combine_as_str("Sex", "AgeBand", "Sex_AgeBand")
    combine_as_str("Embarked", "Pclass", "Embarked_Pclass")
    combine_as_str("Title", "AgeBand", "Title_AgeBand")
    combine_as_str("Pclass", "FareBand", "Pclass_FareBand")

    # -------------------
    # 7. Missingness indicators (assert presence)
    # -------------------
    for col in ["AgeMissing", "FareMissing", "CabinKnown"]:
        df[col] = df.get(col, 0)

    # -------------------
    # 8. Drop unused or obsolete columns
    # -------------------
    DROP_COLS = [
        "Name",           # raw text, never used
        "Ticket",         # raw messy ticket string
        "Cabin",          # raw cabin string
        "Ticket_clean",   # obsolete feature; replaced by Ticket_strong + TicketPrefix
    ]

    df_fe = df.drop(columns=DROP_COLS, errors="ignore")

    return df_fe, age_bins, fare_bins

In [37]:
# Separate target from features in training data
target_col = "Survived"
y = train[target_col].copy()
X_train_base = train.drop(columns=[target_col])

# Apply feature engineering to TRAIN
X_train_fe, age_bins, fare_bins = engineer_features(
    X_train_base,
    meta=clean_meta,
    is_train=True,
    age_bins=None,
    fare_bins=None,
)

# Reattach target to FE train set (mirror interim structure)
X_train_fe[target_col] = y

# Apply feature engineering to TEST using TRAIN bins
X_test_fe, _, _ = engineer_features(
    test,
    meta=clean_meta,
    is_train=False,
    age_bins=age_bins,
    fare_bins=fare_bins,
)

print("Train FE shape:", X_train_fe.shape)
print("Test FE shape: ", X_test_fe.shape)

X_train_fe.head()

Train FE shape: (891, 29)
Test FE shape:  (418, 28)


Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title,AgeMissing,Ticket_strong,TicketPrefix,FareMissing,CabinKnown,CabinDeck,FamilySize,IsAlone,LogFare,FarePerPerson,TicketGroupSize,TicketGroupFarePerPerson,AgeBand,FareBand,Sex_Pclass,Sex_AgeBand,Embarked_Pclass,Title_AgeBand,Pclass_FareBand,Survived
0,1,3,male,22.0,1,0,7.25,S,Mr,0,Other,A,0,0,Unknown,2,0,2.110213,3.625,855,0.00848,"(20.0, 25.0]","(-0.001, 7.854]",male_3,"male_(20.0, 25.0]",S_3,"Mr_(20.0, 25.0]","3_(-0.001, 7.854]",0
1,2,1,female,38.0,1,0,71.2833,C,Mrs,0,Other,PC,0,1,C,2,0,4.280593,35.64165,855,0.083372,"(30.0, 40.0]","(39.688, 512.329]",female_1,"female_(30.0, 40.0]",C_1,"Mrs_(30.0, 40.0]","1_(39.688, 512.329]",1
2,3,3,female,26.0,0,0,7.925,S,Miss,0,Other,STON,0,0,Unknown,1,1,2.188856,7.925,855,0.009269,"(25.0, 30.0]","(7.854, 10.5]",female_3,"female_(25.0, 30.0]",S_3,"Miss_(25.0, 30.0]","3_(7.854, 10.5]",1
3,4,1,female,35.0,1,0,53.1,S,Mrs,0,Other,NONE,0,1,C,2,0,3.990834,26.55,855,0.062105,"(30.0, 40.0]","(39.688, 512.329]",female_1,"female_(30.0, 40.0]",S_1,"Mrs_(30.0, 40.0]","1_(39.688, 512.329]",1
4,5,3,male,35.0,0,0,8.05,S,Mr,0,Other,NONE,0,0,Unknown,1,1,2.202765,8.05,855,0.009415,"(30.0, 40.0]","(7.854, 10.5]",male_3,"male_(30.0, 40.0]",S_3,"Mr_(30.0, 40.0]","3_(7.854, 10.5]",0


## Sanity Checks

We quickly inspect:

- Alignment of columns between train and test
- Missing values
- Key engineered features

In [38]:
train_cols = set(X_train_fe.columns)
test_cols = set(X_test_fe.columns)

print("Columns only in train:", sorted(train_cols - test_cols))
print("Columns only in test :", sorted(test_cols - train_cols))

Columns only in train: ['Survived']
Columns only in test : []


In [39]:
assert train_cols - {"Survived"} == test_cols, "Train/Test feature mismatch!"

In [40]:
X_train_fe.isna().sum().sort_values(ascending=False)

PassengerId                 0
Pclass                      0
Sex                         0
Age                         0
SibSp                       0
Parch                       0
Fare                        0
Embarked                    0
Title                       0
AgeMissing                  0
Ticket_strong               0
TicketPrefix                0
FareMissing                 0
CabinKnown                  0
CabinDeck                   0
FamilySize                  0
IsAlone                     0
LogFare                     0
FarePerPerson               0
TicketGroupSize             0
TicketGroupFarePerPerson    0
AgeBand                     0
FareBand                    0
Sex_Pclass                  0
Sex_AgeBand                 0
Embarked_Pclass             0
Title_AgeBand               0
Pclass_FareBand             0
Survived                    0
dtype: int64

In [41]:
# Columns we want to inspect
cols_to_show = [
    "Survived",         # <-- added
    "Pclass", "Sex", "Age", "AgeBand",
    "Fare", "FareBand", "FamilySize", "IsAlone",
    "FarePerPerson", "TicketGroupSize",
    "CabinDeck", "CabinKnown",
    "AgeMissing", "FareMissing",
    "Sex_Pclass", "Title_AgeBand"
]

# Filter to only those columns that actually exist
cols_present = [c for c in cols_to_show if c in X_train_fe.columns]

# Display first few rows
X_train_fe[cols_present].head()

Unnamed: 0,Survived,Pclass,Sex,Age,AgeBand,Fare,FareBand,FamilySize,IsAlone,FarePerPerson,TicketGroupSize,CabinDeck,CabinKnown,AgeMissing,FareMissing,Sex_Pclass,Title_AgeBand
0,0,3,male,22.0,"(20.0, 25.0]",7.25,"(-0.001, 7.854]",2,0,3.625,855,Unknown,0,0,0,male_3,"Mr_(20.0, 25.0]"
1,1,1,female,38.0,"(30.0, 40.0]",71.2833,"(39.688, 512.329]",2,0,35.64165,855,C,1,0,0,female_1,"Mrs_(30.0, 40.0]"
2,1,3,female,26.0,"(25.0, 30.0]",7.925,"(7.854, 10.5]",1,1,7.925,855,Unknown,0,0,0,female_3,"Miss_(25.0, 30.0]"
3,1,1,female,35.0,"(30.0, 40.0]",53.1,"(39.688, 512.329]",2,0,26.55,855,C,1,0,0,female_1,"Mrs_(30.0, 40.0]"
4,0,3,male,35.0,"(30.0, 40.0]",8.05,"(7.854, 10.5]",1,1,8.05,855,Unknown,0,0,0,male_3,"Mr_(30.0, 40.0]"


In [42]:
miss_train = X_train_fe.isna().sum().rename("train_missing")
miss_test  = X_test_fe.isna().sum().rename("test_missing")

pd.concat([miss_train, miss_test], axis=1).sort_values("train_missing", ascending=False)

Unnamed: 0,train_missing,test_missing
PassengerId,0,0.0
Pclass,0,0.0
Sex,0,0.0
Age,0,0.0
SibSp,0,0.0
Parch,0,0.0
Fare,0,0.0
Embarked,0,0.0
Title,0,0.0
AgeMissing,0,0.0


### Save Engineered Features

We now save:

- `train_features.csv` (including `Survived`)
- `test_features.csv`

These files will be the starting point for our modeling notebooks.

In [46]:
# ============================================
# SAVE ENGINEERED TRAIN + TEST FEATURES
# ============================================

PROCESSED_DIR = "../data/processed/"

train_path = PROCESSED_DIR + "train_features.csv"
test_path  = PROCESSED_DIR + "test_features.csv"

# Save engineered features
X_train_fe.to_csv(train_path, index=False)
X_test_fe.to_csv(test_path, index=False)

print("Saved feature files:")
print(" -", train_path)
print(" -", test_path)

Saved feature files:
 - ../data/processed/train_features.csv
 - ../data/processed/test_features.csv


In [48]:
# ============================================
# SAVE PROCESSED METADATA
# ============================================

# Identify categorical columns in the final engineered train dataset
cat_cols = X_train_fe.select_dtypes(include=["object", "category"]).columns

# Convert categories to strings so they are JSON serializable
processed_meta = {}

for col in cat_cols:
    cats = X_train_fe[col].astype("category").cat.categories
    # Convert Interval objects (or any objects) to strings
    cats_as_str = [str(c) for c in cats]
    processed_meta[col] = cats_as_str

meta_path = PROCESSED_DIR + "processed_metadata.json"

with open(meta_path, "w") as f:
    json.dump(processed_meta, f, indent=4)

print("Saved processed metadata to:", meta_path)

Saved processed metadata to: ../data/processed/processed_metadata.json


### Summary

We have created a **model-agnostic feature set** with:
```
- Core cleaned features
- Domain-informed engineered features:
  - FamilySize, IsAlone
  - LogFare, FarePerPerson
  - TicketGroupSize, TicketGroupFarePerPerson
- Binned variables:
  - AgeBand, FareBand
- Missingness indicators:
  - AgeMissing, FareMissing, CabinKnown
- Interaction features:
  - Sex_Pclass
  - Sex_AgeBand
  - Embarked_Pclass
  - Title_AgeBand
  - Pclass_FareBand
```
Next step: use `train_features.csv` and `test_features.csv` in a **modeling notebook** to train and compare different algorithms (Logistic Regression, RandomForest, Gradient Boosting, XGBoost, etc.) on a consistent feature space.