# 🧪 Feature Engineering

This notebook focuses on transforming and enriching the Titanic dataset  
to create more informative features for machine learning models.

---

## 🎯 Purpose

To apply feature engineering techniques such as encoding, binning,  
and feature construction to improve model performance and interpretability.


## 📦 Dataset

Same dataset as the previous notebooks:  
[Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic)  
via public repository: [Data Science Dojo GitHub](https://github.com/datasciencedojo/datasets)


📦 1. Load the Dataset & Set Variables

In [130]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style='whitegrid')

# Load Titanic dataset
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

 Handle Missing Values (from previous notebook)

In [131]:
# Embarked: fill with mode
mode_embarked = df['Embarked'].mode()[0]
df['Embarked'] = df['Embarked'].fillna(mode_embarked)

# Cabin: extract first letter, fill missing as 'U'
df['CabinSection'] = df['Cabin'].str[0].fillna('U')

# Age: fill with median by Pclass and Sex
df['Age'] = df.groupby(['Pclass', 'Sex'])['Age'].transform(
    lambda x: x.fillna(x.median())
)

🧾 2. Encoding Categorical Variables

Machine learning models cannot directly interpret categorical variables,<br>
so we need to convert them into numerical format.<br>
In this step, we encode the following categorical features:
- Sex (male, female)

- Embarked (C, Q, S)

- CabinSection (first letter of Cabin, including 'U' for unknown)

- Pclass (1, 2, 3 — treated as categorical)

2-1. Sex (Label Encoding)

<small>
Sex is converted to 0 for male and 1 for female.
</small>

In [132]:
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

2-2. Embarked (One-Hot Encoding)

<small>
The embarkation port is categorized as C, Q, or S, and is split into separate dummy variables for each category.
</small>

In [133]:
embarked_dummies = pd.get_dummies(df['Embarked'], prefix='Embarked')
df = pd.concat([df, embarked_dummies], axis=1)
df.drop('Embarked', axis=1, inplace=True)

2-3. CabinSection (One-Hot Encoding)

<small>
The CabinSection, which indicates cabin areas, includes categories A to G as well as U (used for missing values), and is converted into dummy variables.
</small>

In [134]:
cabin_dummies = pd.get_dummies(df['CabinSection'], prefix='Cabin')
df = pd.concat([df, cabin_dummies], axis=1)
df.drop('CabinSection', axis=1, inplace=True)

2-4. Pclass (One-Hot Encoding)

<small>
Although Pclass appears to be numerical, it is generally treated as an ordinal categorical variable, so we apply one-hot encoding.
</small>

In [135]:
pclass_dummies = pd.get_dummies(df['Pclass'], prefix='Pclass')
df = pd.concat([df, pclass_dummies], axis=1)
df.drop('Pclass', axis=1, inplace=True)

🧱 3. Construct New Features

Feature engineering helps uncover useful patterns by transforming or combining existing data.<br>
In this section, we construct simple but meaningful features that can support model performance.

- FamilySize: Total family members aboard (SibSp + Parch + self)

- IsAlone: Whether the passenger was traveling alone

- Title: Extracted honorific from Name (e.g., Mr, Miss)

- LowFare: Indicates low-cost ticket (Fare < 50)

- AgeBin: Age grouped into bins for categorical modeling

3-1. FamilySize

In [136]:
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

Indicates the total number of family members on board, including the passenger.<br>
Having family on board may influence survival chances.

 3-2. IsAlone

In [137]:
df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

A binary indicator of whether the passenger was traveling alone.<br>
Previous analysis showed that people traveling alone tended to have lower survival rates.

3-3. Title (from Name)

In [138]:
df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)

Extracts social titles (e.g., Mr, Miss, Dr) from the passenger's name.<br>
This feature may reflect gender, age group, or even social class or occupation.

3-4. LowFare

In [139]:
df['LowFare'] = (df['Fare'] < 50).astype(int)

Indicates whether the passenger purchased a relatively low-cost ticket.<br>
Lower fares are generally associated with third-class passengers and lower survival rates.

3-5. AgeBin

In [140]:
df['AgeBin'] = pd.cut(df['Age'], 5, labels=False)

Bins the continuous Age variable into equal-width intervals.<br>
This allows the model to capture nonlinear relationships between age groups and survival.

🧬 4. Advanced Feature Construction (based on EDA insights)

These features are designed based on patterns observed during exploratory data analysis (EDA).<br>
They aim to capture more nuanced relationships that may improve prediction performance.

4-1. ModerateFamily

In [141]:
df['ModerateFamily'] = df['FamilySize'].apply(lambda x: 1 if 2 <= x <= 4 else 0)

In our previous analysis, survival rates varied significantly with family size:

- Solo travelers (FamilySize = 1) had relatively low survival rates.

- Passengers with 2 to 4 family members had the highest survival likelihood.

- Those with larger families (5 or more) tended to have lower survival due to crowding or mobility challenges.

This feature highlights the "moderate family" group, which seems to benefit from both emotional support and manageable group size during evacuation.

 4-2. IsCherbourg

In [142]:
df['IsCherbourg'] = (df['Embarked_C'] == 1).astype(int)

Our EDA showed that passengers who boarded from Cherbourg (C)
had a noticeably higher survival rate compared to other embarkation points (S, Q).

Although the exact reason is unclear, the trend is strong enough to suggest this location
might be associated with more favorable conditions, making it a useful signal for modeling.

This feature flags Cherbourg boarders as a potentially advantaged group.

4-3. FemaleFirstSecondClass

In [143]:
df['FemaleFirstSecondClass'] = ((df['Sex'] == 1) & ((df['Pclass_1'] == 1) | (df['Pclass_2'] == 1))).astype(int)

Our EDA revealed that females in first or second class had the highest survival rates among all groups.<br>
This likely reflects the "women and children first" evacuation protocol combined with better access to lifeboats in higher classes.

This feature explicitly identifies passengers who belonged to this high-priority rescue group,<br>
which models might otherwise fail to capture through independent variables alone.

4-4. IsChildOrElderly

In [144]:
df['IsChildOrElderly'] = ((df['Age'] < 10) | (df['Age'] >= 60)).astype(int)

In our earlier analysis, children under 10 and elderly passengers over 60<br>
stood out with distinctly different survival patterns compared to the general population.

Young children were more likely to survive, possibly due to rescue priority.

Elderly passengers had much lower survival rates, potentially due to mobility or health.

This feature highlights passengers in these extreme age groups, helping the model identify vulnerability or protection factors<br> that may not be captured by raw age or age bins alone.

4-5. LowFare_3rdClass

In [145]:
df['LowFare_3rdClass'] = ((df['LowFare'] == 1) & (df['Pclass_3'] == 1)).astype(int)

During EDA, we observed that third-class passengers with low fares
tended to have the lowest survival rates across the dataset.

This feature combines two key disadvantage indicators:

- A low-cost ticket (Fare < 50)

- A third-class cabin (Pclass_3)

By creating this interaction, we help the model explicitly recognize
a group that faced particularly unfavorable conditions during the disaster.

### ✅ Final Check: Preview the Engineered Data

In [146]:
# Check total number of columns
print(f"Total columns: {df.shape[1]}")

# List all column names
print(df.columns.tolist())

# Preview selected engineered features only
df[['FamilySize', 'IsAlone', 'Title', 'LowFare', 'AgeBin',
    'ModerateFamily', 'IsCherbourg', 'FemaleFirstSecondClass',
    'IsChildOrElderly', 'LowFare_3rdClass']].head()

Total columns: 35
['PassengerId', 'Survived', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Cabin_A', 'Cabin_B', 'Cabin_C', 'Cabin_D', 'Cabin_E', 'Cabin_F', 'Cabin_G', 'Cabin_T', 'Cabin_U', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'FamilySize', 'IsAlone', 'Title', 'LowFare', 'AgeBin', 'ModerateFamily', 'IsCherbourg', 'FemaleFirstSecondClass', 'IsChildOrElderly', 'LowFare_3rdClass']


Unnamed: 0,FamilySize,IsAlone,Title,LowFare,AgeBin,ModerateFamily,IsCherbourg,FemaleFirstSecondClass,IsChildOrElderly,LowFare_3rdClass
0,2,0,Mr,1,1,1,0,0,0,1
1,2,0,Mrs,0,2,1,1,1,0,0
2,1,1,Miss,1,1,0,0,0,0,1
3,2,0,Mrs,0,2,1,0,1,0,0
4,1,1,Mr,1,2,0,0,0,0,1


## 🧠 Summary

In this notebook, we focused on feature engineering techniques to enhance the predictive power of the Titanic dataset.

We completed the following steps:

- **Handled missing values** using context-aware imputation strategies
- **Encoded categorical variables** such as `Sex`, `Embarked`, and `CabinSection`
- **Constructed base features** like:
  - `FamilySize` and `IsAlone` for capturing social structure
  - `Title` extracted from names
  - `LowFare` and `AgeBin` to simplify numerical distributions
- **Designed advanced features** based on EDA insights:
  - `ModerateFamily`: optimal survival group by family size
  - `IsCherbourg`: flagged high-survival embarkation point
  - `FemaleFirstSecondClass`: identified priority rescue group
  - `IsChildOrElderly`: grouped vulnerable and protected age segments
  - `LowFare_3rdClass`: marked a high-risk economic segment

These transformations reflect both domain knowledge and data-driven insights,<br>preparing the dataset for effective modeling in the next stage.