# Titanic Dataset: Data Inspection and Missing Value Handling
**Author:** Apeksha Shenoy Mangalpady
**Purpose:** Load and inspect Titanic dataset, identify missing values, and prepare initial features.

### Missing Data Handling Decisions

- **Age:** Fill missing values using median grouped by Pclass and Sex. This preserves age distribution for different social classes and gender.
- **Cabin:** Too many missing values; instead, create `HasCabin` binary feature to indicate presence of cabin.
- **Embarked:** Fill missing values with mode (most frequent port) because only 2 rows are missing.
- **Fare:** Fill missing value in test set with median, which is robust to outliers.


In [37]:
import pandas as pd
import numpy as np
from pathlib import Path
import sys
sys.path.append(str(Path('..').resolve())) # Add parent directory to sys.path so Python can find feature_engineering.py
from feature_engineering import engineer_features

### Load Raw Data

In [38]:
# Load the original Titanic training and test datasets. These files contain missing values and 
# categorical variables that must be processed before modelin

train_data = pd.read_csv('D:/CAPSTONE_PROJECT/Titanic_Capstone_Project/Data/train.csv')

### Initial Data Inspection

We inspect the structure of the dataset to understand:

Data types

Missing values

Distribution of numerical features

Cardinality of categorical features

This step helps determine how missing values and categorical variables should be handled.

In [39]:
# Check basic info: columns, types, non-null counts

train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [40]:
# Quick look at first few rows

train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Check for missing values

In [41]:
# Count missing values per column

print("Train missing values:\n", train_data.isnull().sum())


Train missing values:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


### Missing Values Analysis

- **Train set:** Age (177), Cabin (687), Embarked (2)
- **Test set:** Age (86), Fare (1), Cabin (327)

**From inspection:**

Age contains many missing values in both train and test sets.

Cabin has a large number of missing values.

Embarked has only 2 missing values in the training set.

Fare has 1 missing value in the test set.

**Handling Strategy:**

Age: Imputed using median grouped by Pclass and Sex to preserve socio-economic and gender-based distribution.

Cabin: Instead of imputing, create a binary feature HasCabin.

Embarked: Filled with mode (most frequent value).

Fare (test set): Filled with median.

This ensures no missing values remain before modeling.


In [42]:
# Numerical features

train_data.describe()

# Categorical features

train_data.describe(include=['O'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Dooley, Mr. Patrick",male,347082,G6,S
freq,1,577,7,4,644


### Observations
- Age has many missing values, needs imputation.
- Cabin has too many missing values to use directly; consider HasCabin feature.
- Embarked has 2 missing values; can be filled with mode.
- Fare has 1 missing value in test set; fill with median.
- Family-related columns (SibSp, Parch) can be used to create FamilySize and IsAlone.
- Name can be used to extract Title for social status feature.


### Handling missing values in column Embark:

Only two values are missing in the training set. These are filled with the most frequent port of embarkation (mode), as this introduces minimal bias into the dataset.

Port codes:

C = Cherbourg

Q = Queenstown

S = Southampton

After imputing the missing values, the Embarked column is converted into numeric format for use in the model.

In [43]:
# Fill missing values with mode (most frequent port)
train_data.fillna({'Embarked': train_data['Embarked'].mode()[0]}, inplace=True)

embarked_mapping = {'S': 1, 'C': 2, 'Q': 3}
train_data['Embarked'] = train_data['Embarked'].map(embarked_mapping)

### Handling missing values in column Age:

### Initial aproach:

Age is an important feature and contains many missing values.

Instead of filling with a global median, we compute the median Age grouped by:

Passenger Class (Pclass)

Sex

This preserves realistic age differences across social class and gender.

In [44]:
# # Fill Age missing values using median grouped by Pclass and Sex
# train_data['Age'] = (
#     train_data['Age']
#     .fillna(
#         train_data.groupby(['Pclass', 'Sex'])['Age'].transform('median')
#     )
# )

# test_data['Age'] = (
#     test_data['Age']
#     .fillna(
#         test_data.groupby(['Pclass', 'Sex'])['Age'].transform('median')
#     )
# )

### Title Extraction and Age Handling Strategy Adopted:

In this section, we extract passenger titles from the Name column and use them to improve age imputation.

Titles often correlate with age groups (for example, Master typically represents young boys, while Mrs represents adult women). Instead of filling missing age values using a global median, we impute missing Age values using the median age within each Title group.

Steps performed:

Extract Title from Name

Group rare titles into a single category (Rare)

Standardize equivalent titles (e.g., Mlle → Miss)

Fill missing Age values using median age per Title

This approach provides more meaningful age estimates compared to overall median imputation.

In [45]:
#implementation moved to feature_engineering.py

### Feature Engineering Function

The `engineer_features` function creates new features and transforms existing ones to improve the dataset for modeling:

1. **Title Extraction**  
   - Extracts titles from the `Name` column (e.g., Mr, Mrs, Miss).  
   - Standardizes similar titles (`Mlle` → `Miss`, `Mme` → `Mrs`).  
   - Groups rare titles into a single category `Rare`.

2. **Age Binning**  
   - Categorizes `Age` into groups: `Child`, `Teen`, `YoungAdult`, `Adult`, `Senior`.

3. **Fare Binning**  
   - Splits `Fare` into quartiles and labels them: `Low`, `Mid`, `High`, `VeryHigh`.

4. **Family Features**  
   - `FamilySize`: total family members aboard (SibSp + Parch + 1).  
   - `IsAlone`: binary flag indicating whether the passenger is alone.

This function returns the modified DataFrame with these additional features.

In [46]:
# Apply feature engineering to the train data
train_data = engineer_features(train_data)

print(train_data.columns)

# Fill missing Age using median age per Title 
#train_data['Age'] = train_data.groupby('Title')['Age'].transform(lambda x: x.fillna(x.median()))

# # Create AgeGroup after filling missing Age values
# train_data['AgeGroup'] = pd.cut(
#     train_data['Age'],
#     bins=[0,12,18,35,60,100],
#     labels=['Child','Teen','YoungAdult','Adult','Senior'],
#     include_lowest=True
# )

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Title', 'AgeGroup',
       'FareGroup', 'FamilySize', 'IsAlone'],
      dtype='object')


### Handling missing values in column Sex:

Machine learning models require numerical input. The Sex column is categorical (male, female), so we convert it into a numerical format.

We map:

male → 0

female → 1

This binary encoding preserves the information while making the feature usable for model training.

In [47]:
# Map Sex to numeric: male=0, female=1
sex_mapping = {'male': 0, 'female': 1}
train_data['Sex'] = train_data['Sex'].map(sex_mapping)

### Adding HasCabin :
cabin information is largely missing ; Filling it with 1 if Cabin exists, else 0

In [48]:
# Create binary feature HasCabin
train_data['HasCabin'] = train_data['Cabin'].notnull().astype(int)

### Verify missing values are gone :

In [49]:
#Look for any remaining missing values after processing
print("Train missing values after processing:\n", train_data.isnull().sum())

Train missing values after processing:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
Title            0
AgeGroup         0
FareGroup        0
FamilySize       0
IsAlone          0
HasCabin         0
dtype: int64


In [50]:
# Check the data after feature engineering
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,AgeGroup,FareGroup,FamilySize,IsAlone,HasCabin
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,1,Mr,YoungAdult,Low,2,0,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,2,Mrs,Adult,VeryHigh,2,0,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,1,Miss,YoungAdult,Mid,1,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,1,Mrs,YoungAdult,VeryHigh,2,0,1
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,1,Mr,YoungAdult,Mid,1,1,0


### Drop columns not used for modeling
Columns like PassengerId, Name, Ticket, and Cabin are dropped because:
- PassengerId: unique identifier, not useful as a feature
- Name: information extracted into Title
- Ticket: high cardinality, not used in current model
- Cabin: mostly missing, represented via HasCabin instead

In [51]:
train_data.drop(
    columns=["PassengerId", "Name", "Ticket", "Cabin"],
    inplace=True
)

print(train_data.columns)

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked', 'Title', 'AgeGroup', 'FareGroup', 'FamilySize', 'IsAlone',
       'HasCabin'],
      dtype='object')


In [52]:
#just to check the final processed train data after feature engineering and dropping unnecessary columns
train_data.head(10)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title,AgeGroup,FareGroup,FamilySize,IsAlone,HasCabin
0,0,3,0,22.0,1,0,7.25,1,Mr,YoungAdult,Low,2,0,0
1,1,1,1,38.0,1,0,71.2833,2,Mrs,Adult,VeryHigh,2,0,1
2,1,3,1,26.0,0,0,7.925,1,Miss,YoungAdult,Mid,1,1,0
3,1,1,1,35.0,1,0,53.1,1,Mrs,YoungAdult,VeryHigh,2,0,1
4,0,3,0,35.0,0,0,8.05,1,Mr,YoungAdult,Mid,1,1,0
5,0,3,0,30.0,0,0,8.4583,3,Mr,YoungAdult,Mid,1,1,0
6,0,1,0,54.0,0,0,51.8625,1,Mr,Adult,VeryHigh,1,1,1
7,0,3,0,2.0,3,1,21.075,1,Master,Child,High,5,0,0
8,1,3,1,27.0,0,2,11.1333,1,Mrs,YoungAdult,Mid,3,0,0
9,1,2,1,14.0,1,0,30.0708,2,Mrs,Teen,High,2,0,0


In [53]:
train_data['AgeGroup'].isnull().sum()

np.int64(0)

In [54]:
# Save the processed train data for later use in modeling
train_data.to_csv('../../Data/train_processed.csv', index=False)

### Dataset ready for modeling
Final shape: (891, 15)
All missing values handled, categorical features encoded, and engineered features added.