# Titanic Dataset: Data Inspection and Missing Value Handling
**Author:** Apeksha Shenoy Mangalpady
**Purpose:** Load and inspect Titanic dataset, identify missing values, and prepare initial features.

### Missing Data Handling Decisions

- **Age:** Fill missing values using median grouped by Pclass and Sex. This preserves age distribution for different social classes and gender.
- **Cabin:** Too many missing values; instead, create `HasCabin` binary feature to indicate presence of cabin.
- **Embarked:** Fill missing values with mode (most frequent port) because only 2 rows are missing.
- **Fare:** Fill missing value in test set with median, which is robust to outliers.


In [1]:
import pandas as pd
import numpy as np


In [2]:
train_data = pd.read_csv('D:/CAPSTONE_PROJECT/Titanic_Capstone_Project/Data/train.csv')
test_data = pd.read_csv('D:/CAPSTONE_PROJECT/Titanic_Capstone_Project/Data/test.csv')
gender_data = pd.read_csv('D:/CAPSTONE_PROJECT/Titanic_Capstone_Project/Data/gender_submission.csv')

In [3]:
# Check basic info: columns, types, non-null counts
train_data.info()
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass  

In [4]:
# Quick look at first few rows
train_data.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Check for missing values

In [5]:
# Count missing values per column
print("Train missing values:\n", train_data.isnull().sum())
print("\nTest missing values:\n", test_data.isnull().sum())


Train missing values:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Test missing values:
 PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


### Missing Values Summary
- **Train set:** Age (177), Cabin (687), Embarked (2)
- **Test set:** Age (86), Fare (1), Cabin (327)


In [6]:
# Numerical features
train_data.describe()

# Categorical features
train_data.describe(include=['O'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Dooley, Mr. Patrick",male,347082,G6,S
freq,1,577,7,4,644


### Observations
- Age has many missing values, needs imputation.
- Cabin has too many missing values to use directly; consider HasCabin feature.
- Embarked has 2 missing values; can be filled with mode.
- Fare has 1 missing value in test set; fill with median.
- Family-related columns (SibSp, Parch) can be used to create FamilySize and IsAlone.
- Name can be used to extract Title for social status feature.


### Handling missing values in column Embark:

In [7]:
# Fill missing values with mode (most frequent port)
train_data.fillna({'Embarked': train_data['Embarked'].mode()[0]}, inplace=True)

embarked_mapping = {'S': 1, 'C': 2, 'Q': 3}
train_data['Embarked'] = train_data['Embarked'].map(embarked_mapping)
test_data['Embarked'] = test_data['Embarked'].map(embarked_mapping)

### Handling missing values in column Age:

In [8]:
# # Fill Age missing values using median grouped by Pclass and Sex
# train_data['Age'] = (
#     train_data['Age']
#     .fillna(
#         train_data.groupby(['Pclass', 'Sex'])['Age'].transform('median')
#     )
# )

# test_data['Age'] = (
#     test_data['Age']
#     .fillna(
#         test_data.groupby(['Pclass', 'Sex'])['Age'].transform('median')
#     )
# )

### Handling missing values in column Sex:

In [9]:
# Map Sex to numeric: male=0, female=1
sex_mapping = {'male': 0, 'female': 1}
train_data['Sex'] = train_data['Sex'].map(sex_mapping)
test_data['Sex'] = test_data['Sex'].map(sex_mapping)

### Handling missing values in column Fare(test_data):

In [10]:
test_data.fillna({'Fare': test_data['Fare'].median()}, inplace=True)

### Adding HasCabin :
cabin information is largely missing ; Filling it with 1 if Cabin exists, else 0

In [11]:
# Create binary feature HasCabin
train_data['HasCabin'] = train_data['Cabin'].notnull().astype(int)
test_data['HasCabin'] = test_data['Cabin'].notnull().astype(int)

### Verify missing values are gone :

In [12]:
print("Train missing values after processing:\n", train_data.isnull().sum())
print("\n Test missing values after processing:\n", test_data.isnull().sum())

Train missing values after processing:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
HasCabin         0
dtype: int64

 Test missing values after processing:
 PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          327
Embarked         0
HasCabin         0
dtype: int64


In [13]:
median_fare = test_data['Fare'].median()
test_data['Fare'].fillna(median_fare)

0        7.8292
1        7.0000
2        9.6875
3        8.6625
4       12.2875
         ...   
413      8.0500
414    108.9000
415      7.2500
416      8.0500
417     22.3583
Name: Fare, Length: 418, dtype: float64

In [14]:
test_data.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          327
Embarked         0
HasCabin         0
dtype: int64

In [15]:
# train_data.drop(columns=['Cabin'], inplace=True, errors='ignore')
# test_data.drop(columns=['Cabin'], inplace=True, errors='ignore')
# since we have HasCabin, we can drop the original Cabin column for now. we will explore it later if needed.

In [16]:
print(train_data.columns)
print(test_data.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'HasCabin'],
      dtype='object')
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked', 'HasCabin'],
      dtype='object')


In [17]:
train_data_processed = train_data.copy()
test_data_processed = test_data.copy()

In [18]:
train_data_processed.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,HasCabin
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,2,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,1,1
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,1,0


In [19]:
test_data_processed.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,HasCabin
0,892,3,"Kelly, Mr. James",0,34.5,0,0,330911,7.8292,,3,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",1,47.0,1,0,363272,7.0,,1,0
2,894,2,"Myles, Mr. Thomas Francis",0,62.0,0,0,240276,9.6875,,3,0
3,895,3,"Wirz, Mr. Albert",0,27.0,0,0,315154,8.6625,,1,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1,22.0,1,1,3101298,12.2875,,1,0


In [20]:
train_data_processed.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
HasCabin         0
dtype: int64

In [21]:
test_data_processed.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          327
Embarked         0
HasCabin         0
dtype: int64

In [22]:
train_data_processed.to_csv(
    '../../Data/train_processed.csv',
    index=False
)

test_data_processed.to_csv(
    '../../Data/test_processed.csv',
    index=False
)


In [23]:
pd.read_csv('../../Data/train_processed.csv').head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,HasCabin
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,2,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,1,1
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,1,0


In [24]:
pd.read_csv('../../Data/test_processed.csv').head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,HasCabin
0,892,3,"Kelly, Mr. James",0,34.5,0,0,330911,7.8292,,3,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",1,47.0,1,0,363272,7.0,,1,0
2,894,2,"Myles, Mr. Thomas Francis",0,62.0,0,0,240276,9.6875,,3,0
3,895,3,"Wirz, Mr. Albert",0,27.0,0,0,315154,8.6625,,1,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1,22.0,1,1,3101298,12.2875,,1,0


In [25]:
import sys

# Add parent folder to sys.path so you can import your module

sys.path.append('..')  
from feature_engineering import engineer_features


In [26]:
from pathlib import Path

data_path = Path('../../Data/train_processed.csv')  
train_data = pd.read_csv(data_path)


In [27]:
# Apply it to the loaded train data
train_data = engineer_features(train_data)

print(train_data.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'HasCabin', 'AgeGroup',
       'FareGroup', 'FamilySize', 'IsAlone'],
      dtype='object')


In [28]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,HasCabin,AgeGroup,FareGroup,FamilySize,IsAlone
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,1,0,YoungAdult,Low,2,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,2,1,Adult,VeryHigh,2,0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,1,0,YoungAdult,Mid,1,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,1,1,YoungAdult,VeryHigh,2,0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,1,0,YoungAdult,Mid,1,1


In [29]:
from feature_engineering import engineer_features
from pathlib import Path
import pandas as pd

# Load processed train data
train_path = Path('../../Data/train_processed.csv')
train_data = pd.read_csv(train_path)

# 1. Extract Title and impute Age by Title
train_data["Title"] = train_data["Name"].str.extract(" ([A-Za-z]+)\.", expand=False)
train_data["Title"] = train_data["Title"].replace(
    ["Lady","Countess","Capt","Col","Don","Dr","Major","Rev","Sir","Jonkheer","Dona"],
    "Rare"
)
train_data["Title"] = train_data["Title"].replace({"Mlle":"Miss","Ms":"Miss","Mme":"Mrs"})
train_data["Age"] = train_data.groupby("Title")["Age"].transform(lambda x: x.fillna(x.median()))


train_data = engineer_features(train_data)

train_data.to_csv('../../Data/train_processed.csv', index=False)


In [30]:
print(train_data[["Name", "Title", "Age"]].head(10))

                                                Name   Title   Age
0                            Braund, Mr. Owen Harris      Mr  22.0
1  Cumings, Mrs. John Bradley (Florence Briggs Th...     Mrs  38.0
2                             Heikkinen, Miss. Laina    Miss  26.0
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)     Mrs  35.0
4                           Allen, Mr. William Henry      Mr  35.0
5                                   Moran, Mr. James      Mr  30.0
6                            McCarthy, Mr. Timothy J      Mr  54.0
7                     Palsson, Master. Gosta Leonard  Master   2.0
8  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)     Mrs  27.0
9                Nasser, Mrs. Nicholas (Adele Achem)     Mrs  14.0


In [31]:
train_data.to_csv('../../Data/train_processed.csv', index=False)

In [32]:
test_path = Path('../../Data/test_processed.csv')  
test_data = pd.read_csv(test_path)

In [33]:
test_data = engineer_features(test_data)

In [34]:
print(test_data.columns)

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked', 'HasCabin', 'AgeGroup',
       'FareGroup', 'FamilySize', 'IsAlone'],
      dtype='object')


In [35]:
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,HasCabin,AgeGroup,FareGroup,FamilySize,IsAlone
0,892,3,"Kelly, Mr. James",0,34.5,0,0,330911,7.8292,,3,0,YoungAdult,Low,1,1
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",1,47.0,1,0,363272,7.0,,1,0,Adult,Low,2,0
2,894,2,"Myles, Mr. Thomas Francis",0,62.0,0,0,240276,9.6875,,3,0,Senior,Mid,1,1
3,895,3,"Wirz, Mr. Albert",0,27.0,0,0,315154,8.6625,,1,0,YoungAdult,Mid,1,1
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1,22.0,1,1,3101298,12.2875,,1,0,YoungAdult,Mid,3,0


In [36]:
test_data.to_csv('../../Data/test_processed.csv', index=False)