# Handling the missing data in Titanic

The titanic dataset has some missing data. Not that much, but enough for some experimentation.

## Load data

Let's start at the beginning. The file "titanic3.xlsx" is stored the "files"-folder. Load it into a pandas dataframe.


In [None]:
#DELETE

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_excel('../files/titanic3.xlsx', engine='openpyxl')
df.head()

Which fields are good candidates for filling?

Show how much values are missing in every column.

In [None]:
#DELETE
missing_values = df.isnull().sum()
print(missing_values)

* "Cabin" has way to much missing data, so that's not a good candidate.
* "Embarked" only has two missing values. We'll fill it first.
* "home.dest" also has to much missing values to fill it up. It's also a text-field, which makes it difficult to predict. (On the other hand is doesn't have a lot of different values, which makes it possible again. Bet we'll ignore it for now.)
* A NaN in "boat" and "body" mean that there was no body or the person was not in a boat. Filling in this data would be a very, very bad idea.
* "age" is a good value. We'll be filling in that (repeatedly).


## Embarked

Embarked is a text-field, one of three different values (the three stopt the titanic made before starting across the ocean).

Fill in the missing values with the most frequent value (the mode). Mean or median is not an option here, as text doesn't have an order.

In [None]:
#DELETE
mode_embarked = df['embarked'].mode()[0]
df['embarked'] = df['embarked'].fillna(mode_embarked)

## Age with mode

Filling in with mode was easy. Let's try again for age!

In [None]:
#DELETE
mode_age = df['age'].mode()[0]
df['age'] = df['age'].fillna(mode_age)

Now draw a histogram of age.

In [None]:
#DELETE
plt.figure(figsize=(8, 5))
df['age'].hist(bins=20, edgecolor='black')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

That kind of looks like an impolite gesture. Filling in with mode is a bad idea for a number of reasons:

* Artificial spike at the mode value
    * The histogram will show a large, unnatural peak at the mode value (e.g., age 24), because all missing values are replaced with one single number.
    * This distorts the distribution, making it look like a large portion of passengers were that exact age, which is not true.
* Loss of variability
    * Age is a continuous variable with a natural spread.
    * By filling with the mode, you are compressing variability and creating a misleading summary of the data.
    * This can negatively affect models, especially those that rely on the shape of the distribution (e.g. tree-based models or k-NN).
* Bias introduction
    * The mode might not represent all subgroups equally.
    * For example, 24 might be the mode because of many young third-class passengers, but missing ages could belong to older first-class passengers.
    * Filling with the mode ignores class-based, gender-based, or other contextual patterns in age.
* Misleading statistics
    * Measures like mean, median, and standard deviation will be inaccurate.
    * You're biasing the age distribution toward a single value, affecting any further analysis.

The median would be marginally better: It doesn’t create a sharp peak and is less sensitive to outliers. It's still naive but more distributionally fair.

Before you continue, reload the data to empty up the column again.

In [None]:
df = pd.read_excel('../files/titanic3.xlsx', engine='openpyxl')

## Group-wise median or mean

A better alternative is to fill in the median based on a group. This group can be "all passengers from the third class" or "gender". Let's start with just the pclass variable.

Fill in age based in Pclass and show the histogram.

In [None]:
#DELETE
df['age_pclass'] = df['age'].fillna(df.groupby('pclass')['age'].transform('median'))

plt.figure(figsize=(8, 5))
df['age_pclass'].hist(bins=20, edgecolor='black')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Still a big spike. Which makes sense when you look at the passenger-distribution.

![](../files/2025-05-14-19-52-01.png)

A lot more people are from third class. And more of those don't have a registered age to. Maybe of we fill in the age based on class and gender?

In [None]:
#DELETE
df['age_class_gender'] = df['age'].fillna(df.groupby(['pclass', 'sex'])['age'].transform('median'))

plt.figure(figsize=(8, 5))
df['age_class_gender'].hist(bins=20, edgecolor='black')
plt.title('Age Distribution (Filled by Class and Gender)')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Better. But let's not forget title, which leads to a very nice sidestep into the wonderful world of feature engineering.

Extract the "title" from the name-column. Show all different values in the column and the how many times they appear.

In [None]:
#DELETE
df['title'] = df['name'].str.extract(r',\s*([^\.]+)\.')
df['title'].head()

title_counts = df['title'].value_counts()
print(title_counts)

Mr, Miss, Mrs and Master are very good predictors: married women (mrs) are probably a bit older than unmarried women (miss), and "master" is the title given to a child.

But the other titles introduce other problems: what if the one "dona" we have doesn't have an age? And in a small group the median will basically be a random number.

Combine the other into the existing categories. Create a new column called "title_grouping" and group all officers and royalty and translate "Mlle" and "Ms" to "Miss" and "Mme" to "Mrs".

In [None]:
df['title_grouping'] = df['title'].replace({
    'Capt': 'Officer',
    'Col': 'Officer',
    'Major': 'Officer',
    'Dr': 'Officer',
    'Rev': 'Officer',
    'Jonkheer': 'Royalty',
    'Don': 'Royalty',
    'Sir': 'Royalty',
    'Lady': 'Royalty',
    'the Countess': 'Royalty',
    'Dona': 'Royalty',
    'Mlle': 'Miss',
    'Ms': 'Miss',
    'Mme': 'Mrs',
})

title_counts = df['title_grouping'].value_counts()
print(title_counts)

Now fill in the age based on all three values.

In [None]:
#DELETE
df['age_class_gender_title'] = df['age'].fillna(
    df.groupby(['pclass', 'sex', 'title_grouping'])['age'].transform('median')
)

plt.figure(figsize=(8, 5))
df['age_class_gender_title'].hist(bins=20, edgecolor='black')
plt.title('Age Distribution (Filled by Class, Gender, and Title)')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Don't forget to clean up by loading the data again!

In [None]:
df = pd.read_excel('../files/titanic3.xlsx', engine='openpyxl')

## Model based imputation

When you start filling in NaN-values based on other columns there is a limit to how far we can go. Every extra column we add into the mix makes the groups smaller and the median less reliable. Better is to create a model and predict the values based on that model.

To build a linear regressor we'll use the numerical columns we have, like 'pclass', 'sibsp', 'parch' and 'fare'. The irony is we can't use fare as it has 1 missing value. Fill it in based on pclass.

In [None]:
#DELETE
df['fare'] = df['fare'].fillna(df.groupby('pclass')['fare'].transform('median'))

Next build a linear regressor based on the four features we have.

In [None]:
#DELETE
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Select features for regression
features = ['pclass', 'sibsp', 'parch', 'fare']
df_features = df[features]
df_features = df_features.fillna(df_features.mean())  # Fill missing values in features

# Split data into training and prediction sets
train_data = df[df['age'].notnull()]
predict_data = df[df['age'].isnull()]

X_train = train_data[features]
y_train = train_data['age']
X_predict = predict_data[features]

# Train the regression model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predict missing ages
predicted_ages = regressor.predict(X_predict)

# Impute the missing values
df.loc[df['age'].isnull(), 'age'] = predicted_ages

And plot again?

In [None]:
#DELETE
plt.figure(figsize=(8, 5))
df['age'].hist(bins=20, edgecolor='black')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Not a very good imputation. Of course we're only using four columns and passing very good ones like gender and title. We could encode these (ordinals into an integer, nominals in a one-hot encoding) A better model could help, preferably a tree-based model that handles categoricals better.

This is the point where you'll be needing the [simpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
) from sklearn.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Only use rows with known Age to train
age_df = df[df['age'].notna()]
predict_df = df[df['age'].isna()]

# Features to use for prediction
features = ['pclass', 'sex', 'sibsp', 'parch', 'fare', 'embarked']

# Preprocessing for numerical and categorical
numeric_features = ['sibsp', 'parch', 'fare']
categorical_features = ['pclass', 'sex', 'embarked']

numeric_transformer = SimpleImputer(strategy='median')
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Define pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

# Fit model
pipeline.fit(age_df[features], age_df['age'])

# Predict missing ages
predicted_ages = pipeline.predict(predict_df[features])

# Fill in missing values
df.loc[df['age'].isna(), 'age'] = predicted_ages


The graph agian?

In [None]:
#DELETE
plt.figure(figsize=(8, 5))
df['age'].hist(bins=20, edgecolor='black')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Summary for using a model, to use non-numerical features:

* Encode them manually (e.g., with one-hot encoding) → still use linear models
* Or use tree-based models like RandomForest, LightGBM (has native categorical support), or XGBoost that naturally handle them better and are often more accurate for this task