# Titanic Survival Prediction: Step-by-Step Data Wrangling for Classification

This notebook walks through data wrangling and classification to predict survival on the Titanic. We'll:
- Load train/test CSVs and inspect types/missing values
- Coerce numeric types safely and clean strings
- Impute missing values and encode categorical features
- Train baseline and improved classifiers while handling class imbalance
- Engineer features (e.g., `FamilySize`, `IsAlone`, `Title`, `FarePerPerson`)
- Track metrics across cleaning experiments and generate test predictions


# Dataset Information

| Variable | Definition | Key |
|----------|-----------|-----|
| **survival** | Survival | 0 = No, 1 = Yes |
| **pclass** | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| **sex** | Sex | M = Male, F = Female |
| **age** | Age in years | Continuous |
| **sibsp** | # of siblings / spouses aboard | Count |
| **parch** | # of parents / children aboard | Count |
| **ticket** | Ticket number | Text |
| **fare** | Passenger fare | Continuous |
| **cabin** | Cabin number | Text |
| **embarked** | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

In [None]:
# Imports and data loading
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, balanced_accuracy_score, classification_report)

# Paths
TRAIN_PATH = 'train.csv'
TEST_PATH = 'test.csv'

# Load
train = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)

# Separate target
y = train['Survived']
X = train.drop(columns=['Survived'])

# Align columns between train features and test
assert set(X.columns) == set(test.columns), 'Train features and test columns should match'
print('Train shape:', train.shape, 'Test shape:', test.shape)
X.head(3)

Train shape: (891, 12) Test shape: (418, 11)


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [2]:
# Inspect dtypes and missing values
print('--- Train info ---')
print(train.info())
print('\nMissing values (train):\n', train.isna().sum())
print('\n--- Test info ---')
print(test.info())
print('\nMissing values (test):\n', test.isna().sum())

# Class distribution
print('\nClass distribution in train (Survived):')
print(y.value_counts(normalize=True).rename('proportion'))

--- Train info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None

Missing values (train):
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabi

# Data type coercion for numeric columns

Real-world CSVs sometimes store numbers as strings with stray spaces. We'll safely coerce numeric types for `['Age','Fare','SibSp','Parch','Pclass']` by stripping whitespace and using `pd.to_numeric(errors='coerce')`.


In [3]:
pd.to_numeric("3. 14 ", errors='coerce'), pd.to_numeric("    3.14 ", errors='coerce')


(np.float64(nan), np.float64(3.14))

In [4]:
# Coerce numeric columns for both train/test
numeric_cols = ['Age','Fare','SibSp','Parch','Pclass']
def coerce_numeric(df, cols):
    for col in cols:
        # Convert to string, strip whitespace, then coerce to numeric
        df[col] = pd.to_numeric(df[col].astype('string').str.strip(), errors='coerce')
    return df

X = coerce_numeric(X.copy(), numeric_cols)
test = coerce_numeric(test.copy(), numeric_cols)

X[numeric_cols].describe(include='all')

Unnamed: 0,Age,Fare,SibSp,Parch,Pclass
count,714.0,891.0,891.0,891.0,891.0
mean,29.699118,32.204208,0.523008,0.381594,2.308642
std,14.526497,49.693429,1.102743,0.806057,0.836071
min,0.42,0.0,0.0,0.0,1.0
25%,20.125,7.9104,0.0,0.0,2.0
50%,28.0,14.4542,0.0,0.0,3.0
75%,38.0,31.0,1.0,0.0,3.0
max,80.0,512.3292,8.0,6.0,3.0


# Basic imputations for missing values

We'll impute missing values to enable modeling:
- `Age`: median (optionally by `Pclass`/`Sex` later)
- `Fare`: median by `Pclass`
- `Embarked`: mode
- Add `HasCabin` as 1 if `Cabin` present else 0; optionally drop `Cabin`.


.mode() — most frequent value(s)

In [5]:
s = pd.Series([1, 20, 20, 30, 30])
t = pd.Series([1, 21, 21, 21, 3])
s.mode(), t.mode()

(0    20
 1    30
 dtype: int64,
 0    21
 dtype: int64)

.isna() — True where value is missing

.notna() — True where value is not missing

In [6]:
s = pd.Series([10, None, 30])
s.isna(),s.notna()

(0    False
 1     True
 2    False
 dtype: bool,
 0     True
 1    False
 2     True
 dtype: bool)

.any() — True if any element is True (often used after a boolean mask)

In [7]:
s = pd.Series([0, 0, 1])
(s > 0).any()

np.True_

 .fillna() — replace missing values

In [8]:
s = pd.Series([1, None, 3])
s.fillna(0)

0    1.0
1    0.0
2    3.0
dtype: float64

### `.loc` vs `.iloc`

- `.loc` selects by **label** (index/column names), inclusive of endpoints when slicing.
- `.iloc` selects by **position** (integer offsets), end-exclusive when slicing (like Python ranges).

In [9]:
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Cara'],
    'age': [25, 30, 27]
}, index=['a', 'b', 'c'])

# loc: by label (inclusive slice)
loc_rows = df.loc['a':'b', ['name']]

# iloc: by position (end-exclusive slice)
iloc_rows = df.iloc[0:2, 1:2]

loc_rows, iloc_rows

(    name
 a  Alice
 b    Bob,
    age
 a   25
 b   30)

In [10]:
# Create simple cleaned copies for a baseline
X_clean = X.copy()
test_clean = test.copy()

# Embarked mode (on full train to avoid leakage within split)
embarked_mode = X_clean['Embarked'].mode().iloc[0] if X_clean['Embarked'].notna().any() else 'S'
X_clean['Embarked'] = X_clean['Embarked'].fillna(embarked_mode)
test_clean['Embarked'] = test_clean['Embarked'].fillna(embarked_mode)

# Fare median by Pclass
fare_by_pclass = X_clean.groupby('Pclass')['Fare'].median()
def fill_fare(row):
    if pd.isna(row['Fare']):
        return fare_by_pclass.get(row['Pclass'], X_clean['Fare'].median())
    return row['Fare']
X_clean['Fare'] = X_clean.apply(fill_fare, axis=1)
test_clean['Fare'] = test_clean.apply(fill_fare, axis=1)

# Age median
age_median = X_clean['Age'].median()
X_clean['Age'] = X_clean['Age'].fillna(age_median)
test_clean['Age'] = test_clean['Age'].fillna(age_median)

# HasCabin feature
X_clean['HasCabin'] = (X_clean['Cabin'].notna()).astype(int) if 'Cabin' in X_clean.columns else 0
test_clean['HasCabin'] = (test_clean['Cabin'].notna()).astype(int) if 'Cabin' in test_clean.columns else 0

# Categorical encoding

We treat `Pclass` as categorical along with `Sex` and `Embarked`. Numeric features are `Age`, `Fare`, `SibSp`, `Parch`. We'll use `OneHotEncoder(handle_unknown='ignore')` for categories.


In [12]:
categorical_features = ['Sex','Embarked','Pclass']
numeric_features = ['Age','Fare','SibSp','Parch'] + (['HasCabin'] if 'HasCabin' in X_clean.columns else [])

# Preprocess: impute numeric (median), scale; one-hot encode categorical
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocess_v1 = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
 )

clf_lr = Pipeline(steps=[
    ('preprocess', preprocess_v1),
    ('model', LogisticRegression(max_iter=1000, random_state=42))
])

# Fit on full training data
clf_lr.fit(X_clean, y)
y_pred_train = clf_lr.predict(X_clean)

print('Training Accuracy:', accuracy_score(y, y_pred_train))


Training Accuracy: 0.8114478114478114


# Evaluation metric

We'll use accuracy to evaluate model performance:
- Accuracy: $\text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN}$


# Address class imbalance

Titanic labels are not perfectly balanced. We’ll inspect the ratio and refit with `class_weight='balanced'` to weight minority class more during training, and compare metrics including balanced accuracy.


In [13]:
# Refit LogisticRegression with class_weight='balanced'
clf_lr_bal = Pipeline(steps=[
    ('preprocess', preprocess_v1),
    ('model', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42))
])

# Fit on full training data
clf_lr_bal.fit(X_clean, y)
y_pred_train_bal = clf_lr_bal.predict(X_clean)

print('Training Accuracy (balanced):', accuracy_score(y, y_pred_train_bal))


Training Accuracy (balanced): 0.7991021324354658


## How to properly compare the two models

Here are the mathematical formulas for precision, recall, and F1-score:

**Precision**: Measures how many of the predicted positives are actually positive.

$$\text{Precision} = \frac{TP}{TP + FP}$$

**Recall** (Sensitivity): Measures how many of the actual positives were correctly identified.

$$\text{Recall} = \frac{TP}{TP + FN}$$

**F1-Score**: Harmonic mean of precision and recall, balancing both metrics.

$$\text{F1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \times TP}{2 \times TP + FP + FN}$$

Where:
- **TP** (True Positives): Correctly predicted positive cases
- **FP** (False Positives): Incorrectly predicted as positive (Type I error)
- **FN** (False Negatives): Incorrectly predicted as negative (Type II error)
- **TN** (True Negatives): Correctly predicted negative cases

For the Titanic example:
- **Precision for "Survived"**: Of all passengers we predicted survived, what percentage actually survived?
- **Recall for "Survived"**: Of all passengers who actually survived, what percentage did we correctly predict?

In [19]:
# Compare both models
print("=== Without class_weight='balanced' ===")
print(f"Accuracy: {accuracy_score(y, y_pred_train):.4f}")
print(f"Balanced Accuracy: {balanced_accuracy_score(y, y_pred_train):.4f}")
print(classification_report(y, y_pred_train, target_names=['Not Survived', 'Survived']))

print("\n=== With class_weight='balanced' ===")
print(f"Accuracy: {accuracy_score(y, y_pred_train_bal):.4f}")
print(f"Balanced Accuracy: {balanced_accuracy_score(y, y_pred_train_bal):.4f}")
print(classification_report(y, y_pred_train_bal, target_names=['Not Survived', 'Survived']))

=== Without class_weight='balanced' ===
Accuracy: 0.8114
Balanced Accuracy: 0.7946
              precision    recall  f1-score   support

Not Survived       0.83      0.87      0.85       549
    Survived       0.77      0.72      0.75       342

    accuracy                           0.81       891
   macro avg       0.80      0.79      0.80       891
weighted avg       0.81      0.81      0.81       891


=== With class_weight='balanced' ===
Accuracy: 0.7991
Balanced Accuracy: 0.7940
              precision    recall  f1-score   support

Not Survived       0.85      0.82      0.83       549
    Survived       0.72      0.77      0.75       342

    accuracy                           0.80       891
   macro avg       0.79      0.79      0.79       891
weighted avg       0.80      0.80      0.80       891



# Feature engineering

We’ll add domain-inspired features:
- `FamilySize = SibSp + Parch + 1`
- `IsAlone = (FamilySize == 1)`
- `Title` extracted from `Name`
- `FarePerPerson = Fare / FamilySize`


In [None]:
import re

def add_features(df):
    df = df.copy()
    # Family features
    df['FamilySize'] = df['SibSp'].fillna(0) + df['Parch'].fillna(0) + 1
    df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
    # Title from Name
    if 'Name' in df.columns:
        df['Title'] = df['Name'].str.extract(r',\s*([^\.]+)\.')
        # Rare title grouping
        title_counts = df['Title'].value_counts()
        rare = title_counts[title_counts < 10].index
        df['Title'] = df['Title'].replace(rare, 'Rare')
    else:
        df['Title'] = 'Unknown'
    # Fare per person
    df['FarePerPerson'] = (df['Fare'] / df['FamilySize']).replace([np.inf, -np.inf], np.nan)
    return df

X_fe = add_features(X_clean)
test_fe = add_features(test_clean)
X_fe[['FamilySize','IsAlone','Title','FarePerPerson']].head()

Unnamed: 0,FamilySize,IsAlone,Title,FarePerPerson
0,2,0,Mr,3.625
1,2,0,Mrs,35.64165
2,1,1,Miss,7.925
3,2,0,Mrs,26.55
4,1,1,Mr,8.05


# Transformations and outlier handling

To reduce skew, we’ll apply log transforms:
- `FareLog = log1p(Fare)`
- `FarePerPersonLog = log1p(FarePerPerson)`
Optionally, bin `Age` into quantiles for robustness.


In [None]:
def add_transforms(df):
    df = df.copy()
    df['FareLog'] = np.log1p(df['Fare'])
    if 'FarePerPerson' in df.columns:
        df['FarePerPersonLog'] = np.log1p(df['FarePerPerson'].fillna(0))
    return df

X_tx = add_transforms(X_fe)
test_tx = add_transforms(test_fe)
X_tx[['Fare','FareLog','FarePerPerson','FarePerPersonLog']].describe()

Unnamed: 0,Fare,FareLog,FarePerPerson,FarePerPersonLog
count,891.0,891.0,891.0,891.0
mean,32.204208,2.962246,19.916375,2.565012
std,49.693429,0.969048,35.841257,0.85768
min,0.0,0.0,0.0,0.0
25%,7.9104,2.187218,7.25,2.110213
50%,14.4542,2.737881,8.3,2.230014
75%,31.0,3.465736,23.666667,3.205453
max,512.3292,6.240917,512.3292,6.240917


# Reproducible Pipeline with ColumnTransformer

We’ll build a reusable pipeline that applies feature engineering (via `FunctionTransformer`), imputations, encoding, scaling, and `LogisticRegression(class_weight='balanced')`.


In [None]:
from sklearn.preprocessing import FunctionTransformer

def feature_engineer(df):
    return add_transforms(add_features(df))

feature_step = FunctionTransformer(feature_engineer, validate=False)

# Expanded feature sets after FE
cat_features_fe = ['Sex','Embarked','Pclass','Title']
num_features_fe = ['Age','Fare','SibSp','Parch','FamilySize','IsAlone','FarePerPerson','FareLog','FarePerPersonLog'] + (['HasCabin'] if 'HasCabin' in X_tx.columns else [])

numeric_tx = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
categorical_tx = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocess_fe = ColumnTransformer(
    transformers=[
        ('num', numeric_tx, num_features_fe),
        ('cat', categorical_tx, cat_features_fe)
    ]
 )

pipe_lr_bal_fe = Pipeline(steps=[
    ('features', feature_step),
    ('preprocess', preprocess_fe),
    ('model', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42))
])

# Fit on full training data
pipe_lr_bal_fe.fit(X_clean, y)
y_pred_lr_train = pipe_lr_bal_fe.predict(X_clean)

print('Training Accuracy (LR with FE):', accuracy_score(y, y_pred_lr_train))


Training Accuracy (LR with FE): 0.8316498316498316
