# Titanic - Machine Learning from Disaster

## 2. Feature Engineering

In this notebook, we will:
1. Handle missing values in the Titanic dataset.
2. Create new features.
3. Encode categorical features.
4. Prepare the dataset for model training.


In [547]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Titanic dataset
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# Display the first few rows of the dataset
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### 2.1. Handling Missing Values

**Age**: Impute missing values using the median, as it's less affected by outliers.

In [551]:
train_data.loc[:, 'Age'] = train_data['Age'].fillna(train_data['Age'].median())
test_data.loc[:, 'Age'] = test_data['Age'].fillna(test_data['Age'].median())

**Embarked**: Since there are only a few missing entries, we can replace them with the most common embarkation point (mode).

In [554]:
train_data['Embarked'] = train_data['Embarked'].fillna(train_data['Embarked'].mode()[0])
test_data['Embarked'] = test_data['Embarked'].fillna(test_data['Embarked'].mode()[0])

**Fare**: Impute missing values using the median fare to handle outliers.

In [557]:
train_data['Fare'] = train_data['Fare'].fillna(train_data['Fare'].median())
test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].median())

**Cabin**: Since a large number of Cabin entries are missing, we create a binary feature indicating whether a passenger has cabin information or not.

In [560]:
train_data['HasCabin'] = train_data['Cabin'].notnull().astype(int)
test_data['HasCabin'] = test_data['Cabin'].notnull().astype(int)

**Key Observations:**
1. PassengerId, Survived, Pclass, SibSp, Parch: These integer-based columns remain unchanged.
2. Name, Sex, Ticket, Cabin, Embarked: These are still object (string) data types.
3. Age, Fare: Both are now fully populated and remain as float64 data types after filling missing values.
4. HasCabin: The newly created feature, which is a binary indicator of whether a passenger had cabin information, is successfully added and stored as int32.

In [563]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     891 non-null    object 
 12  HasCabin     891 non-null    int32  
dtypes: float64(2), int32(1), int64(5), object(5)
memory usage: 87.1+ KB


In [565]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          418 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         418 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
 11  HasCabin     418 non-null    int32  
dtypes: float64(2), int32(1), int64(4), object(5)
memory usage: 37.7+ KB


### 2.2. Creating New Features

**Creating FamilySize Feature**

In [569]:
# Create FamilySize feature
train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch'] + 1  # Adding 1 for the passenger themselves
test_data['FamilySize'] = test_data['SibSp'] + test_data['Parch'] + 1

**Family Survival Impact**

In [572]:
# Family Survival
train_data['FamilySurvival'] = (train_data['FamilySize'] > 1).astype(int)
test_data['FamilySurvival'] = (test_data['FamilySize'] > 1).astype(int)

**Creating FarePerPerson Feature**

In [575]:
# Create FarePerPerson feature
train_data['FarePerPerson'] = train_data['Fare'] / train_data['FamilySize']
test_data['FarePerPerson'] = test_data['Fare'] / test_data['FamilySize']

**Interaction between Age and PClass**

In [578]:
# Interaction between Age and Pclass
train_data['Age_Pclass'] = train_data['Age'] * train_data['Pclass']
test_data['Age_Pclass'] = test_data['Age'] * train_data['Pclass']

**Group Age into Categories**

In [581]:
# Create age groups
train_data['AgeGroup'] = pd.cut(train_data['Age'], bins=[0, 12, 18, 35, 60, 100], labels=['Child', 'Teen', 'Young Adult', 'Middle Age', 'Senior'])
test_data['AgeGroup'] = pd.cut(train_data['Age'], bins=[0, 12, 18, 35, 60, 100], labels=['Child', 'Teen', 'Young Adult', 'Middle Age', 'Senior'])

**IsHighFare Feature**

In [584]:
# IsHighFare Feature
train_data['IsHighFare'] = (train_data['Fare'] > train_data['Fare'].median()).astype(int)
test_data['IsHighFare'] = (test_data['Fare'] > test_data['Fare'].median()).astype(int)

In [586]:
# Display the newly created features
train_data[['FamilySize', 'FarePerPerson', 
             'Age_Pclass', 'AgeGroup', 'FamilySurvival', 'IsHighFare']].head()


Unnamed: 0,FamilySize,FarePerPerson,Age_Pclass,AgeGroup,FamilySurvival,IsHighFare
0,2,3.625,66.0,Young Adult,1,0
1,2,35.64165,38.0,Middle Age,1,1
2,1,7.925,78.0,Young Adult,0,0
3,2,26.55,35.0,Young Adult,1,1
4,1,8.05,105.0,Young Adult,0,0


### 2.3. Encoding Categorical Features
1. Sex_male (1 if the passenger is male, 0 if female)
2. Embarked_Q, Embarked_S (1 if the passenger embarked from that port, 0 otherwise)
3. Title_Mr, Title_Mrs, etc. (one-hot encoding for the title)

In [589]:
# One-hot encode the categorical features
train_data = pd.get_dummies(train_data, columns=['Sex', 'Embarked', 'AgeGroup'], drop_first=True)
test_data = pd.get_dummies(test_data, columns=['Sex', 'Embarked', 'AgeGroup'], drop_first=True)
# Display the first few rows to confirm the encoding
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,...,FarePerPerson,Age_Pclass,IsHighFare,Sex_male,Embarked_Q,Embarked_S,AgeGroup_Teen,AgeGroup_Young Adult,AgeGroup_Middle Age,AgeGroup_Senior
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,...,3.625,66.0,0,True,False,True,False,True,False,False
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,...,35.64165,38.0,1,False,False,False,False,False,True,False
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,...,7.925,78.0,0,False,False,True,False,True,False,False
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,...,26.55,35.0,1,False,False,True,False,True,False,False
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,...,8.05,105.0,0,True,False,True,False,True,False,False


### 2.4 Prepare the dataset for model training.

**Drop Irrelevant Columns**: Some columns in the dataset do not contribute to the prediction task. These include:

1. PassengerId: An identifier, not useful for predictions.
2. Name: Already used to extract Title; no longer needed.
3. Ticket: This doesn't contain useful predictive information.
4. Cabin: Too many missing values; already converted into a HasCabin feature.

In [593]:
# Drop irrelevant columns
train_data.drop([ 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
test_data.drop([ 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

**Check for Missing Values**: We need to ensure that there are no remaining missing values in the dataset, which could hinder model performance.

In [596]:
# Check for remaining missing values
missing_values = train_data.isnull().sum()
missing_values

PassengerId             0
Survived                0
Pclass                  0
Age                     0
SibSp                   0
Parch                   0
Fare                    0
HasCabin                0
FamilySize              0
FamilySurvival          0
FarePerPerson           0
Age_Pclass              0
IsHighFare              0
Sex_male                0
Embarked_Q              0
Embarked_S              0
AgeGroup_Teen           0
AgeGroup_Young Adult    0
AgeGroup_Middle Age     0
AgeGroup_Senior         0
dtype: int64

In [598]:
# Display the first few rows of the cleaned dataset
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,HasCabin,FamilySize,FamilySurvival,FarePerPerson,Age_Pclass,IsHighFare,Sex_male,Embarked_Q,Embarked_S,AgeGroup_Teen,AgeGroup_Young Adult,AgeGroup_Middle Age,AgeGroup_Senior
0,1,0,3,22.0,1,0,7.25,0,2,1,3.625,66.0,0,True,False,True,False,True,False,False
1,2,1,1,38.0,1,0,71.2833,1,2,1,35.64165,38.0,1,False,False,False,False,False,True,False
2,3,1,3,26.0,0,0,7.925,0,1,0,7.925,78.0,0,False,False,True,False,True,False,False
3,4,1,1,35.0,1,0,53.1,1,2,1,26.55,35.0,1,False,False,True,False,True,False,False
4,5,0,3,35.0,0,0,8.05,0,1,0,8.05,105.0,0,True,False,True,False,True,False,False


In [600]:
test_data.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,HasCabin,FamilySize,FamilySurvival,FarePerPerson,Age_Pclass,IsHighFare,Sex_male,Embarked_Q,Embarked_S,AgeGroup_Teen,AgeGroup_Young Adult,AgeGroup_Middle Age,AgeGroup_Senior
0,892,3,34.5,0,0,7.8292,0,1,0,7.8292,103.5,0,True,True,False,False,True,False,False
1,893,3,47.0,1,0,7.0,0,2,1,3.5,47.0,0,False,False,True,False,False,True,False
2,894,2,62.0,0,0,9.6875,0,1,0,9.6875,186.0,0,True,True,False,False,True,False,False
3,895,3,27.0,0,0,8.6625,0,1,0,8.6625,27.0,0,True,False,True,False,True,False,False
4,896,3,22.0,1,1,12.2875,0,3,1,4.095833,66.0,0,False,False,True,False,True,False,False


In [602]:
# Save the processed train and test data to CSV files
train_data.to_csv('train_processed.csv', index=False)
test_data.to_csv('test_processed.csv', index=False)
print("Processed data saved successfully.")


Processed data saved successfully.


## Summary and Next Steps

In this notebook, we:
- Handled missing values in key features such as `Age`, `Embarked`, and `Fare`.
- Created new features such as `Title` and `FamilySize`.
- Encoded categorical variables using one-hot encoding.

In the next notebook, we will move on to **Model Building** where we will train various classifiers to predict survival.
