# Feature Engineering with the Titanic dataset

This notebook contains examples of feature engineering steps using the Titanic dataset from OpenML, the details of which can be found [here](https://www.openml.org/search?type=data&sort=runs&id=40945&status=active).

## First, import libraries, and load and explore the data.

We will use the [pandas](https://pandas.pydata.org/) library to load, explore, and process the data.
We also use [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) and [sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) to perform some of the feature manipulation steps.

In the OpenML Titanic dataset, missing entries are represented by a question mark ('?'). We use the pandas.read_csv() method to read in the data, and we specify that '?' represents missing values.

**Attention:** The code in this notebook creates Google Cloud resources that can incur costs.

Refer to the Google Cloud pricing documentation for details.

For example:

* [Vertex AI Pricing](https://cloud.google.com/vertex-ai/pricing)


In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Load the data, and specify that '?' represents missing values
titanic = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl', na_values='?')

# Preview the data
titanic.head()

## Check for missing data

Before we begin our feature engineering steps, one of the first tings we want to do is to see if our dataset has any missing values. The following code will summarize how many missing values exist for each feature:

In [None]:
# Identify columns with missing values
missing_values = titanic.isnull().sum()
print(missing_values)

## Feature Engineering

Next, we will perform data preparation and feature engineering steps.

### Replace missing data

There are many ways in which we can replace missing data. In this case, we will replace missing values with the `median` for numeric features, or the `mode` for categorical features.

In [4]:
# Iterate over each column
for col in titanic.columns:
    # Skip columns with no missing values
    if missing_values[col] == 0:
        continue
    
    # For numerical columns, use median to replace missing values
    if titanic[col].dtype in ['float64', 'int64']:
        titanic[col] = titanic[col].fillna(titanic[col].median()) 
    # For categorical columns, use mode to replace missing values
    else:
        titanic[col] = titanic[col].fillna(titanic[col].mode()[0])   


Verify that no missing values remain:

In [None]:
missing_values = titanic.isnull().sum()
print(missing_values)

## Engineering and encoding features

In this section, we explore how we can use the existing features in the dataset to create new features that could help our ML model to make better predictions.

### Create a new feature named 'Title'

It’s unlikely that the passenger’s name would affect the outcome, nor their ticket number or port of embarkation. However, we could engineer a new feature named “Title”, which is extracted from the passengers’ names, and which could provide valuable information related to social status, occupation, marital status, and age, which might not be immediately apparent from the other features. We could also clean up this new feature by merging similar titles such as “Miss” and “Ms”, and identifying elevated titles as “Distinguished”. The code to do that would be as follows: 

In [None]:
# Create a new feature "Title", this is extracted from the name feature
def get_title(name):
    if '.' in name:
        return name.split(',')[1].split('.')[0].strip()
    else:
        return 'Unknown'
        
# Create a new "Title" feature
titanic['Title'] = titanic['name'].apply(get_title)

# Simplify the titles, merge less common titles into the same category
titanic['Title'] = titanic['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 
                                             'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Distinguished')
titanic['Title'] = titanic['Title'].replace('Mlle', 'Miss')
titanic['Title'] = titanic['Title'].replace('Ms', 'Miss')
titanic['Title'] = titanic['Title'].replace('Mme', 'Mrs')

### Create 'CabinClass', 'FamilySize', 'IsAlone', and 'FarePerPerson' features

Next, let’s consider the Fare and Cabin features. These could be somewhat correlated with class, but we will dive into these features in more detail. For the Cabin feature, we could extract another feature named CabinClass, which more clearly represents the class associated with each entry.

Let’s also ensure that we represent the fare as accurately as possible by considering that people may have purchased fares as families traveling together. To do this, we would first create a new feature named "FamilySize" as a combination of the SibSp and Parch features (adding an additional “1” to account for the current passenger), and then compute a “FarePerPerson” by dividing the Fare feature by the FamilySize feature.

Also, whether somebody is traveling alone or with their family could  affect their chances of survival. For example, family members could help each other when trying to get to the lifeboats. Let’s therefore create a feature from the FamilySize feature that identifies whether a passenger was traveling alone, with the following code: 

In [None]:
# Create "CabinClass" feature
titanic['CabinClass'] = titanic['cabin'].apply(lambda x: x[0] if pd.notna(x) else 'U')

# Create a new feature "FamilySize" as a combination of sibsp and parch
titanic['FamilySize'] = titanic['sibsp'] + titanic['parch'] + 1

# Create new feature "IsAlone" from "FamilySize"
titanic['IsAlone'] = 0
titanic.loc[titanic['FamilySize'] == 1, 'IsAlone'] = 1

# Convert 'fare' from object type to numeric
titanic['fare'] = pd.to_numeric(titanic['fare'], errors='coerce')

# Create "FarePerPerson" feature
# Handle any division by zero issues
titanic['FarePerPerson'] = titanic['fare'] / titanic['FamilySize'].replace(0, np.nan)

### Create 'AgeGroup' feature

Next, let’s consider how age affects the likelihood of survival. People who are very young, or elderly, may unfortunately have less likelihood of surviving unless they had people to help them. However, we may not need yearly and fractional-yearly granularity when considering age in this context, and perhaps grouping people into age groups may be more effective. In that case, we can use the following code to create a new feature named AgeGroup, which will group passengers by decades such as 0-9, 10-19, 20-29, and so on.

In [None]:
# Create "AgeGroup" feature
bins = [0, 10, 20, 30, 40, 50, 60, 70, np.inf]
labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70+']
titanic['AgeGroup'] = pd.cut(titanic['age'], bins=bins, labels=labels)

### Encode features

We also want to convert the categorical features into numerical values using one-hot encoding, because ML models typically require numeric values.

In [None]:
# Convert "Title" into numerical values using one-hot encoding
one_hot = OneHotEncoder()
title_encoded = one_hot.fit_transform(titanic[['Title']]).toarray()
title_encoded_df = pd.DataFrame(title_encoded, columns=one_hot.get_feature_names_out(['Title']))
titanic = pd.concat([titanic, title_encoded_df], axis=1)

# Convert "AgeGroup" into numerical values using one-hot encoding
age_group_encoded = one_hot.fit_transform(titanic[['AgeGroup']]).toarray()
age_group_encoded_df = pd.DataFrame(age_group_encoded, columns=one_hot.get_feature_names_out(['AgeGroup']))
titanic = pd.concat([titanic, age_group_encoded_df], axis=1)

# Convert CabinClass into numerical values using one-hot encoding
cabin_class_encoded = one_hot.fit_transform(titanic[['CabinClass']]).toarray()
cabin_class_encoded_df = pd.DataFrame(cabin_class_encoded, columns=one_hot.get_feature_names_out(['CabinClass']))
titanic = pd.concat([titanic, cabin_class_encoded_df], axis=1)

# Convert sex into numerical values using one-hot encoding
sex_encoded = one_hot.fit_transform(titanic[['sex']]).toarray()
sex_encoded_df = pd.DataFrame(sex_encoded, columns=one_hot.get_feature_names_out(['sex']))
titanic = pd.concat([titanic, sex_encoded_df], axis=1)

# Convert embarked into numerical values using one-hot encoding
embarked_encoded = one_hot.fit_transform(titanic[['embarked']]).toarray()
embarked_encoded_df = pd.DataFrame(embarked_encoded, columns=one_hot.get_feature_names_out(['embarked']))
titanic = pd.concat([titanic, embarked_encoded_df], axis=1)

# Drop irrelevant and non-encoded features
titanic = titanic.drop(['name', 'ticket', 'Title', 'cabin', 'sex', 'embarked', 'AgeGroup', 'CabinClass', 'home.dest'], axis=1)

## View all columns in updated dataset

In [None]:
titanic.columns

## Preview the updated dataset 

In [None]:
titanic.head()