# Feature Engineering for Titanic Dataset
In this notebook, we perform various feature engineering techniques to preprocess the Titanic dataset. The goal is to create meaningful features, handle missing values, and encode categorical variables in preparation for the modeling phase. We also scale numerical features to ensure that the dataset is ready for machine learning algorithms.

In [129]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.preprocessing import StandardScaler

In [130]:
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
submission_data = pd.read_csv('gender_submission.csv')

### Data Preparation for Feature Engineering
- Split the training data into features (`x_train`) and target (`y_train`), and similarly for the test data.
- Combined the training and test datasets for unified feature engineering.
- Dropped the `PassengerId` column from the dataset as it is not useful for model training.


In [131]:
y_train = train_data['Survived']
x_train = train_data.drop(columns='Survived')

y_test = submission_data['Survived']
x_test = test_data


x_data = pd.concat([x_train, x_test], axis=0, ignore_index=True)
y_data = pd.concat([y_train, y_test], axis=0, ignore_index=True)

train_test_data = pd.concat([x_data, y_data], axis=1)

In [132]:
train_test_data.drop(columns='PassengerId', inplace=True)

In [133]:
train_test_data.head(3)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1


In [134]:
train_test_data.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Survived
count,1309.0,1046.0,1309.0,1309.0,1308.0,1309.0
mean,2.294882,29.881138,0.498854,0.385027,33.295479,0.377387
std,0.837836,14.413493,1.041658,0.86556,51.758668,0.484918
min,1.0,0.17,0.0,0.0,0.0,0.0
25%,2.0,21.0,0.0,0.0,7.8958,0.0
50%,3.0,28.0,0.0,0.0,14.4542,0.0
75%,3.0,39.0,1.0,0.0,31.275,1.0
max,3.0,80.0,8.0,9.0,512.3292,1.0


## Handling Missing Values 
- Filled missing `Fare` values with the median and `Embarked` values with the mode.
- Created a new feature `AgeGroup` by categorizing `Age` into bins (Infant, Child, Adult, Middle, Senior).
- For missing values in `AgeGroup`, filled them based on the mode within `Pclass` and `Sex` groups.
- Created a binary feature `HasCabin` indicating the presence of cabin information.
- Dropped the original `Age` and `Cabin` columns after creating the new features.


In [135]:
train_test_data.isnull().sum()

Pclass         0
Name           0
Sex            0
Age          263
SibSp          0
Parch          0
Ticket         0
Fare           1
Cabin       1014
Embarked       2
Survived       0
dtype: int64

### 1. Fare & Embarked

In [136]:
train_test_data['Fare'].fillna(train_test_data['Fare'].median(), inplace=True)
train_test_data['Embarked'].fillna(train_test_data['Embarked'].mode()[0], inplace=True)

### 2. Age

In [137]:
bins = [0, 5, 16, 30, 65, np.inf]
labels = ['Infant', 'Child', 'Adult', 'Middle', 'Senior']

train_test_data['AgeGroup'] = pd.cut(train_test_data['Age'], bins=bins, labels=labels)


In [138]:
train_test_data['AgeGroup'] = train_test_data.groupby(['Pclass', 'Sex'])['AgeGroup'].transform(lambda x: x.fillna(x.mode()[0]))

### 3. Cabin

In [139]:
train_test_data['HasCabin'] = train_test_data['Cabin'].notna().astype(int)

In [140]:
train_test_data.drop(columns=['Age', 'Cabin'], inplace=True)

## Feature Combining and Transformation
- Extracted the `Title` from the `Name` column and mapped it to more general categories (e.g., `Mr`, `Miss`, `Mrs`, `Others`).
- Updated titles labeled as `Others` based on gender (`Sex`).
- Created a new feature `FamilySize` by combining `SibSp` and `Parch`, then categorized family size into `Solo`, `Small`, and `Large` groups.
- Categorized the `Fare` feature into `Very_Low`, `Low`, `Medium`, `High`, and `Very_High` groups.
- Developed a `ClassCategory` feature by combining `Pclass` and `Sex` to define passenger class in terms of gender.
- Mapped `Pclass` values to `1st`, `2nd`, and `3rd` class labels.
- Dropped irrelevant features such as `Name`, `SibSp`, `Parch`, `FamilySize`, `Ticket`, and `Fare` after transformation.


### 1. Name

In [141]:
train_test_data['Title'] = train_test_data['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)

In [142]:
train_test_data['Title'].value_counts()

Title
Mr          757
Miss        260
Mrs         197
Master       61
Rev           8
Dr            8
Col           4
Mlle          2
Major         2
Ms            2
Lady          1
Sir           1
Mme           1
Don           1
Capt          1
Countess      1
Jonkheer      1
Dona          1
Name: count, dtype: int64

In [143]:
title_mapping = {
    'Mr': 'Mr',
    'Miss': 'Miss',
    'Mrs': 'Mrs',
    'Master': 'Master',
    'Rev': 'Others',
    'Dr': 'Others',
    'Col': 'Others',
    'Mlle': 'Others',
    'Major': 'Others',
    'Ms': 'Others',
    'Lady': 'Others',
    'Sir': 'Others',
    'Mme': 'Others',
    'Don': 'Others',
    'Capt': 'Others',
    'Countess': 'Others',
    'Jonkheer': 'Others',
    'Dona': 'Others'
}

# Map titles to the 'Title' column
train_test_data['Title'] = train_test_data['Title'].map(title_mapping)

# Update 'Others' titles based on 'Sex'
train_test_data['Title'] = train_test_data.apply(
    lambda row: f"Others_{row['Sex']}" if row['Title'] == 'Others' else row['Title'],
    axis=1
)


### 2. Family

In [144]:
train_test_data['FamilySize'] = train_test_data['SibSp'] + train_test_data['Parch'] + 1

In [145]:
bins = [0, 2, 5, np.inf]
labels = ['Solo', 'Small', 'Large']

train_test_data['FamilySizeCategory'] = pd.cut(train_test_data['FamilySize'], bins=bins, labels=labels, right=False)

### 3. Fare

In [146]:
bins = [0, 12, 35, 55, 100, np.inf]
labels = ['Very_Low', 'Low', 'Medium', 'High', 'Very_High']

# Create a new column 'FareCategory' by binning the 'Fare' column
train_test_data['FareCategory'] = pd.cut(train_test_data['Fare'], bins=bins, labels=labels, right=False)

### 4. Class - Gender Combination

In [147]:
def categorize_class(row):
    if row['Pclass'] in [1, 2] and row['Sex'] == 'female':
        return 'HighClassFemale'
    elif row['Pclass'] == 3 and row['Sex'] == 'female':
        return 'LowClassFemale'
    elif row['Pclass'] == 1 and row['Sex'] == 'male':
        return 'HighClassMale'
    elif row['Pclass'] in [2, 3] and row['Sex'] == 'male':
        return 'LowClassMale'
    else:
        return 'Other'

train_test_data['ClassCategory'] = train_test_data.apply(categorize_class, axis=1)

### 5. Pclass

In [148]:
train_test_data['Pclass'] = train_test_data['Pclass'].map({1: '1st', 2: '2nd', 3: '3rd'})

In [149]:
train_test_data.drop(columns=['Name','SibSp', 'Parch', 'FamilySize','Ticket', 'Fare'], inplace=True)

## Encoding Categorical Features Based on Survival Rates
- Converted categorical columns to `object` type for consistency.
- Calculated survival rates for each category within the categorical features.
- Mapped categories to numeric values based on their survival rates, assigning higher survival rates to lower values (0, 1, 2, ...).
- Transformed the entire dataset to `float` type after encoding.


In [150]:
train_test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   Pclass              1309 non-null   object  
 1   Sex                 1309 non-null   object  
 2   Embarked            1309 non-null   object  
 3   Survived            1309 non-null   int64   
 4   AgeGroup            1309 non-null   category
 5   HasCabin            1309 non-null   int32   
 6   Title               1309 non-null   object  
 7   FamilySizeCategory  1309 non-null   category
 8   FareCategory        1309 non-null   category
 9   ClassCategory       1309 non-null   object  
dtypes: category(3), int32(1), int64(1), object(5)
memory usage: 71.0+ KB


In [151]:

categorical_columns = train_test_data.select_dtypes(include='category').columns
train_test_data[categorical_columns] = train_test_data[categorical_columns].astype('object')

In [152]:
# Example: Assuming 'train_test_data' is your DataFrame and 'Survived' is the target column
for column in train_test_data.select_dtypes(include='object').columns:
    # Calculate survival rates for each category in the column
    survival_rates = train_test_data.groupby(column)['Survived'].mean()

    # Sort the categories based on survival rates (ascending order)
    survival_rates = survival_rates.sort_values()

    # Create a mapping from category to survival rate-based values (0, 1, 2,...)
    category_mapping = {category: idx for idx, category in enumerate(survival_rates.index)}

    # Map the categories to their new values
    train_test_data[column] = train_test_data[column].map(category_mapping)

In [153]:
train_test_data = train_test_data.astype('float')

## Feature Scaling
- Dropped the `Survived` column as it is no longer needed for training.
- Applied `StandardScaler` to scale the numerical features to a standard normal distribution (mean = 0, standard deviation = 1).


In [154]:
train_test_data = train_test_data.drop(columns='Survived')

In [155]:
scaler = StandardScaler()

numerical_columns = train_test_data.select_dtypes(include=['int64', 'float64']).columns

train_test_data[numerical_columns] = scaler.fit_transform(train_test_data[numerical_columns])

In [156]:
train_test_data.head()

Unnamed: 0,Pclass,Sex,Embarked,AgeGroup,HasCabin,Title,FamilySizeCategory,FareCategory,ClassCategory
0,-0.841916,-0.743497,-0.622279,-0.749575,-0.539377,-0.808559,1.282478,-0.863471,-0.868057
1,1.546098,1.344995,1.834926,0.508365,1.853992,1.596904,1.282478,1.627969,1.636615
2,-0.841916,1.344995,-0.622279,-0.749575,-0.539377,0.995538,-0.477232,-0.863471,0.801725
3,1.546098,1.344995,-0.622279,0.508365,1.853992,1.596904,1.282478,0.797489,1.636615
4,-0.841916,-0.743497,-0.622279,0.508365,-0.539377,-0.808559,-0.477232,-0.863471,-0.868057


## Data Preparation for Modeling
- Split the dataset into training features (`train_x`), target variable (`y_train`), and testing features (`test_x`).
- Saved the prepared data into CSV files for model training and evaluation.


In [157]:
train_x = train_test_data.iloc[:y_train.shape[0], :]
test_x = train_test_data.iloc[y_train.shape[0]:, :]

In [158]:
#train_x.to_csv('train_x.csv', index=False)
#y_train.to_csv('train_y.csv', index=False)
#test_x.to_csv('test_x.csv', index=False)