# Feature Engineering on Spaceship Titanic Data
This notebook focuses on transforming and creating new features from the raw Spaceship Titanic dataset to enhance model performance. 
It includes:
- Extraction of meaningful features from columns such as `PassengerId`, `Cabin`, and `Age`.
- Handling missing values using appropriate strategies for categorical and numerical features.
- Feature normalization and transformation.
- Target encoding for categorical features.
- Final data preparation for machine learning models.


In [203]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

pd.set_option('display.max_columns', None)

from sklearn.preprocessing import StandardScaler

In [204]:
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
submission_data = pd.read_csv('sample_submission.csv')

In [205]:
y_train = train_data['Transported']
x_train = train_data.drop(columns='Transported')

y_test = submission_data['Transported']
x_test = test_data


x_data = pd.concat([x_train, x_test], axis=0, ignore_index=True)
y_data = pd.concat([y_train, y_test], axis=0, ignore_index=True)

train_test_data = pd.concat([x_data, y_data], axis=1)

In [206]:
train_test_data.head(3)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False


## Extracting Features
- Created `Group_Number` by extracting the prefix from `PassengerId`.
- Calculated `Group_Size` by counting the occurrences of each `Group_Number` and mapped it accordingly.
- Converted both `Group_Number` and `Group_Size` to float type.
- Split the `Cabin` column into three new features: `Cabin_Deck`, `Cabin_Num`, and `Cabin_Side`.
- Dropped the `PassengerId`, `Cabin`, and `Name` columns as they are not necessary for model training.

In [207]:
train_test_data['Group_Number'] = train_test_data['PassengerId'].str.split('_').str[0]

group_sizes = train_test_data['Group_Number'].value_counts()
train_test_data['Group_Size'] = train_test_data['Group_Number'].map(group_sizes)

train_test_data[['Group_Number', 'Group_Size']] = train_test_data[['Group_Number', 'Group_Size']].astype('float')

In [208]:
train_test_data[['Cabin_Deck', 'Cabin_Num', 'Cabin_Side']] = train_test_data['Cabin'].str.split('/', expand=True)

train_test_data['Cabin_Num'] = train_test_data['Cabin_Num'].astype('float')

In [209]:
#train_test_data[['FirstName', 'SureName']] = train_test_data['Name'].str.split(' ', n=1, expand=True)

In [210]:
train_test_data.drop(columns=['PassengerId', 'Cabin', 'Name'], inplace=True)

## Handling Missing Values
- Checked and displayed the columns with missing values.
- Filled missing categorical values with the mode of the respective feature.
- Filled missing numerical values with the median of the respective feature.


In [211]:
missing_values = train_test_data.isnull().sum()
print(missing_values[missing_values > 0])


HomePlanet      288
CryoSleep       310
Destination     274
Age             270
VIP             296
RoomService     263
FoodCourt       289
ShoppingMall    306
Spa             284
VRDeck          268
Cabin_Deck      299
Cabin_Num       299
Cabin_Side      299
dtype: int64


In [212]:
for feature in train_test_data.columns:
    if train_test_data[feature].isnull().sum() > 0: 
        if (train_test_data[feature].dtype == 'object') or (train_test_data[feature].nunique() <= 10):
            # Fill categorical missing values with mode
            train_test_data[feature] = train_test_data[feature].fillna(train_test_data[feature].mode()[0])
        else:
            # Fill numerical missing values with median
            train_test_data[feature] = train_test_data[feature].fillna(train_test_data[feature].median())

In [213]:
train_test_data.isnull().sum()

HomePlanet      0
CryoSleep       0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Transported     0
Group_Number    0
Group_Size      0
Cabin_Deck      0
Cabin_Num       0
Cabin_Side      0
dtype: int64

## Creating New Features
- Created `Total_Spend` by summing the expenditures across different services.
- Created `Has_Spent` as a boolean feature indicating whether the individual has spent any money.
- Categorized individuals into age groups using the `Age` feature, resulting in the `Age_Group` feature.
- Categorized individuals into cabin regions based on `Cabin_Num`, resulting in the `Cabin_Region` feature.
- Converted `Age_Group` and `Cabin_Region` to categorical data type (`object`).
- Dropped the original `Age` and `Cabin_Num` columns as they were no longer needed.

In [214]:
train_test_data['Total_Spend'] = train_test_data['RoomService'] + train_test_data['FoodCourt'] + train_test_data['ShoppingMall'] + train_test_data['Spa'] + train_test_data['VRDeck']

train_test_data['Has_Spent'] = train_test_data['Total_Spend'].apply(lambda x: True if x > 0 else False)

In [215]:
bins = [0, 10, 16, 20, 26, 50, float('inf')]
labels = ['Child', 'Young','Adult', 'Middle-Age', 'Senior', 'Old']

train_test_data['Age_Group'] = pd.cut(train_test_data['Age'], bins=bins, labels=labels, right=False)

In [216]:
bins = [0, 300, 800, 1100, 1550, float('inf')]
labels = ['r1', 'r2', 'r3', 'r4', 'r5']

train_test_data['Cabin_Region'] = pd.cut(train_test_data['Cabin_Num'], bins=bins, labels=labels, right=False)

In [217]:
train_test_data[['Age_Group', 'Cabin_Region']] = train_test_data[['Age_Group', 'Cabin_Region']].astype('object')
train_test_data.drop(columns=['Age', 'Cabin_Num'], inplace=True)

## Normalization & Transformation
- Normalized the expenditure features relative to `Total_Spend`.
- Applied log1p transformation to numerical features with high skewness to improve distribution.


In [218]:
feature_to_normalize = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

for feature in feature_to_normalize:
    train_test_data[feature] = train_test_data.apply(
        lambda row: 0 if row['Total_Spend'] == 0 else row[feature] / row['Total_Spend'],
        axis=1
    )

In [219]:
numerical_features = train_test_data.select_dtypes(include=['float64', 'int64']).columns[
    train_test_data.select_dtypes(include=['float64', 'int64']).nunique() > 10
]

for feature in numerical_features:
    skewness = train_test_data[feature].skew() 
    if abs(skewness) > 0.5: 
        print(f"Applying log1p to: {feature}, Skewness: {skewness:.2f}")
        train_test_data[feature] = np.log1p(train_test_data[feature])  # log1p transformation

Applying log1p to: RoomService, Skewness: 2.14
Applying log1p to: FoodCourt, Skewness: 2.15
Applying log1p to: ShoppingMall, Skewness: 2.48
Applying log1p to: Spa, Skewness: 2.41
Applying log1p to: VRDeck, Skewness: 2.51
Applying log1p to: Total_Spend, Skewness: 4.55


## Target Encoding
- Applied target encoding to categorical features with fewer than 10 unique values.
- Mapped each category in these features to a value based on the survival rate of `Transported`.
- Converted the entire dataset to float type after encoding.


In [220]:
train_test_data.head(3)

Unnamed: 0,HomePlanet,CryoSleep,Destination,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Group_Number,Group_Size,Cabin_Deck,Cabin_Side,Total_Spend,Has_Spent,Age_Group,Cabin_Region
0,Europa,False,TRAPPIST-1e,False,0.0,0.0,0.0,0.0,0.0,False,1.0,1.0,B,P,0.0,False,Senior,r1
1,Earth,False,TRAPPIST-1e,False,0.138107,0.012154,0.033403,0.557284,0.058064,True,2.0,1.0,F,S,6.602588,True,Middle-Age,r1
2,Europa,False,TRAPPIST-1e,True,0.004133,0.295955,0.0,0.498792,0.004708,False,3.0,2.0,A,S,9.248021,True,Old,r1


In [221]:
for column in train_test_data.columns:
    if train_test_data[column].nunique() <= 10: 
        # Calculate survival rates for each category in the column
        survival_rates = train_test_data.groupby(column)['Transported'].mean()

        # Sort the categories based on survival rates (ascending order)
        survival_rates = survival_rates.sort_values()

        # Create a mapping from category to survival rate-based values (0, 1, 2,...)
        category_mapping = {category: idx for idx, category in enumerate(survival_rates.index)}

        # Map the categories to their new values
        train_test_data[column] = train_test_data[column].map(category_mapping)

In [222]:
train_test_data = train_test_data.astype('float')

## Scaling Features
- Applied StandardScaler to scale all numerical features to have a mean of 0 and a standard deviation of 1.

In [223]:
scaler = StandardScaler()

numerical_columns = train_test_data.columns

train_test_data[numerical_columns] = scaler.fit_transform(train_test_data[numerical_columns])

## Splitting Saved Data for Modeling
- Dropped the `Transported` column from `train_test_data` to separate features from the target variable.
- Split the dataset into training (`train_x`, `train_y`) and testing (`test_x`, `test_y`) sets:
  - `train_x`: Features for training.
  - `test_x`: Features for testing.
  - `train_y`: Target variable for training.
  - `test_y`: Target variable for testing (from `submission_data`).
- Saved `train_x`, `test_x`, `train_y`, and `test_y` as CSV files: `train_x.csv`, `test_x.csv`, `train_y.csv`, and `test_y.csv`

In [224]:
train_test_data.drop(columns='Transported', inplace=True)

train_x = train_test_data.iloc[:y_train.shape[0], :]
test_x = train_test_data.iloc[y_train.shape[0]:, :]

test_y = submission_data['Transported']
train_y = y_train

In [225]:
train_x.shape, y_train.shape, test_x.shape, test_y.shape

((8693, 17), (8693,), (4277, 17), (4277,))

In [226]:
# Exporting datasets to CSV
train_x.to_csv('train_x.csv', index=False)
test_x.to_csv('test_x.csv', index=False)
train_y.to_csv('train_y.csv', index=False)
test_y.to_csv('test_y.csv', index=False)
