# About this notebook

This notebook is the first part of a larger collection of notebooks that try to solve the Kaggle Competition - Spaceship Titanic. The goal of this competition is to predict which passengers were transported by the anomaly using records recovered from the spaceshipâ€™s damaged computer system.

The competition can be found in the following link: https://www.kaggle.com/c/spaceship-titanic/overview

The collection of notebooks is structured as follows:
- Notebook 1: Problem Exploration
- Notebook 2: Data Wrangling + Exploratory Data Analysis
- Notebook 3: Model Selection and Evaluation

In particular this notebook provides, after understanding the data, a process of cleaning and transforming data will be applied to the datasets. Also, new features will be generated in the feature engineering stage.  In summary, this section covers the following:

- Data transformation: Dealing with missing values, transforming columns to other typesâ€¦etc
- Feature Generation

That being said, let's get started! ðŸ¤˜


---

# Project summary

So, we could summarize the project as follows:


- Total number of passengers: 13K passengers (14 columns)
- **Goal**: To predict which passengers were transported by the anomaly
- **Target column**: Transported (0 = No, 1 = Yes)
- Evaluation: Accuracy (percentage of passengers correctly predicted)
- Type of problem: Binary Classification

It is important to take into account that although I will approach this problem as a binary classification I will try use different models (logistic regression, random forest, neural networks, etc.) and different feature engineering techniques.

The goal is to learn and practice different techniques and to compare the results obtained with each of them.

In [1]:
# Libraries
import os, sys

# Add root folder to path if not already there
root_folder = os.path.dirname(os.getcwd())
if root_folder not in sys.path: sys.path.append(root_folder)

# Data manipulation libraries
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
new_palette = sns.color_palette("Set2")  # You can replace "Set2" with any other palette name
sns.set_palette(new_palette)

# Own libraries (Data Wrangling)
from preprocessing.preproc_basic import preprocess_dataset
from preprocessing.preproc_missing_values import *
from preprocessing.preproc_features import *

path_data = r'D:\Kaggle Data/spaceship-titanic/'

In [2]:
# Load main datasets
train = pd.read_csv(path_data + 'train.csv')
test = pd.read_csv(path_data + 'test.csv')
sample_submission = pd.read_csv(path_data + 'sample_submission.csv')

submission_ids = test['PassengerId'] # Extract the ids for a future submission

# Get target
TARGET = train['Transported']

In [3]:
print('Train dataset before the preprocessing...')
train.head(1)

Train dataset before the preprocessing...


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False


# Basic pre-processing

The initial phase of data wrangling involves enhancing data comprehensibility by tasks such as renaming columns and altering data types. Additionally, it includes removing initial features that do not provide substantial value.

Once this stage is completed, we can proceed with more in-depth data preprocessing.

In [4]:
print('Train dataset after basic preprocessing')
df_train = preprocess_dataset(train)
df_train.head(1)

Train dataset after basic preprocessing


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,europa,False,B/0/P,trappist_1e,39.0,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,0


# Handling Missing and Outliers Value

In [5]:
print('Train dataset after handling missing values and outliers')
df_train = process_missing_values(df_train)
df_train.head(1)

Train dataset after handling missing values and outliers


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,europa,False,B/0/P,trappist_1e,39.0,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,0


# Generating New Features - First Iteration

Now that we have completed the initial dataset cleaning, it's time to create new features using the information at our disposal. As observed in our previous Exploratory Data Analysis (EDA), we will introduce the following features:

1. **Age Group (String):** We will categorize passengers into different age ranges based on insights gained from our distribution analysis: Group child (0-12) | Group teenager (12-17) | Group young adult (18-25) | Group adult (26-30) | Group middle-aged (31-50) | Group senior (50+).

2. **Total Billing (Float):** This feature represents the total amount of money spent during the trip, calculated as the sum of expenses from the 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', and 'RoomService' columns.

3. **Has Spent (Int):** This binary feature indicates whether a passenger has made any expenditures during the trip. It will be 1 if the passenger has spent money and 0 otherwise.

4. **New Features: Deck, Cabin Number, and Cabin Letter (String, Int, String):** Recognizing that the 'Cabin' field combines Deck, Number, and Letter, we will split this format to investigate potential relationships between these individual components and our target variable.

5. **Group ID (Int):** Extracted from the passenger ID.

6. **Group Size (Int):** Derived from the passenger ID. We aim to explore whether passengers traveling in larger groups have different probabilities of being transported.

7. **Travel Alone (Int):** This binary feature indicates whether a passenger is traveling alone. It will be 1 if the passenger is in a group of size 1 and 0 otherwise.

In [7]:
print('Train dataset after generating new features')
df_train = feature_engineering_iteration_one(df_train)
df_train.head(1).T

Train dataset after generating new features


Unnamed: 0,0
PassengerId,0001_01
HomePlanet,europa
CryoSleep,False
Cabin,B/0/P
Destination,trappist_1e
Age,39.0
RoomService,0.0
FoodCourt,0.0
ShoppingMall,0.0
Spa,0.0
