# About this notebook

This notebook is the first part of a larger collection of notebooks that try to solve the Kaggle Competition - Spaceship Titanic. The goal of this competition is to predict which passengers were transported by the anomaly using records recovered from the spaceshipâ€™s damaged computer system.

The competition can be found in the following link: https://www.kaggle.com/c/spaceship-titanic/overview

The collection of notebooks is structured as follows:
- Notebook 1: Problem Exploration
- Notebook 2: Data Wrangling + Exploratory Data Analysis
- Notebook 3: Model Selection and Evaluation

In particular this notebook provides, after understanding the data, a process of cleaning and transforming data will be applied to the datasets. Also, new features will be generated in the feature engineering stage.  In summary, this section covers the following:

- Data transformation: Dealing with missing values, transforming columns to other typesâ€¦etc
- Feature Generation

That being said, let's get started! ðŸ¤˜


---

# Project summary

So, we could summarize the project as follows:


- Total number of passengers: 13K passengers (14 columns)
- **Goal**: To predict which passengers were transported by the anomaly
- **Target column**: Transported (0 = No, 1 = Yes)
- Evaluation: Accuracy (percentage of passengers correctly predicted)
- Type of problem: Binary Classification

It is important to take into account that although I will approach this problem as a binary classification I will try use different models (logistic regression, random forest, neural networks, etc.) and different feature engineering techniques.

The goal is to learn and practice different techniques and to compare the results obtained with each of them.

In [1]:
# Libraries
import os, sys

# Add root folder to path if not already there
root_folder = os.path.dirname(os.getcwd())
if root_folder not in sys.path: sys.path.append(root_folder)

# Data manipulation libraries
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
new_palette = sns.color_palette("Set2")  # You can replace "Set2" with any other palette name
sns.set_palette(new_palette)

# Own libraries (Data Wrangling)
from lib_eda.dataset_analysis import basic_eda_dataset

from preprocessing.preproc_basic import preprocess_dataset
from preprocessing.missing_home_planet import missing_values_homeplanet
from preprocessing.preproc_features import *

path_data = r'D:\Kaggle Data/spaceship-titanic/'

In [2]:
# Load main datasets
train = pd.read_csv(path_data + 'train.csv')
test = pd.read_csv(path_data + 'test.csv')
sample_submission = pd.read_csv(path_data + 'sample_submission.csv')

submission_ids = test['PassengerId'] # Extract the ids for a future submission

# Get target
TARGET = train['Transported']

In [3]:
print('Train dataset before the preprocessing...')
train.head(1)

Train dataset before the preprocessing...


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False


# Basic pre-processing

The initial phase of data wrangling involves enhancing data comprehensibility by tasks such as renaming columns and altering data types. Additionally, it includes removing initial features that do not provide substantial value.

Once this stage is completed, we can proceed with more in-depth data preprocessing.

In [4]:
print('Train dataset after basic preprocessing')
df_train = preprocess_dataset(train)
df_test = preprocess_dataset(test)
df_train.head(1)

Train dataset after basic preprocessing


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,europa,False,B/0/P,trappist_1e,39.0,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,0


# Iteration 1

## Feature Engineering

Now that we have completed the initial dataset cleaning, it's time to create new features using the information at our disposal. We need to take into account that, since we haven't filled in missing values yet, we are going to create basic features that don't rely on the completeness of data for their creation. For instance, although the `Cabin` feature will generate other different ones, we'll handle with its missing values later.

As observed in our previous Exploratory Data Analysis (EDA), we will introduce the following features:

1. **Group ID** (Int): Extracted from the passenger ID.

2. **Group Size** (Int): Derived from the passenger ID. We aim to explore whether passengers traveling in larger groups have different probabilities of being transported.

3. **Travel Alone** (Int): This binary feature indicates whether a passenger is traveling alone. It will be 1 if the passenger is in a group of size 1 and 0 otherwise.

4. **Name** and **Surname** (String): Two different columns where we split the name and surname of the passenger if possible.

5. **New Features: Deck, Cabin Number, and Cabin Letter (String, Int, String):** Recognizing that the 'Cabin' field combines Deck, Number, and Letter, we will split this format to investigate potential relationships between these individual components and our target variable.

In [5]:
print('Train dataset after generating new features')

df_train = feature_engineering_it_1(df_train)
df_test = feature_engineering_it_1(df_test)

df_train

Train dataset after generating new features


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,fe_group_id,fe_group_size,fe_is_alone,Surname,fe_cabin_deck,fe_cabin_number,fe_cabin_letter
0,0001_01,europa,False,B/0/P,trappist_1e,39.0,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,0,1,1,alone,Ofracculy,B,0,P
1,0002_01,earth,False,F/0/S,trappist_1e,24.0,109.0,9.0,25.0,549.0,44.0,Juanna Vines,1,2,1,alone,Vines,F,0,S
2,0003_01,europa,False,A/0/S,trappist_1e,58.0,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,0,3,2,not_alone,Susent,A,0,S
3,0003_02,europa,False,A/0/S,trappist_1e,33.0,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,0,3,2,not_alone,Susent,A,0,S
4,0004_01,earth,False,F/1/S,trappist_1e,16.0,303.0,70.0,151.0,565.0,2.0,Willy Santantines,1,4,1,alone,Santantines,F,1,S
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,europa,False,A/98/P,55_cancri_e,41.0,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,0,9276,1,alone,Noxnuther,A,98,P
8689,9278_01,earth,True,G/1499/S,pso_j318_5_22,18.0,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,0,9278,1,alone,Mondalley,G,1499,S
8690,9279_01,earth,False,G/1500/S,trappist_1e,26.0,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,1,9279,1,alone,Connon,G,1500,S
8691,9280_01,europa,False,E/608/S,55_cancri_e,32.0,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,0,9280,2,not_alone,Hontichre,E,608,S


## Handling Missing Values

One effective approach to address missing values is to combine the train and test datasets into a single dataset while excluding the target variable from the training data. This consolidation allows for a more organized and systematic analysis of missing values.

Upon inspecting the dataset, it becomes evident that missing values account for approximately 2% of the data, which represents a relatively modest proportion.

To handle missing data, a straightforward method involves using the median for continuous features and the mode for categorical features. Although this approach can produce acceptable results, achieving optimal model accuracy requires a more nuanced strategy that includes a thorough examination of missing data patterns.

In [6]:
# Extract the target and create a single dataset with the data
y = df_train['Transported'].copy().astype(int)
X = df_train.drop('Transported', axis=1).copy()

# We'll split the data again later
data = pd.concat([X, df_test], axis=0).reset_index(drop=True)

# Review missing values in total
basic_eda_dataset(data, 'all_data') # Uncomment this in order to see the main stats of the dataset

all_data shape: (12970, 19)
--------------------------------------------------
all_data info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12970 entries, 0 to 12969
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   PassengerId      12970 non-null  object 
 1   HomePlanet       12682 non-null  object 
 2   CryoSleep        12660 non-null  object 
 3   Cabin            12671 non-null  object 
 4   Destination      12696 non-null  object 
 5   Age              12700 non-null  float64
 6   RoomService      12707 non-null  float64
 7   FoodCourt        12681 non-null  float64
 8   ShoppingMall     12664 non-null  float64
 9   Spa              12686 non-null  float64
 10  VRDeck           12702 non-null  float64
 11  Name             12970 non-null  object 
 12  fe_group_id      12970 non-null  int64  
 13  fe_group_size    12970 non-null  int64  
 14  fe_is_alone      12970 non-null  object 
 15  Surname   

Unnamed: 0,missing_values_percentage,missing_values_count
PassengerId,0.0,0
HomePlanet,2.2,288
CryoSleep,2.4,310
Cabin,2.3,299
Destination,2.1,274
Age,2.1,270
RoomService,2.0,263
FoodCourt,2.2,289
ShoppingMall,2.4,306
Spa,2.2,284


all_data describe:
                Age   RoomService     FoodCourt  ShoppingMall           Spa  \
count  12700.000000  12707.000000  12681.000000  12664.000000  12686.000000   
mean      28.771969    222.897852    451.961675    174.906033    308.476904   
std       14.387261    647.596664   1584.370747    590.558690   1130.279641   
min        0.000000      0.000000      0.000000      0.000000      0.000000   
25%       19.000000      0.000000      0.000000      0.000000      0.000000   
50%       27.000000      0.000000      0.000000      0.000000      0.000000   
75%       38.000000     49.000000     77.000000     29.000000     57.000000   
max       79.000000  14327.000000  29813.000000  23492.000000  22408.000000   

             VRDeck   fe_group_id  fe_group_size  
count  12702.000000  12970.000000   12970.000000  
mean     306.789482   4635.337471       2.022976  
std     1180.097223   2685.904299       1.577102  
min        0.000000      1.000000       1.000000  
25%        0.0

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,fe_group_id,fe_group_size,fe_is_alone,Surname,fe_cabin_deck,fe_cabin_number,fe_cabin_letter
0,0001_01,europa,False,B/0/P,trappist_1e,39.0,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,1,1,alone,Ofracculy,B,0,P
1,0002_01,earth,False,F/0/S,trappist_1e,24.0,109.0,9.0,25.0,549.0,44.0,Juanna Vines,2,1,alone,Vines,F,0,S


### Expectation-Maximization (EM) algorithm

One common technique for handling missing data using joint distributions is the Expectation-Maximization (EM) algorithm. The EM algorithm estimates missing values by iteratively maximizing the likelihood function of the observed and missing data jointly.

It assumes a joint distribution for the complete data and iteratively refines its estimates of missing values until convergence.

In the context of missing data, a joint distribution is often used to model the relationships between variables that are observed (non-missing) and those that are missing. It allows statisticians and data scientists to represent the joint probability distribution of both observed and missing data, which can be valuable for various tasks, such as imputing missing values or understanding the dependencies between variables.

Given the numerous potential combinations, this summary will highlight valuable trends identified by myself and other kagglers.

Here are the main Joint Distributions:
- Group and HomePlanet

In [7]:
data_clean = missing_values_homeplanet(data)

Filling HomePlanet feature
----------------------------------------------------------------------
Missing values in HomePlanet before preprocessing: 288
++EDA JOINT: HomePlanet <> GroupID
Min planets per group: 1
Max planets per group: 1
Conclusion: Everyone in the same group has the same HomePlanet!


++EDA JOINT: HomePlanet <> GroupID


HomePlanet,earth,europa,mars
fe_cabin_deck,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.0,1.0,0.0
B,0.0,1.0,0.0
C,0.0,1.0,0.0
D,0.0,1.0,1.0
E,1.0,1.0,1.0
F,1.0,0.0,1.0
G,1.0,0.0,0.0
T,0.0,1.0,0.0
Unkown,1.0,1.0,1.0


Conclusions: 
-All passengers from Deck A-B-C-T have as HomePlanet Europa
-All passengers from G have as a HomePlanet Earth
-All passengers from Deck D-E-F came from multiple planets


Filling HomePlanet feature
----------------------------------------------------------------------
Missing values in HomePlanet after preprocessing: 157


# Iteration 2

## Feature Engineering

Now that we have completed the initial dataset cleaning, it's time to create new features using the information at our disposal. As observed in our previous Exploratory Data Analysis (EDA), we will introduce the following features:

1. **Age Group (String):** We will categorize passengers into different age ranges based on insights gained from our distribution analysis: Group child (0-12) | Group teenager (12-17) | Group young adult (18-25) | Group adult (26-30) | Group middle-aged (31-50) | Group senior (50+).

2. **Total Billing (Float):** This feature represents the total amount of money spent during the trip, calculated as the sum of expenses from the 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', and 'RoomService' columns.

3. **Has Spent (Int):** This binary feature indicates whether a passenger has made any expenditures during the trip. It will be 1 if the passenger has spent money and 0 otherwise.



5. **Group ID (Int):** Extracted from the passenger ID.

6. **Group Size (Int):** Derived from the passenger ID. We aim to explore whether passengers traveling in larger groups have different probabilities of being transported.

7. **Travel Alone (Int):** This binary feature indicates whether a passenger is traveling alone. It will be 1 if the passenger is in a group of size 1 and 0 otherwise.