# About this notebook

This notebook is the first part of a larger collection of notebooks that try to solve the Kaggle Competition - Spaceship Titanic. The goal of this competition is to predict which passengers were transported by the anomaly using records recovered from the spaceshipâ€™s damaged computer system.

The competition can be found in the following link: https://www.kaggle.com/c/spaceship-titanic/overview

The collection of notebooks is structured as follows:
- Notebook 1: Problem Exploration
- Notebook 2: Data Wrangling + Exploratory Data Analysis
- Notebook 3: Model Selection and Evaluation

In particular this notebook provides, after understanding the data, a process of cleaning and transforming data will be applied to the datasets. Also, new features will be generated in the feature engineering stage.  In summary, this section covers the following:

- Data transformation: Dealing with missing values, transforming columns to other typesâ€¦etc
- Feature Generation

That being said, let's get started! ðŸ¤˜


---

# Project summary

So, we could summarize the project as follows:


- Total number of passengers: 13K passengers (14 columns)
- **Goal**: To predict which passengers were transported by the anomaly
- **Target column**: Transported (0 = No, 1 = Yes)
- Evaluation: Accuracy (percentage of passengers correctly predicted)
- Type of problem: Binary Classification

It is important to take into account that although I will approach this problem as a binary classification I will try use different models (logistic regression, random forest, neural networks, etc.) and different feature engineering techniques.

The goal is to learn and practice different techniques and to compare the results obtained with each of them.

In [17]:
# Libraries
import os, sys

# Add root folder to path if not already there
root_folder = os.path.dirname(os.getcwd())
if root_folder not in sys.path: sys.path.append(root_folder)

# Data manipulation libraries
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
new_palette = sns.color_palette("Set2")  # You can replace "Set2" with any other palette name
sns.set_palette(new_palette)
# Own libraries (EDA)
from lib_eda.dataset_analysis import basic_eda_dataset
from lib_eda.target_analysis import basic_eda_target
from lib_eda.features_num import basic_eda_num

path_kaggle_data = r'D:\Kaggle Data'
path_competition = 'spaceship-titanic'
path_data = path_kaggle_data + '/' + path_competition

In [18]:
# Load main datasets
train = pd.read_csv(path_data + '/' + 'train.csv')
test = pd.read_csv(path_data + '/' + 'test.csv')
sample_submission = pd.read_csv(path_data + '/' + 'sample_submission.csv')

submission_ids = test['PassengerId'] # Extract the ids for a future submission

# Get target
TARGET = train['Transported']

# Basic pre-processing

The initial phase of data wrangling involves enhancing data comprehensibility by tasks such as renaming columns and altering data types. Additionally, it includes removing initial features that do not provide substantial value.

Once this stage is completed, we can proceed with more in-depth data preprocessing.

In [19]:
train.head(1).T

Unnamed: 0,0
PassengerId,0001_01
HomePlanet,Europa
CryoSleep,False
Cabin,B/0/P
Destination,TRAPPIST-1e
Age,39.0
VIP,False
RoomService,0.0
FoodCourt,0.0
ShoppingMall,0.0


In [20]:
def camel_to_snake(name):
    name = "".join(['_' + i.lower() if i.isupper() else i for i in name]).lstrip('_')
    return name

def preproc_basic(dataset):
    
    # Remove useless columns (insights from the previous EDA)
    cols_remove = ['VIP']
    dataset.drop(cols_remove, inplace=True, axis=1)
    
    # Rename columns
    dataset.columns = [camel_to_snake(column) for column in dataset.columns]
    dataset.rename(columns={'v_r_deck':'vr_deck'}, inplace=True)
    
    
    
    return dataset

df_train = preproc_basic(train)

In [22]:
df_train

Unnamed: 0,passenger_id,home_planet,cryo_sleep,cabin,destination,age,room_service,food_court,shopping_mall,spa,vr_deck,name,transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,18.0,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False
