## Data processing
The competition provides three files: 
* `train.csv`: personal records for about two-thirds (~8700) of the passengers, to be used as training data
* `test.csv`: personal records for the remaining one-third (~4300) of the passengers, to be used as test data
* `sample_submission`: a submission file in the correct format
Which are all located in the `data/` directory.

Our first task is to load and preprocess the data to be able to feed it into our neural network for training. As we can see, there are lots of non-numeric data.
We are going to perform feature encoding for each of the columns containing non-numerical data, and some feature engineering after that to (potentially) improve the model's performance. 

### First things first
Importing libraries. Make sure you have them installed (check the instructions in the `README.md`)

In [None]:
import os
import re
import string
import random
from statistics import mean
from collections import Counter
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

# Load training and test data
base_file_path = "./data/"
test_data = pd.read_csv(base_file_path+'test.csv')
training_data = pd.read_csv(base_file_path+'train.csv')
print(training_data.columns)

pd.set_option('future.no_silent_downcasting', True)

### Turning planets into numbers
We will turn the `HomePlanet` and `Destination` columns into numeric values representing each planet

In [None]:
# Drop the "Name" column - we won't be using that data to train the model
training_data = training_data.drop('Name', axis=1)

# Load the 'HomePlanet' and'"Destination' columns
home_planets = training_data.get('HomePlanet').unique().tolist()[:-1]
destinations = training_data.get('Destination').unique().tolist()[:-1]

# Convert both columns to a dictionary 
home_planets_map = {k: v for v, k in enumerate(home_planets)}
destinations_map = {k: v for v, k in enumerate(destinations)}

# Turn both columns into their respective keys in the dictionary
training_data['HomePlanet'] = training_data['HomePlanet'].replace(home_planets_map)
training_data['Destination'] = training_data['Destination'].replace(destinations_map)

### Cabin IDs are a problem
Cabin IDs are structured as follows: "letter/number/letter" (i.e B/0/P).

We need to:
- encode them
- replace missing values in a non-random way

The number in the middle seems to be increasing non-monotonically, so I decided to replace missing values for a random number in the following range: $|(\text{prev valid value} - 5, \text{prev valid value})|$.

For the letters, it's clear that they repeat, so I decided to we create probability mappings and replace missing letters based on those.

In [None]:

# Get the 'Cabin' column, whose values are formatted as 'letter/number/letter'
cabins = training_data.get('Cabin')

# Separate 'Cabin' into 'Cabin_N' with N \in {1,2,3}
pattern = re.compile(r'^([a-zA-Z])/(\d+)/([a-zA-Z])$')

parsed_cabins = [
    (matches.group(1), int(matches.group(2)), matches.group(3)) if (matches := pattern.match(str(cabin)))
    else ('NA', -1, 'NA')
    for cabin in cabins
]

# Transpose the parsed_cabins to separate lists for cabin_1, cabin_2, and cabin_3
cabin_1, cabin_2, cabin_3 = map(list, zip(*parsed_cabins))

# Replace missing numbers for a random value in |(previous valid value - 5, previous valid value)| 
# just because it seems to work
prev_valid_val = None
cabin_2 = [abs(random.randint(prev_valid_val-10, prev_valid_val)) if x == -1 else (prev_valid_val := x) for x in cabin_2]


# Methods to compute probabilities for letter in the cabin IDs
def list_to_probability_mapping(value_list):
    value_counts = Counter(value_list)
    total_count = len(value_list)

    probability_mapping = {value: count / total_count for value, count in value_counts.items()}

    return probability_mapping

def get_index_from_probability_mapping(probability_mapping):
    indices = list(range(len(probability_mapping)))
    probabilities = list(probability_mapping.values())
    return np.random.choice(indices, p=probabilities)

def get_mean_val_randomized(dictionary):
  return int(np.random.normal(loc=mean(dictionary), scale=2))

# Create probability mappings
cabin_1_map = list_to_probability_mapping(cabin_1)
cabin_3_map = list_to_probability_mapping(cabin_3)

# Replace missing letters for another value, weighing probability
cabin_1 = [get_index_from_probability_mapping(cabin_1_map) if x == 'NA' else x for x in cabin_1]
cabin_3 = [get_index_from_probability_mapping(cabin_3_map) if x == 'NA' else x for x in cabin_3]

# Replace the 'Cabin' column for 'Cabin_ID_1, Cabin_ID_2, Cabin_ID_3' with their respective values
training_data = training_data.drop('Cabin', axis=1)
training_data['Cabin_ID_1'] = cabin_1
training_data['Cabin_ID_2'] = cabin_2
training_data['Cabin_ID_3'] = cabin_3

### Turning booleans into numbers is easier
The "CryoSleep", "VIP" and "Transported" columns can be turned into numbers trivially.

In [None]:
training_data['CryoSleep'] = np.where(training_data['CryoSleep'] == True, 1, 0)
training_data['VIP'] = np.where(training_data['VIP'] == True, 1, 0)
training_data['Transported'] = np.where(training_data['Transported'] == True, 1, 0)

In [None]:
# Split training data (assuming "Transported" is the last column!!)
X = training_data.iloc[:, 0:-1]
y = training_data.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Save processed training data in a new CSV
training_data.to_csv(base_file_path+'train_processed.csv', index=False)

print(training_data)

### TODO
- Replacing cabin_1, cabin_3 for numerical data
- Turning Passenger IDs into numbers
- Group non-essential and essential spending categories together