# Space-ship Titanic Kaggle Work

- This work is my own. I didn't reference any existing guides for this kaggle competition to get my score.
- Kaggle score for this work: 0.80336 (position 550/2543)

In [None]:
import os
os.environ['HSA_OVERRIDE_GFX_VERSION'] = '10.3.0'

# basic modules
import pandas as pd
import numpy as np
import seaborn as sns
from IPython.display import display
import seaborn as sns
import matplotlib.pyplot as plt

from scipy.stats import chi2_contingency
from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler, PowerTransformer, MinMaxScaler
from sklearn.ensemble import GradientBoostingClassifier

***Read Data***

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_rows', 50)

data_path = './data/'
train = 'train.csv'

train_file = os.path.join(data_path,train)

In [None]:
df = pd.read_csv(train_file)

## Begin Exploratory Data Analysis

In [None]:
print(df.info())

df.head()

## Basic Column Data Extraction

### PassengerId Column
**Engineer 'PassengerId' column**:
- The 'PassengerId' column is of the form: gggg_pp where gggg indicates the group they are traveling with and pp is their number within the group. By extracting these values we can create meaning from this column and hopefully use it to impute missing values later.

In [None]:
df['GroupId'] = df['PassengerId'].apply(lambda x: x.split("_")[0]).astype(int)
df['PersonId'] = df['PassengerId'].apply(lambda x: x.split("_")[1]).astype(int)
df.drop(columns=['PassengerId'], inplace=True)
df.head(2)

### 'Cabin' Column
**Engineer the 'Cabin' column:**
- The 'Cabin' column is of the form deck/num/side. We can engineer this feature to extract meaningful info for the column.

In [None]:
df['Deck'] = df[df['Cabin'].notna()]['Cabin'].str.split('/').apply(lambda x: x[0])
df['Num'] = df[df['Cabin'].notna()]['Cabin'].str.split('/').apply(lambda x: x[1])
df['Side'] = df[df['Cabin'].notna()]['Cabin'].str.split('/').apply(lambda x: x[2])
df.drop(columns=['Cabin'],inplace=True)
df.head(2)

### Name Column
**First Name and Last Name extraction**
- By extracting the first and last names, we can hopefully use this data to impute missing features. Namely, we can hopefully use last names to impute missing data.

In [None]:
df['FirstName'] = df[df['Name'].notna()]['Name'].str.split(' ').apply(lambda x: x[0].strip())
df['LastName'] = df[df['Name'].notna()]['Name'].str.split(' ').apply(lambda x: x[1].strip())
df.drop(columns=['Name'], inplace = True)
df.head(2)

## Imputation: Missing Home Planets

**Missing home planet imputation**
- Imputation of missing home planets might be relatively simple to do. We might only have to consider shared attribtues among groups we are confident about like shared last names, group ids and home planets.
- Before we attempt to impute missing home-planets, we should have some confidence that these missing home planets are in-fact MCAR, so we will run a chi-2 test on the home planets.

In [None]:
contingency = pd.crosstab(df['HomePlanet'], df['Transported'], dropna=False)

print(contingency)

c, p, dof, expected = chi2_contingency(contingency)

print(c,p,dof)
print(expected)

Observation
- We can see that the frequency table for the missing home planet and the chi-2 expected frequency values are nearly the same, which indicates that there isn't an apparent association between a home planet being missing and being successfully transported or not. This hints at homeplanet being missing being an MCAR value, which means we will proceed with imputation of this feature.

### Impute missing home planets (High-Confidence) (Family members)
Missing home planet will be imputed for groups of people such that:
- They all have the same GroupID
- All have the same LastName
- All come from the same planet
- All going to the same destination

The home planet imputed will be the groups home planet

In [None]:
def get_groups_0(df) -> list:
  groups = []
  group_ids = df['GroupId'].unique().tolist()
  for group_id in group_ids:

    # get sub-dataframe based off of group id
    group_df = df[df['GroupId'] == group_id]

    has_missing_planet = group_df['HomePlanet'].isna().any()
    has_one_distinct_home = group_df['HomePlanet'].dropna().nunique() == 1
    has_one_distinct_destination = group_df['Destination'].dropna().nunique() == 1
    has_one_distinct_last_name = group_df['LastName'].dropna().nunique() == 1
    
    if (
        has_missing_planet and
        has_one_distinct_last_name and
        has_one_distinct_home and
        has_one_distinct_destination        
    ):
        groups.append(group_df)
      
  return groups

group_dfs = get_groups_0(df)

print(f"number of samples where home planet is missing:{df['HomePlanet'].isna().sum()}")
print(f"Number of groups: {len(group_dfs)}")

while group_dfs:
  group_df = group_dfs.pop()
  home_planets = group_df['HomePlanet'].dropna().unique()
  if len(home_planets) > 1:
    raise ValueError(home_planets)
  home_planet = home_planets[0]
  df.loc[group_df['HomePlanet'].isna().index, 'HomePlanet'] = home_planet
  
print(f"number of samples where home planet is missing:{df['HomePlanet'].isna().sum()}")

### Impute missing home planets (Medium Confidence) (Family members from same planet)
**Imput Home Planets by:**
  - Groups where GroupID are all the same
  - Groups where LastName are all the same
  - There is only one unique type of home planet in the group of people
  
Update missing home planets with the groups single distinct non-na home planet

In [None]:
def get_groups_1(df):
    groups =[]
    group_ids = df['GroupId'].unique().tolist()
    for group_id in group_ids:
        group_df = df[df['GroupId'] == group_id]
        
        missing_home_planet = group_df['HomePlanet'].isna().any()
        one_distinct_last_name = group_df['LastName'].dropna().nunique() == 1
        one_distinct_home_planet = group_df['HomePlanet'].dropna().nunique() == 1
        
        if (
            missing_home_planet and
            one_distinct_last_name and
            one_distinct_home_planet
        ):
            groups.append(group_df)
            
    return groups

group_dfs = get_groups_1(df)

print(f"number of samples with missing home planet: {df['HomePlanet'].isna().sum()}")
print("groups:",len(group_dfs))

while group_dfs:
    group_df = group_dfs.pop()
    
    planets = group_df['HomePlanet'].dropna().unique().tolist()
    
    if len(planets) != 1:
        raise ValueError("HUH")

    df.loc[group_df['HomePlanet'].isna().index, 'HomePlanet'] = planets[0]    
    
print(f'samples remaining with missing home planets: {df["HomePlanet"].isna().sum()}')

## Imputation: Missing destination planets
- Same process as with missing home planets essentially

In [None]:
contingency = pd.crosstab(df['Destination'], df['Transported'], dropna=False)

print(contingency)

c, p, dof, expected = chi2_contingency(contingency)

print(c,p,dof)
print(expected)

Obvservation:
- Similar case as the home planetplanet feature. These missing destination planets appear to be MCAR due to the low assocation. We will impute the missing values.

Action
- For Groups of GroupId such that:
    - There is at least one person in the group with a missing destination planet
    - All people in the group have the same last name (family)
    - ALl people in the group have the same home planet
    - ALl people in the group have the same destination (excluding the missing destinations)
Fill the missing destination

In [None]:
def get_group_2(df):
  groups = []
  group_ids = df['GroupId'].unique().tolist()
    
  for group_id in group_ids:
    
    group_df = df[df['GroupId'] == group_id]
    
    at_least_one_missing_destination_planet = group_df['Destination'].isna().any()
    only_one_distinct_destination_planet = group_df['Destination'].dropna().nunique() == 1    
    only_one_distinct_home_planet = group_df['HomePlanet'].dropna().nunique() == 1
    only_one_distinct_last_name = group_df['LastName'].dropna().nunique() == 1
    
    if (
        at_least_one_missing_destination_planet and
        only_one_distinct_home_planet and
        only_one_distinct_destination_planet and
        only_one_distinct_last_name
    ):
        groups.append(group_df)
    
  return groups

print('number of remaining samples where destination is missing:',df['Destination'].isna().sum())

groups = get_group_2(df)
print(f'number of groups found: {len(groups)}')

while groups:
    group_df = groups.pop()
    destination_planets = group_df['Destination'].dropna().unique().tolist()
    if len(destination_planets) != 1:
        raise ValueError(len(destination_planets))
    df.loc[group_df['Destination'].isna().index, 'Destination'] = destination_planets[0]
    
print('number of remaining samples where destination is missing after imputation:',df['Destination'].isna().sum())


## Feature Importance
To determine which categorical features to impute next, lets attempt to rank our categorical features importance with respect to the target using regression.

In [None]:
display(df.info())
display(df.head(2))

In [None]:
# Assuming your data is in a DataFrame called 'df'
categorical_features = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck', 'Side']  # Replace with your categorical feature names
continuous_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'GroupId', 'Num', 'PersonId']
target_feature = 'Transported'  # Replace with the name of your target variable

temp_df = df[categorical_features + continuous_features + [target_feature]]

temp_df = pd.get_dummies(temp_df, columns=categorical_features)

X = temp_df.drop(columns=[target_feature])
y = temp_df[target_feature]

# Train a Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X, y)

importance_df = pd.DataFrame({'Feature': rf_classifier.feature_names_in_, 'Importance': rf_classifier.feature_importances_})

# Sort the features by importance in descending order
importance_df = importance_df.sort_values('Importance', ascending=False)

# Plot the feature importance
plt.figure(figsize=(10, 6))
plt.bar(importance_df['Feature'], importance_df['Importance'])
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importance (Random Forest) for Transported')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

**Observations**:
- Continuous features have the largest impact on the target.
- Unexpectedly, 'Num', and 'GroupId' have a large impact on the target.
- The categorical feature that has the largest impact on the target is CryoSleep.

## Cryosleep Analysis
- Let's try to glean information about cryosleep for imputation.

In [None]:
did_cryo = df[df['CryoSleep'] == True]
no_cryo = df[df['CryoSleep'] == False]

spending_money_cols = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
display("DID CRYO")
display(did_cryo.head())
display(did_cryo[spending_money_cols].describe())
display('-----------')
display('NO CRYO')
display(no_cryo.head())
display(no_cryo[spending_money_cols].describe())

**Observations**
- We can see that people who do cryo never spend any money. This can help us impute missing cryo data and missing money data. 
- We can see that people who did CRYO and were VIP were all from Europa Home planet. We can quickly impute this

## Imputation: CryoSleep and Shopping Data
- We discussed earlier that shopping data is a good proxy for determining if someone did cryosleep or not. Lets begin by imputing shopping data.

***Imputation:***
- If someone did cryosleep, we know they didn't spend any money. So, for any sample where shopping data is missing, if the person did cryosleep, we will impute a 0.0.
- If cryosleep data is missing, and we see the person spent money, we know they didn't do cryosleep. That is, if a person spent money and cryosleep is missing, we will fill a False.df[(df['CryoSleep'] == True) & (df[spending_money_cols].sum(axis=1))]spending_money_cols = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']


In [None]:
spending_money_cols = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

- If cryosleep is missing, and money was spent, that person did not do cryosleep, so impute cryosleep with false

In [None]:
# impute missing cryosleep columns with false (pretty safe)
mask = df[(df['CryoSleep'].isna()) & (df[spending_money_cols].sum(axis=1) > 0.0)].index
df.loc[mask, 'CryoSleep'] = False

- If someone did cryosleep, and any spending columns are missing, impute those with zero because they did cryosleep.

In [None]:
# impute spending columns (somewhat risky)
# cond_1 = (df['CryoSleep'] == True)
# cond_2 = (df[spending_money_cols].sum(axis=1) == 0.0)
# cond_3 = (df[spending_money_cols].isna().any(axis=1))
# mask = df[cond_1 & cond_2 & cond_3].index
# df.loc[mask, spending_money_cols] = 0.0

In [None]:
# sanity check, this should have nothing
df[(df['CryoSleep'] == True) & (df[spending_money_cols].sum(axis=1) > 0.0)]

### Feature Engineering: TotalSpent and SpentMoney column
- Spending money appears to be a good proxy for certain information. We can create a simple continuous feature that simply tells how much money, if any, was spent.

- Total spent: strict version. THis will only compute the total spent when we have a value for every shopping column

In [None]:
spending_money_cols = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

# total spent - less strict version
df['TotalSpent'] = df[spending_money_cols].sum(axis=1)

# total spent strict version
# cond_1 = (df[spending_money_cols].notna().all(axis=1))
# mask = df.loc[cond_1].index
# df.loc[mask, 'TotalSpent'] = df.loc[mask][spending_money_cols].sum(axis=1)

# display(df.loc[mask])

## Feature Importance Round II: Gradient Boosting
- We've imputed some data and engineered some new features. Lets check out the new feature importance.

In [None]:
# Assuming your data is in a DataFrame called 'df'
categorical_features = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck', 'Side']  # Replace with your categorical feature names
continuous_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'GroupId', 'Num', 'PersonId', 'TotalSpent']
target_feature = 'Transported'  # Replace with the name of your target variable

temp_df = df[categorical_features + continuous_features + [target_feature]].dropna()

temp_df = pd.get_dummies(temp_df, columns=categorical_features)

X = temp_df.drop(columns=[target_feature])
y = temp_df[target_feature]

# Train a Random Forest classifier
rf_classifier = GradientBoostingClassifier(n_estimators=100, random_state=1337)
rf_classifier.fit(X, y)

importance_df = pd.DataFrame({'Feature': rf_classifier.feature_names_in_, 'Importance': rf_classifier.feature_importances_})

# Sort the features by importance in descending order
importance_df = importance_df.sort_values('Importance', ascending=False)

# Plot the feature importance
plt.figure(figsize=(10, 6))
plt.bar(importance_df['Feature'], importance_df['Importance'])
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importance (Gradient Boosting) for Transported')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

***Observations:***
- 'TotalSpent' has a significant impact on performance.
- The continuous spending columns have the second largest impact on performance.

### Remaining Missing Data:
- Lets check out what else we can try to impute and analyze

In [None]:
print("number of samples without missing data:")
print(len(df.dropna()))

print("Missing features total:")
display(df.isna().sum())

### Further HomePlanet-based EDA

In [None]:
planets = df['HomePlanet'].dropna().unique().tolist()
for planet in planets:
    print('planet:',planet)
    df_planet = df[df['HomePlanet'] == planet]
    print(df_planet['VIP'].value_counts(normalize=True))
    print(df_planet['Destination'].value_counts(normalize=True))
    print(df_planet['CryoSleep'].value_counts(normalize=True))
    print(df_planet['Deck'].value_counts(normalize=True))
    print(df_planet['Side'].value_counts(normalize=True))
    print()

### VIP based EDA

In [None]:
df_isvip = df[df['VIP'] == True]
df_novip = df[df['VIP'] == False]
df_nanvip = df[df['VIP'].isna()]

df_vips = [df_isvip, df_novip, df_nanvip]
for df_vip in df_vips:
    display(df_vip['VIP'].value_counts())
    display(df_vip['Deck'].value_counts(normalize=True))
    display(df_vip[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].max())
    display(df_vip['CryoSleep'].value_counts(normalize=True))
    print()

### Cryo-sleep based EDA

In [None]:
cryosleeps = df['CryoSleep'].dropna().unique().tolist()
for cryosleep in cryosleeps:
    print('cryosleep:',cryosleep)
    df_cryo = df[df['CryoSleep'] == cryosleep]
    print(df_cryo['VIP'].value_counts(normalize=True))
    print(df_cryo['Destination'].value_counts(normalize=True))
    print(df_cryo['Deck'].value_counts(normalize=True))
    print(df_cryo['Side'].value_counts(normalize=True))
    print()

In [None]:
df[(df['CryoSleep'] == True) & (df['VIP'] == True)]

# Moving on to training
- Its not immediately obvious any more correlations between features. Further analysis will be very granular on features that have low feature importance. At this point, we will move on to training the model and hyperparamter tuning. We will revisit if performance is unacceptable.

# Kaggle score
- .80336 (top 550 people!!)

# Reflection
Imputations I missed:
- The most obvious imputations I missed are related to correlation between age & spending, age & cryosleep. There are some correlations with decks, but i do not feel i could have made any assertions/assumptions with confidence.