# change index for passenger ID and maybe get rid of iloc

In the following notebook I will show you the algorithmic approach I used to fill (nearly) every Cabin. This is not a guessing/probabilistic approach, cabins are filled in a structured order based on the passengers Homeplanet and its group (from its passengerID).

Cabins are filled in order based on their number, ie if a passenger is in cabin A/05/P, a passenger in a later group cannot be in A/04/P but they could be in A/01/S, or B/01/P

We are defining the components of the cabin by 
A/01/P
A = cabin deck, can take values 'A','B','C','D','E','F','G','T'
01 = cabin number, can take values 0,1,2...
P  = cabin side, can take values 'P', 'S' (presumably 'Port' and 'Starboard' )

Some assumptions
* If two passengers are in the same group then they are on the same side, Appendix A.1
* If two passengers are in the same group then they are from the same home planet, Appendix A.2
* If two passengers share a last name then they are from the same home planet, Appendix A.3
* Children <= 12 in age have no bills, Appendix A.4
* Those who have Cryosleep = True have no bills, Appendix A.5
* Cabins can only be shared with group members, Appendix A.6
* Home planets restrict which decks a passenger is on, Appendix B
** Passengers with Mars as their home planet are in decks 'D','E' or 'F'
** Passengers with Earth as their home planet are in decks 'E','F' or 'G'
** Passengers with Europa as their home planet are in decks 'A','B','C','D','E','T'
** If a passenger has no bills (RoomService + ShoppingMall + Spa + VRDeck + FoodCourt) and has members in its group in different decks then they are restricted to these decks 
*** Earth :'G'
*** Europa: 'B'
*** Mars: 'E','F'



# Feature engineering

Below we combine the training data and the test data as their orders by group are preserved

In [171]:
import pandas as pd 
from collections import defaultdict # Slightly modified from a regular dictionary


training_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')
training_data['Set'] = 'Train'
test_data['Set'] = 'Test'

# The combined dataframe we will be using for the rest of this project
df = pd.concat([training_data,test_data]) 



This is our starting point, we have 299 Cabins that are missing

In [172]:
df['Cabin'].isna().sum()

299

We then split their unique PassengerId into Group and their number in the group,
we split their cabin into the deck,side and number and we split their names into first and last name

In [173]:
def column_splits(data_frame):
    data_frame[['Group', 'GroupNumber']] = data_frame['PassengerId'].str.split('_', expand=True)

    data_frame[['CabinDeck', 'CabinNumber', 'CabinSide']]= data_frame['Cabin'].str.split("/", expand = True)
    data_frame.CabinNumber = data_frame.CabinNumber.astype('Int64')
    
    data_frame[['FirstName','LastName']] = data_frame['Name'].str.split(" ",expand = True)

    return data_frame

df = column_splits(df)


df = df.sort_values(by = ['Group','GroupNumber'])
df = df.reset_index(drop = True)

Total bills are composed of the summation of each passengers roomservice, foodcourt, shoppingmall, spa and vrdeck payments.
We can impute bills to be equal to 0 if someone is under 13 and/or they are in cryosleep (Appendix A.4,A.5)

In [174]:
df['Bills'] = df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] + df['Spa'] + df['VRDeck']
df.loc[(df['Age'] < 13), 'Bills'] = 0
df.loc[(df['CryoSleep'] == True),'Bills'] = 0 
    

Group Size is a useful column seeing how many passengers are in their group

In [175]:
def add_group_size_column(dataframe):
    dataframe['GroupSize'] = dataframe.groupby('Group')['Group'].transform('count')
    return dataframe


df = add_group_size_column(df)


Below is a function to impute attributes based on a shared feature, rows with missing values for their homeplanet will be imputed if they share a group with someone else with their homeplanet known or share a last name (Appendix A.2,A.3)

In [176]:
def impute_attribute_by_shared_features(dataframe,attribute,shared_feature):
    
    # Iterates through all the rows that have nan for this attribute
    for index, row in dataframe[dataframe[attribute].isna()].iterrows():
        rows_with_shared_features = dataframe[dataframe[shared_feature] == row[shared_feature]].dropna(subset=[attribute])
        
        if not rows_with_shared_features.empty:
            dataframe.loc[index, attribute] = rows_with_shared_features[attribute].iloc[0]

    return dataframe

df = impute_attribute_by_shared_features(df,'HomePlanet','Group')
df = impute_attribute_by_shared_features(df,'HomePlanet','LastName')

For those who have missing Cabin values, we can limit their options by removing some cabin decks or sides that they can fit into. We know that some homeplanets can only be on some decks and that if a passenger has bills = 0 and their group are in multiple decks then they must be in a certain deck based on their home planet (Appendix B).

We also know that every group is only on one cabin side even if the group is split into multiple Cabins (Appendix A.1)

In [177]:

def add_potential_decks_column(dataframe):
    
    potential_decks_by_homeplanet = {
    'Earth':['E','F','G'],
    'Europa': ['A','B','C','D','E','T'],
    'Mars': ['D','E','F']
    }

    potential_decks_by_homeplanet_no_bills = {
        'Earth':['G'],
        'Europa':['B'],
        'Mars': ['E','F']
    }
    
    def func_potential_decks_apply(row):
        if pd.isna(row.Cabin):
            if row.Bills == 0 and not pd.isna(row.HomePlanet) and row.GroupSize > 1:
                
                group_members = dataframe[(dataframe.Group == row.Group) & (dataframe.PassengerId != row.PassengerId)].CabinDeck
                # Checking if other members of group are in multiple different cabin decks
                if group_members.dropna().nunique() > 1:
                    return potential_decks_by_homeplanet_no_bills[row.HomePlanet]
                
                elif not group_members.isna().any():
                    return list(set(potential_decks_by_homeplanet_no_bills[row.HomePlanet] + list(group_members.dropna().unique())))

                if group_members.nunique() == 1:
                    if group_members.iloc[0] in potential_decks_by_homeplanet_no_bills[row.HomePlanet]:
                        return potential_decks_by_homeplanet_no_bills[row.HomePlanet]
                    
            # If there bills are greater than 0 then it goes to the standard decks for their homeplanet
            if not pd.isna(row.HomePlanet):
                return potential_decks_by_homeplanet[row.HomePlanet]
            
            # If their homeplanet isn't known then they could be in any cabin deck
            else:
                return list(dataframe.CabinDeck.dropna().unique())
            
    dataframe['PotentialDecks'] = dataframe.apply(func_potential_decks_apply,axis = 1)
    return dataframe
    
            

def add_potential_sides_column(dataframe):
    
    def func_potential_sides_apply(row):
        if pd.isna(row.Cabin):
            
            # Checks to see if anyone else in their group has a known cabin side
            group_sides = dataframe[dataframe.Group == row.Group].CabinSide.dropna()
            if group_sides.nunique() > 0:
                return [group_sides.iloc[0]]
            
            # If no one else is in their group or they haven't got a known cabin side then the passenger could be on either side
            return ['P','S']
        
    dataframe['PotentialSides'] = dataframe.apply(func_potential_sides_apply,axis = 1)
    return dataframe

    
df = add_potential_decks_column(df)
df = add_potential_sides_column(df)



Ordering the values by Group then GroupNumber will be useful as this is the order we can fill up free cabins

In [126]:
df = df.sort_values(by = ['Group','GroupNumber'])
df = df.reset_index(drop = True)


# Imputing

This is a helpful function so that when we fill a cabin it'll fill the corresponding Deck, Number and Side

In [128]:
def impute_from_cabin_and_index(dataframe,cabin,index):
    dataframe.loc[index,['Cabin','CabinDeck','CabinNumber','CabinSide']] = [cabin,cabin.split("/")[0],int(cabin.split("/")[1]),cabin.split("/")[2]]
    return dataframe

Below is a major function finding all the passengers missing a cabin and collecting all the cabins it could potentially fill. This is conducted by looking athe potential decks and potential sides we've found, and seeing what room number of each of those decks and sides last came beforehand and first came afterwards.
Ie if cabin A/02/S was before and A/03/S came after there is no room in between them for the passenger to fill, 
If however Cabin A/05/P came before them and A/07/P came after, then A/06/P is a cabin it could potentially take

In [129]:
def passengers_empty_cabin_options(dataframe):
    
    df_passengers_without_cabin = dataframe[dataframe['Cabin'].isna()]
    all_passenger_cabin_options = {}

    for passenger_index, passenger in df_passengers_without_cabin.iterrows():
        all_passenger_cabin_options[passenger_index] = []

        for deck in passenger.PotentialDecks:
            for side in passenger.PotentialSides:
                
                # Filter dataframe for the current deck and side
                df_filtered = dataframe[(dataframe['CabinDeck'] == deck) & (dataframe['CabinSide'] == side)]

                # Split into cabins before and after the current passenger index
                max_cabin_no_before = max(df_filtered.loc[df_filtered.index < passenger_index, 'CabinNumber'].dropna().unique(), default = -1 )
                min_cabin_no_after = min(df_filtered.loc[df_filtered.index > passenger_index, 'CabinNumber'].dropna().unique(), default = -1)

                # If no cabins were found of that deck and side before or after the row
                if max_cabin_no_before == -1 or min_cabin_no_after == -1:
                    continue
                
                # If a cabin number is seen before the row and the next cabin number is more than 1 higher after the row
                # then there is an empty cabin it can potentially fill
                if max_cabin_no_before + 1 < min_cabin_no_after:
                    all_passenger_cabin_options[passenger_index] += [f"{deck}/{i}/{side}" for i in range(max_cabin_no_before + 1, min_cabin_no_after)]

    return all_passenger_cabin_options





# solo group and only one room that fits

The reasons we can't fill a cabin if the passenger has only one option of cabins from passengers_empty_cabin_options() is because the passenger also has an option to share a cabin. This function checks that they are alone in their group, meaning they can't share a cabin with anyone else as they can only share a cabin with a group member (Appendix A.6). If they are alone in their group and they have only one cabin option based on their position onboard then this cabin will be imputed for them

In [130]:
def solo_group_one_cabin_option(dataframe):
    
    all_passenger_cabin_options = passengers_empty_cabin_options(dataframe)

    # Iterates through all the passengers that haven't got a Cabin yet and are alone in their group (ie can't share)
    for passenger_index in list(df[(df.Cabin.isna()) & (df.GroupSize == 1)].index):

        # If they have only one free cabin that they could fill
        if len(all_passenger_cabin_options[passenger_index]) == 1:
            matching_cabin = all_passenger_cabin_options[passenger_index][0]
            dataframe = impute_from_cabin_and_index(dataframe,matching_cabin,passenger_index)

    return dataframe


# no free rooms so has to share

The next function imputes a cabin if they didn't have any empty cabin options, so will have to share a cabin with a member of their group. Some passengers have groups with multiple cabins, which we wouldn't want to guess which they would fit into, this problem can be eradicated if only one cabin that their group members are in meets their requirements we found in potential decks and potential sides

In [131]:
def no_suitable_cabin_so_shares(dataframe):
    all_passenger_cabin_options = passengers_empty_cabin_options(dataframe)
    
    for passenger_index,passenger_cabin_options in all_passenger_cabin_options.items():
        
        # If there are no free cabins that the passenger can fill
        if not passenger_cabin_options:
            
            passenger_row = dataframe.loc[passenger_index]
            
            # Finding all other group members cabins and filtering them by whether they are in the same deck that the passenger must be in
            passengers_group_cabins = dataframe[(dataframe['Group'] == passenger_row['Group']) &
                                  (dataframe['CabinDeck'].isin(passenger_row['PotentialDecks']))].Cabin.dropna()
            
            # If there is only one Cabin from their group they could share with
            if passengers_group_cabins.nunique() == 1:
                matching_cabin = passengers_group_cabins.iloc[0]
                dataframe = impute_from_cabin_and_index(dataframe,matching_cabin,passenger_index)
                
    return dataframe
    


# only passenger that can take that cabin

This final function works based on the presumption that every cabin is filled, (ie there are no gaps in the cabin numbers), if a passenger is the only passenger that suits a certain cabin then that passenger will have that cabin allocated to them

In [132]:
def only_matching_passenger_for_cabin(dataframe):
    all_passenger_cabin_options = passengers_empty_cabin_options(dataframe)
    
    cabins_to_fill = defaultdict(list)
    
    # Iterate over cabins to see which passengers can fit that cabin
    for passenger_index, cabin_options in all_passenger_cabin_options.items():
        for cabin in cabin_options:
            cabins_to_fill[cabin].append(passenger_index)
    
    # Iterate over cabin and impute passengers where only one fits
    for cabin, passengers_indices in cabins_to_fill.items():
        if len(passengers_indices) == 1:
            dataframe = impute_from_cabin_and_index(dataframe, cabin, passengers_indices[0])
    
    return dataframe


# all imputes

As we impute some passengers with cabins, this limits the number of free cabins to the remaining passengers, it also reduces the competition to fill certain cabins, with my functions two iterations of them both fills all the cabins that the functions can find

In [186]:
def all_imputes(dataframe):
    dataframe = solo_group_one_cabin_option(dataframe)
    dataframe = no_suitable_cabin_so_shares(dataframe)
    dataframe = only_matching_passenger_for_cabin(dataframe)

    dataframe = solo_group_one_cabin_option(dataframe)
    dataframe = no_suitable_cabin_so_shares(dataframe)
    dataframe = only_matching_passenger_for_cabin(dataframe)
    
    return dataframe
    
df = all_imputes(df)
df.Cabin.isna().sum()

29

There are 37 cabins that still remain unfilled, and we started with 299! There are still a few more that we can find that those functions did not cover

# Manual workings


This function prints out some useful data to help us deduce some of the remaining cabins for ourselves

In [182]:
def all_cabin_options_for_each_row(dataframe):
    all_passenger_cabin_options = passengers_empty_cabin_options(dataframe)
    for passenger_index, passenger_options in all_passenger_cabin_options.items():
        print()
        print("Index:",passenger_index, "GroupSize:", dataframe.iloc[passenger_index].GroupSize)
        print("Free cabins that match:")
        print(passenger_options)

                
             


In [183]:
all_cabin_options_for_each_row(df)


Index: 404 GroupSize: 1
Free cabins that match:
['B/13/P', 'C/13/S']
0.0

Index: 421 GroupSize: 1
Free cabins that match:
['B/13/P', 'C/13/S']
nan

Index: 479 GroupSize: 2
Free cabins that match:
['E/20/P', 'E/21/P']
2385.0

Index: 505 GroupSize: 2
Free cabins that match:
['E/20/P', 'E/21/P']
1298.0

Index: 517 GroupSize: 2
Free cabins that match:
['E/20/P', 'E/21/P']
789.0

Index: 1429 GroupSize: 2
Free cabins that match:
['E/58/P']
1692.0

Index: 1466 GroupSize: 1
Free cabins that match:
['C/40/S', 'D/36/S', 'E/58/P']
0.0

Index: 1543 GroupSize: 1
Free cabins that match:
['C/40/S', 'D/36/S']
0.0

Index: 2442 GroupSize: 7
Free cabins that match:
[]
1338.0

Index: 2970 GroupSize: 5
Free cabins that match:
[]
9597.0

Index: 3529 GroupSize: 1
Free cabins that match:
['E/150/P', 'F/519/P']
711.0

Index: 3530 GroupSize: 1
Free cabins that match:
['E/150/P', 'F/519/P']
791.0

Index: 4233 GroupSize: 1
Free cabins that match:
['B/98/P', 'B/99/P']
0.0

Index: 4254 GroupSize: 1
Free cabins tha

### Manual imputation reasoning

* Index = 1429, Cabin = E/58/P
** Since the cabin can only be filled by 1429 and 1466, but one of 1466 and 1543 has to fill C/40/S and the other has to fill D/36/S as the only two that can fill those two, it leaves index 1429 to fill E/58/P
* Index = 4233,4254 , Cabin = B/98/P, B/99/P
** These indices weren't filled as the consecutive free cabins showed multiple options for each for those passengers, as no one else can fill it and index 4233 comes before 4254, they are filled in that order
* Index = 6493,6514 , Cabin = E/300/S, E/301/S
** As with the previous example they are the only two cabins that can fill these cabins and didn't get imputed as the free cabins are consecutive
* Index = 8413, Cabin = A/57/P
** As index 8465 and 8450 are each alone in their groups and with only two cabins to fill, one of them must fill D/191/P and one must fill E/387/P leaving index 8413 no other option but to join the only cabin that the rest of its group is in.
* Index 12892,12893 Cabin = F/1785/S, F/1785/S
** These two indices are the only members of a group together, they have one option for a cabin so they must both share F/1785/S

In [184]:




cabin_list = [(1429,'E/58/P'),(4233,'B/98/P'),(4254,'B/99/P'),(6493,'E/300/S'),(6514,'E/301/S'),(8413,'A/57/P'), (12892,'F/1785/S'),(12893,'F/1785/S')]

for index,cabin in cabin_list:
    impute_from_cabin_and_index(df,cabin,index)




# Conclusion

In [187]:
df.Cabin.isna().sum()

29

Here we have finished with all the cabins I can find that can be imputed, we have just but 29 Cabins remaining and should help all rise up the leaderboard!

Below I will detail how we can split the data back into the training set and the test set, and then further detail the reasoning behind those last 29 passengers and why we can't decie which cabin they should take (yet). If any new inferences come to light please let me know!

In [109]:
traindata = df[df.Set == 'Train']
testdata = df[df.Set == 'Test']

# Remaining Passengers

In [188]:
all_cabin_options_for_each_row(df)


Index: 404 GroupSize: 1
Free cabins that match:
['B/13/P', 'C/13/S']
0.0

Index: 421 GroupSize: 1
Free cabins that match:
['B/13/P', 'C/13/S']
nan

Index: 479 GroupSize: 2
Free cabins that match:
['E/20/P', 'E/21/P']
2385.0

Index: 505 GroupSize: 2
Free cabins that match:
['E/20/P', 'E/21/P']
1298.0

Index: 517 GroupSize: 2
Free cabins that match:
['E/20/P', 'E/21/P']
789.0

Index: 1466 GroupSize: 1
Free cabins that match:
['C/40/S', 'D/36/S']
0.0

Index: 1543 GroupSize: 1
Free cabins that match:
['C/40/S', 'D/36/S']
0.0

Index: 2442 GroupSize: 7
Free cabins that match:
[]
1338.0

Index: 2970 GroupSize: 5
Free cabins that match:
[]
9597.0

Index: 3529 GroupSize: 1
Free cabins that match:
['E/150/P', 'F/519/P']
711.0

Index: 3530 GroupSize: 1
Free cabins that match:
['E/150/P', 'F/519/P']
791.0

Index: 4569 GroupSize: 3
Free cabins that match:
[]
770.0

Index: 4751 GroupSize: 7
Free cabins that match:
[]
3674.0

Index: 5016 GroupSize: 1
Free cabins that match:
['G/590/P', 'G/579/S']
67

Cases of two passengers alone in their group and two cabins that fit both of their requirements
* Index 404 and 421, cabins B/13/P and C/13/S
* index 1466 and 1543, cabins C/40/S and D/36/S
* Index 3529 and 3530, Cabins E/150/P and F/519/P
* Index 5016 and 5017, Cabins G/590/P and G/579/S
* Index 8450 and 8465, Cabins D/191/P and E/387/P
* Index 10081 and 10082, Cabins F/1489/P and G/1157/P
* Index 10434 and 10440, Cabins F/1544/P and G/1212/S **
* Index 11129 and 11148, Cabins C/298/S and E/528/S
** Index 10434 has the option of  F/1424/S but as 10434 and 10440 are the only passengers that can take F/1544/P and G/1212/S one must logically take one of each


Cases of passengers who have to share a cabin with a member of their group, but there are multiple cabins that meet their requirements
* Index 2442, Cabins F/326/S, D/61/S, E/127/S
* Index 2970, Cabins D/70/S, E/153/S, F/410/S
* Index 4569, Cabins G/522/S, F/621/S
* Index 4751, Cabins E/232/S, F/645/S
* Index 12174, Cabins F/1798/P, G/1416/P

Other cases
* Each of index 479, 505 and 517 are in a group of 2 and are the only ones that can take cabins E/20/P and E/21/P, meaning that one of them shares with their group member and the other two take those cabins
* Either 10290 takes cabin C/270/S and then 10313 takes D/235/P, 10394 takes F/1424/S, 10408 takes G/1206/S and 10411 shares E/495/S
Or 10290 shares C/269/S and then 10313 takes C/270/S, 10394 takes D/235/P, 10408 takes F/1424/S and 10411 takes G/1206/S
* 10290 10313 10394 10408 10411 10434 10440 

Further assumptions,
There aren't more cabins available than we have assumed. While I was working on this project I had missed imputing lots of cabins as if a passenger was in a later group than the highest number on a given deck and side, there was the potential that the cabin numbers could go on past what we had seen. I had given up this belief when after assuming it didn't the passengers all fit in given the constraints that seemed unlikely if this wasnt the case. Ie by way of not finding any passengers that had no options or anyone to share with, nor rooms that had no passengers that could match them.

# Appendix

For the Evidence in the Appendix I will reuse the combined original dataframes without any imputations as to not misrepresent the underlying distributions

In [138]:
training_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')
df = pd.concat([training_data,test_data]) 

df = column_splits(df)
df = df.sort_values(by = ['Group','GroupNumber'])
df = df.reset_index(drop = True)

## A.1

Evidence of passengers sharing a group implying that their cabin is on the same side

In [144]:
# Group by 'Group' and check if all non-NaN 'CabinSide' values within each group are the same
consistent_count = 0
inconsistent_count = 0

# Iterate through each group
for group, group_df in df.groupby('Group'):
    if len(group_df) > 1:
        # Get unique non-NaN CabinSide values
        unique_sides = group_df['CabinSide'].dropna().unique()
        
        if len(unique_sides) <= 1:
            # All rows in this group are consistent
            consistent_count += len(group_df)
        else:
            # Some rows in this group are inconsistent
            inconsistent_count += len(group_df)

print(f"Number of rows with consistent cabin sides: {consistent_count}")
print(f"Number of rows with inconsistent cabin sides: {inconsistent_count}")

Number of rows with consistent cabin sides: 5825
Number of rows with inconsistent cabin sides: 0


## A.2

Evidence of passengers sharing a group implying that they have the same home planet

In [143]:
# Initialize counters
planet_consistent_count = 0
planet_inconsistent_count = 0

# Iterate through each group
for group, group_df in df.groupby('Group'):
    if len(group_df) > 1:
    # Get unique non-NaN HomePlanet values
        unique_home_planets = group_df['HomePlanet'].dropna().unique()
        
        if len(unique_home_planets) <= 1:
            # All rows in this group are consistent in HomePlanet
            planet_consistent_count += len(group_df)
        else:
            # Some rows in this group are inconsistent in HomePlanet
            planet_inconsistent_count += len(group_df)

print(f"Number of rows with consistent home planets: {planet_consistent_count}")
print(f"Number of rows with inconsistent home planets: {planet_inconsistent_count}")

Number of rows with consistent home planets: 5825
Number of rows with inconsistent home planets: 0


## A.3

Evidence of passengers sharing a last name implying that they have the same home planet

In [145]:
# Initialize counters
planet_consistent_count = 0
planet_inconsistent_count = 0

# Iterate through each last name group
for last_name, group_df in df.groupby('LastName'):
    if len(group_df) > 1:  # Exclude last names with only one passenger
        # Get unique non-NaN HomePlanet values
        unique_home_planets = group_df['HomePlanet'].dropna().unique()
        
        if len(unique_home_planets) <= 1:
            # All rows in this group are consistent in HomePlanet
            planet_consistent_count += len(group_df)
        else:
            # Some rows in this group are inconsistent in HomePlanet
            planet_inconsistent_count += len(group_df)

print(f"Number of rows with consistent home planets by last name: {planet_consistent_count}")
print(f"Number of rows with inconsistent home planets by last name: {planet_inconsistent_count}")

Number of rows with consistent home planets by last name: 12468
Number of rows with inconsistent home planets by last name: 0


## A.4

Evidence of children under the age of 13 having no bills

In [147]:
df['Bills'] = df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] + df['Spa'] + df['VRDeck']


In [167]:
# Filter out rows with NaN values in Bills
df_filtered = df[df['Bills'].notna()]

# Check if passengers under the age of 13 have bills = 0
under_13 = df_filtered[df_filtered['Age'] < 13]
under_13_bills_zero = under_13['Bills'] == 0

# Calculate summary statistics
total_under_13 = len(under_13)
bills_zero_under_13 = under_13_bills_zero.sum()
bills_not_zero_under_13 = total_under_13 - bills_zero_under_13

# Create a summary DataFrame
summary = pd.DataFrame({
    'Total Under 13': [total_under_13],
    'Bills = 0': [bills_zero_under_13],
    'Bills != 0': [bills_not_zero_under_13],
    'Consistency Ratio': [bills_zero_under_13 / total_under_13]
})

# Print summary statistics
print("Summary statistics for passengers under the age of 13 (excluding NaN Bills):")
print(summary)

Summary statistics for passengers under the age of 13 (excluding NaN Bills):
   Total Under 13  Bills = 0  Bills != 0  Consistency Ratio
0            1030       1030           0                1.0


# A.5

Evidence of those in CryoSleep having no bills

In [166]:
# Filter out rows with NaN values in Bills
df_filtered = df[df['Bills'].notna()]

# Check if passengers under the age of 13 have bills = 0
cryo = df_filtered[df_filtered['CryoSleep'] == True]
cryo_bills_zero = cryo['Bills'] == 0

# Calculate summary statistics
total_in_cryo = len(cryo)
cryo_bills_zero = cryo_bills_zero.sum()
cryo_bills_not_zero = total_in_cryo - cryo_bills_zero

# Create a summary DataFrame
summary = pd.DataFrame({
    'Total CryoSleep': [total_in_cryo],
    'Bills == 0': [cryo_bills_zero],
    'Bills != 0': [cryo_bills_not_zero],
    'Consistency Ratio': [cryo_bills_zero / total_in_cryo]
})

# Print summary statistics
print("Summary statistics for passengers under the age of 13 (excluding NaN Bills):")
print(summary)

Summary statistics for passengers under the age of 13 (excluding NaN Bills):
   Total CryoSleep  Bills == 0  Bills != 0  Consistency Ratio
0             4068        4068           0                1.0


# A.6

In [180]:
# Filter cabins with more than one member
cabin_counts = df['Cabin'].value_counts()
multi_member_cabins = cabin_counts[cabin_counts > 1].index

# Group by Cabin and list unique groups for each Cabin with more than one member
cabin_group_mapping = df[df['Cabin'].isin(multi_member_cabins)].groupby('Cabin')['Group'].unique()

# Check if any Cabin is associated with more than one group
shared_cabins = cabin_group_mapping[cabin_group_mapping.apply(lambda groups: len(groups) > 1)]

# Print the results
print("Cabins shared by more than one group (with more than one member):")
print(shared_cabins)

# Summary statistics
total_multi_member_cabins = len(cabin_group_mapping)
shared_cabin_count = len(shared_cabins)
unique_cabin_count = total_multi_member_cabins - shared_cabin_count

print(f"\nTotal multi-member cabins: {total_multi_member_cabins}")
print(f"Multi-member cabins shared by multiple groups: {shared_cabin_count}")
print(f"Multi-member cabins unique to one group: {unique_cabin_count}")

Cabins shared by more than one group (with more than one member):
Series([], Name: Group, dtype: object)

Total multi-member cabins: 1684
Multi-member cabins shared by multiple groups: 0
Multi-member cabins unique to one group: 1684


## Appendix B

Evidence of home planets restricting which deck a passenger's cabin is on 

In [146]:
# Group by 'HomePlanet' and 'CabinDeck' and count occurrences
deck_counts = df.groupby(['HomePlanet', 'CabinDeck']).size().reset_index(name='Count')

# Pivot the table to get a better overview
pivot_table = deck_counts.pivot(index='CabinDeck', columns='HomePlanet', values='Count').fillna(0).astype(int)

print(pivot_table)

HomePlanet  Earth  Europa  Mars
CabinDeck                      
A               0     346     0
B               0    1124     0
C               0    1081     0
D               0     296   406
E             583     197   508
F            2426       0  1713
G            3700       0     0
T               0      10     0


In [156]:
# Initialize a dictionary to store results
results = {}

# Iterate through each group
for group, group_df in df.groupby('Group'):
    # Check if within the group there are more than one CabinDeck
    unique_decks = group_df['CabinDeck'].dropna().unique()
    
    if len(unique_decks) > 1:
        # Find passengers with bills = 0
        zero_bill_passengers = group_df[(group_df['Bills'] == 0)  & (group_df['HomePlanet'].notna())]
        
        for idx, passenger in zero_bill_passengers.iterrows():
            home_planet = passenger['HomePlanet']
            cabin_deck = passenger['CabinDeck']
            
            if home_planet not in results:
                results[home_planet] = []
            
            results[home_planet].append(cabin_deck)

# Print the results
print("CabinDecks for passengers with bills = 0 in groups with multiple CabinDecks:")
for home_planet, cabin_decks in results.items():
    cabin_deck_counts = pd.Series(cabin_decks).value_counts().to_dict()
    print(f"HomePlanet: {home_planet}")
    for deck, count in cabin_deck_counts.items():
        print(f"  CabinDeck: {deck}, Count: {count}")


CabinDecks for passengers with bills = 0 in groups with multiple CabinDecks:
HomePlanet: Earth
  CabinDeck: G, Count: 526
HomePlanet: Mars
  CabinDeck: F, Count: 160
  CabinDeck: E, Count: 47
HomePlanet: Europa
  CabinDeck: B, Count: 11


# End

In [110]:
df_to_comp = pd.read_csv('data/31remaining.csv')
df_to_comp = df_to_comp.rename(columns = {'Number':'CabinNumber'})
df_to_comp['CabinNumber'] = df_to_comp['CabinNumber'].astype('Int64')


In [111]:
for index,row in df.iterrows():
    if not (pd.isna(row.Cabin) and pd.isna(df_to_comp.iloc[index].Cabin)):
        if row.Cabin != df_to_comp.iloc[index].Cabin:
            print(index,row.Cabin, df_to_comp.iloc[index].Cabin)

9267 G/1077/S F/1267/S
12651 A/94/P nan
12668 B/297/P nan


In [112]:
for index,row in df_to_comp.iterrows():
    if row.HomePlanet == 'Earth':
        if row.Deck == 'E':
            if row.GroupSize == 2:
                if row.Bills == 0:
                    if df_to_comp[df_to_comp.Group == row.Group].Deck.nunique() == 1:
                        print(index)


1512
1513
2778
7216
