
## Introduction
In this notebook, I will demonstrate an algorithmic approach to fill in (nearly) every missing Cabin value in the Space Titanic dataset. Unlike probabilistic or guessing methods, this approach follows a structured order based on each passenger's Homeplanet and group (derived from their PassengerID).

Cabins are filled sequentially based on their numbers. For instance, if a passenger is assigned to cabin A/05/P, a passenger in a subsequent group cannot be assigned to A/04/P. However, they could be assigned to A/01/S or B/01/P.

By employing this method, we aim to achieve a more accurate and logical imputation of missing Cabin values, which will improve the performance of our predictive model.

## Defining Cabin and PassengerID Components and Assumptions

The Cabin column in the dataset is structured in the format A/01/P, where:

- **A**: Cabin deck, which can take values 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'
- **01**: Cabin number, which can take values 0, 1, 2, ...
- **P**: Cabin side, which can take values 'P' (Port) or 'S' (Starboard)

The PassengerId column in the dataset is structured in the format 0201_01, where:
- **0201**: Group, these codes correspond to other members of the same group
- **01**: Group number, always starts at 1 and will cout how many members there are in a group


### Assumptions

To facilitate our imputation approach, we make several key assumptions:

1. **Group Members Share the Same Side**: If two passengers are in the same group, they are on the same side of the ship. (Appendix A.1)
2. **Group Members Share the Same Home Planet**: If two passengers are in the same group, they originate from the same home planet. (Appendix A.2)
3. **Shared Last Names Indicate Same Home Planet**: Passengers sharing a last name are from the same home planet. (Appendix A.3)
4. **Children Have No Bills**: Passengers aged 12 or younger do not incur any bills. (Appendix A.4)
5. **Cryosleep Implies No Bills**: Passengers who are in cryosleep do not incur any bills. (Appendix A.5)
6. **Cabins Shared Within Groups**: Cabins can only be shared by members of the same group. (Appendix A.6)
7. **Home Planets and Deck Restrictions**: Passengers' home planets restrict which decks they can be assigned to. (Appendix B)
    - **Mars**: Decks 'D', 'E', or 'F'
    - **Earth**: Decks 'E', 'F', or 'G'
    - **Europa**: Decks 'A', 'B', 'C', 'D', 'E', or 'T'
    - Passengers with no bills and group members on different decks have further restrictions:
        - **Earth**: Restricted to deck 'G'
        - **Europa**: Restricted to deck 'B'
        - **Mars**: Restricted to decks 'E' and 'F'



# Feature engineering

To maintain the order of passengers by group, we combine the training and test datasets. This combined dataframe will be used for the rest of the project, ensuring consistency in our imputation process.

By combining the datasets, we have a more comprehensive view of all passengers, which will aid in the structured filling of missing Cabin values.

Below is the code to achieve this:

In [1]:
# Import necessary libraries
import pandas as pd 
from collections import defaultdict # Slightly modified from a regular dictionary

# Load the training and test data
training_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')

# Add a column to distinguish between the training and test sets
training_data['Set'] = 'Train'
test_data['Set'] = 'Test'

# Combine the training and test datasets
df = pd.concat([training_data, test_data])

# Display the first few rows of the combined dataframe
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,Set
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,Train
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,Train
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,Train
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,Train
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,Train


In this section, we perform several essential preprocessing steps to prepare our data for imputation and analysis. These steps include splitting relevant columns, sorting the dataframe, and handling missing values in a structured manner.

First, let's verify the number of missing Cabin values in our combined dataframe:

This is our starting point, and we have identified that there are 299 missing Cabin values. Our next step will be to address these missing values systematically using the assumptions defined earlier.


In [2]:
# Check the number of missing values in the Cabin column
df['Cabin'].isna().sum()

299

We then split the unique PassengerId into Group and their number in the group, split the Cabin into deck, side, and number, and split their names into first and last name. This segmentation helps us leverage the information contained within these columns more effectively.

Here is the code to achieve these splits:

In [3]:
# Define a function to split columns into multiple components
def column_splits(data_frame):
    # Split PassengerId into Group and GroupNumber
    data_frame[['Group', 'GroupNumber']] = data_frame['PassengerId'].str.split('_', expand=True)
    
    # Split Cabin into CabinDeck, CabinNumber, and CabinSide
    data_frame[['CabinDeck', 'CabinNumber', 'CabinSide']] = data_frame['Cabin'].str.split("/", expand=True)
    data_frame['CabinNumber'] = data_frame['CabinNumber'].astype('Int64')
    
    # Split Name into FirstName and LastName
    data_frame[['FirstName', 'LastName']] = data_frame['Name'].str.split(" ", expand=True)

    return data_frame

# Apply the function to the combined dataframe
df = column_splits(df)

# Sort the dataframe by Group and GroupNumber
df = df.sort_values(by=['Group', 'GroupNumber'])
df = df.reset_index(drop=True)

# Display the first few rows of the modified dataframe
df.head()


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,...,Name,Transported,Set,Group,GroupNumber,CabinDeck,CabinNumber,CabinSide,FirstName,LastName
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,...,Maham Ofracculy,False,Train,1,1,B,0,P,Maham,Ofracculy
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,...,Juanna Vines,True,Train,2,1,F,0,S,Juanna,Vines
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,...,Altark Susent,False,Train,3,1,A,0,S,Altark,Susent
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,...,Solam Susent,False,Train,3,2,A,0,S,Solam,Susent
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,...,Willy Santantines,True,Train,4,1,F,1,S,Willy,Santantines


Total bills are composed of the summation of each passengers roomservice, foodcourt, shoppingmall, spa and vrdeck payments.
We can impute bills to be equal to 0 if someone is under 13 and/or they are in cryosleep (Appendix A.4,A.5)

In [4]:
# Calculate total bills for each passenger
df['Bills'] = df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] + df['Spa'] + df['VRDeck']

# Impute bills to be zero for passengers under 13 or in cryosleep
df.loc[df['Age'] < 13, 'Bills'] = 0
df.loc[df['CryoSleep'] == True, 'Bills'] = 0


We then add a useful column to our dataframe: GroupSize, which indicates the number of passengers in each group. This helps us understand group dynamics and may assist in the imputation process:


In [5]:
# Define a function to add a GroupSize column
def add_group_size_column(dataframe):
    dataframe['GroupSize'] = dataframe.groupby('Group')['Group'].transform('count')
    return dataframe

# Apply the function to the combined dataframe
df = add_group_size_column(df)

# Display the first few rows of the dataframe with the new GroupSize column
df.head()


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,...,Set,Group,GroupNumber,CabinDeck,CabinNumber,CabinSide,FirstName,LastName,Bills,GroupSize
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,...,Train,1,1,B,0,P,Maham,Ofracculy,0.0,1
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,...,Train,2,1,F,0,S,Juanna,Vines,736.0,1
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,...,Train,3,1,A,0,S,Altark,Susent,10383.0,2
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,...,Train,3,2,A,0,S,Solam,Susent,5176.0,2
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,...,Train,4,1,F,1,S,Willy,Santantines,1091.0,1


To further enhance our dataset, we define a function to impute missing values based on shared features. For instance, rows with missing values for HomePlanet can be imputed if they share a group with someone whose HomePlanet is known or share a last name with someone whose HomePlanet is known (Appendix A.2, A.3):


In [6]:
# Define a function to impute attributes based on shared features
def impute_attribute_by_shared_features(dataframe, attribute, shared_feature):
    # Iterate through rows with missing values for the specified attribute
    for index, row in dataframe[dataframe[attribute].isna()].iterrows():
        # Find rows that share the specified feature and have known values for the attribute
        rows_with_shared_features = dataframe[dataframe[shared_feature] == row[shared_feature]].dropna(subset=[attribute])
        
        # Impute the attribute if there are rows with shared features and known values
        if not rows_with_shared_features.empty:
            dataframe.loc[index, attribute] = rows_with_shared_features[attribute].iloc[0]

    return dataframe

# Impute missing HomePlanet values based on shared group or last name
df = impute_attribute_by_shared_features(df, 'HomePlanet', 'Group')
df = impute_attribute_by_shared_features(df, 'HomePlanet', 'LastName')

# Display the first few rows of the dataframe with imputed HomePlanet values
df.head()


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,...,Set,Group,GroupNumber,CabinDeck,CabinNumber,CabinSide,FirstName,LastName,Bills,GroupSize
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,...,Train,1,1,B,0,P,Maham,Ofracculy,0.0,1
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,...,Train,2,1,F,0,S,Juanna,Vines,736.0,1
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,...,Train,3,1,A,0,S,Altark,Susent,10383.0,2
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,...,Train,3,2,A,0,S,Solam,Susent,5176.0,2
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,...,Train,4,1,F,1,S,Willy,Santantines,1091.0,1


## Determining Potential Cabin Decks and Sides

For passengers with missing Cabin values, we can limit their options by excluding certain cabin decks or sides based on their attributes. This step is crucial for our structured imputation process. Specifically:

1. **Home Planet Deck Restrictions**: Each home planet restricts passengers to certain decks. Additionally, if a passenger has no bills and their group members are on multiple decks, their deck is further restricted based on their home planet (Appendix B).
2. **Group Cabin Side Consistency**: Every group is restricted to a single cabin side, even if they are split into multiple cabins (Appendix A.1).

We define functions to add columns for potential decks and sides for passengers with missing Cabin values.


In [7]:
# Define a function to add a column for potential decks based on home planet and other conditions
def add_potential_decks_column(dataframe):
    # Define potential decks for each home planet
    potential_decks_by_homeplanet = {
        'Earth': ['E', 'F', 'G'],
        'Europa': ['A', 'B', 'C', 'D', 'E', 'T'],
        'Mars': ['D', 'E', 'F']
    }

    # Define restricted decks for passengers with no bills
    potential_decks_by_homeplanet_no_bills = {
        'Earth': ['G'],
        'Europa': ['B'],
        'Mars': ['E', 'F']
    }
    
    # Inner function to determine potential decks for each passenger
    def func_potential_decks_apply(row):
        # If the Cabin value is missing
        if pd.isna(row.Cabin):
            # If the passenger has no bills, a known HomePlanet, and is part of a group
            if row.Bills == 0 and not pd.isna(row.HomePlanet) and row.GroupSize > 1:
                # Get the decks of other group members
                group_members = dataframe[(dataframe.Group == row.Group) & (dataframe.PassengerId != row.PassengerId)].CabinDeck
                
                # If group members are in multiple different decks, restrict to specific decks for no bills
                if group_members.dropna().nunique() > 1:
                    return potential_decks_by_homeplanet_no_bills[row.HomePlanet]
                
                # If no group members have a known deck, return a combination of specific decks for no bills and known decks
                elif not group_members.isna().any():
                    return list(set(potential_decks_by_homeplanet_no_bills[row.HomePlanet] + list(group_members.dropna().unique())))
                
                # If group members are in one known deck, check if it matches the restricted decks
                if group_members.nunique() == 1:
                    if group_members.iloc[0] in potential_decks_by_homeplanet_no_bills[row.HomePlanet]:
                        return potential_decks_by_homeplanet_no_bills[row.HomePlanet]
            
            # If the passenger has bills, return the standard decks for their HomePlanet
            if not pd.isna(row.HomePlanet):
                return potential_decks_by_homeplanet[row.HomePlanet]
            
            # If the HomePlanet is unknown, return all unique decks in the dataframe
            else:
                return list(dataframe.CabinDeck.dropna().unique())
    
    # Apply the inner function to each row in the dataframe
    dataframe['PotentialDecks'] = dataframe.apply(func_potential_decks_apply, axis=1)
    return dataframe

# Define a function to add a column for potential sides based on group consistency
def add_potential_sides_column(dataframe):
    # Inner function to determine potential sides for each passenger
    def func_potential_sides_apply(row):
        # If the Cabin value is missing
        if pd.isna(row.Cabin):
            # Get the sides of other group members
            group_sides = dataframe[dataframe.Group == row.Group].CabinSide.dropna()
            
            # If other group members have a known side, return that side
            if group_sides.nunique() > 0:
                return [group_sides.iloc[0]]
            
            # If no group members have a known side, return both possible sides
            return ['P', 'S']
        
    # Apply the inner function to each row in the dataframe
    dataframe['PotentialSides'] = dataframe.apply(func_potential_sides_apply, axis=1)
    return dataframe

# Apply the functions to add potential decks and sides columns
df = add_potential_decks_column(df)
df = add_potential_sides_column(df)

In [8]:
# Sort the dataframe by Group and GroupNumber
df = df.sort_values(by=['Group', 'GroupNumber'])
df = df.reset_index(drop=True)


By adding these columns, we can better manage the imputation of missing Cabin values by limiting the possible options based on the passenger's attributes and group dynamics.

Sorting the dataframe by Group and GroupNumber is useful as it allows us to fill in free cabins in a logical order.

With this preparation, we are now ready to proceed with the structured imputation of missing Cabin values.


In [9]:
# Display the first few rows of the dataframe with potential decks and sides
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,...,GroupNumber,CabinDeck,CabinNumber,CabinSide,FirstName,LastName,Bills,GroupSize,PotentialDecks,PotentialSides
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,...,1,B,0,P,Maham,Ofracculy,0.0,1,,
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,...,1,F,0,S,Juanna,Vines,736.0,1,,
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,...,1,A,0,S,Altark,Susent,10383.0,2,,
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,...,2,A,0,S,Solam,Susent,5176.0,2,,
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,...,1,F,1,S,Willy,Santantines,1091.0,1,,


# Imputing functions

## Imputing Missing Cabin Values

To accurately impute missing Cabin values, we utilize several helper functions. These functions ensure that when a cabin is filled, the corresponding Deck, Number, and Side are also updated. Additionally, we identify potential cabin options for passengers based on the possible decks and sides determined earlier.




### Helper Function to Impute Cabin Details

First, we define a function to fill a Cabin and its corresponding Deck, Number, and Side for a given passenger.


In [10]:
# Define a function to impute cabin details for a given passenger index
def impute_from_cabin_and_index(dataframe, cabin, index):
    # Split the cabin string into Deck, Number, and Side
    cabin_deck = cabin.split("/")[0]
    cabin_number = int(cabin.split("/")[1])
    cabin_side = cabin.split("/")[2]
    
    # Update the dataframe with the cabin details
    dataframe.loc[index, ['Cabin', 'CabinDeck', 'CabinNumber', 'CabinSide']] = [cabin, cabin_deck, cabin_number, cabin_side]
    
    return dataframe


### Finding Potential Cabin Options

Next, we define a major function to find all the passengers missing a cabin and collect all the cabins they could potentially fill. This is done by examining the potential decks and sides, and determining the available room numbers within those decks and sides. For instance, if Cabin A/02/S was before and A/03/S came after, there is no room in between for the passenger to fill. However, if Cabin A/05/P was before and A/07/P was after, then A/06/P is a potential cabin the passenger could take.


In [11]:
# Define a function to find potential cabin options for passengers missing a cabin
def passengers_empty_cabin_options(dataframe):
    # Filter dataframe to find passengers without a cabin
    df_passengers_without_cabin = dataframe[dataframe['Cabin'].isna()]
    
    # Dictionary to store potential cabin options for each passenger
    all_passenger_cabin_options = {}

    # Iterate through each passenger without a cabin
    for passenger_index, passenger in df_passengers_without_cabin.iterrows():
        all_passenger_cabin_options[passenger_index] = []

        # Iterate through each potential deck for the passenger
        for deck in passenger.PotentialDecks:
            # Iterate through each potential side for the passenger
            for side in passenger.PotentialSides:
                # Filter dataframe for the current deck and side
                df_filtered = dataframe[(dataframe['CabinDeck'] == deck) & (dataframe['CabinSide'] == side)]

                # Find the maximum cabin number before the current passenger index
                max_cabin_no_before = max(df_filtered.loc[df_filtered.index < passenger_index, 'CabinNumber'].dropna().unique(), default=-1)
                
                # Find the minimum cabin number after the current passenger index
                min_cabin_no_after = min(df_filtered.loc[df_filtered.index > passenger_index, 'CabinNumber'].dropna().unique(), default=-1)

                # If no cabins were found of that deck and side before or after the row
                if max_cabin_no_before == -1 or min_cabin_no_after == -1:
                    continue
                
                # If there is a gap between the maximum cabin number before and the minimum cabin number after
                # Then there are potential cabins the passenger can fill
                if max_cabin_no_before + 1 < min_cabin_no_after:
                    potential_cabins = [f"{deck}/{i}/{side}" for i in range(max_cabin_no_before + 1, min_cabin_no_after)]
                    all_passenger_cabin_options[passenger_index].extend(potential_cabins)

    return all_passenger_cabin_options



By identifying the potential cabins for each passenger, we can systematically fill in the missing Cabin values. This ensures that each imputation is consistent with the constraints and assumptions defined earlier.

In the next section, we will use these potential cabin options to impute the missing values and complete our dataset.


## Imputing Cabins for Passengers in Solo Groups with Only One Cabin Option

The reasons we can't fill a cabin if the passenger has only one option of cabins from `passengers_empty_cabin_options()` is because the passenger also has an option to share a cabin. This function checks that they are alone in their group, meaning they can't share a cabin with anyone else as they can only share a cabin with a group member (Appendix A.6). If they are alone in their group and they have only one cabin option based on their position onboard, then this cabin will be imputed for them.

### Function to Impute Cabins for Solo Group Passengers


In [12]:
# Define a function to impute cabins for passengers in solo groups with only one cabin option
def solo_group_one_cabin_option(dataframe):
    # Get potential cabin options for passengers missing a cabin
    all_passenger_cabin_options = passengers_empty_cabin_options(dataframe)

    # Iterate through passengers who don't have a cabin yet and are alone in their group
    for passenger_index in list(dataframe[(dataframe.Cabin.isna()) & (dataframe.GroupSize == 1)].index):
        # If they have only one free cabin that they could fill
        if len(all_passenger_cabin_options[passenger_index]) == 1:
            matching_cabin = all_passenger_cabin_options[passenger_index][0]
            dataframe = impute_from_cabin_and_index(dataframe, matching_cabin, passenger_index)

    return dataframe



By identifying and imputing cabins for passengers who are alone in their group and have only one cabin option, we ensure that these passengers are assigned cabins in a consistent and logical manner.

In the next step, we will handle the remaining passengers with missing Cabin values by considering additional constraints and options.


## Imputing Cabins for Passengers with No Suitable Free Cabins

The next function handles passengers who do not have any empty cabin options and therefore must share a cabin with a member of their group. Some passengers belong to groups with multiple cabins, which makes it challenging to determine which cabin they should share. This problem is resolved if there is only one cabin in which their group members are present that meets the potential decks and sides requirements.

### Function to Impute Cabins for Passengers with No Suitable Free Cabins


In [13]:
# Define a function to impute cabins for passengers who have no suitable free cabins and must share
def no_suitable_cabin_so_shares(dataframe):
    # Get potential cabin options for passengers missing a cabin
    all_passenger_cabin_options = passengers_empty_cabin_options(dataframe)
    
    # Iterate through each passenger and their potential cabin options
    for passenger_index, passenger_cabin_options in all_passenger_cabin_options.items():
        # If there are no free cabins that the passenger can fill
        if not passenger_cabin_options:
            passenger_row = dataframe.loc[passenger_index]
            
            # Find all other group members' cabins that are in the same potential decks as the passenger
            passengers_group_cabins = dataframe[(dataframe['Group'] == passenger_row['Group']) &
                                                (dataframe['CabinDeck'].isin(passenger_row['PotentialDecks']))].Cabin.dropna()
            
            # If there is only one cabin from their group they could share with
            if passengers_group_cabins.nunique() == 1:
                matching_cabin = passengers_group_cabins.iloc[0]
                dataframe = impute_from_cabin_and_index(dataframe, matching_cabin, passenger_index)
                
    return dataframe





By identifying passengers who do not have any suitable free cabins and ensuring they share a cabin with a member of their group, we maintain consistency in the imputation process. This method is only applied when there is a clear, single cabin option that meets all requirements for sharing.

With this approach, we further refine our dataset and fill in more missing Cabin values systematically.

In the next step, we will handle the remaining passengers with missing Cabin values by considering additional constraints and options.


## Imputing Cabins for the Only Passenger That Can Take a Certain Cabin

This final function works based on the assumption that every cabin is filled (i.e., there are no gaps in the cabin numbers). If a passenger is the only one that suits a certain cabin, then that passenger will have that cabin allocated to them.

### Function to Impute Cabins for the Only Matching Passenger


In [14]:
# Define a function to impute cabins for the only matching passenger for certain cabins
def only_matching_passenger_for_cabin(dataframe):
    # Get potential cabin options for passengers missing a cabin
    all_passenger_cabin_options = passengers_empty_cabin_options(dataframe)
    
    # Dictionary to store which passengers can fit each cabin
    cabins_to_fill = defaultdict(list)
    
    # Iterate over each passenger and their potential cabin options
    for passenger_index, cabin_options in all_passenger_cabin_options.items():
        for cabin in cabin_options:
            cabins_to_fill[cabin].append(passenger_index)
    
    # Iterate over each cabin and impute passengers where only one passenger fits
    for cabin, passengers_indices in cabins_to_fill.items():
        if len(passengers_indices) == 1:
            dataframe = impute_from_cabin_and_index(dataframe, cabin, passengers_indices[0])
    
    return dataframe


# Display the first few rows of the dataframe after imputing cabins for the only matching passenger
df.head()


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,...,GroupNumber,CabinDeck,CabinNumber,CabinSide,FirstName,LastName,Bills,GroupSize,PotentialDecks,PotentialSides
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,...,1,B,0,P,Maham,Ofracculy,0.0,1,,
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,...,1,F,0,S,Juanna,Vines,736.0,1,,
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,...,1,A,0,S,Altark,Susent,10383.0,2,,
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,...,2,A,0,S,Solam,Susent,5176.0,2,,
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,...,1,F,1,S,Willy,Santantines,1091.0,1,,


By assigning cabins to passengers who are the only ones suitable for certain cabins, we ensure that the imputation process is thorough and consistent. This method helps to finalize the imputation of missing Cabin values by addressing the remaining gaps systematically.

With this final step, we have now handled the imputation of Cabin values for all passengers in our dataset, based on the assumptions and constraints defined earlier.


## Imputation Process

As we impute cabins for some passengers, it limits the number of free cabins available for the remaining passengers and reduces the competition to fill certain cabins. By iterating through our imputation functions twice, we ensure that all the cabins that can be found by the functions are filled.

### Function to Apply All Imputation Steps


In [15]:
# Define a function to apply all imputation steps iteratively
def all_imputes(dataframe):
    # Apply the imputation functions in sequence
    dataframe = solo_group_one_cabin_option(dataframe)
    dataframe = no_suitable_cabin_so_shares(dataframe)
    dataframe = only_matching_passenger_for_cabin(dataframe)

    # Apply the imputation functions again to ensure thorough filling of cabins
    dataframe = solo_group_one_cabin_option(dataframe)
    dataframe = no_suitable_cabin_so_shares(dataframe)
    dataframe = only_matching_passenger_for_cabin(dataframe)
    
    return dataframe

# Apply the all_imputes function to the dataframe
df = all_imputes(df)

# Check the number of missing values in the Cabin column after imputation
df.Cabin.isna().sum()


37

By running the imputation functions twice, we ensure that all possible cabin assignments are made. This approach helps to systematically reduce the number of missing Cabin values, making our dataset more complete and accurate.

With this final imputation process, we have successfully filled as many missing Cabin values as possible based on the constraints and assumptions defined earlier.


There are 37 cabins that still remain unfilled, and we started with 299! There are still a few more that we can find that those functions did not cover

## Manual Imputation of Remaining Cabins

After applying our imputation functions, 37 cabins still remain unfilled. To address these, we can manually deduce some of the remaining cabins using a helper function that prints out useful data for each passenger with missing cabin information. This function will provide insights into the potential cabin options for each passenger, which can help us manually impute the remaining cabins.

### Function to Print Cabin Options for Each Passenger


In [16]:
# Define a function to print potential cabin options for each passenger with missing cabin information
def all_cabin_options_for_each_row(dataframe):
    # Get potential cabin options for passengers missing a cabin
    all_passenger_cabin_options = passengers_empty_cabin_options(dataframe)
    
    # Iterate through each passenger and their potential cabin options
    for passenger_index, passenger_options in all_passenger_cabin_options.items():
        print()
        print("PassengerId:", dataframe.iloc[passenger_index].PassengerId, "GroupSize:", dataframe.iloc[passenger_index].GroupSize)
        print("Free cabins that match:")
        print(passenger_options)




By printing out the potential cabin options for each passenger with missing cabin information, we can manually review and deduce the best possible cabin assignments for the remaining passengers. This manual step ensures that we leave no stone unturned in our imputation process.

With these insights, we can further reduce the number of missing Cabin values and improve the completeness of our dataset.


In [17]:
all_cabin_options_for_each_row(df)


PassengerId: 0293_01 GroupSize: 1
Free cabins that match:
['B/13/P', 'C/13/S']

PassengerId: 0310_01 GroupSize: 1
Free cabins that match:
['B/13/P', 'C/13/S']

PassengerId: 0348_02 GroupSize: 2
Free cabins that match:
['E/20/P', 'E/21/P']

PassengerId: 0364_02 GroupSize: 2
Free cabins that match:
['E/20/P', 'E/21/P']

PassengerId: 0374_02 GroupSize: 2
Free cabins that match:
['E/20/P', 'E/21/P']

PassengerId: 1011_01 GroupSize: 2
Free cabins that match:
['E/58/P']

PassengerId: 1041_01 GroupSize: 1
Free cabins that match:
['C/40/S', 'D/36/S', 'E/58/P']

PassengerId: 1095_01 GroupSize: 1
Free cabins that match:
['C/40/S', 'D/36/S']

PassengerId: 1709_03 GroupSize: 7
Free cabins that match:
[]

PassengerId: 2092_03 GroupSize: 5
Free cabins that match:
[]

PassengerId: 2513_01 GroupSize: 1
Free cabins that match:
['E/150/P', 'F/519/P']

PassengerId: 2514_01 GroupSize: 1
Free cabins that match:
['E/150/P', 'F/519/P']

PassengerId: 3034_01 GroupSize: 1
Free cabins that match:
['B/98/P', 'B

## Manual Imputation Reasoning

Despite our algorithmic efforts, some cabins remain unfilled due to specific constraints and options available. By analyzing the potential options for these passengers manually, we can deduce the best possible cabin assignments for the remaining cases.

### Manual Imputation Details

1. **Passenger 1011_01, Cabin E/58/P**
   - The cabin can only be filled by 1011_01 and 1041_01. However, one of 1041_01 and 1095_01 has to fill C/40/S, and the other has to fill D/36/S, as they are the only two that can fill these cabins. This leaves passenger 1011_01 to fill E/58/P.

2. **Passengers 3034_01, 3053_01, Cabins B/98/P, B/99/P**
   - These passengers weren't filled as the consecutive free cabins showed multiple options. As no one else can fill these cabins and index 3034_01 comes before 3053_01, they are filled in that order.

3. **Passengers 4637_01, 4652_01, Cabins E/300/S, E/301/S**
   - These passengers are the only two that can fill these cabins. They weren't imputed earlier as the free cabins are consecutive.

4. **Passenger 6028_04, Cabin A/57/P**
   - Since 6060_01 and 6048_01 are each alone in their groups with only two cabins to fill, one must fill D/191/P, and the other must fill E/387/P. This leaves index 6028_04 no other option but to join the only cabin that the rest of its group is in.

5. **Passengers 9223_01, 9223_02, Cabin F/1785/S**
   - These two passengers are the only members of their group and have one option for a cabin, so they both must share F/1785/S.


In [18]:
# List of manually determined cabin assignments
cabin_list = [
    (1429, 'E/58/P'), (4233, 'B/98/P'), (4254, 'B/99/P'), 
    (6493, 'E/300/S'), (6514, 'E/301/S'), (8413, 'A/57/P'), 
    (12892, 'F/1785/S'), (12893, 'F/1785/S')
]

# Apply the manual imputation
for index, cabin in cabin_list:
    df = impute_from_cabin_and_index(df, cabin, index)

# Check the number of missing values in the Cabin column after manual imputation
df.Cabin.isna().sum()


29

By manually reviewing and assigning the remaining cabins based on the detailed reasoning, we ensure that these final imputed values are consistent with our earlier assumptions and constraints. This step further reduces the number of missing Cabin values, enhancing the completeness and accuracy of our dataset.

With these manual imputations, we have now addressed the remaining missing Cabin values, achieving a more comprehensive imputation process.


## Conclusion

We have successfully imputed most of the missing Cabin values, reducing the number of unfilled cabins from 299 to just 29. This significant reduction should help improve our standings in the competition.

**Thank you for making it this far! This project required a significant amount of effort and dedication to address the complex problem of imputing missing cabin values. By meticulously analyzing the data and applying structured algorithms, we have managed to reduce the number of missing cabins from 299 to just 29. This progress is a testament to the power of data-driven approaches and careful reasoning.**

**I sincerely hope that the techniques and insights shared in this notebook will be beneficial for your own projects and help you climb the leaderboard. If you found this work helpful, I would greatly appreciate your upvotes and feedback. It would mean a lot to know that my contributions are making a positive impact.**

**Best of luck with your submissions, and may this notebook serve as a valuable resource in your data science journey. If you have any further questions or if new inferences come to light, please feel free to reach out. Together, we can continue to improve and refine our approaches.**

**Thank you once again for your time and effort in reviewing this work. Your support and encouragement are much appreciated!**




Next, we will split the data back into the training and test sets. Additionally, I will detail the reasoning behind the remaining 29 passengers and explain why we cannot yet decide which cabin they should take. If any new inferences come to light, please let me know

In [19]:
# Split the data back into training and test sets
traindata = df[df.Set == 'Train']
testdata = df[df.Set == 'Test']

# Check the number of missing values in the Cabin column after manual imputation
df.Cabin.isna().sum()

# Display the first few rows of the training and test data
traindata.head()
testdata.head()


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,...,GroupNumber,CabinDeck,CabinNumber,CabinSide,FirstName,LastName,Bills,GroupSize,PotentialDecks,PotentialSides
16,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,...,1,G,3,S,Nelly,Carsoning,0.0,1,,
22,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,...,1,F,4,S,Lerome,Peckers,2832.0,1,,
23,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,...,1,C,0,S,Sabih,Unhearfus,0.0,1,,
30,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,...,1,C,1,S,Meratz,Caltilter,7418.0,1,,
32,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,...,1,F,5,S,Brence,Harperez,645.0,1,,


In [20]:
all_cabin_options_for_each_row(df)


PassengerId: 0293_01 GroupSize: 1
Free cabins that match:
['B/13/P', 'C/13/S']

PassengerId: 0310_01 GroupSize: 1
Free cabins that match:
['B/13/P', 'C/13/S']

PassengerId: 0348_02 GroupSize: 2
Free cabins that match:
['E/20/P', 'E/21/P']

PassengerId: 0364_02 GroupSize: 2
Free cabins that match:
['E/20/P', 'E/21/P']

PassengerId: 0374_02 GroupSize: 2
Free cabins that match:
['E/20/P', 'E/21/P']

PassengerId: 1041_01 GroupSize: 1
Free cabins that match:
['C/40/S', 'D/36/S']

PassengerId: 1095_01 GroupSize: 1
Free cabins that match:
['C/40/S', 'D/36/S']

PassengerId: 1709_03 GroupSize: 7
Free cabins that match:
[]

PassengerId: 2092_03 GroupSize: 5
Free cabins that match:
[]

PassengerId: 2513_01 GroupSize: 1
Free cabins that match:
['E/150/P', 'F/519/P']

PassengerId: 2514_01 GroupSize: 1
Free cabins that match:
['E/150/P', 'F/519/P']

PassengerId: 3287_02 GroupSize: 3
Free cabins that match:
[]

PassengerId: 3411_02 GroupSize: 7
Free cabins that match:
[]

PassengerId: 3598_01 GroupS


### Remaining Missing Cabins

Despite our best efforts, there are still 29 cabins that remain unfilled. Here, we provide details about these remaining cases and the reasoning behind why we couldn't impute them:

#### Cases of Two Passengers Alone in Their Group with Two Suitable Cabins
1. **Passengers 0293_01 and 0310_01**: Cabins B/13/P and C/13/S
2. **Passengers 1041_01 and 1095_01**: Cabins C/40/S and D/36/S
3. **Passengers 2513_01 and 2514_01**: Cabins E/150/P and F/519/P
4. **Passengers 3598_01 and 3599_01**: Cabins G/590/P and G/579/S
5. **Passengers 6048_01 and 6048_01**: Cabins D/191/P and E/387/P
6. **Passengers 7182_01 and 7183_01**: Cabins F/1489/P and G/1157/P
7. **Passengers 7463_01 and 7469_01**: Cabins F/1544/P and G/1212/S
    - Passenger 7463_01 also has the option of F/1424/S, but as 7463_01 and 7469_01 are the only passengers that can take F/1544/P and G/1212/S, logically 7463_01 must take one of F/1544/P or G/1212/S, and 7469_01 must take the other.
8. **Passengers 7983_01 and 7995_01**: Cabins C/298/S and E/528/S

#### Cases of Passengers Who Have to Share a Cabin with a Group Member, but There Are Multiple Suitable Cabins
1. **Passenger 1709_03**: Cabins F/326/S, D/61/S, E/127/S
2. **Passenger 2092_03**: Cabins D/70/S, E/153/S, F/410/S
3. **Passenger 3287_02**: Cabins G/522/S, F/621/S
4. **Passenger 3411_02**: Cabins E/232/S, F/645/S
5. **Passenger 8728_07**: Cabins F/1798/P, G/1416/P

#### Other Cases
1. **Passengers 0348_02, 0364_02, and 0374_02**: Each of these passengers is in a group of 2 and can only take cabins E/20/P and E/21/P, meaning that one of them shares with their group member, and the other two take those cabins.
2. **Passenger 7353_03**: Either takes cabin C/270/S and then:
    - Passenger 7368_01 takes D/235/P
    - Passenger 7429_01 takes F/1424/S
    - Passenger 7503_02 takes G/1206/S
    - Passenger 7442_02 shares E/495/S with a group member
    Or:
    - Passenger 7353_03 shares C/269/S with a group member, and then:
    - Passenger 7368_01 takes C/270/S
    - Passenger 7429_01 takes D/235/P
    - Passenger 7503_02 takes F/1424/S
    - Passenger 7442_02 takes G/1206/S


## Further Assumptions

There aren't more cabins available than we have assumed. While working on this project, I initially missed imputing many cabins due to the belief that cabin numbers could extend beyond what we observed. However, after assuming that cabins do not extend beyond the observed numbers, the passengers all fit within the given constraints. This seemed unlikely if there were more cabins available than assumed, as we did not find any passengers without options or rooms without matching passengers.

In conclusion, this project has significantly improved the completeness of our dataset by reducing the number of missing Cabin values from 299 to just 29. These remaining cases present complex scenarios that require further inference or additional data to resolve. If new patterns or information emerge, we can continue to refine our imputation strategy to achieve even higher accuracy.


In [21]:
df_to_comp = pd.read_csv('data/31remaining.csv')
df_to_comp = df_to_comp.rename(columns = {'Number':'CabinNumber'})
df_to_comp['CabinNumber'] = df_to_comp['CabinNumber'].astype('Int64')


In [22]:
for index,row in df.iterrows():
    if not (pd.isna(row.Cabin) and pd.isna(df_to_comp.iloc[index].Cabin)):
        if row.Cabin != df_to_comp.iloc[index].Cabin:
            print(index,row.Cabin, df_to_comp.iloc[index].Cabin)

9267 G/1077/S F/1267/S
12651 A/94/P nan
12668 B/297/P nan


# Appendix

For the Evidence in the Appendix I will reuse the combined original dataframes without any imputations as to not misrepresent the underlying distributions

In [23]:
training_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')
df = pd.concat([training_data,test_data]) 

df = column_splits(df)
df = df.sort_values(by = ['Group','GroupNumber'])
df = df.reset_index(drop = True)

## A.1

Evidence of passengers sharing a group implying that their cabin is on the same side

In [24]:
# Group by 'Group' and check if all non-NaN 'CabinSide' values within each group are the same
consistent_count = 0
inconsistent_count = 0

# Iterate through each group
for group, group_df in df.groupby('Group'):
    if len(group_df) > 1:
        # Get unique non-NaN CabinSide values
        unique_sides = group_df['CabinSide'].dropna().unique()
        
        if len(unique_sides) <= 1:
            # All rows in this group are consistent
            consistent_count += len(group_df)
        else:
            # Some rows in this group are inconsistent
            inconsistent_count += len(group_df)

print(f"Number of rows with consistent cabin sides: {consistent_count}")
print(f"Number of rows with inconsistent cabin sides: {inconsistent_count}")

Number of rows with consistent cabin sides: 5825
Number of rows with inconsistent cabin sides: 0


## A.2

Evidence of passengers sharing a group implying that they have the same home planet

In [25]:
# Initialize counters
planet_consistent_count = 0
planet_inconsistent_count = 0

# Iterate through each group
for group, group_df in df.groupby('Group'):
    if len(group_df) > 1:
    # Get unique non-NaN HomePlanet values
        unique_home_planets = group_df['HomePlanet'].dropna().unique()
        
        if len(unique_home_planets) <= 1:
            # All rows in this group are consistent in HomePlanet
            planet_consistent_count += len(group_df)
        else:
            # Some rows in this group are inconsistent in HomePlanet
            planet_inconsistent_count += len(group_df)

print(f"Number of rows with consistent home planets: {planet_consistent_count}")
print(f"Number of rows with inconsistent home planets: {planet_inconsistent_count}")

Number of rows with consistent home planets: 5825
Number of rows with inconsistent home planets: 0


## A.3

Evidence of passengers sharing a last name implying that they have the same home planet

In [26]:
# Initialize counters
planet_consistent_count = 0
planet_inconsistent_count = 0

# Iterate through each last name group
for last_name, group_df in df.groupby('LastName'):
    if len(group_df) > 1:  # Exclude last names with only one passenger
        # Get unique non-NaN HomePlanet values
        unique_home_planets = group_df['HomePlanet'].dropna().unique()
        
        if len(unique_home_planets) <= 1:
            # All rows in this group are consistent in HomePlanet
            planet_consistent_count += len(group_df)
        else:
            # Some rows in this group are inconsistent in HomePlanet
            planet_inconsistent_count += len(group_df)

print(f"Number of rows with consistent home planets by last name: {planet_consistent_count}")
print(f"Number of rows with inconsistent home planets by last name: {planet_inconsistent_count}")

Number of rows with consistent home planets by last name: 12468
Number of rows with inconsistent home planets by last name: 0


## A.4

Evidence of children under the age of 13 having no bills

In [27]:
df['Bills'] = df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] + df['Spa'] + df['VRDeck']


In [28]:
# Filter out rows with NaN values in Bills
df_filtered = df[df['Bills'].notna()]

# Check if passengers under the age of 13 have bills = 0
under_13 = df_filtered[df_filtered['Age'] < 13]
under_13_bills_zero = under_13['Bills'] == 0

# Calculate summary statistics
total_under_13 = len(under_13)
bills_zero_under_13 = under_13_bills_zero.sum()
bills_not_zero_under_13 = total_under_13 - bills_zero_under_13

# Create a summary DataFrame
summary = pd.DataFrame({
    'Total Under 13': [total_under_13],
    'Bills = 0': [bills_zero_under_13],
    'Bills != 0': [bills_not_zero_under_13],
    'Consistency Ratio': [bills_zero_under_13 / total_under_13]
})

# Print summary statistics
print("Summary statistics for passengers under the age of 13 (excluding NaN Bills):")
print(summary)

Summary statistics for passengers under the age of 13 (excluding NaN Bills):
   Total Under 13  Bills = 0  Bills != 0  Consistency Ratio
0            1030       1030           0                1.0


# A.5

Evidence of those in CryoSleep having no bills

In [29]:
# Filter out rows with NaN values in Bills
df_filtered = df[df['Bills'].notna()]

# Check if passengers under the age of 13 have bills = 0
cryo = df_filtered[df_filtered['CryoSleep'] == True]
cryo_bills_zero = cryo['Bills'] == 0

# Calculate summary statistics
total_in_cryo = len(cryo)
cryo_bills_zero = cryo_bills_zero.sum()
cryo_bills_not_zero = total_in_cryo - cryo_bills_zero

# Create a summary DataFrame
summary = pd.DataFrame({
    'Total CryoSleep': [total_in_cryo],
    'Bills == 0': [cryo_bills_zero],
    'Bills != 0': [cryo_bills_not_zero],
    'Consistency Ratio': [cryo_bills_zero / total_in_cryo]
})

# Print summary statistics
print("Summary statistics for passengers under the age of 13 (excluding NaN Bills):")
print(summary)

Summary statistics for passengers under the age of 13 (excluding NaN Bills):
   Total CryoSleep  Bills == 0  Bills != 0  Consistency Ratio
0             4068        4068           0                1.0


# A.6

In [30]:
# Filter cabins with more than one member
cabin_counts = df['Cabin'].value_counts()
multi_member_cabins = cabin_counts[cabin_counts > 1].index

# Group by Cabin and list unique groups for each Cabin with more than one member
cabin_group_mapping = df[df['Cabin'].isin(multi_member_cabins)].groupby('Cabin')['Group'].unique()

# Check if any Cabin is associated with more than one group
shared_cabins = cabin_group_mapping[cabin_group_mapping.apply(lambda groups: len(groups) > 1)]

# Print the results
print("Cabins shared by more than one group (with more than one member):")
print(shared_cabins)

# Summary statistics
total_multi_member_cabins = len(cabin_group_mapping)
shared_cabin_count = len(shared_cabins)
unique_cabin_count = total_multi_member_cabins - shared_cabin_count

print(f"\nTotal multi-member cabins: {total_multi_member_cabins}")
print(f"Multi-member cabins shared by multiple groups: {shared_cabin_count}")
print(f"Multi-member cabins unique to one group: {unique_cabin_count}")

Cabins shared by more than one group (with more than one member):
Series([], Name: Group, dtype: object)

Total multi-member cabins: 1684
Multi-member cabins shared by multiple groups: 0
Multi-member cabins unique to one group: 1684


# Appendix B
Evidence of Home Planets Restricting Which Deck a Passenger's Cabin is On
We have observed that certain home planets restrict which decks their passengers' cabins are on. To provide evidence of this pattern, we analyze the data by grouping it based on 'HomePlanet' and 'CabinDeck', and counting the occurrences of each combination.

Grouping by HomePlanet and CabinDeck


In [31]:
# Group by 'HomePlanet' and 'CabinDeck' and count occurrences
deck_counts = df.groupby(['HomePlanet', 'CabinDeck']).size().reset_index(name='Count')

# Pivot the table to get a better overview
pivot_table = deck_counts.pivot(index='CabinDeck', columns='HomePlanet', values='Count').fillna(0).astype(int)

print(pivot_table)

HomePlanet  Earth  Europa  Mars
CabinDeck                      
A               0     346     0
B               0    1124     0
C               0    1081     0
D               0     296   406
E             583     197   508
F            2426       0  1713
G            3700       0     0
T               0      10     0


Evidence from Passengers with Bills = 0 in Groups with Multiple CabinDecks
To further support our assumption, we analyze passengers who have bills equal to 0 and belong to groups with multiple CabinDecks.



In [32]:
# Initialize a dictionary to store results
results = {}

# Iterate through each group
for group, group_df in df.groupby('Group'):
    # Check if within the group there are more than one CabinDeck
    unique_decks = group_df['CabinDeck'].dropna().unique()
    
    if len(unique_decks) > 1:
        # Find passengers with bills = 0
        zero_bill_passengers = group_df[(group_df['Bills'] == 0)  & (group_df['HomePlanet'].notna())]
        
        for idx, passenger in zero_bill_passengers.iterrows():
            home_planet = passenger['HomePlanet']
            cabin_deck = passenger['CabinDeck']
            
            if home_planet not in results:
                results[home_planet] = []
            
            results[home_planet].append(cabin_deck)

# Print the results
print("CabinDecks for passengers with bills = 0 in groups with multiple CabinDecks:")
for home_planet, cabin_decks in results.items():
    cabin_deck_counts = pd.Series(cabin_decks).value_counts().to_dict()
    print(f"HomePlanet: {home_planet}")
    for deck, count in cabin_deck_counts.items():
        print(f"  CabinDeck: {deck}, Count: {count}")


CabinDecks for passengers with bills = 0 in groups with multiple CabinDecks:
HomePlanet: Earth
  CabinDeck: G, Count: 526
HomePlanet: Mars
  CabinDeck: F, Count: 160
  CabinDeck: E, Count: 47
HomePlanet: Europa
  CabinDeck: B, Count: 11
