- PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
- HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
- CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- Destination - The planet the passenger will be debarking to.
- Age - The age of the passenger.
- VIP - Whether the passenger has paid for special VIP service during the voyage.
- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- Name - The first and last names of the passenger.
- Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

Plan:
1) Cabin split into deck & num  & side
2) delete `Name`
3) PassengerId split into group & pp
4) CryoSleep, VIP, HomePlanet, Destination and parts of passenger and cabin to labels
5) NaNs

In [87]:
# import libraries
import numpy as np
import pandas as pd

from sklearn.compose import make_column_transformer, make_column_selector, ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder

from sklearn.ensemble import RandomForestClassifier

from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score

RANDOM_STATE = 654321

In [88]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
df = pd.concat([train, test]).reset_index()
train_index = ~df.Transported.isna()
test_index = df.Transported.isna()
print(df.info())
# print(test_index)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12970 entries, 0 to 12969
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   index         12970 non-null  int64  
 1   PassengerId   12970 non-null  object 
 2   HomePlanet    12682 non-null  object 
 3   CryoSleep     12660 non-null  object 
 4   Cabin         12671 non-null  object 
 5   Destination   12696 non-null  object 
 6   Age           12700 non-null  float64
 7   VIP           12674 non-null  object 
 8   RoomService   12707 non-null  float64
 9   FoodCourt     12681 non-null  float64
 10  ShoppingMall  12664 non-null  float64
 11  Spa           12686 non-null  float64
 12  VRDeck        12702 non-null  float64
 13  Name          12676 non-null  object 
 14  Transported   8693 non-null   object 
dtypes: float64(6), int64(1), object(8)
memory usage: 1.5+ MB
None


In [89]:
def cabin2dns(row):
    # print(row['Cabin'])
    if row['Cabin'] is not np.NAN:
        a = row['Cabin'].split('/')
        row['Deck'] = a[0]
        row['Num'] = int(a[1])
        row['Side'] = a[2]
    else:
        row['Deck'] = np.NAN
        row['Num'] = np.NAN
        row['Side'] = np.NAN
    return row

In [90]:
def passenger2grouppp(row):
    if row['PassengerId'] is not np.NAN:
        a = row['PassengerId'].split('_')
        row['Group'] = int(a[0])
        row['Pp'] = int(a[1])
    else:
        row['Group'] = np.NAN
        row['Pp'] = np.NAN
    return row

In [91]:
df = df.apply(cabin2dns, axis=1)
df.drop(columns=['Cabin'], inplace=True)

In [92]:
df = df.apply(passenger2grouppp, axis=1)
df.drop(columns=['PassengerId'], inplace=True)

In [93]:
df.drop(columns=['Name'], inplace=True)

In [94]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12970 entries, 0 to 12969
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   index         12970 non-null  int64  
 1   HomePlanet    12682 non-null  object 
 2   CryoSleep     12660 non-null  object 
 3   Destination   12696 non-null  object 
 4   Age           12700 non-null  float64
 5   VIP           12674 non-null  object 
 6   RoomService   12707 non-null  float64
 7   FoodCourt     12681 non-null  float64
 8   ShoppingMall  12664 non-null  float64
 9   Spa           12686 non-null  float64
 10  VRDeck        12702 non-null  float64
 11  Transported   8693 non-null   object 
 12  Deck          12671 non-null  object 
 13  Num           12671 non-null  float64
 14  Side          12671 non-null  object 
 15  Group         12970 non-null  int64  
 16  Pp            12970 non-null  int64  
dtypes: float64(7), int64(3), object(7)
memory usage: 1.7+ MB


In [95]:
group_np = df[['Group', 'Pp']].groupby('Group', as_index=False).count()

# group_np

In [96]:
def count_group(row):
    global group_np
    row['Group_count'] = group_np[group_np.Group == row.Group].reset_index().Pp[0]
    return row

In [97]:
df = df.apply(count_group, axis=1)
df

  output = repr(obj)
  return method()


Unnamed: 0,index,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Num,Side,Group,Pp,Group_count
0,0,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,B,0.0,P,1,1,1
1,1,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,F,0.0,S,2,1,1
2,2,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,A,0.0,S,3,1,2
3,3,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,A,0.0,S,3,2,2
4,4,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,F,1.0,S,4,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12965,4272,Earth,True,TRAPPIST-1e,34.0,False,0.0,0.0,0.0,0.0,0.0,,G,1496.0,S,9266,2,2
12966,4273,Earth,False,TRAPPIST-1e,42.0,False,0.0,847.0,17.0,10.0,144.0,,,,,9269,1,1
12967,4274,Mars,True,55 Cancri e,,False,0.0,0.0,0.0,0.0,0.0,,D,296.0,P,9271,1,1
12968,4275,Europa,False,,,False,0.0,2680.0,0.0,0.0,523.0,,D,297.0,P,9273,1,1


In [98]:
categorial = ['CryoSleep', 'VIP', 'HomePlanet', 'Destination', 'Deck', 'Side']

In [99]:
ordinal = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=10000)
ordinal.fit(df[categorial])
df.loc[:,categorial] = ordinal.transform(df.loc[:,categorial])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12970 entries, 0 to 12969
Data columns (total 18 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   index         12970 non-null  int64  
 1   HomePlanet    12682 non-null  float64
 2   CryoSleep     12660 non-null  float64
 3   Destination   12696 non-null  float64
 4   Age           12700 non-null  float64
 5   VIP           12674 non-null  float64
 6   RoomService   12707 non-null  float64
 7   FoodCourt     12681 non-null  float64
 8   ShoppingMall  12664 non-null  float64
 9   Spa           12686 non-null  float64
 10  VRDeck        12702 non-null  float64
 11  Transported   8693 non-null   object 
 12  Deck          12671 non-null  float64
 13  Num           12671 non-null  float64
 14  Side          12671 non-null  float64
 15  Group         12970 non-null  int64  
 16  Pp            12970 non-null  int64  
 17  Group_count   12970 non-null  int64  
dtypes: float64(13), int64(4), 

  df.loc[:,categorial] = ordinal.transform(df.loc[:,categorial])


In [100]:
df['Transported'] = df.Transported.astype('bool')

In [101]:
train = df[train_index]
test = df[test_index]

In [102]:
train.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train.dropna(inplace=True)


In [103]:
y_train = train.Transported
X_train = train.drop(columns=['Transported'])
X_test = test.drop(columns=['Transported'])

In [104]:
model = RandomForestClassifier(random_state=RANDOM_STATE)

In [105]:
model.fit(X_train, y_train)

In [106]:
pred_train = model.predict(X_train)

In [107]:
accuracy_score(y_train, pred_train)

1.0

In [108]:
X_test.isna().count()

index           4277
HomePlanet      4277
CryoSleep       4277
Destination     4277
Age             4277
VIP             4277
RoomService     4277
FoodCourt       4277
ShoppingMall    4277
Spa             4277
VRDeck          4277
Deck            4277
Num             4277
Side            4277
Group           4277
Pp              4277
Group_count     4277
dtype: int64

In [109]:
pred_test = model.predict(X_test)

ValueError: Input X contains NaN.
RandomForestClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values