- PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
- HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
- CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- Destination - The planet the passenger will be debarking to.
- Age - The age of the passenger.
- VIP - Whether the passenger has paid for special VIP service during the voyage.
- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- Name - The first and last names of the passenger.
- Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

Plan:
1) Cabin split into deck & num  & side
2) delete `Name`
3) PassengerId split into group & pp
4) CryoSleep, VIP, HomePlanet, Destination and parts of passenger and cabin to labels
5) NaNs

In [84]:
# import libraries
import numpy as np
import pandas as pd
import datetime

from sklearn.compose import make_column_transformer, make_column_selector, ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier, LGBMRegressor

from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score

RANDOM_STATE = 654321

In [85]:
train = pd.read_csv('data/train.csv', index_col=None)
test = pd.read_csv('data/test.csv', index_col=None)
df = pd.concat([train, test]).reset_index(drop=True)
train_index = ~df.Transported.isna()
test_index = df.Transported.isna()
print(df.info())
# print(test_index)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12970 entries, 0 to 12969
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   12970 non-null  object 
 1   HomePlanet    12682 non-null  object 
 2   CryoSleep     12660 non-null  object 
 3   Cabin         12671 non-null  object 
 4   Destination   12696 non-null  object 
 5   Age           12700 non-null  float64
 6   VIP           12674 non-null  object 
 7   RoomService   12707 non-null  float64
 8   FoodCourt     12681 non-null  float64
 9   ShoppingMall  12664 non-null  float64
 10  Spa           12686 non-null  float64
 11  VRDeck        12702 non-null  float64
 12  Name          12676 non-null  object 
 13  Transported   8693 non-null   object 
dtypes: float64(6), object(8)
memory usage: 1.4+ MB
None


In [86]:
def cabin2dns(row):
    # print(row['Cabin'])
    if row['Cabin'] is not np.NAN:
        a = row['Cabin'].split('/')
        row['Deck'] = a[0]
        row['Num'] = int(a[1])
        row['Side'] = a[2]
    else:
        row['Deck'] = np.NAN
        row['Num'] = np.NAN
        row['Side'] = np.NAN
    return row

In [87]:
def passenger2grouppp(row):
    if row['PassengerId'] is not np.NAN:
        a = row['PassengerId'].split('_')
        row['Group'] = int(a[0])
        row['Pp'] = int(a[1])
    else:
        row['Group'] = np.NAN
        row['Pp'] = np.NAN
    return row

In [88]:
df = df.apply(cabin2dns, axis=1)
df.drop(columns=['Cabin'], inplace=True)

In [89]:
df = df.apply(passenger2grouppp, axis=1)

In [90]:
df.drop(columns=['Name'], inplace=True)

In [91]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12970 entries, 0 to 12969
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   12970 non-null  object 
 1   HomePlanet    12682 non-null  object 
 2   CryoSleep     12660 non-null  object 
 3   Destination   12696 non-null  object 
 4   Age           12700 non-null  float64
 5   VIP           12674 non-null  object 
 6   RoomService   12707 non-null  float64
 7   FoodCourt     12681 non-null  float64
 8   ShoppingMall  12664 non-null  float64
 9   Spa           12686 non-null  float64
 10  VRDeck        12702 non-null  float64
 11  Transported   8693 non-null   object 
 12  Deck          12671 non-null  object 
 13  Num           12671 non-null  float64
 14  Side          12671 non-null  object 
 15  Group         12970 non-null  int64  
 16  Pp            12970 non-null  int64  
dtypes: float64(7), int64(2), object(8)
memory usage: 1.7+ MB


In [92]:
group_np = df[['Group', 'Pp']].groupby('Group', as_index=False).count()
# group_np

In [93]:
def count_group(row):
    global group_np
    row['Group_count'] = group_np[group_np.Group == row.Group].reset_index().Pp[0]
    return row

In [94]:
df = df.apply(count_group, axis=1)
df.drop(columns=['Pp', 'Group'], inplace=True)
df

  output = repr(obj)
  return method()


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Num,Side,Group_count
0,0001_01,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,B,0.0,P,1
1,0002_01,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,F,0.0,S,1
2,0003_01,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,A,0.0,S,2
3,0003_02,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,A,0.0,S,2
4,0004_01,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,F,1.0,S,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12965,9266_02,Earth,True,TRAPPIST-1e,34.0,False,0.0,0.0,0.0,0.0,0.0,,G,1496.0,S,2
12966,9269_01,Earth,False,TRAPPIST-1e,42.0,False,0.0,847.0,17.0,10.0,144.0,,,,,1
12967,9271_01,Mars,True,55 Cancri e,,False,0.0,0.0,0.0,0.0,0.0,,D,296.0,P,1
12968,9273_01,Europa,False,,,False,0.0,2680.0,0.0,0.0,523.0,,D,297.0,P,1


In [95]:
label = LabelEncoder()
label.fit(df.Transported)
label.classes_

array([False, True, nan], dtype=object)

In [96]:
df.loc[:,'Transported'] = label.transform(df.loc[:,'Transported'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12970 entries, 0 to 12969
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   12970 non-null  object 
 1   HomePlanet    12682 non-null  object 
 2   CryoSleep     12660 non-null  object 
 3   Destination   12696 non-null  object 
 4   Age           12700 non-null  float64
 5   VIP           12674 non-null  object 
 6   RoomService   12707 non-null  float64
 7   FoodCourt     12681 non-null  float64
 8   ShoppingMall  12664 non-null  float64
 9   Spa           12686 non-null  float64
 10  VRDeck        12702 non-null  float64
 11  Transported   12970 non-null  int64  
 12  Deck          12671 non-null  object 
 13  Num           12671 non-null  float64
 14  Side          12671 non-null  object 
 15  Group_count   12970 non-null  int64  
dtypes: float64(7), int64(2), object(7)
memory usage: 1.6+ MB


  df.loc[:,'Transported'] = label.transform(df.loc[:,'Transported'])


In [97]:
categorial = ['CryoSleep', 'VIP', 'HomePlanet', 'Destination', 'Deck', 'Side']
ordinal = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=10000)
ordinal.fit(df[categorial])
df.loc[:,categorial] = ordinal.transform(df.loc[:,categorial])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12970 entries, 0 to 12969
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   12970 non-null  object 
 1   HomePlanet    12682 non-null  float64
 2   CryoSleep     12660 non-null  float64
 3   Destination   12696 non-null  float64
 4   Age           12700 non-null  float64
 5   VIP           12674 non-null  float64
 6   RoomService   12707 non-null  float64
 7   FoodCourt     12681 non-null  float64
 8   ShoppingMall  12664 non-null  float64
 9   Spa           12686 non-null  float64
 10  VRDeck        12702 non-null  float64
 11  Transported   12970 non-null  int64  
 12  Deck          12671 non-null  float64
 13  Num           12671 non-null  float64
 14  Side          12671 non-null  float64
 15  Group_count   12970 non-null  int64  
dtypes: float64(13), int64(2), object(1)
memory usage: 1.6+ MB


  df.loc[:,categorial] = ordinal.transform(df.loc[:,categorial])


In [101]:
values = {
    'RoomService': 0, 
    'FoodCourt': 0,
    'ShoppingMall': 0,
    'Spa': 0,
    'VRDeck': 0
}
df.fillna(values, inplace=True)
df[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].isna().sum()

RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
dtype: int64

In [130]:
df['sum'] = df[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)
df

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Num,Side,Group_count,sum,CrioSleep
0,0001_01,1.0,0.0,2.0,39.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1.0,0.0,0.0,1,0.0,0.0
1,0002_01,0.0,0.0,2.0,24.0,0.0,109.0,9.0,25.0,549.0,44.0,1,5.0,0.0,1.0,1,736.0,0.0
2,0003_01,1.0,0.0,2.0,58.0,1.0,43.0,3576.0,0.0,6715.0,49.0,0,0.0,0.0,1.0,2,10383.0,0.0
3,0003_02,1.0,0.0,2.0,33.0,0.0,0.0,1283.0,371.0,3329.0,193.0,0,0.0,0.0,1.0,2,5176.0,0.0
4,0004_01,0.0,0.0,2.0,16.0,0.0,303.0,70.0,151.0,565.0,2.0,1,5.0,1.0,1.0,1,1091.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12965,9266_02,0.0,1.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,0.0,2,6.0,1496.0,1.0,2,0.0,1.0
12966,9269_01,0.0,0.0,2.0,42.0,0.0,0.0,847.0,17.0,10.0,144.0,2,,,,1,1018.0,0.0
12967,9271_01,2.0,1.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,2,3.0,296.0,0.0,1,0.0,1.0
12968,9273_01,1.0,0.0,,,0.0,0.0,2680.0,0.0,0.0,523.0,2,3.0,297.0,0.0,1,3203.0,0.0


In [131]:
def fill_cryo(row):
    if row['CryoSleep'] is np.NAN:
        print('enter here')
        row['CrioSleep'] = 0 if row['sum'] > 0 else 1
    else:
        row['CrioSleep'] = row['CryoSleep']
    return row

In [129]:
df = df.apply(fill_cryo, axis=1)
df[df['CrioSleep'].isna()]


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Num,Side,Group_count,sum,CrioSleep
92,0099_02,0.0,,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1,6.0,12.0,0.0,2,0.0,
98,0105_01,0.0,,2.0,27.0,0.0,0.0,0.0,570.0,2.0,131.0,0,5.0,21.0,0.0,1,703.0,
104,0110_02,1.0,,2.0,40.0,0.0,0.0,331.0,0.0,0.0,1687.0,0,1.0,5.0,0.0,4,2018.0,
111,0115_01,2.0,,2.0,26.0,0.0,0.0,0.0,0.0,0.0,0.0,1,5.0,24.0,0.0,1,0.0,
152,0173_01,0.0,,2.0,58.0,0.0,0.0,985.0,0.0,5.0,0.0,1,4.0,11.0,1.0,1,990.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12688,8705_01,2.0,,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,2,5.0,1790.0,0.0,1,0.0,
12801,8925_06,1.0,,2.0,27.0,0.0,0.0,2182.0,7.0,2582.0,19.0,2,2.0,295.0,0.0,6,4790.0,
12902,9138_01,1.0,,2.0,41.0,0.0,0.0,1998.0,0.0,1023.0,867.0,2,,,,1,3888.0,
12923,9182_01,0.0,,2.0,14.0,0.0,0.0,0.0,0.0,1377.0,29.0,2,5.0,1776.0,1.0,1,1406.0,


In [None]:
df.drop(columns=['sum'], inplace=True)
ds = df.drop(columns=['Transported', 'PassengerId'])
ds.dropna(inplace=True)

In [None]:
def predict_categorial(field: str):
    global df
    global ds
    # print(f'building model for {field}')
    ds_y = ds[field]
    ds_x = ds.drop(columns=[field])
    ds_model = GridSearchCV(LGBMClassifier(random_state=RANDOM_STATE, n_jobs=-1),
                    param_grid={
                        'max_depth': [5, 7, 9, 11, 13],
                        'n_estimators': [50, 75, 100, 125, 150, 175, 200]
                                        },
                    scoring='accuracy',
                    verbose=0)
    ds_model.fit(ds_x, ds_y)
    print(f'--------- {field} --------')
    print(f'{field}\tbest score:\t{ds_model.best_score_}')
    print(f'{field}\tbest params:\t{ds_model.best_params_}\n-------------------')
    df.loc[df[field].isna(), field] = ds_model.predict(df.loc[df[field].isna(), :].drop(columns=[field, 'PassengerId', 'Transported']))

In [None]:
def predict_numeric(field: str):
    global df
    global ds
    # print(f'building model for {field}')
    ds_y = ds[field]
    ds_x = ds.drop(columns=[field])
    ds_model = GridSearchCV(LGBMRegressor(random_state=RANDOM_STATE, n_jobs=-1),
                    param_grid={
                        'max_depth': [5, 7, 9, 11, 13],
                        'n_estimators': [50, 75, 100, 125, 150, 175, 200]
                                        },
                    scoring='explained_variance',
                    verbose=0)
    ds_model.fit(ds_x, ds_y)
    print(f'--------- {field} --------')
    print(f'{field}\tbest score:\t{ds_model.best_score_}')
    print(f'{field}\tbest params:\t{ds_model.best_params_}\n-------------------')
    df.loc[df[field].isna(), field] = ds_model.predict(df.loc[df[field].isna(), :].drop(columns=[field, 'PassengerId', 'Transported']))

In [None]:
predict_numeric('Num') # 0.99
predict_numeric('FoodCourt') # 0.93
predict_numeric('VRDeck') # 0.88
predict_numeric('Spa') # 0.87
predict_numeric('RoomService') # 0.85
predict_numeric('ShoppingMall') # 0.77

predict_numeric('Age') # 0.20

In [None]:
predict_categorial('VIP') # 0.98
predict_categorial('HomePlanet') # 0.96
predict_categorial('Deck') # 0.88

predict_categorial('Destination') # 0.69
predict_categorial('Side') # 0.67

In [None]:
df.isna().sum()

In [None]:
train = df[train_index]
test = df[test_index]

In [None]:
y_train = train.Transported
X_train = train.drop(columns=['Transported'])
X_test = test.drop(columns=['Transported'])

In [None]:
model = GridSearchCV(LGBMClassifier(random_state=RANDOM_STATE, n_jobs=-1),
                     param_grid={
    'max_depth': [5, 7, 9, 11, 13],
    'n_estimators': [50, 75, 100, 125, 150, 175, 200]
                     },
    scoring='accuracy',
    verbose=4
)

In [None]:
model.fit(X_train.drop(columns=['PassengerId']), y_train)

In [None]:
print(f'best score:\t{model.best_score_}')  # 0.70
print(f'best params:\t{model.best_params_}')
model = model.best_estimator_

In [None]:
feature_imp = []
for i, f in enumerate(X_train.drop(columns=['PassengerId']).columns):
    feature_imp += [(model.feature_importances_[i], f)]
feature_imp = pd.DataFrame(feature_imp, columns=['importance', 'feature'])
feature_imp = feature_imp.sort_values(by='importance', ascending=False)
feature_imp

In [None]:
pred_train = model.predict(X_train.drop(columns=['PassengerId']))

In [None]:
accuracy_score(y_train, pred_train) # 0.83

In [None]:
X_test.isna().sum()

In [None]:
pred_test = model.predict(X_test.drop(columns=['PassengerId']))

In [None]:
pred = pd.concat([X_test['PassengerId'].reset_index(drop=True), pd.DataFrame(pred_test)], axis=1)
pred.columns = ['PassengerId', 'Transported']
pred

In [None]:
def class2bool(row):
    row['Transported'] = 'True' if row['Transported'] == 1 else 'False'
    return row

In [None]:
pred = pred.apply(class2bool, axis=1)
pred

In [None]:
pred.to_csv(f'data/attempt{str(datetime.datetime.now())}.csv', index=False, sep=',', header=True)