- PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
- HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
- CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- Destination - The planet the passenger will be debarking to.
- Age - The age of the passenger.
- VIP - Whether the passenger has paid for special VIP service during the voyage.
- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- Name - The first and last names of the passenger.
- Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

Plan:
1) Cabin split into deck & num  & side
2) delete `Name`
3) PassengerId split into group & pp
4) CryoSleep, VIP, HomePlanet, Destination and parts of passenger and cabin to labels
5) NaNs

In [191]:
# import libraries
import numpy as np
import pandas as pd
import datetime

from sklearn.compose import make_column_transformer, make_column_selector, ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier

from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score

RANDOM_STATE = 654321

In [158]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
df = pd.concat([train, test]).reset_index()
train_index = ~df.Transported.isna()
test_index = df.Transported.isna()
print(df.info())
# print(test_index)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12970 entries, 0 to 12969
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   index         12970 non-null  int64  
 1   PassengerId   12970 non-null  object 
 2   HomePlanet    12682 non-null  object 
 3   CryoSleep     12660 non-null  object 
 4   Cabin         12671 non-null  object 
 5   Destination   12696 non-null  object 
 6   Age           12700 non-null  float64
 7   VIP           12674 non-null  object 
 8   RoomService   12707 non-null  float64
 9   FoodCourt     12681 non-null  float64
 10  ShoppingMall  12664 non-null  float64
 11  Spa           12686 non-null  float64
 12  VRDeck        12702 non-null  float64
 13  Name          12676 non-null  object 
 14  Transported   8693 non-null   object 
dtypes: float64(6), int64(1), object(8)
memory usage: 1.5+ MB
None


In [159]:
def cabin2dns(row):
    # print(row['Cabin'])
    if row['Cabin'] is not np.NAN:
        a = row['Cabin'].split('/')
        row['Deck'] = a[0]
        row['Num'] = int(a[1])
        row['Side'] = a[2]
    else:
        row['Deck'] = np.NAN
        row['Num'] = np.NAN
        row['Side'] = np.NAN
    return row

In [160]:
def passenger2grouppp(row):
    if row['PassengerId'] is not np.NAN:
        a = row['PassengerId'].split('_')
        row['Group'] = int(a[0])
        row['Pp'] = int(a[1])
    else:
        row['Group'] = np.NAN
        row['Pp'] = np.NAN
    return row

In [161]:
df = df.apply(cabin2dns, axis=1)
df.drop(columns=['Cabin'], inplace=True)

In [162]:
df = df.apply(passenger2grouppp, axis=1)
# df.drop(columns=['PassengerId'], inplace=True)

In [163]:
df.drop(columns=['Name'], inplace=True)

In [164]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12970 entries, 0 to 12969
Data columns (total 18 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   index         12970 non-null  int64  
 1   PassengerId   12970 non-null  object 
 2   HomePlanet    12682 non-null  object 
 3   CryoSleep     12660 non-null  object 
 4   Destination   12696 non-null  object 
 5   Age           12700 non-null  float64
 6   VIP           12674 non-null  object 
 7   RoomService   12707 non-null  float64
 8   FoodCourt     12681 non-null  float64
 9   ShoppingMall  12664 non-null  float64
 10  Spa           12686 non-null  float64
 11  VRDeck        12702 non-null  float64
 12  Transported   8693 non-null   object 
 13  Deck          12671 non-null  object 
 14  Num           12671 non-null  float64
 15  Side          12671 non-null  object 
 16  Group         12970 non-null  int64  
 17  Pp            12970 non-null  int64  
dtypes: float64(7), int64(3), o

In [165]:
group_np = df[['Group', 'Pp']].groupby('Group', as_index=False).count()

# group_np

In [166]:
def count_group(row):
    global group_np
    row['Group_count'] = group_np[group_np.Group == row.Group].reset_index().Pp[0]
    return row

In [167]:
df = df.apply(count_group, axis=1)
df

  output = repr(obj)
  return method()


Unnamed: 0,index,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Num,Side,Group,Pp,Group_count
0,0,0001_01,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,B,0.0,P,1,1,1
1,1,0002_01,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,F,0.0,S,2,1,1
2,2,0003_01,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,A,0.0,S,3,1,2
3,3,0003_02,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,A,0.0,S,3,2,2
4,4,0004_01,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,F,1.0,S,4,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12965,4272,9266_02,Earth,True,TRAPPIST-1e,34.0,False,0.0,0.0,0.0,0.0,0.0,,G,1496.0,S,9266,2,2
12966,4273,9269_01,Earth,False,TRAPPIST-1e,42.0,False,0.0,847.0,17.0,10.0,144.0,,,,,9269,1,1
12967,4274,9271_01,Mars,True,55 Cancri e,,False,0.0,0.0,0.0,0.0,0.0,,D,296.0,P,9271,1,1
12968,4275,9273_01,Europa,False,,,False,0.0,2680.0,0.0,0.0,523.0,,D,297.0,P,9273,1,1


In [168]:
categorial = ['CryoSleep', 'VIP', 'HomePlanet', 'Destination', 'Deck', 'Side']

In [169]:
ordinal = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=10000)
ordinal.fit(df[categorial])
df.loc[:,categorial] = ordinal.transform(df.loc[:,categorial])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12970 entries, 0 to 12969
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   index         12970 non-null  int64  
 1   PassengerId   12970 non-null  object 
 2   HomePlanet    12682 non-null  float64
 3   CryoSleep     12660 non-null  float64
 4   Destination   12696 non-null  float64
 5   Age           12700 non-null  float64
 6   VIP           12674 non-null  float64
 7   RoomService   12707 non-null  float64
 8   FoodCourt     12681 non-null  float64
 9   ShoppingMall  12664 non-null  float64
 10  Spa           12686 non-null  float64
 11  VRDeck        12702 non-null  float64
 12  Transported   8693 non-null   object 
 13  Deck          12671 non-null  float64
 14  Num           12671 non-null  float64
 15  Side          12671 non-null  float64
 16  Group         12970 non-null  int64  
 17  Pp            12970 non-null  int64  
 18  Group_count   12970 non-nu

  df.loc[:,categorial] = ordinal.transform(df.loc[:,categorial])


In [170]:
label = LabelEncoder()
label.fit(df.Transported)
df.loc[:,'Transported'] = label.transform(df.loc[:,'Transported'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12970 entries, 0 to 12969
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   index         12970 non-null  int64  
 1   PassengerId   12970 non-null  object 
 2   HomePlanet    12682 non-null  float64
 3   CryoSleep     12660 non-null  float64
 4   Destination   12696 non-null  float64
 5   Age           12700 non-null  float64
 6   VIP           12674 non-null  float64
 7   RoomService   12707 non-null  float64
 8   FoodCourt     12681 non-null  float64
 9   ShoppingMall  12664 non-null  float64
 10  Spa           12686 non-null  float64
 11  VRDeck        12702 non-null  float64
 12  Transported   12970 non-null  int64  
 13  Deck          12671 non-null  float64
 14  Num           12671 non-null  float64
 15  Side          12671 non-null  float64
 16  Group         12970 non-null  int64  
 17  Pp            12970 non-null  int64  
 18  Group_count   12970 non-nu

  df.loc[:,'Transported'] = label.transform(df.loc[:,'Transported'])


In [171]:
df['Age'] = df.Age.median()

In [172]:
def fill_cryo(row):
    costs = row[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].sum()
    if costs > 0:
        row['CryoSleep'] = 1
    else:
        row['CryoSleep'] = 0
    return row

In [173]:
df = df.apply(fill_cryo, axis=1)

In [174]:
df[df.VIP == 0].describe()

Unnamed: 0,index,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Num,Side,Group,Pp,Group_count
count,12401.0,12122.0,12401.0,12137.0,12401.0,12401.0,12149.0,12123.0,12107.0,12125.0,12142.0,12401.0,12120.0,12120.0,12120.0,12401.0,12401.0,12401.0
mean,3607.53278,0.65575,0.572938,1.495922,27.0,0.0,216.735698,419.522808,174.006608,292.849402,285.578406,1.001371,4.366832,610.400825,0.504538,4627.730425,1.512297,2.023466
std,2402.472697,0.802317,0.494671,0.812723,0.0,0.0,630.845068,1465.779835,595.481307,1090.430391,1116.953592,0.813345,1.732529,513.11353,0.5,2682.124276,1.044894,1.584302
min,0.0,0.0,0.0,0.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,1615.0,0.0,0.0,1.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,174.0,0.0,2306.0,1.0,1.0
50%,3237.0,0.0,1.0,2.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0,445.0,1.0,4625.0,1.0,1.0
75%,5420.0,1.0,1.0,2.0,27.0,0.0,45.0,59.0,26.0,50.0,37.0,2.0,6.0,1018.0,1.0,6908.0,2.0,2.0
max,8692.0,2.0,1.0,2.0,27.0,0.0,14327.0,27071.0,23492.0,22408.0,24133.0,2.0,7.0,1894.0,1.0,9280.0,8.0,8.0


In [175]:
df[df.VIP == 1].describe()

Unnamed: 0,index,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Num,Side,Group,Pp,Group_count
count,273.0,267.0,273.0,270.0,273.0,273.0,269.0,267.0,266.0,269.0,266.0,273.0,266.0,266.0,266.0,273.0,273.0,273.0
mean,3932.428571,1.337079,0.871795,1.288889,27.0,1.0,486.349442,1793.651685,273.680451,932.847584,1207.199248,0.820513,2.255639,281.597744,0.466165,4856.47619,1.527473,2.065934
std,2533.740898,0.4736,0.334932,0.915514,0.0,0.0,1058.272031,3593.931435,575.222273,2131.559126,2523.478754,0.831824,1.664974,372.27244,0.499794,2732.573542,0.939422,1.301601
min,2.0,1.0,0.0,0.0,27.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,1.0
25%,1815.0,1.0,1.0,0.0,27.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,76.0,0.0,2529.0,1.0,1.0
50%,3599.0,1.0,1.0,2.0,27.0,1.0,1.0,280.0,0.0,46.0,28.5,1.0,2.0,166.0,0.0,4833.0,1.0,2.0
75%,6049.0,2.0,1.0,2.0,27.0,1.0,600.0,2167.5,277.25,928.0,1281.0,2.0,3.75,278.75,1.0,7396.0,2.0,3.0
max,8688.0,2.0,1.0,2.0,27.0,1.0,8243.0,29813.0,3700.0,15255.0,19086.0,2.0,5.0,1791.0,1.0,9276.0,6.0,7.0


In [176]:
df.isna().sum()

index             0
PassengerId       0
HomePlanet      288
CryoSleep         0
Destination     274
Age               0
VIP             296
RoomService     263
FoodCourt       289
ShoppingMall    306
Spa             284
VRDeck          268
Transported       0
Deck            299
Num             299
Side            299
Group             0
Pp                0
Group_count       0
dtype: int64

In [177]:
train = df[train_index]
test = df[test_index]

In [178]:
train.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train.dropna(inplace=True)


In [179]:
y_train = train.Transported
X_train = train.drop(columns=['Transported'])
X_test = test.drop(columns=['Transported'])

In [180]:
model = LGBMClassifier(random_state=RANDOM_STATE)

In [181]:
model.fit(X_train.drop(columns=['PassengerId']), y_train)

In [182]:
pred_train = model.predict(X_train.drop(columns=['PassengerId']))

In [183]:
accuracy_score(y_train, pred_train)

0.8879334648999154

In [184]:
X_test.isna().sum()

index             0
PassengerId       0
HomePlanet       87
CryoSleep         0
Destination      92
Age               0
VIP              93
RoomService      82
FoodCourt       106
ShoppingMall     98
Spa             101
VRDeck           80
Deck            100
Num             100
Side            100
Group             0
Pp                0
Group_count       0
dtype: int64

In [185]:
pred_test = model.predict(X_test.drop(columns=['PassengerId']))

In [186]:
pred_test.shape

(4277,)

In [187]:
pred = pd.concat([X_test['PassengerId'].reset_index(drop=True), pd.DataFrame(pred_test)], axis=1)
pred.columns = ['PassengerId', 'Transported']
pred

Unnamed: 0,PassengerId,Transported
0,0013_01,1
1,0018_01,0
2,0019_01,1
3,0021_01,1
4,0023_01,1
...,...,...
4272,9266_02,1
4273,9269_01,1
4274,9271_01,1
4275,9273_01,1


In [188]:
def class2bool(row):
    row['Transported'] = 'True' if row['Transported'] == 1 else 'False'
    return row

In [189]:
pred = pred.apply(class2bool, axis=1)
pred

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True
...,...,...
4272,9266_02,True
4273,9269_01,True
4274,9271_01,True
4275,9273_01,True


In [194]:
pred.to_csv(f'data/attempt{str(datetime.date.today())}.csv', index=False, sep=',', header=True)