<a href="https://www.kaggle.com/code/jannikca/titanic-competition-with-xgboost?scriptVersionId=124526848" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import xgboost as xgb

#Own imports
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/spaceship-titanic/sample_submission.csv
/kaggle/input/spaceship-titanic/train.csv
/kaggle/input/spaceship-titanic/test.csv


# **Intention of the Notebook**

This is my first competition notebook that is independent of any exercise of a course.
I try to recap lots of concepts on my own to make suitable predictions.
Later on, I think about how to improve those predictions.

In [2]:
#Read training and test data
X = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
X_test = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
X_test_old = X_test.copy()


#Remove rows with empty target and separate target from predictors
X.dropna(axis=0, subset=['Transported'], inplace=True)
y = X.Transported
X.drop(['Transported'], axis=1, inplace=True)





Next, we split the information that is contained in the passenger id and the cabin number into multiple columns.

In [3]:
X[['GroupId', 'IdInGroup']] = X.PassengerId.str.split("_", expand = True)
X.drop(['PassengerId'], axis=1, inplace=True)

print(X_test.head)
X_test[['GroupId', 'IdInGroup']] = X_test.PassengerId.str.split("_", expand = True)
print(X_test.head)
X_test.drop(['PassengerId'], axis=1, inplace=True)


<bound method NDFrame.head of      PassengerId HomePlanet CryoSleep     Cabin    Destination   Age    VIP  \
0        0013_01      Earth      True     G/3/S    TRAPPIST-1e  27.0  False   
1        0018_01      Earth     False     F/4/S    TRAPPIST-1e  19.0  False   
2        0019_01     Europa      True     C/0/S    55 Cancri e  31.0  False   
3        0021_01     Europa     False     C/1/S    TRAPPIST-1e  38.0  False   
4        0023_01      Earth     False     F/5/S    TRAPPIST-1e  20.0  False   
...          ...        ...       ...       ...            ...   ...    ...   
4272     9266_02      Earth      True  G/1496/S    TRAPPIST-1e  34.0  False   
4273     9269_01      Earth     False       NaN    TRAPPIST-1e  42.0  False   
4274     9271_01       Mars      True   D/296/P    55 Cancri e   NaN  False   
4275     9273_01     Europa     False   D/297/P            NaN   NaN  False   
4276     9277_01      Earth      True  G/1498/S  PSO J318.5-22  43.0  False   

      RoomService  Fo

In [4]:
X[['Deck', 'CabinNumber','Side']] = X.Cabin.str.split("/", expand = True)
X.drop(['Cabin'], axis=1, inplace=True)

X_test[['Deck', 'CabinNumber','Side']] = X_test.Cabin.str.split("/", expand = True)
X_test.drop(['Cabin'], axis=1, inplace=True)

In [5]:
# Select categorical columns with relatively low cardinality. In fact, we remove columns likes ids or names because they are unique and we cannot learn from them.
categorical_cols = [cname for cname in X.columns if
                    X[cname].nunique() < 100 and 
                    X[cname].dtype == "object"]



# Select numerical columns
numerical_cols = [cname for cname in X.columns if 
                X[cname].dtype in ['int64', 'float64']]


numerical_cols.append("GroupId")

# Keep selected columns only
my_cols = categorical_cols + numerical_cols



X_train = X[my_cols].copy()
X_test = X_test[my_cols].copy()
print(X_train.head)

<bound method NDFrame.head of      HomePlanet CryoSleep    Destination    VIP IdInGroup Deck Side   Age  \
0        Europa     False    TRAPPIST-1e  False        01    B    P  39.0   
1         Earth     False    TRAPPIST-1e  False        01    F    S  24.0   
2        Europa     False    TRAPPIST-1e   True        01    A    S  58.0   
3        Europa     False    TRAPPIST-1e  False        02    A    S  33.0   
4         Earth     False    TRAPPIST-1e  False        01    F    S  16.0   
...         ...       ...            ...    ...       ...  ...  ...   ...   
8688     Europa     False    55 Cancri e   True        01    A    P  41.0   
8689      Earth      True  PSO J318.5-22  False        01    G    S  18.0   
8690      Earth     False    TRAPPIST-1e  False        01    G    S  26.0   
8691     Europa     False    55 Cancri e  False        01    E    S  32.0   
8692     Europa     False    TRAPPIST-1e  False        02    E    S  44.0   

      RoomService  FoodCourt  ShoppingMall   

Next, we use the observation that all passengers within the same group have the same home planet and impute missing values accordingly.

In [6]:
group_wise_mf_planet=X_train[['HomePlanet','GroupId']].groupby(['GroupId']).agg(lambda x: x.value_counts().index[0] if x.value_counts().size>0 else float("Nan"))

X_train=pd.merge(X_train,group_wise_mf_planet,how='left',left_on='GroupId',right_index=True)
X_train.isna().sum() 
print(X_train.loc[pd.isna(X_train["HomePlanet_x"]), :].index)
X_train.HomePlanet_x = X_train.HomePlanet_y
X_train.drop(['HomePlanet_y'], axis=1, inplace=True)
X_train.rename(columns={'HomePlanet_x': 'HomePlanet'},inplace=True)
print(X_train.loc[59])

Int64Index([  59,  113,  186,  225,  234,  274,  286,  291,  347,  365,
            ...
            8353, 8383, 8454, 8468, 8489, 8515, 8613, 8666, 8674, 8684],
           dtype='int64', length=201)
HomePlanet             Mars
CryoSleep              True
Destination     TRAPPIST-1e
VIP                   False
IdInGroup                02
Deck                      E
Side                      S
Age                    33.0
RoomService             0.0
FoodCourt               0.0
ShoppingMall            NaN
Spa                     0.0
VRDeck                  0.0
GroupId                0064
Name: 59, dtype: object


Now, we setup a pipeline and replace the remaining missing values with standard imputation strategies such as one hot encoding. Afterward, we train an XGBoost model.

In [7]:
numerical_transformer = SimpleImputer(strategy='median')

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

model = xgb.XGBClassifier(learning_rate=0.05,n_estimators=20, n_jobs=8,max_depth=5,min_child_weight = 2,gamma=0,eta=0)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('std_scaler', StandardScaler()),
                      ('model', model)
                     ])


pipeline.fit(X_train,y)



Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  SimpleImputer(strategy='median'),
                                                  ['Age', 'RoomService',
                                                   'FoodCourt', 'ShoppingMall',
                                                   'Spa', 'VRDeck',
                                                   'GroupId']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['HomePlanet', 'CryoSleep',
                                

**The following code is optional. I used it to optimize a set of hyperparameters.**

In [8]:
'''
parameters = {
    'model__max_depth': range (5, 10, 1),
    'model__n_estimators': range(20, 50, 10),
    'model__learning_rate': [0.1, 0.05],
    'model__min_child_weight' : range(1,5),
    'model__eta' : [0,0.1,0.2],
    'model__gamma' : [0,0.1,0.2]
}


grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=parameters,
    scoring = 'accuracy',
    n_jobs = 4,
    cv = 5,
    verbose=True
)


grid_search.fit(X_train,y)
'''

"\nparameters = {\n    'model__max_depth': range (5, 10, 1),\n    'model__n_estimators': range(20, 50, 10),\n    'model__learning_rate': [0.1, 0.05],\n    'model__min_child_weight' : range(1,5),\n    'model__eta' : [0,0.1,0.2],\n    'model__gamma' : [0,0.1,0.2]\n}\n\n\ngrid_search = GridSearchCV(\n    estimator=pipeline,\n    param_grid=parameters,\n    scoring = 'accuracy',\n    n_jobs = 4,\n    cv = 5,\n    verbose=True\n)\n\n\ngrid_search.fit(X_train,y)\n"

**Finally, the predictions are derived from the model.**

In [9]:
#predictions = grid_search.best_estimator_.predict(X_test).astype('bool')
predictions = pipeline.predict(X_test).astype('bool')

#print(grid_search.best_score_)
#print(grid_search.best_params_)



**Lastly, the predictions are stored in a .csv.**

In [10]:
# Run the code to save predictions in the format used for competition scoring
output = pd.DataFrame({'PassengerId': X_test_old.PassengerId,
                       'Transported': predictions})
output.to_csv('submission.csv', index=False)