<a href="https://www.kaggle.com/code/mmellinger66/spaceship-titanic-voting-classifier?scriptVersionId=115229694" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Spaceship Titanic: Voting Classifier</h1>
</div>

I'm taking a deep dive into solving the Spaceship Titanic learning competition by solving the problem using several different methods.  I'll try to resuse code in the notebooks.  Hopefully, this will be a learning experience for both myself and the readers.

## Notebooks in the Series

- [Spaceship Titanic: XGB+LGBM Blend](https://www.kaggle.com/code/mmellinger66/spaceship-titanic-xgb-lgbm-blend)
- [Spaceship Titanic: Voting Classifier](https://www.kaggle.com/code/mmellinger66/spaceship-titanic-voting-classifier)
- [Spaceship Titanic: XGBoost/Optuna](https://www.kaggle.com/code/mmellinger66/spaceship-titanic-xgboost-optuna)

In [1]:
# Black formatter https://black.readthedocs.io/en/stable/

! pip install nb-black > /dev/null

%load_ext lab_black

[0m

<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Import Libraries</h1>
</div>

A best practise is to include all libraries here.  However, I will put a few imports farther down where they are first used so beginners can learn with an "as needed" approach.

In [2]:
from typing import List, Set, Dict, Tuple, Optional

import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)

from pathlib import Path

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

import xgboost as xgb
import catboost as cat
import lightgbm as lgb

pd.options.display.max_columns = 100  # Want to view all the columns

from IPython.display import Markdown as md

<div style='background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill'><h1 style='text-align: center;padding: 12px 0px 12px 0px;'>Configuration</h1>
</div>

In [3]:
data_dir = "../input/spaceship-titanic"

In [4]:
TARGET = "Transported"

<div style='background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill'><h1 style='text-align: center;padding: 12px 0px 12px 0px;'>Library</h1>
</div>

In [5]:
def read_data(path: str) -> (pd.DataFrame, pd.DataFrame, pd.DataFrame):
    data_dir = Path(path)

    train = pd.read_csv(data_dir / "train.csv")
    test = pd.read_csv(data_dir / "test.csv")
    submission_df = pd.read_csv(data_dir / "sample_submission.csv")

    print(f"train data: Rows={train.shape[0]}, Columns={train.shape[1]}")
    print(f"test data : Rows={test.shape[0]}, Columns={test.shape[1]}")
    return train, test, submission_df

In [6]:
def create_submission(model_name: str, target: str, preds: List[float]) -> pd.DataFrame:
    sample_submission[target] = preds

    if len(model_name) > 0:
        fname = "submission_{model_name}.csv"
    else:
        fname = "submission.csv"

    sample_submission.to_csv(fname, index=False)

    return sample_submission[:5]

In [7]:
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
)


def show_scores(gt: List[float], yhat: List[float]) -> None:
    accuracy = accuracy_score(gt, yhat)
    precision = precision_score(gt, yhat)
    recall = recall_score(gt, yhat)
    f1 = f1_score(gt, yhat)

    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"f1: {f1:.4f}")

In [8]:
from sklearn.preprocessing import LabelEncoder


def label_encoder(
    train: pd.DataFrame, test: pd.DataFrame, columns: List[str]
) -> (pd.DataFrame, pd.DataFrame):
    for col in columns:
        train[col] = train[col].astype(str)
        test[col] = test[col].astype(str)
        train[col] = LabelEncoder().fit_transform(train[col])
        test[col] = LabelEncoder().fit_transform(test[col])
    return train, test

In [9]:
def show_missing_features(df: pd.DataFrame) -> None:
    missing_vals = df.isna().sum().sort_values(ascending=False)
    print(missing_vals[missing_vals > 0])

<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Load Train/Test Data</h1>
</div>
- train.csv - Data used to build our machine learning model
- test.csv - Data used to build our machine learning model. Does not contain the target variable
- sample_submission.csv - A file in the proper format to submit test predictions

In [10]:
train, test, sample_submission = read_data(data_dir)

train data: Rows=8693, Columns=14
test data : Rows=4277, Columns=13


In [11]:
train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [12]:
train.columns

Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name', 'Transported'],
      dtype='object')

<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Missing Data</h1>
</div>

In [13]:
show_missing_features(train)

CryoSleep       217
ShoppingMall    208
VIP             203
HomePlanet      201
Name            200
Cabin           199
VRDeck          188
FoodCourt       183
Spa             183
Destination     182
RoomService     181
Age             179
dtype: int64


In [14]:
## Separate Categorical and Numerical Features
cat_features = list(train.select_dtypes(include=["category", "object"]).columns)
num_features = list(test.select_dtypes(include=["number"]).columns)

### Impute Missing Categorical Features

In [15]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="most_frequent")

train[cat_features] = imputer.fit_transform(train[cat_features])
test[cat_features] = imputer.transform(test[cat_features])

### Impute Missing Numerical Features

In [16]:
# imputer = SimpleImputer(strategy="mean")
imputer = SimpleImputer(strategy="median")  # median is more robust to outliers

train[num_features] = imputer.fit_transform(train[num_features])
test[num_features] = imputer.transform(test[num_features])

## Verify No Missing Data

In [17]:
missing_vals = train.isna().sum()
print(missing_vals[missing_vals > 0])

Series([], dtype: int64)


<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Encode Categorical Features</h1>
</div>

In [18]:
train, test = label_encoder(train, test, cat_features)

In [19]:
FEATURES = cat_features + num_features

y = train[TARGET]
X = train[FEATURES].copy()

X_test = test[FEATURES].copy()

In [20]:
X.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,VIP,Name,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0,1,0,149,2,0,5252,39.0,0.0,0.0,0.0,0.0,0.0
1,1,0,0,2184,2,0,4502,24.0,109.0,9.0,25.0,549.0,44.0
2,2,1,0,1,2,1,457,58.0,43.0,3576.0,0.0,6715.0,49.0
3,3,1,0,1,2,0,7149,33.0,0.0,1283.0,371.0,3329.0,193.0
4,4,0,0,2186,2,0,8319,16.0,303.0,70.0,151.0,565.0,2.0


## Scale the Data

In [21]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X = scaler.fit(X).transform(X)
X_test = scaler.transform(X_test)

<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Train Model with Train/Test Split</h1>
</div>

We split the training data so we can evaluate how well each model performs  We are saving 20% of the training data to validate the model(s).

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    X,
    y,
    test_size=0.2,  # Save 20% for validation
    random_state=42,  # Make the split deterministic
)
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

((6954, 13), (6954,), (1739, 13), (1739,))

<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Create Models</h1>
</div>

In [23]:
from sklearn.linear_model import LogisticRegression, RidgeClassifier

model = LogisticRegression(
    solver="liblinear",
    #                             penalty="l1",
    random_state=42,
)

model.fit(X_train, y_train)
valid_preds = model.predict(X_valid)
show_scores(y_valid, valid_preds)

Accuracy: 0.7683
Precision: 0.7529
Recall: 0.8052
f1: 0.7782


In [24]:
model = RidgeClassifier(alpha=0.5)
model.fit(X_train, y_train)

valid_preds = model.predict(X_valid)
show_scores(y_valid, valid_preds)

Accuracy: 0.7539
Precision: 0.7907
Recall: 0.6970
f1: 0.7409


In [25]:
test_preds = model.predict(X_test)

test_preds[:5]

array([ True, False,  True,  True, False])

In [26]:
rf_clf = RandomForestClassifier(n_estimators=2000)
xgb_clf = xgb.XGBClassifier(n_estimators=2_000, eta=0.001, max_depth=10)
cat_clf = cat.CatBoostClassifier(n_estimators=2_000, eta=0.001, max_depth=10, verbose=0)
lgb_clf = lgb.LGBMClassifier(
    n_estimators=2_000, objective="regression", learning_rate=0.001, max_depth=8
)

svc_clf = SVC(kernel="poly", degree=2, gamma="auto", coef0=1, C=5, probability=True)

ridge_clf = RidgeClassifier(alpha=0.5)

lr_clf = LogisticRegression(
    solver="liblinear",
    #                             penalty="l1",
    random_state=42,
)

### Hard Voting

In [27]:
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
    estimators=[
        ("lr", lr_clf),
        ("rf", rf_clf),
        ("svc", svc_clf),
        ("xgb_clf", xgb_clf),
        ("cat_clf", cat_clf),
        ("lgb_clf", lgb_clf),
        #         ("ridge", ridge_clf),
    ],
    voting="soft",
)

voting_clf.fit(X_train, y_train)

valid_preds = voting_clf.predict(X_valid)
show_scores(y_valid, valid_preds)

Accuracy: 0.7867
Precision: 0.7700
Recall: 0.8235
f1: 0.7958


In [28]:
test_preds = voting_clf.predict(X_test)

create_submission("", TARGET, test_preds)

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True


### Soft Voting