# Exercise 1

Take the `titanic` dataset and use all attributes to predict the class `Survived` (convert age and fare into classes ; exclude names from the attribute list)
Build a Support vector machines (SVM) model with:
1. Linear kernel
2. Polynomial kernel
3. radial basis function (RBF) kernel
4. sigmoid kernel

In [1]:
import pandas as pd

pd.set_option('display.max_colwidth', None)
titanic = pd.read_csv('../Data/titanic.csv.zst', index_col='Name')

titanic['Age Group'] = pd.qcut(x=titanic['Age'], q=4)
titanic['Fare Group'] = pd.qcut(x=titanic['Fare'], q=4)

titanic.head(5)

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Age Group,Fare Group
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Mr. Owen Harris Braund,0,3,male,22.0,1,0,7.25,"(20.25, 28.0]","(-0.001, 7.925]"
Mrs. John Bradley (Florence Briggs Thayer) Cumings,1,1,female,38.0,1,0,71.2833,"(28.0, 38.0]","(31.138, 512.329]"
Miss. Laina Heikkinen,1,3,female,26.0,0,0,7.925,"(20.25, 28.0]","(-0.001, 7.925]"
Mrs. Jacques Heath (Lily May Peel) Futrelle,1,1,female,35.0,1,0,53.1,"(28.0, 38.0]","(31.138, 512.329]"
Mr. William Henry Allen,0,3,male,35.0,0,0,8.05,"(28.0, 38.0]","(7.925, 14.454]"


In [2]:
titanic.describe(include='all')

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Age Group,Fare Group
count,887.0,887.0,887,887.0,887.0,887.0,887.0,887,887
unique,,,2,,,,,4,4
top,,,male,,,,,"(20.25, 28.0]","(-0.001, 7.925]"
freq,,,573,,,,,243,238
mean,0.385569,2.305524,,29.471443,0.525366,0.383315,32.30542,,
std,0.487004,0.836662,,14.121908,1.104669,0.807466,49.78204,,
min,0.0,1.0,,0.42,0.0,0.0,0.0,,
25%,0.0,2.0,,20.25,0.0,0.0,7.925,,
50%,0.0,3.0,,28.0,0.0,0.0,14.4542,,
75%,1.0,3.0,,38.0,1.0,0.0,31.1375,,


In [3]:
from sklearn import preprocessing

for col in ['Sex', 'Age Group', 'Fare Group']:
    le = preprocessing.LabelEncoder()
    titanic[col] = le.fit_transform(titanic[col])

titanic.head(5)

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Age Group,Fare Group
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Mr. Owen Harris Braund,0,3,1,22.0,1,0,7.25,1,0
Mrs. John Bradley (Florence Briggs Thayer) Cumings,1,1,0,38.0,1,0,71.2833,2,3
Miss. Laina Heikkinen,1,3,0,26.0,0,0,7.925,1,0
Mrs. Jacques Heath (Lily May Peel) Futrelle,1,1,0,35.0,1,0,53.1,2,3
Mr. William Henry Allen,0,3,1,35.0,0,0,8.05,2,1


Some preliminary definitions to use later.

In [4]:
all_features = ['Pclass', 'Sex', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Age Group', 'Fare Group']

In [5]:
import numpy as np
from sklearn.svm import SVC
from util import kfold_eval


def svm_performance(kernel: str = 'linear', degree: int = 3, k: int = 5) -> (np.array, np.array, np.array, np.array):
    """
    Performs a k-fold cross-validation on an SVM with the specified kernel.

    :param kernel: The model to evaluate
    :param degree: The degree for the polynomial kernel. Is ignored by all other kernels.
    :param k: how many folds to perform

    :return: accuracy, precision, recall, f1
    """
    X = titanic[all_features]
    y = titanic['Survived']

    svm = SVC(kernel=kernel, degree=degree)
    return kfold_eval(model=svm, X=X, y=y, k=k)

## Show the Comparison of the Performance of the Kernels

In [None]:
kernels = {
    'Linear SVM Kernel': 'linear',
    'Poly SVM Kernel': 'poly',
    'Rbf SVM Kernel': 'rbf',
    'Sigmoid SVM Kernel': 'sigmoid'
}

for name, kernel in kernels.items():
    print(f'{name}:')
    a, p, r, f = svm_performance(kernel, k=5)
    data = {
        'Fold': range(1, 6),
        'Accuracy': a,
        'Precision': p,
        'Recall': r,
        'F1-Score': f,
    }

    scores = pd.DataFrame(data).set_index('Fold')
    display(scores)

Linear SVM Kernel:


Unnamed: 0_level_0,Accuracy,Precision,Recall,F1-Score
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.814607,0.745763,0.709677,0.727273
2,0.792135,0.721519,0.791667,0.754967
3,0.785311,0.7,0.742424,0.720588
4,0.734463,0.56338,0.714286,0.629921
5,0.80226,0.68254,0.741379,0.710744


Poly SVM Kernel:


Unnamed: 0_level_0,Accuracy,Precision,Recall,F1-Score
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.831461,0.779661,0.730159,0.754098
2,0.808989,0.721519,0.826087,0.77027
3,0.79096,0.7,0.753846,0.725926
4,0.768362,0.577465,0.788462,0.666667
5,0.836158,0.746032,0.783333,0.764228


Rbf SVM Kernel:


Unnamed: 0_level_0,Accuracy,Precision,Recall,F1-Score
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.820225,0.762712,0.714286,0.737705
2,0.814607,0.734177,0.828571,0.778523
3,0.819209,0.728571,0.796875,0.761194
4,0.785311,0.577465,0.836735,0.683333
5,0.853107,0.746032,0.824561,0.783333


Sigmoid SVM Kernel:


## Results
The RBF kernel performs the best with good stability.