# Titanic Survivor - Predictive model

**1 - Initial imports**

In [2]:
import pandas as pd

It actually does not encode categorical Pclass because it's not a string. But Pclass does not really need it because we can see it as an ordered set (1st class > 2nd class > 3rd class). This is just a shortcut to get both in the model.

In [3]:
data = pd.read_csv("./train.csv").set_index("PassengerId")

**2 - Feature engineering**

In [52]:
def delete_digit(string):
    if string.__class__ != str:
        return ""
    else:
        return string[0]

data["Cabin"] = data["Cabin"].apply(delete_digit)

features = pd.get_dummies(data[["Sex", "Pclass", "Cabin"]])

In [None]:
data["Embarked"]

** 3 - Feature normalization**

SVM is sensible to feature normalization, so we could argue that we should do it.

Currently our 2 features have almost the same domain; at least it's the same order of magnitude:
- Sex € [0 ; 1]
- Pclass €  [ 1, 2, 3]

** 4 - Split train / test**

It will allow us to see how we perform before to send the predictions. But before to send the predictions, we could train on the complete the model.

In [54]:
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(features, data.Survived, test_size=0.3, random_state=2029)

**5 - Model training**

In [55]:
from sklearn import svm
clf = svm.SVC(C=1.0, kernel="linear")
clf.fit(features_train, labels_train)
None

**6 - Predictions**

In [56]:
predictions = clf.predict(features_test)

**7 - Evaluation**

In [57]:
from sklearn.metrics import accuracy_score
accuracy_score(labels_test, predictions)

0.77238805970149249

Surprisingly, not that bad for such a dummy model!

In [58]:
from sklearn.metrics import f1_score
f1_score(labels_test, predictions)

0.72398190045248867

In [59]:
from sklearn.metrics import confusion_matrix
confusion_matrix(labels_test, predictions)

array([[127,  25],
       [ 36,  80]])

**8 - Predict & export for Kaggle**

In [11]:
data_out = pd.read_csv("./test.csv")
features_out = attributes_to_features(data_out)
data_out["Survived"] = clf.predict(features_out)

In [12]:
data_out[["PassengerId", "Survived"]].to_csv("output.csv", index=False)