<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Titanic: Support Vector Machine</h1>
</div>

In [1]:
# Black formatter https://black.readthedocs.io/en/stable/

! pip install nb-black > /dev/null

%load_ext lab_black

[0m

<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Import Libraries</h1>
</div>

A best practise is to include all libraries here.  However, I will put a few imports farther down where they are first used so beginners can learn with an "as needed" approach.

In [2]:
import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)

from pathlib import Path

pd.options.display.max_columns = 100  # Want to view all the columns

<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Configuration</h1>
</div>

In [3]:
TARGET = "Survived"

<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Library</h1>
</div>

Creating a few functions that we will reuse in each project.

In [4]:
def read_data(path):
    data_dir = Path(path)

    train = pd.read_csv(data_dir / "train.csv")
    test = pd.read_csv(data_dir / "test.csv")
    submission_df = pd.read_csv(data_dir / "gender_submission.csv")

    print(f"train data: Rows={train.shape[0]}, Columns={train.shape[1]}")
    print(f"test data : Rows={test.shape[0]}, Columns={test.shape[1]}")
    return train, test, submission_df

In [5]:
def create_submission(model_name, target, preds):
    sample_submission[target] = preds

    if len(model_name) > 0:
        fname = "submission_{model_name}.csv"
    else:
        fname = "submission.csv"

    sample_submission.to_csv(fname, index=False)

    return sample_submission[:5]

In [6]:
from sklearn.metrics import accuracy_score


def show_scores(gt, yhat):
    score = accuracy_score(gt, yhat)
    print(f"Accuracy: {score:.4f}")

In [7]:
from sklearn.preprocessing import LabelEncoder


def label_encoder(train, test, columns):
    for col in columns:
        train[col] = train[col].astype(str)
        test[col] = test[col].astype(str)
        train[col] = LabelEncoder().fit_transform(train[col])
        test[col] = LabelEncoder().fit_transform(test[col])
    return train, test

<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Load Train/Test Data</h1>
</div>

- train.csv - Data used to build our machine learning model
- test.csv - Data used to build our machine learning model. Does not contain the target variable
- gender_submission.csv - A file in the proper format to submit test predictions

In [8]:
train, test, sample_submission = read_data("../input/titanic")

train data: Rows=891, Columns=12
test data : Rows=418, Columns=11


In [9]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Missing Data</h1>
</div>

In [11]:
missing_vals = train.isna().sum()
print(missing_vals[missing_vals > 0])

Age         177
Cabin       687
Embarked      2
dtype: int64


In [12]:
## Separate Categorical and Numerical Features
cat_features = list(train.select_dtypes(include=["category", "object"]).columns)
num_features = list(test.select_dtypes(include=["number"]).columns)

### Impute Missing Categorical Features

In [13]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="most_frequent")

train[cat_features] = imputer.fit_transform(train[cat_features])
test[cat_features] = imputer.transform(test[cat_features])

### Impute Missing Numerical Features

In [14]:
# imputer = SimpleImputer(strategy="mean")
imputer = SimpleImputer(strategy="median")  # median is more robust to outliers

train[num_features] = imputer.fit_transform(train[num_features])
test[num_features] = imputer.transform(test[num_features])

## Verify No Missing Data

In [15]:
missing_vals = train.isna().sum()
print(missing_vals[missing_vals > 0])

Series([], dtype: int64)


<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Encode Categorical Features</h1>
</div>

In [16]:
train, test = label_encoder(train, test, cat_features)

In [17]:
FEATURES = cat_features + num_features

y = train[TARGET]
X = train[FEATURES].copy()

X_test = test[FEATURES].copy()

In [18]:
X.head()

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked,PassengerId,Pclass,Age,SibSp,Parch,Fare
0,108,1,523,47,2,1.0,3.0,22.0,1.0,0.0,7.25
1,190,0,596,81,0,2.0,1.0,38.0,1.0,0.0,71.2833
2,353,0,669,47,2,3.0,3.0,26.0,0.0,0.0,7.925
3,272,0,49,55,2,4.0,1.0,35.0,1.0,0.0,53.1
4,15,1,472,47,2,5.0,3.0,35.0,0.0,0.0,8.05


## Scale the Data

In [19]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X = scaler.fit(X).transform(X)
X_test = scaler.transform(X_test)

<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Train Model with Train/Test Split</h1>
</div>

We split the training data so we can evaluate how well each model performs  We are saving 20% of the training data to validate the model(s).

In [20]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    X,
    y,
    test_size=0.2,  # Save 20% for validation
    random_state=42,  # Make the split deterministic
)
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

((712, 11), (712,), (179, 11), (179,))

<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Create Models</h1>
</div>

In [21]:
%%time
from sklearn.svm import SVC

model = SVC(kernel="poly", degree=2, gamma="auto", coef0=1, C=5, probability=True)


model.fit(X_train, y_train)
valid_preds = model.predict(X_valid)
show_scores(y_valid, valid_preds)

Accuracy: 0.8156
CPU times: user 112 ms, sys: 0 ns, total: 112 ms
Wall time: 112 ms


In [22]:
preds = model.predict(X_test)

<div style="background-color:rgba(128, 0, 128, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Submission</h1>
</div>

In [23]:
ss = create_submission("", TARGET, preds)
ss

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,0
