# Accelerated version

In this case we are using Intel(R) Extension for Scikit-learn(https://github.com/intel/scikit-learn-intelex) that accelerate execution of Scikit-learn calls.
2 lines below is the only difference from previous kernel:


In [1]:
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


This notebook is based on Kaggle solution https://www.kaggle.com/napetrov/tps04-svm-with-scikit-learn-intelex for  Tabular Playground Series - Apr 2021

In [2]:
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier

from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split

Next set of cell read data and perform feature engineering operations

In [3]:
train = pd.read_csv('./SVM/train.csv', index_col='PassengerId')
test = pd.read_csv('./SVM/test.csv', index_col='PassengerId')

target = train.pop('Survived')

In [4]:
train.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
test.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

In [5]:
test_prepared = test.copy()
train_prepared = train.copy()

test_prepared['Age'].fillna((train['Age'].median()), inplace=True)
train_prepared['Age'].fillna((train['Age'].median()), inplace=True)

test_prepared['Fare'].fillna((train['Fare'].median()), inplace=True)
train_prepared['Fare'].fillna((train['Fare'].median()), inplace=True)

test_prepared['Embarked'].fillna('S', inplace=True)
train_prepared['Embarked'].fillna('S', inplace=True)


In [6]:
for col in ['Pclass', 'Sex', 'Embarked']:
    le = LabelEncoder()
    le.fit(train_prepared[col])
    train_prepared[col] = le.transform(train_prepared[col])
    test_prepared[col] = le.transform(test_prepared[col])

In [7]:
train_prepared.head()


Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,1,39.0,2,0,27.14,2
1,2,1,39.0,0,0,13.35,2
2,2,1,0.33,1,2,71.29,2
3,2,1,19.0,0,0,13.04,2
4,2,1,25.0,0,0,7.76,2


In [8]:
train_prepared_scaled = train_prepared.copy()
test_prepared_scaled = test_prepared.copy()

scaler = StandardScaler()
scaler.fit(train_prepared)
train_prepared_scaled = scaler.transform(train_prepared_scaled)
test_prepared_scaled = scaler.transform(test_prepared_scaled)

train_prepared_scaled = pd.DataFrame(train_prepared_scaled, columns=train_prepared.columns)
test_prepared_scaled = pd.DataFrame(test_prepared_scaled, columns=train_prepared.columns)


In [9]:
X_train, X_valid, y_train, y_valid = train_test_split(train_prepared_scaled, target, test_size=0.1, random_state=0)

And here we start training for SVM with RBF kernel. Code cell have timing integrated and you can see noticeable time reduction around 10x for 2 core system

In [10]:
%%time
svc_kernel_rbf = SVC(kernel='rbf', random_state=0, C=0.01)
svc_kernel_rbf.fit(X_train, y_train)
y_pred = svc_kernel_rbf.predict(X_valid)
accuracy_score(y_pred, y_valid)

CPU times: user 1min 11s, sys: 1.26 s, total: 1min 12s
Wall time: 1min 12s


0.7613

For SVM prediction we observe even greater acceleration - around 20x improvement on 2 cores system

In [11]:
%%time
final_pred = svc_kernel_rbf.predict(test_prepared_scaled)

CPU times: user 16 s, sys: 20.3 ms, total: 16 s
Wall time: 16 s


With such dramatic acceleration we now can perform hyperparameters search in reasonable time!

In [12]:
%%time
n_folds = 10
kf = KFold(n_splits=n_folds, shuffle=True, random_state=0)
y_pred = np.zeros(test.shape[0])

for fold, (train_index, valid_index) in enumerate(kf.split(train_prepared_scaled, target)):
    print("Running Fold {}".format(fold + 1))
    X_train, X_valid = pd.DataFrame(train_prepared_scaled.iloc[train_index]), pd.DataFrame(train_prepared_scaled.iloc[valid_index])
    y_train, y_valid = target.iloc[train_index], target.iloc[valid_index]
    svc_kernel_rbf = SVC(kernel='rbf', random_state=0, C=0.01)
    svc_kernel_rbf.fit(X_train, y_train)
    print("  Accuracy: {}".format(accuracy_score(y_valid, svc_kernel_rbf.predict(X_valid))))
    y_pred += svc_kernel_rbf.predict(test_prepared_scaled)

y_pred /= n_folds

print("")
print("Done!")

Running Fold 1
  Accuracy: 0.5406
Running Fold 2
  Accuracy: 0.7649
Running Fold 3
  Accuracy: 0.6269
Running Fold 4
  Accuracy: 0.7681
Running Fold 5
  Accuracy: 0.7622
Running Fold 6
  Accuracy: 0.7653
Running Fold 7
  Accuracy: 0.768
Running Fold 8
  Accuracy: 0.7692
Running Fold 9
  Accuracy: 0.7653
Running Fold 10
  Accuracy: 0.7674

Done!
CPU times: user 11min 50s, sys: 9.01 s, total: 11min 59s
Wall time: 11min 58s
