## Suport Vector Machine with Titanic Dataset

Titanic Dataset contains data for 891 of the real Titanic passengers. Each row represents a single person. The columns describe different attributes about the person including whether they survived, their age, their passenger-class, their sex, the fare they paid etc.

In this notebook we will use SVM classification model to predict who survived the disaster. 

### About the alogrithm

Classifying data is a common task in machine learning. Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in. In the case of support-vector machines, a data point is viewed as a *p*-dimensional vector (a list of *p* numbers), and we want to know whether we can separate such points with a *(p-1)*-dimensional hyperplane. 

( Wikipedia: https://en.wikipedia.org/wiki/Support-vector_machine)

Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. Support Vectors are simply the co-ordinates of individual observation.

For more information, check out here : https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-theory-f0812effc72

In [None]:
import os
path = '/rapids/notebooks/ml_tutorial/ML_intro/data/'
if not os.path.exists(str(path + 'titanic')):
    print("error, data is missing!")

## 1. Loading the Data

In [None]:
import cudf
titanic_gdf=  cudf.read_csv(str(path + 'titanic/train.csv')) #891x12

## 2. Cleaning the Data 

In [None]:
#encoding
survived = titanic_gdf['Survived'] == 1
die = titanic_gdf['Survived'] == 0


In [None]:
import numpy as np
#encoding
sex = np.zeros(len(titanic_gdf))
sex[titanic_gdf['Sex']== 'male'] = 1
sex[titanic_gdf['Sex']== 'female'] = 0
titanic_gdf['Sex'] = sex

In [None]:
#Fill the NA values with the most common one 
titanic_gdf['Embarked'] = titanic_gdf['Embarked'].fillna('S')
#encoding
embarked = np.zeros(len(titanic_gdf))
embarked[titanic_gdf['Embarked']== 'S'] = 1
embarked[titanic_gdf['Embarked']== 'C'] = 2
embarked[titanic_gdf['Embarked']== 'Q'] = 3
titanic_gdf['Embarked'] = embarked

In [None]:
# Dealing with the missing values in the Age feature.

titanic_gdf['Age'].fillna(titanic_gdf['Age'].mean(),inplace=True)
age = np.zeros(len(titanic_gdf))
#encoding
age[titanic_gdf['Age']<20] = 1
age[(titanic_gdf['Age']>=20)&(titanic_gdf['Age']<60)] = 2
age[(titanic_gdf['Age']>=60)] = 3
titanic_gdf['Age'] = age

In [None]:
# Dropping the columns .ipynb_checkpoints/e are not useing for the analysis 
titanic_gdf = titanic_gdf.drop(columns= ['PassengerId','Name','Ticket','Cabin'])

In [None]:
titanic_gdf.head()

## 3. Splitting the Data into Training and Testing

In [None]:
import cudf
from cuml.preprocessing.model_selection import train_test_split

target = titanic_gdf['Survived']
titanic_gdf = titanic_gdf.drop(['Survived'])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(titanic_gdf, target,
                                                    train_size=0.8, shuffle = False)

## 4. Modelling

### 4.1. Define Parameters

In [None]:
C = 30 #Penalty parameter C of the error term.
tol = 1e-3   # tolerance for stopping criterion.
kernel= 'rbf' # Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable.
gamma = 0.01  # Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.

### 4.2. Fit the model

In [None]:
import cuml.svm 
cumlSVC = cuml.svm.SVC(C=C,  gamma=gamma, tol=tol, kernel=kernel)
cumlSVC.fit(X_train, y_train)

### 4.3. Prediction

In [None]:
cuml_pred = cumlSVC.predict(X_test)

### 4.4. Scoring

In [None]:
cuml_accuracy = np.sum(cuml_pred.to_array()==y_test.to_array()) / y_test.shape[0] * 100
print("Accuracy: cumlSVC {}%".format(cuml_accuracy))