### Author: Kubam Ivo
### Date: 8/17/2020
### Purpose: Titanic Kaggle competition

## Support Vector Classifier

Support vector classifier model was used as a way to improve the performance of the baseline model (Logistic regression). Setting the Kernel of model to linear and gamma to scale, the auc for this model was 0.79, accuracy of 0.79 and weighted f1 score of 0.78. Same preprocessing of train dataset were done on the test dataset. The kaggle score (0.76555) for support vector classifier was slightly lower than that of the simple logistic regression 

In [58]:
#Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score


In [45]:
#Importing the clean dataset
train_data = pd.read_csv("train_clean1")
#train_data = train_data.drop(["Fare","Age"],axis=1)
train_data.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,1,0,7.25,0,0,1,0,1,0,0,1
1,1,38.0,1,0,71.2833,1,0,0,1,0,1,0,0
2,1,26.0,0,0,7.925,0,0,1,1,0,0,0,1
3,1,35.0,1,0,53.1,1,0,0,1,0,0,0,1
4,0,35.0,0,0,8.05,0,0,1,0,1,0,0,1


In [46]:
# Importing the test data set
test_data = pd.read_csv("test.csv")

In [47]:
# Extracting the features
X = train_data.iloc[:,1:]
#Extracting the labels
y = train_data["Survived"]

In [53]:
#Initialising model class
svm = SVC(kernel = 'linear',gamma='scale')

In [54]:
# 5 fold cross validation
y_pred = cross_val_predict(svm, X, y, cv=5)

In [55]:
# Confusion matrix
confusion_matrix(y,y_pred, labels=[0, 1])

array([[469,  80],
       [110, 232]], dtype=int64)

In [56]:
print(classification_report(y, y_pred, labels=[0,1]))

              precision    recall  f1-score   support

           0       0.81      0.85      0.83       549
           1       0.74      0.68      0.71       342

   micro avg       0.79      0.79      0.79       891
   macro avg       0.78      0.77      0.77       891
weighted avg       0.78      0.79      0.78       891



In [57]:
#AUC
roc_auc_score(y,y_pred)

0.766321541558815

In [59]:
#Accuracy
accuracy_score(y, y_pred)

0.7867564534231201

In [60]:
# Fitting the logistic regression model
svm.fit(X,y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [61]:
test_data1 = pd.get_dummies(test_data,columns=["Pclass","Sex","Embarked"])


In [62]:
test_data1 = test_data1.drop(["PassengerId","Name", "Ticket", "Cabin"], axis=1)


In [63]:
test_data1.head(5)

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,34.5,0,0,7.8292,0,0,1,0,1,0,1,0
1,47.0,1,0,7.0,0,0,1,1,0,0,0,1
2,62.0,0,0,9.6875,0,1,0,0,1,0,1,0
3,27.0,0,0,8.6625,0,0,1,0,1,0,0,1
4,22.0,1,1,12.2875,0,0,1,1,0,0,0,1


In [64]:
test_data1.isnull().sum()

Age           86
SibSp          0
Parch          0
Fare           1
Pclass_1       0
Pclass_2       0
Pclass_3       0
Sex_female     0
Sex_male       0
Embarked_C     0
Embarked_Q     0
Embarked_S     0
dtype: int64

In [65]:
# Handling missing values
#test_data1["Age"] = test_data1["Age"].fillna(29.7) # imputing the mean value of 29.7 for all missing ages
test_data1["Fare"] = test_data1["Fare"].fillna(32.2)

# Handling missing values in age column usng linear regression
from sklearn.linear_model import LinearRegression

# Extracting the features
X = train_data.iloc[:,2:]
#Extracting the labels
y = train_data["Age"]

reg = LinearRegression().fit(X,y)
print(reg.score(X,y))


x1 = pd.isnull(test_data1["Age"]) # Extracting rows with null values for Age
x2 = test_data1[x1].index #Extracting index for all rows with null values for age
x_test = test_data1.iloc[x2,1:] # extracting test dataset where age is null
y_pred = reg.predict(x_test) #Predicting Age
test_data1.iloc[x2,0]=y_pred #replacing all null ages from  original dataset with predicted values
test_data1.isnull().sum()


0.32237586235082505


Age           0
SibSp         0
Parch         0
Fare          0
Pclass_1      0
Pclass_2      0
Pclass_3      0
Sex_female    0
Sex_male      0
Embarked_C    0
Embarked_Q    0
Embarked_S    0
dtype: int64

In [43]:
pred = svm.predict(test_data1)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': pred})
output.to_csv('my_svm1.csv', index=False)