## A notebook for Kaggle's Titanic Intro Challenge using SVM for binary classification.

Import the necessary libraries and read the provided data from the .csv files.

In [43]:
import pandas as pd
from matplotlib import pyplot as plt
from sklearn import svm

In [153]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

Display the header of the training data.

In [154]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


From the given data we choose the passenger class, sex, age, number of siblings, family members and fare.

In [185]:
column_features = ['Pclass', 'Sex', 'Age', 'Fare']
column_target = ['Survived']

In [186]:
features_train = train_df[column_features].copy()
target_train = train_df[column_target].copy()

### Now we perform basic preprocessing of the data.

We start by converting the entries in the 'Sex' column into numbers (0/1).

In [187]:
features_train['Sex'].replace(['female','male'],[0,1],inplace=True)

We check whether the command gave the desired result in the dataset.

In [188]:
features_train.head()

Unnamed: 0,Pclass,Sex,Age,Fare
0,3,1,22.0,7.25
1,1,0,38.0,71.2833
2,3,0,26.0,7.925
3,1,0,35.0,53.1
4,3,1,35.0,8.05


We can now proceed to cleaning NaN values if they occur.

In [189]:
features_train.isnull().sum()

Pclass      0
Sex         0
Age       177
Fare        0
dtype: int64

We see that a lot of the entries in the 'Age' column are invalid. Replacing them with zeros would not be a good idea. It would be more suitable to replace it with the mean/median of the remaining values in the column.

If we just use the median, we get the following.

In [190]:
features_train['Age'].median()

28.0

However, this **disregards the fact that we have NaN values that contribute only to the count**. Dividing the total sum by the number of non-empty cells gives a better estimate of the median age.

In [191]:
features_train['Age'].sum()

21205.169999999998

In [192]:
median = features_train['Age'].sum()/(features_train['Age'].count())

In [193]:
median

29.69911764705882

As expected, the median age is now slighlty higher.

Now we can replace the NaN values with the median.

In [194]:
features_train['Age'] = features_train['Age'].fillna(median)

We can double-check that the action above removed the NaN values.

In [195]:
features_train.isnull().sum()

Pclass    0
Sex       0
Age       0
Fare      0
dtype: int64

In [196]:
features_train.head()

Unnamed: 0,Pclass,Sex,Age,Fare
0,3,1,22.0,7.25
1,1,0,38.0,71.2833
2,3,0,26.0,7.925
3,1,0,35.0,53.1
4,3,1,35.0,8.05


We can now check the target dataset as well.

In [197]:
target_train.isnull().sum()

Survived    0
dtype: int64

In [198]:
target_train = target_train.as_matrix().ravel()

### Now we will do the same for the testing dataset.

In [199]:
features_test = test_df[column_features].copy()

In [200]:
features_test.isnull().sum()

Pclass     0
Sex        0
Age       86
Fare       1
dtype: int64

In [201]:
median = features_test['Age'].sum()/(features_test['Age'].count())

In [202]:
median

30.272590361445783

In [203]:
features_test['Age'] = features_test['Age'].fillna(median)

Here, there was an invallid entry in the 'Fare' column as well, se we'll also have to clean that.

In [204]:
median_fare = features_test['Fare'].median()

In [205]:
median_fare

14.4542

In [206]:
features_test['Fare'] = features_test['Fare'].fillna(median_fare)

In [207]:
features_test['Sex'].replace(['female','male'],[0,1],inplace=True)

In [208]:
features_test.head()

Unnamed: 0,Pclass,Sex,Age,Fare
0,3,1,34.5,7.8292
1,3,0,47.0,7.0
2,2,1,62.0,9.6875
3,3,1,27.0,8.6625
4,3,0,22.0,12.2875


In [209]:
features_test.isnull().sum()

Pclass    0
Sex       0
Age       0
Fare      0
dtype: int64

### Now we can proceed to applying the SVM algorithm to the training dataset.

In [220]:
clf = svm.SVC(C = 2.0, kernel = 'linear')

We can now try running the algorithm with the default parameters.

In [221]:
clf.fit(features_train, target_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Now that we have trained our classifier, we can predict the result for our testing dataset.

In [222]:
target_test = clf.predict(features_test)

Once we have created the target for the test dataset, we export it into a pandas dataframe and then append it to a dataframe consisting of passenger IDs.

In [223]:
target_test = pd.DataFrame(target_test)

In [224]:
result = test_df[['PassengerId']].copy()

In [225]:
result = result.assign(Survived = target_test)

In [226]:
result.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


**We are done! Now it only left to export the resulting dataframe to a .csv file and upload it to Kaggle.**

In [227]:
result.to_csv('results-SVM.csv', index = False)