# Binary classification using the Titanic dataset

One of the classic public datasets used to demonstrate binary classification is the Titanic dataset, which lists the passengers aboard the RMS Titanic when it sank on April 15, 1912. The dataset includes the name of each passenger as well as other information such as the fare class, the fare price, the person's age and gender, and whether that person survived the sinking of the ship. In this example, we will build a binary-classification model that predicts whether a passenger will survive. We will build the model two ways — first as a logistic-regression model, and then as a Support Vector Machine (SVM) model — and compare the results.

![](Images/titanic.png)

## Load and prepare the dataset

The first step is to load the dataset and prepare it for training a machine-learning model. One of the reasons the Titanic dataset is popular is that it provides ample opportunity for data scientists to practice their data-cleaning skills.

In [1]:
# Load the dataset
import pandas as pd

df = pd.read_csv('Data/titanic.csv')
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


We'll drop columns such as "PassengerId" and "Name" that have no bearing on the outcome. We will also drop the "Fare" column because there is colinearity between that column and the "Pclass" column. Finally, we will one-hot-encode the "Sex" and "Pclass" columns, and remove rows containing missing values.

In [3]:
df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked', 'Fare'], axis=1, inplace=True)
df = pd.get_dummies(df, columns=['Sex', 'Pclass'])
df.dropna(inplace=True)
df.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3
0,0,22.0,1,0,0,1,0,0,1
1,1,38.0,1,0,1,0,1,0,0
2,1,26.0,0,0,1,0,0,0,1
3,1,35.0,1,0,1,0,1,0,0
4,0,35.0,0,0,0,1,0,0,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 714 entries, 0 to 890
Data columns (total 9 columns):
Survived      714 non-null int64
Age           714 non-null float64
SibSp         714 non-null int64
Parch         714 non-null int64
Sex_female    714 non-null uint8
Sex_male      714 non-null uint8
Pclass_1      714 non-null uint8
Pclass_2      714 non-null uint8
Pclass_3      714 non-null uint8
dtypes: float64(1), int64(3), uint8(5)
memory usage: 31.4 KB


Now let's see which input variables have the mosty influence on the outcome.

In [5]:
df.corr()["Survived"].sort_values(ascending=False)

Survived      1.000000
Sex_female    0.538826
Pclass_1      0.301831
Parch         0.093317
Pclass_2      0.084753
SibSp        -0.017358
Age          -0.077221
Pclass_3     -0.337587
Sex_male     -0.538826
Name: Survived, dtype: float64

The number of parents accompanying the passenger ("Parch") and the number of siblings ("SibSp") have little effect on the outcome, so we'll remove those columns.

In [6]:
df.drop(['Parch', 'SibSp'], axis=1, inplace=True)

The final step is to split the data into two datasets: one for training and one for testing.

In [7]:
from sklearn.model_selection import train_test_split

x = df.drop('Survived', axis=1)
y = df['Survived']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1234)

## Build and train a logistic-regression model

Our first classifier will use logistic regression. One of the advantages of logistic regression is that it will not only make predictions, it will give you probabilities as well.

In [8]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(x, y)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [9]:
# Get the overall accuracy of the model
model.score(x_test, y_test)

0.7762237762237763

In [10]:
# Show the confusion matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, model.predict(x_test))

array([[69, 16],
       [16, 42]])

In [11]:
from sklearn.metrics import classification_report

predictions = model.predict(x_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.81      0.81      0.81        85
           1       0.72      0.72      0.72        58

   micro avg       0.78      0.78      0.78       143
   macro avg       0.77      0.77      0.77       143
weighted avg       0.78      0.78      0.78       143



Now let's the use the model to predict whether a 30-year-old female traveling in first class will survive the voyage.

In [12]:
input = [30, 1, 0, 1, 0, 0]
model.predict([input])

array([1])

More to the point, what is the probability that a 30-year-old female traveling in first class will survive?

In [13]:
probability = model.predict_proba([input])[0][1]
print('Probability of survival: {:.1%}'.format(probability))

Probability of survival: 92.7%


## Build and train an SVM model

Support-vector classifiers (classifiiers that use Support Vector Machines, or SVMs) frequently do better at fitting the data than classifiers that rely on logistic regression. Let's try a support-vector classifier on the same dataset and see if it fares better.

In [14]:
from sklearn.svm import SVC

model = SVC()
model.fit(x, y)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [15]:
# Get the overall accuracy of the model
model.score(x_test, y_test)

0.8181818181818182

In [16]:
# Show the confusion matrix
confusion_matrix(y_test, model.predict(x_test))

array([[71, 14],
       [12, 46]])

In [17]:
predictions = model.predict(x_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.86      0.84      0.85        85
           1       0.77      0.79      0.78        58

   micro avg       0.82      0.82      0.82       143
   macro avg       0.81      0.81      0.81       143
weighted avg       0.82      0.82      0.82       143



Now let's the use the model to predict whether a 30-year-old female traveling in first class will survive the voyage.

In [18]:
model.predict([[30, 1, 0, 1, 0, 0]])

array([1])

How about a 30-year-old male traveling in first class?

In [19]:
model.predict([[30, 0, 1, 1, 0, 0]])

array([0])

Unfortunately, we can't get the probability that a passenger will survive because Support Vector Machines don't compute probabilities.