# Exercise 1

In the folder “Data” you have access to the dataset Titanic.csv presenting information about travellers with their status (survived=1 (yes) or =0 (no)).
In addition, you have the information about the class (Pclass), name (Name), gender (Sex),
age (Age), sibling or spouse on board (1/0), parents or children aboard (1/0), and fare price (Fare).

In [1]:
import pandas as pd

pd.set_option('display.max_colwidth', None)
titanic = pd.read_csv('../Data/titanic.csv.zst', index_col='Name')

titanic['Sex'] = (titanic['Sex'].to_numpy() == 'male').astype(int)
titanic.head(5)

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Mr. Owen Harris Braund,0,3,1,22.0,1,0,7.25
Mrs. John Bradley (Florence Briggs Thayer) Cumings,1,1,0,38.0,1,0,71.2833
Miss. Laina Heikkinen,1,3,0,26.0,0,0,7.925
Mrs. Jacques Heath (Lily May Peel) Futrelle,1,1,0,35.0,1,0,53.1
Mr. William Henry Allen,0,3,1,35.0,0,0,8.05


In [2]:
titanic.describe(include='all')

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887.0,887.0,887.0,887.0,887.0
mean,0.385569,2.305524,0.645998,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,0.47848,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,0.0,20.25,0.0,0.0,7.925
50%,0.0,3.0,1.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,1.0,38.0,1.0,0.0,31.1375
max,1.0,3.0,1.0,80.0,8.0,6.0,512.3292


We create labels for data that is not numeric.

## (a) In the correlation visualization, select the two features that have the most significant correlation to the target feature, Survived.

In [3]:
correlation = titanic.corr()['Survived'].abs().sort_values(ascending=False)
del correlation['Survived']
correlation

Sex                        0.542152
Pclass                     0.336528
Fare                       0.256179
Parents/Children Aboard    0.080097
Age                        0.059665
Siblings/Spouses Aboard    0.037082
Name: Survived, dtype: float64

The two most correlating features are:

In [4]:
ms_features = list(correlation.index[0:2])
ms_features

['Sex', 'Pclass']

In [5]:
all_features = list(correlation.index)
all_features

['Sex',
 'Pclass',
 'Fare',
 'Parents/Children Aboard',
 'Age',
 'Siblings/Spouses Aboard']

## (b) Using Naive Bayes classifier and the most two significant features, predict the Survival of the travellers.

In [6]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB


def bayes_accuracy(X_idx):
    X = titanic[X_idx]
    y = titanic['Survived']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
                                                        random_state=224)  # 224 makes the below accuracies the same

    clf = GaussianNB()
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    return accuracy

In [7]:
ms_accuracies = pd.DataFrame(
    data=[[ms_features, bayes_accuracy(X_idx=ms_features)]],
    columns=['Features', 'Accuracy'],
)

print('Accuracy of the most correlated features:')
ms_accuracies

Accuracy of the most correlated features:


Unnamed: 0,Features,Accuracy
0,"[Sex, Pclass]",0.81982


## (c) Compare the performance of your model when using all the attributes of the travellers.

In [8]:
all_accuracies = pd.DataFrame(
    data=[['All', bayes_accuracy(X_idx=all_features)]],
    columns=['Features', 'Accuracy'],
)

print('Accuracy of using all features:')
all_accuracies

Accuracy of using all features:


Unnamed: 0,Features,Accuracy
0,All,0.828829


Interestingly the above results are all the same value due to the chosen `random_state = 224`.
Changing this value makes the results more diverged.