# Exercise 1

In the folder “Data” you have access to the dataset Titanic.csv presenting information about travellers with their status (survived=1 (yes) or =0 (no)).
In addition, you have the information about the class (Pclass), name (Name), gender (Sex),
age (Age), sibling or spouse on board (1/0), parents or children aboard (1/0), and fare price (Fare).

In [1]:
import pandas as pd

pd.set_option('display.max_colwidth', None)
titanic = pd.read_csv('../Data/titanic.csv.zst', index_col='Name')
titanic.head(5)

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Mr. Owen Harris Braund,0,3,male,22.0,1,0,7.25
Mrs. John Bradley (Florence Briggs Thayer) Cumings,1,1,female,38.0,1,0,71.2833
Miss. Laina Heikkinen,1,3,female,26.0,0,0,7.925
Mrs. Jacques Heath (Lily May Peel) Futrelle,1,1,female,35.0,1,0,53.1
Mr. William Henry Allen,0,3,male,35.0,0,0,8.05


In [2]:
titanic.describe(include='all')

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887,887.0,887.0,887.0,887.0
unique,,,2,,,,
top,,,male,,,,
freq,,,573,,,,
mean,0.385569,2.305524,,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,,0.42,0.0,0.0,0.0
25%,0.0,2.0,,20.25,0.0,0.0,7.925
50%,0.0,3.0,,28.0,0.0,0.0,14.4542
75%,1.0,3.0,,38.0,1.0,0.0,31.1375


We binarize the gender to make it easier to work with.

In [3]:
titanic.insert(
    loc=titanic.columns.get_loc('Sex') + 1,
    column='Binary Gender',
    value=titanic['Sex'].map({'female': 0, 'male': 1})
)
titanic.head(5)

Unnamed: 0_level_0,Survived,Pclass,Sex,Binary Gender,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Mr. Owen Harris Braund,0,3,male,1,22.0,1,0,7.25
Mrs. John Bradley (Florence Briggs Thayer) Cumings,1,1,female,0,38.0,1,0,71.2833
Miss. Laina Heikkinen,1,3,female,0,26.0,0,0,7.925
Mrs. Jacques Heath (Lily May Peel) Futrelle,1,1,female,0,35.0,1,0,53.1
Mr. William Henry Allen,0,3,male,1,35.0,0,0,8.05


## (a) In the correlation visualization, select the two features that have the most significant correlation to the target feature, Survived.

In [4]:
features = [c for c in titanic.columns if c != 'Name' and c != 'Sex' and c != 'Survived']
X = titanic[features]
y = titanic['Survived']

In [5]:
from sklearn.feature_selection import SelectKBest, f_classif, f_regression, chi2
from sklearn.model_selection import train_test_split

score_funcs = [f_classif, f_regression, chi2]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

ms_features = []
for f in score_funcs:
    #feature_scores = pd.concat({'Feature': columns, 'Score': scores}, axis=1) score_funcs:
    print(f.__name__, ':')
    best_features = SelectKBest(score_func=f, k=2)
    fit = best_features.fit(X_train, y_train)
    scores = pd.DataFrame(fit.scores_)
    columns = pd.DataFrame(X_train.columns)

    feature_scores = pd.concat([columns, scores], axis=1)
    feature_scores.columns = ['Feature', 'Score']

    best_2 = feature_scores.nlargest(2, 'Score')
    print(best_2)
    print('')

    best_2_features = [a for a in best_2['Feature']]
    if best_2_features not in ms_features:
        ms_features.append(best_2_features)

print('Most correlating pair of features:')
print(ms_features)

f_classif :
         Feature       Score
1  Binary Gender  225.705798
0         Pclass   73.585815

f_regression :
         Feature       Score
1  Binary Gender  225.705798
0         Pclass   73.585815

chi2 :
         Feature        Score
5           Fare  2788.932599
1  Binary Gender    58.159436

Most correlating pair of features:
[['Binary Gender', 'Pclass'], ['Fare', 'Binary Gender']]


## (b) Using Naive Bayes classifier and the most two significant features, predict the Survival of the travellers.

In [6]:
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from typing import Iterable


def bayes_accuracy(X_idx: Iterable[str]):
    X = titanic[X_idx]
    y = titanic['Survived']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

    clf = GaussianNB()
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    return accuracy

In [7]:
ms_accuracies = pd.DataFrame(
    data=map(lambda fest: [fest, bayes_accuracy(X_idx=fest)], ms_features),
    columns=['Features', 'Accuracy'],
)

print('Accuracy of the most correlated features:')
ms_accuracies

Accuracy of the most correlated features:


Unnamed: 0,Features,Accuracy
0,"[Binary Gender, Pclass]",0.783784
1,"[Fare, Binary Gender]",0.756757


## (c) Compare the performance of your model when using all the attributes of the travellers.

In [8]:
all_accuracies = pd.DataFrame(
    data=[['All', bayes_accuracy(X_idx=features)]],
    columns=['Features', 'Accuracy'],
)

print('Accuracy of using all features:')
all_accuracies

Accuracy of using all features:


Unnamed: 0,Features,Accuracy
0,All,0.824324
