# Exercise 1

In the folder “Data” you have access to the dataset Titanic.csv presenting information about travellers with their status (survived=1 (yes) or =0 (no)).
In addition, you have the information about the class (Pclass), name (Name), gender (Sex),
age (Age), sibling or spouse on board (1/0), parents or children aboard (1/0), and fare price (Fare).

In [1]:
import pandas as pd

pd.set_option('display.max_colwidth', None)
titanic = pd.read_csv('../Data/titanic.csv.zst', index_col='Name')
titanic.head(5)

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Mr. Owen Harris Braund,0,3,male,22.0,1,0,7.25
Mrs. John Bradley (Florence Briggs Thayer) Cumings,1,1,female,38.0,1,0,71.2833
Miss. Laina Heikkinen,1,3,female,26.0,0,0,7.925
Mrs. Jacques Heath (Lily May Peel) Futrelle,1,1,female,35.0,1,0,53.1
Mr. William Henry Allen,0,3,male,35.0,0,0,8.05


In [2]:
titanic.describe(include='all')

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887,887.0,887.0,887.0,887.0
unique,,,2,,,,
top,,,male,,,,
freq,,,573,,,,
mean,0.385569,2.305524,,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,,0.42,0.0,0.0,0.0
25%,0.0,2.0,,20.25,0.0,0.0,7.925
50%,0.0,3.0,,28.0,0.0,0.0,14.4542
75%,1.0,3.0,,38.0,1.0,0.0,31.1375


In [3]:
rules = ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']

In [4]:
from typing import Iterable
from mlxtend.evaluate import accuracy_score
from sklearn.model_selection import train_test_split
from mlxtend.classifier import OneRClassifier


def train_and_predict(X_idx: Iterable, y_idx: Iterable) -> float:
    X = titanic[X_idx]
    y = titanic[y_idx]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

    one_r = OneRClassifier()
    one_r.fit(X_train.to_numpy(), y_train)

    y_pred = one_r.predict(X_test.to_numpy())
    accuracy = accuracy_score(y_test, y_pred)

    return accuracy

## (a) What is the best default rule for this dataset? (Default means without any evidence about the person)?

The best default rule would be the most frequency count.
If we assume the strategy of "women and children first" when the Titanic sank, we can see that the most likely properties would be `Sex` or `Parents/Children Aboard`.

We test our default rule (`Sex` and `Parents/Children Aboard` vs. `Survived`):

In [5]:
default_rule_accuracy = train_and_predict(X_idx=["Sex", "Parents/Children Aboard"], y_idx="Survived")

print('Default rule accuracy:', default_rule_accuracy)

Default rule accuracy: 0.8063063063063063


## (b) What is the best 1R for this dataset?

In [6]:
all_one_r = pd.DataFrame(
    map(lambda rule: [rule, train_and_predict(X_idx=[rule], y_idx='Survived')], rules),
    columns=['Rule', 'Accuracy'],
)

all_one_r

Unnamed: 0,Rule,Accuracy
0,Pclass,0.711712
1,Sex,0.77027
2,Age,0.513514
3,Siblings/Spouses Aboard,0.630631
4,Parents/Children Aboard,0.630631
5,Fare,0.693694


In [7]:
best_one_r = all_one_r[all_one_r.Accuracy == all_one_r.Accuracy.max()]

print('Best 1R:')
best_one_r

Best 1R:


Unnamed: 0,Rule,Accuracy
1,Sex,0.77027


We therefore conclude the `Sex` attribute to be the best 1R for the Titanic dataset.

## (c) Can you produce a second rule based on a single attribute with a good effectiveness? You need to split the dataset into two disjoint sample, the training and the test set. For example, used 75% for the training sample, and the remaining 25% for the test set.

In [8]:
rules2 = [[a,b] for a in rules for b in rules if rules.index(a) < rules.index(b)]
all_two_r = pd.DataFrame(
    map(lambda rule: [rule, train_and_predict(X_idx=rule, y_idx="Survived")], rules2),
    columns=["Rules", "Accuracy"],
)

all_two_r

Unnamed: 0,Rules,Accuracy
0,"[Pclass, Sex]",0.720721
1,"[Pclass, Age]",0.626126
2,"[Pclass, Siblings/Spouses Aboard]",0.702703
3,"[Pclass, Parents/Children Aboard]",0.716216
4,"[Pclass, Fare]",0.653153
5,"[Sex, Age]",0.806306
6,"[Sex, Siblings/Spouses Aboard]",0.743243
7,"[Sex, Parents/Children Aboard]",0.779279
8,"[Sex, Fare]",0.698198
9,"[Age, Siblings/Spouses Aboard]",0.585586


In [9]:
best_two_r = all_two_r[all_two_r.Accuracy == all_two_r.Accuracy.max()]

print("Best 2R:")
best_two_r

Best 2R:


Unnamed: 0,Rules,Accuracy
5,"[Sex, Age]",0.806306


There is no direct answer as to which two rules would be best.
Fact is, that `Sex` is always part of the set, but the other rule seems to not have that big of an impact and changes depending on the random splitting of the dataset into a training/test set.

When taking the best out of 100 runs, I get the following result:
```
Best rule:     ['Sex', 'Parents/Children Aboard']
with accuracy: 0.8603603603603603
```
However, this may also be the result of overfitting to the present dataset.