# Exercise 1

In the folder “Data” you have access to the dataset Titanic.csv presenting information about travellers with their status (survived=1 (yes) or =0 (no)).
In addition, you have the information about the class (Pclass), name (Name), gender (Sex),
age (Age), sibling or spouse on board (1/0), parents or children aboard (1/0), and fare price (Fare).

In [140]:
import pandas as pd

pd.set_option('display.max_colwidth', None)
titanic = pd.read_csv('../Data/titanic.csv.zst', index_col='Name')
print(titanic.head(5))
titanic.describe(include='all')

                                                    Survived  Pclass     Sex  \
Name                                                                           
Mr. Owen Harris Braund                                     0       3    male   
Mrs. John Bradley (Florence Briggs Thayer) Cumings         1       1  female   
Miss. Laina Heikkinen                                      1       3  female   
Mrs. Jacques Heath (Lily May Peel) Futrelle                1       1  female   
Mr. William Henry Allen                                    0       3    male   

                                                     Age  \
Name                                                       
Mr. Owen Harris Braund                              22.0   
Mrs. John Bradley (Florence Briggs Thayer) Cumings  38.0   
Miss. Laina Heikkinen                               26.0   
Mrs. Jacques Heath (Lily May Peel) Futrelle         35.0   
Mr. William Henry Allen                             35.0   

                  

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887,887.0,887.0,887.0,887.0
unique,,,2,,,,
top,,,male,,,,
freq,,,573,,,,
mean,0.385569,2.305524,,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,,0.42,0.0,0.0,0.0
25%,0.0,2.0,,20.25,0.0,0.0,7.925
50%,0.0,3.0,,28.0,0.0,0.0,14.4542
75%,1.0,3.0,,38.0,1.0,0.0,31.1375


In [141]:
rules = ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']

In [142]:
from typing import Iterable
from mlxtend.evaluate import accuracy_score
from sklearn.model_selection import train_test_split
from mlxtend.classifier import OneRClassifier


def train_and_predict(X_idx: Iterable, y_idx: Iterable) -> float:
    X = titanic[X_idx]
    y = titanic[y_idx]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

    one_r = OneRClassifier()
    one_r.fit(X_train.to_numpy(), y_train)

    y_pred = one_r.predict(X_test.to_numpy())
    accuracy = accuracy_score(y_test, y_pred)

    return accuracy

## (a) What is the best default rule for this dataset? (Default means without any evidence about the person)?

The best default rule would be the most frequency count.
If we assume the strategy of "women and children first" when the Titanic sank, we can see that the most likely properties would be `Sex` or `Parents/Children Aboard`.

We test our default rule (`Sex` and `Parents/Children Aboard` vs. `Survived`):

In [143]:
default_rule_accuracy = train_and_predict(X_idx=["Sex", "Parents/Children Aboard"], y_idx="Survived")

print('Default rule accuracy:', default_rule_accuracy)

Default rule accuracy: 0.8018018018018018


## (b) What is the best 1R for this dataset?

In [144]:
all_one_r = pd.DataFrame(
    map(lambda rule: [rule, train_and_predict(X_idx=[rule], y_idx="Survived")], rules),
    columns=["Rule", "Accuracy"],
)

best_one_r = all_one_r[all_one_r.Accuracy == all_one_r.Accuracy.max()]

print(all_one_r)
print("\nBest 1R:\n", best_one_r)

                      Rule  Accuracy
0                   Pclass  0.671171
1                      Sex  0.747748
2                      Age  0.590090
3  Siblings/Spouses Aboard  0.608108
4  Parents/Children Aboard  0.612613
5                     Fare  0.653153

Best 1R:
   Rule  Accuracy
1  Sex  0.747748


We therefore conclude the `Sex` attribute to be the best 1R for the Titanic dataset.

## (c) Can you produce a second rule based on a single attribute with a good effectiveness? You need to split the dataset into two disjoint sample, the training and the test set. For example, used 75% for the training sample, and the remaining 25% for the test set.

In [145]:
rules2 = [[a,b] for a in rules for b in rules if rules.index(a) < rules.index(b)]
all_two_r = pd.DataFrame(
    map(lambda rule: [rule, train_and_predict(X_idx=rule, y_idx="Survived")], rules2),
    columns=["Rules", "Accuracy"],
)

best_two_r = all_two_r[all_two_r.Accuracy == all_two_r.Accuracy.max()]

print(all_two_r)
print("\nBest 2R:\n", best_two_r)

                                                 Rules  Accuracy
0                                        [Pclass, Sex]  0.788288
1                                        [Pclass, Age]  0.684685
2                    [Pclass, Siblings/Spouses Aboard]  0.680180
3                    [Pclass, Parents/Children Aboard]  0.662162
4                                       [Pclass, Fare]  0.729730
5                                           [Sex, Age]  0.756757
6                       [Sex, Siblings/Spouses Aboard]  0.801802
7                       [Sex, Parents/Children Aboard]  0.747748
8                                          [Sex, Fare]  0.680180
9                       [Age, Siblings/Spouses Aboard]  0.576577
10                      [Age, Parents/Children Aboard]  0.608108
11                                         [Age, Fare]  0.693694
12  [Siblings/Spouses Aboard, Parents/Children Aboard]  0.662162
13                     [Siblings/Spouses Aboard, Fare]  0.639640
14                     [P

There is no direct answer as to which two rules would be best.
Fact is, that `Sex` is always part of the set, but the other rule seems to not have that big of an impact and changes depending on the random splitting of the dataset into a training/test set.

When taking the best out of 100 runs, I get the following result:
```
Best rule:     ['Sex', 'Parents/Children Aboard']
with accuracy: 0.8603603603603603
```
However, this may also be the result of overfitting to the present dataset.