## Introduction

Decision Trees are non-probabilistic classifiers. As they rely on labled data, they belong to the class of supervised learning models. Decision trees can be used both for classification as well as regression. This exercise will focus on the classification part.

The documentation for Decision Trees in scikit-learn can be found [here](https://scikit-learn.org/stable/modules/tree.html).

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

In [None]:
# Load the train and test datasets from the CSV files
train = pd.read_csv('train.csv')

In [None]:
def add_missing_float_values(data: pd.DataFrame, attribute: str):
    """Add missing float values using mean and standard deviation"""
    mean = data[attribute].mean()
    std = data[attribute].std()
    null_count = data[attribute].isnull().sum()
    random_list = np.random.randint(mean - std, mean + std, size=null_count)  
    data_copy = data[attribute].copy()
    data_copy[np.isnan(data_copy)] = [random_list]
    data[attribute] = data_copy
    return data

In [None]:
# fill nan values
train["Embarked"] = train["Embarked"].fillna("S")
# add missing ages
train = add_missing_float_values(train, "Age")
# create age bins
train.loc[train['Age'] <= 14, 'Age_bin'] = 0
train.loc[(train['Age'] > 14) & (train['Age'] <= 30), 'Age_bin'] = 1
train.loc[(train['Age'] > 30) & (train['Age'] <= 40), 'Age_bin'] = 2
train.loc[(train['Age'] > 40) & (train['Age'] <= 50), 'Age_bin'] = 3
train.loc[(train['Age'] > 50) & (train['Age'] <= 60), 'Age_bin'] = 4
train.loc[ train['Age'] > 60, 'Age_bin'] = 5
train['Age_bin'] = train['Age_bin'].astype(int)

In [None]:
train.head(5)

In [None]:
train.info()
train.describe()

In [None]:
classification_target = "Survived"
train = train[["Pclass", "Embarked", "Sex", "Survived", "Age_bin"]]
test = train.sample(frac=0.2,random_state=200) #random state is a seed value
train=train.drop(index=test.index)

## Decision Trees
### Representation
* Each node represents a split of the data on an attribute
* Each edge represents an attribute value
* Each leaf represents the classification result

### ID3 Algorithm

1. Compute the overall best attribute (i.e. using information gain)
  1. Compute the entropy for attribute
  2. Compute entropy for all attribute values
  3. Weight entropy values for each attribute value by weight of attribute value
2. Assign best attribute to the next node
3. Create edges for all attribute values from node and distribute data according to attribute value split
4. Repeat until no further splits are possible

**ToDo**: Complete the function f_entropy

Entropy H: $$H(S) = \sum_{x \in X}{-p(x) \log_{2}p(x)}$$

In [None]:
def f_entropy(probabilities: pd.DataFrame):
    """Calculate the entropy for given probabilities."""
    return 0

**ToDo**: Complete the function probability_attribute_value.

In [None]:
def probability_attribute_value(
    data: pd.DataFrame,
    attribute: str,
    attribute_value: any,
    target: str,
):
    """Calculate the probability for an attribute value to classify the target."""
    return 0

**ToDo**: Complete the function f_information_gain.

Information Gain IG: $$IG(S,A) = H(S) - \sum_{v \in V(A)}{ \frac{|S_v|}{|S|} H(S_v)}$$

In [None]:
def f_information_gain(data: pd.DataFrame, attribute: str, target: str):
    """Calculate the information gain for a specific attribute given the target."""
    entropies, weights = entropy_attribute_value(data, attribute, target)
    weighted_attribute_value_entropy = 0
    for attribute_value, entropy in entropies.items():
        weighted_attribute_value_entropy += weights[attribute_value] * entropy
    return f_entropy(probability_attribute_values(data[target])) - weighted_attribute_value_entropy

In [None]:
def probability_attribute_values(attribute_values: pd.DataFrame):
    """Calculate the probability for all attribute values."""
    return attribute_values.value_counts() / len(attribute_values)

def entropy_attribute_value(data: pd.DataFrame, attribute: str, target: str):
    """Calculate the entropies for all values of a given attribute as well as the weight of the attribute value."""
    entropies = {}
    for attribute_value in data[attribute].unique():
        probabilities = probability_attribute_value(data, attribute, attribute_value, target)
        entropies[attribute_value] = f_entropy(probabilities)
    return entropies, dict(data[attribute].value_counts() / len(data[attribute]))


def entropies_attributes(data: pd.DataFrame, target: str):
    """Calculate the entropies for all attributs."""
    entropies = {}
    for attribute in data.columns:
        if attribute == target:
            continue
        entropies[attribute] = entropy_attribute_value(data, attribute, target)[0]
    return entropies

def f_information_gains(data: pd.DataFrame, target: str):
    """Calculate the information gain for each attribute, excluding the target."""
    information_gains = {}
    for attribute in data.columns:
        if attribute == target:
            continue
        information_gains[attribute] = f_information_gain(data, attribute, target)
    return information_gains


def max_information_gain_attribute(data: pd.DataFrame, target: str):
    """Returns the attribute with the maximum information gain"""
    information_gain = f_information_gains(data, target)
    return max(information_gain, key=information_gain.get)

**ToDo**: Calculate the entropy and information gain for all attributes. Which attribute should be the first node in the tree?

In [None]:
probabilities = probability_attribute_values(train[classification_target])

### Node 
The Node class is used to build up the tree.

In [None]:
class Node:
    """
    Node with associated decision tree information

    Attributes
    ----------
    parent: Node
        The reference to the parent of this node.
    children: Node[]
        The list of references to the children of this node.
    in_attribute_value : any
        The value of the parent attribute that led to this node.
    attribute : any
        The attribute the data is split on in this node.
    prediction : any
        The predicted class value at of a leaf node.
    depth : int
        The depth of the node; used for printing the tree.
    """

    def __init__(self, attribute: str, in_attribute_value: any, parent: any):
        self.attribute = attribute
        self.in_attribute_value = in_attribute_value
        self.parent = parent
        self.children = []
        self.prediction = None
        self.depth = 0 if parent is None else parent.depth + 1

    def add_child(self, child: any):
        """Appends the given child to the list of children."""
        self.children.append(child)

    def find_child(self, attribute_value: any):
        """"Returns the child with the given attribute_value."""
        for child in self.children:
            if child.in_attribute_value == attribute_value:
                return child
        raise KeyError

    def predict(self, sample: pd.Series):
        """Returns the prediction of the leaf node for the given sample."""
        node = self
        while node.prediction is None:
            node = node.find_child(sample[node.attribute])
        return node.prediction

    def set_prediction(self, data: pd.DataFrame, node_data: pd.DataFrame, target: str):
        """Sets the prediction of a node, given the attribute and node_data"""
        values, counts = np.unique(node_data[target], return_counts=True)
        if len(counts) > 0:
            self.prediction = values[np.argmax(counts)]
        # if the attribute value has no instances of the target class, choose the most likely class over all attribute values
        elif len(counts) == 0:
            values, counts = np.unique(data[target], return_counts=True)
            self.prediction = values[np.argmax(counts)]

    def to_str(self):
        """Converts a node to a string for printing."""
        return "{}{}:{} --> {}".format(
            self.parent.depth * "\t",
            self.parent.attribute,
            self.in_attribute_value,
            self.prediction,
        )

## ID3 Algorithm

1. Compute the overall best attribute (i.e. using information gain)
  1. Compute the entropy for attribute
  2. Compute entropy for all attribute values
  3. Weight entropy values for each attribute value by weight of attribute value
2. Assign best attribute to the next node
3. Create edges for all attribute values from node and distribute data according to attribute value split
4. Repeat until no further splits are possible

**ToDos**:
1. select the attribute with the maximum information gain
2. create a data set for determining the maximum information gain after the split by the parent attribute value
3. create a tree node with the attribute and an in_attribute_value of parent_attribute_value
4. recursively grow the tree deeper

ToDos are marked with `###TODO`, the code won't work before you managed to finish all ToDos

In [None]:
class DecisionTree:
    """DecisionTree with associated methods.
    Attributes
    ----------
    root: Node
        The root node of the tree.
    """

    def __init__(self):
        self.root = None

    def fit(
        self,
        base_data: pd.DataFrame,
        target: str,
        log: bool = False,
        parent: Node = None,
        data: pd.DataFrame = None,
    ):
        """
        This method fits a decision tree. A decision tree consists of three different kind of nodes.
            1. the rood node
            The root node is the attribute with the maximum information gain over all attributes.
            2. the tree nodes
            The tree nodes are the remaining attributes with descending information gain.
            3. the leaf nodes
            The leaf nodes encode the prediction of the respective branch of the tree.
        """
        if data is None:
            data = base_data
        if not self.is_pure(data, target):
            # create a root node
            if parent is None:
                # remove pure attributes from the data frame for the rest of the tree
                data = self.remove_pure_attributes(data, target)
                ### TODO select the attribute with the maximum information gain
                attribute = "Sex"
                parent = Node(attribute, None, None)

            # create child nodes with in_attribute_values for all attribute values of the parent
            for parent_attribute_value in base_data[parent.attribute].unique():
                ### TODO create a data set for determining the maximum information gain after the split by the parent attribute value
                node_data = data
                # remove pure attributes from the data frame for the rest of the tree
                node_data = self.remove_pure_attributes(node_data, target)
                # grow a tree node
                if not self.is_pure(node_data, target):
                    ### TODO select the attribute with the maximum information gain
                    attribute = "Sex"
                    ### TODO create a tree node with the attribute and an in_attribute_value of parent_attribute_value
                    node = None
                    # add the tree node to the children of the parent
                    parent.add_child(node)
                    # set the prediction to the most likely
                    node.set_prediction(data, node_data, target)
                    if log:
                        print(node.to_str())
                    ### TODO recursively grow the tree deeper
                # grow a leaf node
                elif self.is_pure(node_data, target):
                    # create a leaf node with no attribute and an in_attribute_value of parent_attribute_value
                    node = Node(None, parent_attribute_value, parent)
                    # add the leaf node to the children of the parent
                    parent.add_child(node)
                    # set the prediction to the most likely
                    node.set_prediction(data, node_data, target)
                    if log:
                        print(node.to_str())
        self.root = parent
        return parent

    def predict(self, data: pd.DataFrame, target: str):
        """Predicts the class of each sample of the dataframe"""
        result = []
        for index, row in data.iterrows():
            result.append(self.root.predict(row) == data[target][index])
        return np.mean(result), 1 / len(data[target].unique())

    def keep_attribute_value(self, data: pd.DataFrame, attribute: str, attribute_value: any):
        """Returns only the selected attribute values"""
        return data[data[attribute] == attribute_value].reset_index(drop=True)

    def is_pure(self, data, target):
        """Checks whether data is pure, e.g. all attributes have only a single value"""
        for attribute in data.columns:
            if attribute == target:
                continue
            if len(np.unique(data[attribute])) > 1:
                return False
        return True

    def remove_pure_attributes(self, data: pd.DataFrame, target: str):
        """Removes all attributes that are pure."""
        for attribute in data.columns:
            if attribute == target:
                continue
            if len(data[attribute].unique()) == 1:
                data.drop(attribute, inplace=True, axis=1)
        return data

In [None]:
decision_tree = DecisionTree()
decision_tree.fit(train, classification_target, True)
accuracy, chance = decision_tree.predict(train, classification_target)

print("Mean accuracy: {:.2} vs Chance: {:.2}".format(accuracy, chance))

### Decision Trees in sklearn

In [None]:
from sklearn import tree
from sklearn import preprocessing

# encode the data numerically
label_encoder = preprocessing.LabelEncoder()
if "Embarked" in train.columns:
    label_encoder.fit(train["Embarked"])
    train["Embarked_int"] = label_encoder.transform(train["Embarked"])
    test["Embarked_int"] = label_encoder.transform(test["Embarked"])
    train = train.drop(columns="Embarked")
    test = test.drop(columns="Embarked")
if "Sex" in train.columns:    
    label_encoder.fit(train["Sex"])
    train["Sex_int"] = label_encoder.transform(train["Sex"])
    test["Sex_int"] = label_encoder.transform(test["Sex"])
    train = train.drop(columns="Sex")
    test = test.drop(columns="Sex")

# create training and test data
X = train.drop(columns=classification_target)
Y = train[classification_target]
X_test = test.drop(columns=classification_target)
Y_test = test[classification_target]

# create the tree
decision_tree = tree.DecisionTreeClassifier(criterion="entropy",max_depth=len(X.columns)-1)
# fit the tree
decision_tree.fit(X,Y);

In [None]:
result = []
for index, row in X_test.iterrows():
    result.append(decision_tree.predict(np.array(row).reshape(1, -1)) == Y_test[index])
print("Mean accuracy: {:.2} vs Chance: {:.2}".format(np.mean(result), chance))

In [None]:
# visualize the tree
import graphviz
feature_names = list(X.columns)
dot_data = tree.export_graphviz(decision_tree, out_file=None, feature_names=feature_names,filled=True) 
graph = graphviz.Source(dot_data) 
graph.render("decision_tree")
fig, ax = plt.subplots(figsize=(50, 24))
tree.plot_tree(decision_tree);