## **Zadanie 4 - Drzewo decyzyjne ID3**

Cel zadania polega na implementacji drzewa decyzyjnego tworzonego algorytmem ID3 z ograniczeniem maksymalnej głębokości drzewa, jak również na stworzeniu i zbadaniu jakości klasyfikatora dla zbioru danych [Tic-Tac-Toe Endgame](https://archive.ics.uci.edu/dataset/101/tic+tac+toe+endgame).

**Kroki do wykonania:**
- Zaimplementuj drzewo decyzyjne ID3 (z ograniczeniem jego maksymalnej głębokości).
- Zbadaj skuteczność działania kasyfikatora dla zbioru danych Tic-Tac-Toe Endgame, obliczając dokładność i macierz pomyłek.

**Uwagi**
- Należy pamiętać o podziale danych na zbiory trenujący, walidacyjny i testowy.
- Zaimplementowana metoda powinna być uniwersalna - nie należy "zaszywać" na sztywno w kodzie np. nazwy pliku ze zbiorem danych czy wartości atrybutów.

In [26]:
import numpy as np
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split

In [9]:
# fetch dataset
tic_tac_toe_endgame = fetch_ucirepo(id=101)

# data (as pandas dataframes)
X = tic_tac_toe_endgame.data.features
y = tic_tac_toe_endgame.data.targets

In [10]:
# metadata
print(tic_tac_toe_endgame.metadata)

{'uci_id': 101, 'name': 'Tic-Tac-Toe Endgame', 'repository_url': 'https://archive.ics.uci.edu/dataset/101/tic+tac+toe+endgame', 'data_url': 'https://archive.ics.uci.edu/static/public/101/data.csv', 'abstract': 'Binary classification task on possible configurations of tic-tac-toe game', 'area': 'Games', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 958, 'num_features': 9, 'feature_types': ['Categorical'], 'demographics': [], 'target_col': ['class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1991, 'last_updated': 'Mon Aug 19 1991', 'dataset_doi': '10.24432/C5688J', 'creators': ['David Aha'], 'intro_paper': None, 'additional_info': {'summary': 'This database encodes the complete set of possible board configurations at the end of tic-tac-toe games, where "x" is assumed to have played first.  The target concept is "win for x" (i.e., true when "x" has one of 8 possible ways to create a "three

In [11]:
# variable information
print(tic_tac_toe_endgame.variables)

                   name     role         type demographic description units  \
0                 class   Target  Categorical        None        None  None   
1       top-left-square  Feature  Categorical        None        None  None   
2     top-middle-square  Feature  Categorical        None        None  None   
3      top-right-square  Feature  Categorical        None        None  None   
4    middle-left-square  Feature  Categorical        None        None  None   
5  middle-middle-square  Feature  Categorical        None        None  None   
6   middle-right-square  Feature  Categorical        None        None  None   
7    bottom-left-square  Feature  Categorical        None        None  None   
8  bottom-middle-square  Feature  Categorical        None        None  None   
9   bottom-right-square  Feature  Categorical        None        None  None   

  missing_values  
0             no  
1             no  
2             no  
3             no  
4             no  
5             no

In [57]:
class Node:
    def __init__(self, data):
        self.data = data
        self.cgildren = []

    def add_child(self, child):
        self.children.append(child)

In [64]:
class DecisionTreeID3:
    def __init__(self, X, y, train_val_test_split = [0.7, 0.15, 0.15], max_depth = 10, min_samples_split = 2):
        self.X = X
        self.y = y
        self.train_val_test_split = train_val_test_split
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split

    def get_train_val_test_split(self):
        # train is now train_val_test_split[0] of the entire data set
        X_train, X_test, y_train, y_test = train_test_split(self.X, self.y, test_size = 1 - self.train_val_test_split[0], random_state = 42)
        # split the test set into validation and test sets
        X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size = self.train_val_test_split[2]/
                                                        (self.train_val_test_split[1] + self.train_val_test_split[2]), random_state = 42)
        return X_train, X_val, X_test, y_train, y_val, y_test

    def calculate_dataset_entropy(self, y):
        entropy = 0
        for class_name in np.unique(y):
            p = np.sum(y == class_name) / len(y)
            entropy += -p * np.log2(p)
        return entropy

    def calculate_dataset_divide_entropy(self, feature_name, X, y):
        entropy = 0
        for value in np.unique(X[feature_name]):
            entropy += np.sum(X[feature_name] == value) / len(X) * self.calculate_dataset_entropy(y[X[feature_name] == value])

        return entropy

    def information_gain(self, X, y, feature):
        return self.calculate_dataset_entropy(y) - self.calculate_dataset_divide_entropy(feature, X, y)

    def choose_best_feature(self, X, y):
        best_feature = None
        best_information_gain = 0

        for feature in X.columns:
            information_gain = self.information_gain(X, y, feature)
            print(f'Feature: {feature}, Information Gain: {information_gain.item()}')
            if information_gain.item() > best_information_gain:
                best_information_gain = information_gain
                best_feature = feature

        return best_feature

    def fit(self):
        X_train, X_val, X_test, y_train, y_val, y_test = self.get_train_val_test_split()

        self.root = Node(data = X_train)
        self.build_tree(self.root, X_train, y_train, 0)

    def build_tree(self, node, X, y, depth):
        if depth == self.max_depth:
            return None

        # get split feature
        split_feature = self.choose_best_feature(X, y)
        if split_feature is None:
            return None

        # split the data
        for value in np.unique(X[split_feature]):
            child_node = Node(data = X[X[split_feature] == value])
            node.add_child(child_node)
            self.build_tree(child_node, X[X[split_feature] == value], y[X[split_feature] == value], depth + 1)

    def predict(self, X):
        pass

In [65]:
id3_tree = DecisionTreeID3(X, y)

In [67]:
# id3_tree.fit()