# Decision Tree with UCI Poker and scikit-learn

- **Student ID:** 21127135
- **Student name:** Diep Huu Phuc
- **Data set source**: https://archive.ics.uci.edu/ml/datasets/Poker+Hand

## Import library

- Make sure `Import Library` is run at least once.
- The notebook is written in a way that each task is stand-alone. So long as all libraries are imported, it is not obligated to run through every code block sequentially.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import graphviz
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

## Preparing the data sets

### Merging the `.data` files

- Merging `poker-hand-training-true.data` and `poker-hand-testing.data` into `poker-hand-data.csv`. Both of `.data` files and the resulting `.csv` are placed in `pokerData` folder.
   - As a merging method was not specified, all training data is written in first then the testing data afterward.

In [None]:
folder = './pokerData'
train_true_data = pd.read_csv(f'{folder}/poker-hand-training-true.data', header=None)
test_data = pd.read_csv(f'{folder}/poker-hand-testing.data', header=None)
pd.concat([train_true_data, test_data]).to_csv(f'{folder}/poker-hand-data.csv', header=None, index=False)

### Preparing the subsets

- Both the shuffling of the data and stratified-fashion split are handled by `sklearn.model_selection.train_test_split`.
- The following four subsets are extracted from the merged data in `poker-hand-data.csv`.
   - **feature_train** - A set of training examples, each of which is a tuple of 42 attribute values (target attribute excluded).
   - **label_train** - A set of labels corresponding to the examples in feature_train.
   - **feature_test** - A set of test examples, it is of similar structure to feature_train
   - **label_test** -  A set of labels corresponding to the examples in feature_test.
- There will be experiments on training sets and test sets of different proportions, which are `(train/test)` `40/60`, `60/40`, `80/20`, and `90/10`. Hence, a total of 16 subsets, 4 for each ratio, will be output to `pokerSubsets` folder.

In [None]:
folder = './pokerSubsets'
poker_hand_data = pd.read_csv('./pokerData/poker-hand-data.csv', header=None)
features = poker_hand_data.iloc[:, :-1] # features are every row of each column except the last column, which is Label.
labels = poker_hand_data.iloc[:, -1]    # labels are every row of the last column.
train_ratios = [0.4, 0.6, 0.8, 0.9]

for ratio in train_ratios:
    feature_train, feature_test, label_train, label_test = train_test_split(features, labels, train_size=ratio,
                                                                            random_state=42, shuffle=True, stratify=labels)

    feature_train.to_csv(f'{folder}/feature_train_{int(ratio * 100)}.csv', header=None, index=False)
    feature_test.to_csv(f'{folder}/feature_test_{round((1 - ratio) * 100)}.csv', header=None, index=False)
    label_train.to_csv(f'{folder}/label_train_{int(ratio * 100)}.csv', header=None, index=False)
    label_test.to_csv(f'{folder}/label_test_{round((1 - ratio) * 100)}.csv', header=None, index=False)

### Visualizing the distributions of classes

- For `label` subsets, we just need to merge `label_train` and `label_test` into `label_data` and ascertain whether the statistics of this data resemble those of the original set.
- However, in the case of `feature` subsets, due to the absence of `label` (or `class`), i.e., `Poker Hand`, it is impossible to visualize the distributions. As such, their validity will be confirmed by assuring the content of both `feature_train_X.csv` and `feature_test_Y.csv` obeys their respective ratio.

In [None]:
folder = './pokerSubsets'
Poker_Hand = ['Nothing in hand', 'One pair', 'Two pairs', 'Three of a kind', 'Straight',
              'Flush', 'Full house', 'Four of a kind', 'Straight flush', 'Royal flush']

# Plotting the original set
print("==== The original set ====")
poker_hand_data = pd.read_csv('./pokerData/poker-hand-data.csv', header=None)
class_counts = poker_hand_data.iloc[:, -1].value_counts().sort_index()
org_total_instances = total_instances = class_counts.sum()
class_ratios = [class_counts[i] / total_instances for i in range(len(class_counts))]

plt.figure(figsize=(14, 10))
plt.bar(range(len(class_counts)), class_counts)
plt.title('Distribution of classes in poker-hand-data.csv')
plt.xticks(range(len(class_counts)), Poker_Hand)
plt.xlabel('Class')
plt.ylabel('Instance')
for i in range(len(class_counts)):
    plt.text(i, class_counts[i], f'{class_counts[i]}\n({class_ratios[i] * 100:.4f}%)', ha='center', va='baseline')
plt.show()

# Plotting the training and test set
train_ratios = [0.4, 0.6, 0.8, 0.9]
for ratio in train_ratios:
    print(f"==== Train/Test Ratio of {int(ratio * 100)}/{round((1 - ratio) * 100)} ====")
    print('Proving the validity of feature subsets.')
    feature_train = pd.read_csv(f'{folder}/feature_train_{int(ratio * 100)}.csv', header=None)
    print(f'- Ratio of feature_train = {len(feature_train)} / {org_total_instances} = {len(feature_train) / org_total_instances}')
    feature_test = pd.read_csv(f'{folder}/feature_test_{round((1 - ratio) * 100)}.csv', header=None)
    print(f'- Ratio of feature_test = {len(feature_test)} / {org_total_instances} = {len(feature_test) / org_total_instances}')

    label_train = pd.read_csv(f'{folder}/label_train_{int(ratio * 100)}.csv', header=None)
    label_test = pd.read_csv(f'{folder}/label_test_{round((1 - ratio) * 100)}.csv', header=None)
    label_data = pd.concat([label_train, label_test])

    class_counts = label_data.iloc[:, -1].value_counts().sort_index()
    total_instances = class_counts.sum()
    class_ratios = [class_counts[i] / total_instances for i in range(len(class_counts))]

    plt.figure(figsize=(14, 10))
    plt.bar(range(len(class_counts)), class_counts)
    plt.title(f'Distribution of classes for train/test ratio of {int(ratio * 100)}/{round((1 - ratio) * 100)}')
    plt.xticks(range(len(class_counts)), Poker_Hand)
    plt.xlabel('Class')
    plt.ylabel('Instance')
    for i in range(len(class_counts)):
        plt.text(i, class_counts[i], f'{class_counts[i]}\n({class_ratios[i] * 100:.4f}%)', ha='center', va='baseline')
    plt.show()

## Building the decision tree classifiers

- The decision trees drawn by `graphviz` are output to `decisionTreeClassifiers` folder.
   - **max_depth=None:** It takes ***an eternity*** for `graphviz` to render this tree so we will only observe its first 10 levels. In other words, the decision tree, i.e., the `model`, is still built with `max_depth=None`, we will only cut off its plotting at depth 10.

In [None]:
sets_folder = './pokerSubsets'
tree_folder = './decisionTreeClassifiers'
Poker_Hand = ['Nothing in hand', 'One pair', 'Two pairs', 'Three of a kind', 'Straight',
              'Flush', 'Full house', 'Four of a kind', 'Straight flush', 'Royal flush']
train_ratios = [0.4, 0.6, 0.8, 0.9]
max_depth_observable = 10

for ratio in train_ratios:
    print(f"==== Decision Tree Classifier for Train/Test Ratio of {int(ratio * 100)}/{round((1 - ratio) * 100)} ====")
    feature_train = pd.read_csv(f'{sets_folder}/feature_train_{int(ratio * 100)}.csv', header=None)
    label_train = pd.read_csv(f'{sets_folder}/label_train_{int(ratio * 100)}.csv', header=None)
    model = DecisionTreeClassifier(criterion='entropy', random_state=42)
    model.fit(feature_train, label_train)

    dot_data = export_graphviz(model, max_depth=max_depth_observable, feature_names=Poker_Hand, filled=True, rounded=True, special_characters=True)
    graph = graphviz.Source(dot_data)
    graph.render(filename=f'depth_ob{max_depth_observable}_{int(ratio * 100)}{round((1 - ratio) * 100)}', directory=tree_folder)
    display(graph)

## Evaluating the decision tree classifiers

In [None]:
folder = './pokerSubsets'
Poker_Hand = ['Nothing in hand', 'One pair', 'Two pairs', 'Three of a kind', 'Straight',
              'Flush', 'Full house', 'Four of a kind', 'Straight flush', 'Royal flush']
train_ratios = [0.4, 0.6, 0.8, 0.9]

for ratio in train_ratios:
    feature_train = pd.read_csv(f'{folder}/feature_train_{int(ratio * 100)}.csv', header=None)
    label_train = pd.read_csv(f'{folder}/label_train_{int(ratio * 100)}.csv', header=None)
    model = DecisionTreeClassifier(criterion='entropy', random_state=42)
    model.fit(feature_train, label_train)

    feature_test = pd.read_csv(f'{folder}/feature_test_{round((1 - ratio) * 100)}.csv', header=None)
    label_test = pd.read_csv(f'{folder}/label_test_{round((1 - ratio) * 100)}.csv', header=None)
    label_pred = model.predict(feature_test)
    print(f"==== Train/Test Ratio of {int(ratio * 100)}/{round((1 - ratio) * 100)} ====")
    print('Desicion Tree Classifier report')
    print(classification_report(label_test, label_pred, target_names=Poker_Hand, zero_division=0))

    cfm = confusion_matrix(label_test, label_pred)
    plt.figure(figsize=(14, 10))
    sns.heatmap(cfm, annot=True, linewidths=0.5, xticklabels=Poker_Hand, yticklabels=Poker_Hand)
    plt.title('Decision Tree Classifier confusion matrix')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()

## The depth and accuracy of a decision tree

- This task works on the `80/20` training set and test set. `max_depth` will take the following values, `None`, `2`, `3`, `4`, `5`, `6`, `7`.
- The decision trees drawn by `graphviz` are output to `treeMaxDepth` folder.
   - **max_depth=None:** It takes ***an eternity*** for `graphviz` to render this tree so we will only observe its first 10 levels. In other words, the decision tree, i.e., the `model`, is still built with `max_depth=None`, we will only cut off its plotting at depth 10.

In [None]:
sets_folder = './pokerSubsets'
tree_folder = './treeMaxDepth'
Poker_Hand = ['Nothing in hand', 'One pair', 'Two pairs', 'Three of a kind', 'Straight',
              'Flush', 'Full house', 'Four of a kind', 'Straight flush', 'Royal flush']
ratio = 0.8 # train_ratio
max_depths = [None, 2, 3, 4, 5, 6, 7]
feature_train = pd.read_csv(f'{sets_folder}/feature_train_{int(ratio * 100)}.csv', header=None)
label_train = pd.read_csv(f'{sets_folder}/label_train_{int(ratio * 100)}.csv', header=None)

for depth in max_depths:
    print(f"==== Max Depth {depth} with Train/Test Ratio of {int(ratio * 100)}/{round((1 - ratio) * 100)} ====")
    model = DecisionTreeClassifier(criterion='entropy', max_depth=depth, random_state=42)
    model.fit(feature_train, label_train)

    feature_test = pd.read_csv(f'{folder}/feature_test_{round((1 - ratio) * 100)}.csv', header=None)
    label_test = pd.read_csv(f'{folder}/label_test_{round((1 - ratio) * 100)}.csv', header=None)
    label_pred = model.predict(feature_test)
    
    dot_data = export_graphviz(model, max_depth=(10 if depth is None else None), feature_names=Poker_Hand,
                               filled=True, rounded=True, special_characters=True)
    graph = graphviz.Source(dot_data)
    graph.render(filename=f'depth_{depth}_{int(ratio * 100)}{round((1 - ratio) * 100)}', directory=tree_folder)
    print(f'Accuracy score: {accuracy_score(label_test, label_pred)}')
    display(graph)