## Decision Tree Classifier - ID3 - Python
#### Ron Rounsifer | cs678 @ GVSU

In [80]:
import pandas as pd # for dataframe
import math # for log

### Leaves
Since the entire idea behind a decision tree classifier is that an ensemble of leaves, each representative of a decision, is able to help our program navigate its way to the most likely classification of unseen data we will, of course, need to define what exactly a leaf is. Since I am working with Python3, I will make a class structure (similar to a struct in C or C++) to represent each leaf. Each leaf structure will contain the following:

1. An attribute (that the decision will be made on)

2. Branches representing each possible decision of the attribute

### Cases
When it comes to the data that will be used with this machine learning algorithm there are 3 possible cases that exist:

1) All examples are of the same classification => return a leaf with that class label

2) There are no more attributes to examine => return a leaf with the most common label

3) Data is normal => choose the attribute that maximizes Information Gain of the dataset as the root

### Load the data

In [81]:
dataframe = pd.read_csv('./data/training.txt')

### Case 1 - all examples are of same class

In [82]:
# check if all examples are of the same class
def case_one(df):
    if len(df['classification'].value_counts().keys()) == 1:
        return df['classification'].value_counts().keys()[0]
    return False

### Case 2 - there are no attributes to test

In [83]:
# check if there are attributes to test
def case_two(df):
    if (len(df.columns) - 1) == 0:
        return df['classification'].value_counts().keys()[0]
    return False

### Case 3 - Normal Data
Given that the data we have is normal, we can go about building the tree.


To begin, the root must be determined. Accomplishing this can be done by calculating the Information Gain produced by each attribute. To calculate the Information Gain you can simply subtract the entropy of an attribute from the entropy of the entire system as completed in the code below.

In [84]:
def sys_entropy(df):
    """Calc systems entropy
    
    Calculate the system entropy.
    """
    
    # calculate entropy of system
    sys_entropy = 0.0
    num_classes = len(df['classification'].value_counts().keys())
    for i in range(num_classes):
        N = len(df)
        x = ( int(df['classification'].value_counts()[i]) / N)
        sys_entropy -= (x * math.log(x, num_classes)) # need to use log base # of choices avail
    return sys_entropy

In [239]:
def gain(sys, df):
    data = ()
    temp_entropy = 0.0
    num_classes = len(df['classification'].value_counts().keys())

    
    for col in df.columns:
        print('\n\n')
        print("Column: " + col)
        for attr_class in df[col].value_counts().keys():
            temp_entropy = 0.0
            print("\nAttr class: ", attr_class, df[col].value_counts()[attr_class])
            for final_class in df.groupby('classification'):
                x = len(final_class[1]) / df[col].value_counts()[attr_class]
                temp_entropy -= (x * math.log(x, num_classes))

                print(final_class[0], len(final_class[1]), x)

        

In [240]:
def run():
    if case_one(dataframe):
        return case_one(dataframe)
    elif case_two(dataframe):
        return case_two(dataframe)
    else:
        system_entropy = sys_entropy(dataframe)
        root = gain(system_entropy, dataframe)

run()




Column: cost

Attr class:  high 320
acceptable 278 0.86875
good 38 0.11875
poor 875 2.734375
vgood 45 0.140625

Attr class:  medium 309
acceptable 278 0.8996763754045307
good 38 0.12297734627831715
poor 875 2.831715210355987
vgood 45 0.14563106796116504

Attr class:  vhigh 309
acceptable 278 0.8996763754045307
good 38 0.12297734627831715
poor 875 2.831715210355987
vgood 45 0.14563106796116504

Attr class:  low 298
acceptable 278 0.9328859060402684
good 38 0.12751677852348994
poor 875 2.936241610738255
vgood 45 0.15100671140939598



Column: maintenance

Attr class:  high 314
acceptable 278 0.8853503184713376
good 38 0.12101910828025478
poor 875 2.786624203821656
vgood 45 0.14331210191082802

Attr class:  medium 311
acceptable 278 0.8938906752411575
good 38 0.12218649517684887
poor 875 2.8135048231511255
vgood 45 0.14469453376205788

Attr class:  low 307
acceptable 278 0.9055374592833876
good 38 0.1237785016286645
poor 875 2.8501628664495113
vgood 45 0.1465798045602606

Attr class:  

In [121]:
def gain(sys, df, max_gain):
    """Calculate gain
    
    Calcuate and return attribute with highest information gain
    """
    gains = []
    temp_gain = 0.0
    num_classes = len(df['classification'].value_counts().keys())
    
    # calculate info gain for each attribute
    # select the highest value
    num_attr = len(df.columns) - 1
    info_gain = 0.0
    cols = df.columns
    for attr in range(num_attr):
        
        print('\n')
        print(df[cols[attr]].value_counts().keys()[attr])
        for attr_class in df[cols[attr]].value_counts().keys():
            print(attr_class, df[cols[attr]].value_counts()[attr_class])
            # calculate entropy here for each attribute
            for final_class in df['classification'].value_counts().keys():
                
               # print(final_class, df['classification'].value_counts()[final_class])
                current_col = cols[attr]
                num_in_attr_class = df[cols[attr]].value_counts()[attr_class]
                x = ( num_in_attr_class / num_in_attr_class)
                temp_gain -= (x * math.log(x, num_classes))