# Decision Tree Example
This notebook follows Jason Brownlee's [tutorial](https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/) on implementation of a decision tree.

In [3]:
#### Standard Libraries ####
import typing

#### Third party libraries ####
import pandas as pd

In [4]:
#### Import the dataset ####
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt'
column_labels = ['variance','skew','curtosis','entropy','class']
df = pd.read_csv(url,names=column_labels)

In [5]:
df.head()

Unnamed: 0,variance,skew,curtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


# Evaluating splits with Gini index
The gini index strives to split classes into groups. For instance a perfect division of our data into two pools where one pool is completely composed of data labeled as class A and the other pool with data labeled as class B results in a Gini score 0f 0. At the other end of the spectrum, two pools each with 50% class A and 50% class B results in a Gini score of 0.5.

Let's consider a toy example where we have two groups of data. Each group has two data points (rows), and each data point is labeled with a class. In this example we are splitting children who attended a summer camp based on gender so that we can assign them to dormatories. In other words, campers are *classified* based on there gender. We are adopting the notation that 1 represents a yes for the question, *is male*. Each dorm represents a group.

https://www.sielc.com/Application-HPLC-Separation-of-Isomers-of-Aminobenzoic-Acid.html

<center>Dorm 1</center>

| Camper | Male? |
|--------|-------|
| Tim    | 1     |
| Mike   | 1     |

<center>Dorm 2</center>

| Camper | Male? |
|--------|-------|
| Julie  | 0     |
| Sarah  | 0     |

In order to calculate the Geni index for each group of products, we need to caclculate the proportion of each class of molecule in each group.

In [17]:
group = pd.DataFrame([['Tim','Male'],['Mike','Male'],['Eva','Female']],columns=['Camper','Sex'])

In [32]:
test_df = pd.DataFrame([['acetone','ketone',1],
                        ['acetaldehyde','aldehyde',1],
                        ['formaldeyde','aldehyde',2],
                        ['vanillin','aldehyde',1],
                        ['cyclohexanone','ketone',2],
                        ['ethyl methyl ketone','ketone',2]],
                        columns=['name','class','group'])


In [33]:
test_df

Unnamed: 0,name,class,group
0,acetone,ketone,1
1,acetaldehyde,aldehyde,1
2,formaldeyde,aldehyde,2
3,vanillin,aldehyde,1
4,cyclohexanone,ketone,2
5,ethyl methyl ketone,ketone,2


In [48]:
groups = test_df.groupby('group')


In [49]:
for key,item in groups:
    group = groups.get_group(key)
    print(group['class'].unique())
    print(group[group['class'] == 'aldehyde'].count().values[0])
    print(group, "\n\n")
    

['ketone' 'aldehyde']
2
           name     class  group
0       acetone    ketone      1
1  acetaldehyde  aldehyde      1
3      vanillin  aldehyde      1 


['aldehyde' 'ketone']
1
                  name     class  group
2          formaldeyde  aldehyde      2
4        cyclohexanone    ketone      2
5  ethyl methyl ketone    ketone      2 




In [50]:
data = []

for key,item in groups:
    
    group = groups.get_group(key)
    grp = group['group'].unique()[0]
    total = len(group.index)
    
    for cls in group['class'].unique():
        numerator = group[group['class'] == cls].count().values[0]
        prop = float(numerator/total)
        data.append([grp,cls,prop])
    
columns = ['group','class','proportion']
pd.DataFrame(data=data, columns=columns)

Unnamed: 0,group,class,proportion
0,1,ketone,0.333333
1,1,aldehyde,0.666667
2,2,aldehyde,0.333333
3,2,ketone,0.666667


In [None]:
# index, class, group,

In [54]:
def gini_index(data: pd.DataFrame) -> float:
    """
    A function for calculating the Gini index of a
    split.
    
    ARGS:
        groups () |
        classes () |
    
    RETURNS:
        gini_index (float) | The score of the split.
    """
    num_samples = len(data.index)
    
    # calculate proportion of each class in a group
    def _proportion(data: pd.DataFrame) -> pd.DataFrame:
        
        groups = data.groupby('group')
        data = []
        columns = ['group','class','proportion']

        for key,item in groups:
    
            group = groups.get_group(key)
            grp = group['group'].unique()[0]
            total = len(group.index)
    
            for cls in group['class'].unique():
                numerator = group[group['class'] == cls].count().values[0]
                prop = float(numerator/total)
                data.append([grp,cls,prop])
    
        return pd.DataFrame(data=data, columns=columns)

    # calc_group_gini
    def _group_gini(data: pd.DataFrame) -> pd.DataFrame:
        
        data = []
        columns = ['group','Gini index']
        groups = data.groupby('group')
        
        for key,item in groups:
            group = groups.get_group(key)
            grp = group['group'].values[0]
            group_size = len(group.index)
            squared_prop = group['proportion'].apply(lambda x: x^2)
            group_gini = (1 - squared_prop.sum())* (group_size/num_samples)
            data.append([grp,group_gini])
            
        return pd.DataFrame(data=data, columns=columns)
    
    index = _group_gini(_proportion(data)).sum()
    
    return index

In [55]:
print(gini_index(test_df))

AttributeError: 'list' object has no attribute 'groupby'