## Decision Tree Classifier

Tag: ID3,C4.5,C5.0, CART, Gini Index, Impurity, Information Gain and Entropy

In [17]:
#CART(Classification and regression tree) use Gini index
#ID3 use entropy and information gain

In [18]:
#Pick first node
'''
determine the attribute that best classifies the training data; use this attribute at the root of the tree.
Repeat this process at for each branch.
'''

'\ndetermine the attribute that best classifies the training data; use this attribute at the root of the tree.\nRepeat this process at for each branch.\n'

1. Entropy: Measure of uncertainity in data
2. Information Gain: difference in entry after data is splitted based on attribute a 

1.compute the entropy for data-set
2.for every attribute/feature:
       1.calculate entropy for all categorical values
       2.take average information entropy for the current attribute
       3.calculate gain for the current attribute
3. pick the highest gain attribute.
4. Repeat until we get the tree we desired.


In [1]:
import numpy as np

In [7]:
# color, size, label
training_data = [
    ['Green', 3, 'Apple'],
    ['Yellow', 3, 'Apple'],
    ['Red', 1, 'Grape'],
    ['Red', 1, 'Grape'],
    ['Yellow', 3, 'Lemon'],
]

In [8]:
# color, size, label
training_data = [
    ['Green', 3, 'Apple'],
    ['Yellow', 3, 'Apple'],
    ['Red', 1, 'Grape'],
    ['Red', 1, 'Grape'],
    ['Yellow', 2, 'Grape'],
    ['Red', 2, 'Apple'],
]

In [9]:
def count_class_freq(rows):
    #last column is the class
    classes={} #dictionary
    for row in rows:
        c=row[-1]
        if c not in classes:
            classes[c]=1
        else:
            classes[c]+=1
    return classes

In [10]:
count_class_freq(training_data)

{'Apple': 3, 'Grape': 3}

<img src='imgs/gini-index-formula.png' width=20%>

In [11]:
def gini(rows):
    classes=count_class_freq(rows)
    impurity = 1
    for c in classes:
        prob_of_c = classes[c] / float(len(rows))
        impurity -= prob_of_c**2
    return impurity

In [12]:
gini(training_data)

0.5


CART:
1. compute the gini index for data-set
2. for every attribute/feature/column:
       1.calculate gini index for all categorical values
       2.take average information entropy for the current attribute 
       3.calculate the gini gain
3. pick the best gini gain attribute.
4. Repeat until we get the tree we desired.


#### Gini index and Entropy
Decision tree algorithms use information gain to split a node. Gini index or entropy is the criterion for calculating information gain. 
Both gini and entropy are measures of impurity of a node. A node having multiple classes is impure whereas a node having only one class is pure.  Entropy in statistics is analogous to entropy in thermodynamics where it signifies disorder. If there are multiple classes in a node, there is disorder in that node. 
 
Information gain is the entropy of parent node minus sum of weighted entropies of child nodes. 
 Weight of a child node is number of samples in the node/total samples of all child nodes. Similarly information gain is calculated with gini score. 
 <img src='imgs/ginientropy.jpg'>

#### Entropy vs gini
<img src='imgs/entropy_vs_gini.png' width=40%>

In [13]:
from math import log
def entropy(rows):
    classes=count_class_freq(rows)
    impurity = 0
    for c in classes:
        prob_of_c = classes[c] / float(len(rows))
        impurity -= prob_of_c*  log(prob_of_c, 2)
    return impurity

In [14]:
entropy(training_data)

1.0

In [15]:
#information gain for a split.
def info_gain(left, right, current_uncertainty):
    """Information Gain.

    The uncertainty of the starting node, minus the weighted impurity of
    two child nodes.
    """
    p = float(len(left)) / (len(left) + len(right))
    return current_uncertainty - p * gini(left) - (1 - p) * gini(right)

In [16]:
def is_numeric(value):
    """Test if a value is numeric."""
    return isinstance(value, int) or isinstance(value, float)

In [17]:
is_numeric(4)

True

In [18]:
is_numeric('Red')

False

In [19]:
def partition(rows, q, col):
    
    true_rows, false_rows = [], []
    for row in rows: 
        cv=row[col] 
        if is_numeric(cv):
            if cv>=q:
                true_rows.append(row)
            else:
                false_rows.append(row)
        else:
            if cv==q:
                true_rows.append(row)
            else:
                false_rows.append(row)
        
    return true_rows, false_rows

In [20]:
true_rows, false_rows = partition(training_data, 'Red', 0)
print('true: ', true_rows)
print('false: ', false_rows)

true:  [['Red', 1, 'Grape'], ['Red', 1, 'Grape'], ['Red', 2, 'Apple']]
false:  [['Green', 3, 'Apple'], ['Yellow', 3, 'Apple'], ['Yellow', 2, 'Grape']]


In [21]:
true_rows, false_rows = partition(training_data, 3, 1)
print('true: ', true_rows)
print('false: ', false_rows)

true:  [['Green', 3, 'Apple'], ['Yellow', 3, 'Apple']]
false:  [['Red', 1, 'Grape'], ['Red', 1, 'Grape'], ['Yellow', 2, 'Grape'], ['Red', 2, 'Apple']]


In [22]:
cr=gini(training_data) #current
print(cr)

0.5


In [23]:
true_rows, false_rows = partition(training_data, 'Green', 0)
info_gain(true_rows, false_rows, cr)

0.09999999999999998

In [24]:
true_rows, false_rows = partition(training_data, 'Red', 0)
info_gain(true_rows, false_rows, cr)

0.05555555555555555

In [25]:
col=0
values = set([row[col] for row in training_data])
print(values)
for value in values:
    true_rows, false_rows = partition(training_data, value, col)
    ig=info_gain(true_rows, false_rows, cr)
    print(value,' : IG=',ig)

{'Yellow', 'Green', 'Red'}
Yellow  : IG= 0.0
Green  : IG= 0.09999999999999998
Red  : IG= 0.05555555555555555


In [26]:
col=1
values = set([row[col] for row in training_data])
print(values)
for value in values:
    true_rows, false_rows = partition(training_data, value, col)
    ig=info_gain(true_rows, false_rows, cr)
    print(value,' : IG=',ig)

{1, 2, 3}
1  : IG= 0.0
2  : IG= 0.25
3  : IG= 0.25


In [27]:
def find_best_split(rows, igcol=None):
    ncol=len(rows[0])-1
    current_uncertainty = gini(rows)
    best_gain=0
    best_col=None
    best_feature=None
    for col in range(ncol):
        if not igcol==None and col==igcol:  #ignore this column
#             print('Ignoring column=',col, ' Total=',ncol)
            continue
        values = set([row[col] for row in rows])  # unique values in the column
#         print('Values: ', values)
        for val in values:  # for each value
            # try splitting the dataset
            true_rows, false_rows = partition(rows, val, col)
            if len(true_rows) == 0 or len(false_rows) == 0:
                continue
            gain=info_gain(true_rows, false_rows, current_uncertainty)
            if gain>=best_gain:
                best_gain=gain
                best_col=col
                best_feature=val
                
    return best_gain, best_col, best_feature
                
find_best_split(training_data)

(0.25, 1, 3)

In [28]:
gain, best_col, best_val=find_best_split(training_data)
print('gain: ', gain, ' best_col: ',best_col,' best_val=',best_val)

gain:  0.25  best_col:  1  best_val= 3


In [29]:
igcol=None
gain, best_col, best_val=find_best_split(training_data, igcol)
print('gain: ', gain, ' best_col: ',best_col,' best_val=',best_val)

gain:  0.25  best_col:  1  best_val= 3


In [30]:
true_rows, false_rows = partition(training_data, best_val, best_col)
print('true: ', true_rows)
print('false: ', false_rows)
igcol=best_col

true:  [['Green', 3, 'Apple'], ['Yellow', 3, 'Apple']]
false:  [['Red', 1, 'Grape'], ['Red', 1, 'Grape'], ['Yellow', 2, 'Grape'], ['Red', 2, 'Apple']]


In [31]:
gain, best_col, best_val=find_best_split(true_rows, igcol)
print('gain: ', gain, ' best_col: ',best_col,' best_val=',best_val)

gain:  0.0  best_col:  0  best_val= Green


In [32]:
gain, best_col, best_val=find_best_split(false_rows, igcol)
print('gain: ', gain, ' best_col: ',best_col,' best_val=',best_val)

gain:  0.04166666666666663  best_col:  0  best_val= Red


In [33]:
if gain==0:
    print('Leaf Node: ', best_val)
    print( 'leaf: '+str(best_col)+' :'+str(best_val) )

In [34]:

def build_tree(rows, igcol=None):
    gain, best_col, best_val=find_best_split(rows, igcol)
    print('gain: ', gain, ' best_col: ',best_col,' best_val=',best_val)
    if gain==0:
        print('Leaf Node: ', best_val)
        return 'leaf: '+str(best_col)+' :'+str(best_val)
    true_rows, false_rows = partition(rows, best_val, best_col)
    
    true_branch=build_tree(true_rows, best_col)
    false_branch=build_tree(false_rows, best_col)
    return 'Decision: '+str(best_col)+' :'+str(best_val)+'\nTrue:'+str(true_branch)+'\nFalse: '+str(false_branch)
    

In [35]:
build_tree(training_data)

gain:  0.25  best_col:  1  best_val= 3
gain:  0.0  best_col:  0  best_val= Green
Leaf Node:  Green
gain:  0.04166666666666663  best_col:  0  best_val= Red
gain:  0.4444444444444445  best_col:  1  best_val= 2
gain:  0  best_col:  None  best_val= None
Leaf Node:  None
gain:  0  best_col:  None  best_val= None
Leaf Node:  None
gain:  0  best_col:  None  best_val= None
Leaf Node:  None


'Decision: 1 :3\nTrue:leaf: 0 :Green\nFalse: Decision: 0 :Red\nTrue:Decision: 1 :2\nTrue:leaf: None :None\nFalse: leaf: None :None\nFalse: leaf: None :None'

In [36]:
def num_split(data, th): 
    lt=[]
    rt=[]
    for i,d in enumerate(data[:,0]):
#         print(i,d,th)
        d=int(d)
        if d>th:
            rt.append(data[i])
        else:
            lt.append(data[i])
    return lt,rt

In [37]:
dt=[
    [10.0,'F'],
    [0.0, 'F'],
    [80.0, 'P'],
    [100.0,'P'],
    [35.0, 'F'],
    [37.0, 'P'],
    [34.0, 'F'],
    [25.0, 'F']
]

In [38]:
ct=gini(dt)
print(ct)

0.46875


In [39]:
th=25
lb=[]
rb=[]
for row in dt:
#     print(row)
    v=row[0]
#     print(v)
    if v<=th:
        lb.append(row)
    else:
        rb.append(row)

In [40]:
def split_data(dt, th):
    lb=[]
    rb=[]
    for row in dt:
    #     print(row)
        v=row[0]
    #     print(v)
        if v<=th:
            lb.append(row)
        else:
            rb.append(row)
    return lb,rb

In [41]:
lb,rb=split_data(dt, 50)
print(lb)
print(rb)

[[10.0, 'F'], [0.0, 'F'], [35.0, 'F'], [37.0, 'P'], [34.0, 'F'], [25.0, 'F']]
[[80.0, 'P'], [100.0, 'P']]


In [42]:
import numpy as np
# ndt=np.array(dt)

In [43]:
print(lb)
print('')
print(rb)

[[10.0, 'F'], [0.0, 'F'], [35.0, 'F'], [37.0, 'P'], [34.0, 'F'], [25.0, 'F']]

[[80.0, 'P'], [100.0, 'P']]


In [44]:
lg=gini(lb)
rg=gini(rb)
print(lg,rg)
td=len(lb)+len(rb)

ig=ct- (len(lb)/td)*lg - (len(rb)/td)*rg
print(ig)

0.2777777777777777 0.0
0.26041666666666674


In [45]:
td=len(lb)+len(rb)

In [46]:
(len(lb)/td)*lg

0.20833333333333326

In [47]:
def count_class_freq(rows):
    #last column is the class
    classes={} #dictionary
    for row in rows:
        c=row[-1]
        if c not in classes:
            classes[c]=1
        else:
            classes[c]+=1
    return classes

In [48]:
def gini(rows):
    classes=count_class_freq(rows)
    impurity = 1
    for c in classes:
        prob_of_c = classes[c] / float(len(rows))
        impurity -= prob_of_c**2
    return impurity

In [49]:
count_class_freq(data)

NameError: name 'data' is not defined

In [50]:
ct=gini(data)
print(ct)

NameError: name 'data' is not defined

In [51]:
#information gain for a split.
def info_gain(left, right, current_uncertainty):
    p = float(len(left)) / (len(left) + len(right))
    return current_uncertainty - p * gini(left) - (1 - p) * gini(right)

In [52]:
dt=[
    [10.0,1],
    [30.0, 2],
    [60.0, 3],
    [50.0,2],
    [20.0, 1],
    [95.0, 3],
    [85.0, 1]
]

In [56]:
dt=[
    [30.0, 2],
    [60.0, 3],
    [50.0,2],
    [95.0, 3],
    [85.0, 1]
]

In [54]:
dt=[
    [60.0, 3],
    [95.0, 3],
    [85.0, 1]
]

In [55]:
dt=[
    [17,0],
    [25, 0],
    [38, 0],
    [42,1],
    [44, 1],
    [47, 2],
    [49, 2],
    [50, 3],
    [54, 3],
    [53, 3]
]

In [11]:
data=np.array(dt)

In [12]:
ndt=np.array(dt)

In [13]:
lt,rt=num_split(data, 25)
print(lt)
print()
print(rt)

[array([17,  0]), array([25,  0])]

[array([38,  0]), array([42,  1]), array([44,  1]), array([47,  2]), array([49,  2]), array([50,  3]), array([54,  3]), array([53,  3])]


In [23]:
lt,rt=num_split(data, 25)
info_gain(lt,rt,ct)

0.16499999999999992

In [46]:
print(rt[0])
data=rt[0]

[30.  2.]


In [27]:
lt,rt=num_split(data, 54)
info_gain(lt,rt,ct)

0.0