## How Decision Trees Learn to Make Decisions
+ To understand how decision Trees Work, the concept of entropy

Information entropy is the average rate at which information is produced by a stochastic source of data.

The measure of information entropy associated with each possible data value is the negative logarithm of the probability mass function for the value:
+ for a single Class

    ent = -p*log(p) 
    
+ sum for all Classes 

 S = − ∑ -p*log(p) 


In [122]:
#### Define a entropy function for a single binary class  (0 /1 ) outcome
import numpy as np
from collections import Counter

def get_entropy(p):
    if p == 0:
        return 0
    else:
        return -p * np.log2(p)
 

def get_homogeneity(x):
    n = len(x)
    counts = Counter(x).most_common()
    p = counts[0][1]/n
    return get_entropy(p) + get_entropy(1-p)


#### Explore How Entropy works on different
Entropy is measure between zero and one that measures how mix a variable is.
+ contast variables have a entropy of zero
+ 50/50 mix has the highest entrop



In [126]:
# Contant Distrobutions 
print('entropy all ones', get_homogeneity(np.ones(100)))
# Contant Distrobutions 
print(' all zeros', get_homogeneity(np.zeros(100)))
print('half ones and zeros',get_homogeneity([0,1] * 100))
print('Mostly zeros', get_homogeneity([0, 0, 0, 1] * 100 )    )                        
print('Mostly Ones', get_homogeneity([1, 1 ,0, 1] * 100 )    )                                             

entropy all ones 0.0
 all zeros 0.0
half ones and zeros 1.0
Mostly zeros 0.8112781244591328
Mostly Ones 0.8112781244591328


#### Information Gain is the Difference in Entropy of a Given Variable before and after a split
This creates some data with p(y|x=1) = .25 and p(y|x=2) = .75

+ This creates a function to estimate information gain, using difference in homogeneity for each value, for y given x

In [132]:
def get_information_gain(x, y):
    x = np.array(x)
    y = np.array(y)
    n = len(y)
    counts = Counter(x).most_common()
    h = get_homogeneity(y)
    output = {}
    for c in counts:
        h_after =  get_homogeneity(y[x == c[0]])
        info_gain = h -  h_after
        output.update({c[0]:info_gain})
    return output


    
    

In [133]:
import u as pd
x =  [1,1,1,1,2,2,2,2]
y = [0,0,0,1, 0,0,1,1]
get_information_gain(x,y)

{1: 0.1431558784658321, 2: -0.04556599707503506}

In [None]:
def get_information_gain_weighted(x, y):
    x = np.array(x)
    y = np.array(y)
    n = len(y)
    counts = Counter(x).most_common()
    h = get_homogeneity(y)
    output = {}
    for c in counts:
        h_after =  get_homogeneity(y[x == c[0]])
        info_gain = h -  h_after
        output.update({c[0]:info_gain})
    return output



### Gini Impurity 
An Alernate way to determin splits is to use Gini Imputiry gain
+ Gini Imputurity is as follows


G(k) =  Σ P(i) * (1 - P(i))
       i=1

In [148]:
def get_impurity(x):
    counts = Counter(x).most_common() 
    n = len(x)
    if len(counts) == 1:
        p = counts[0][1]/n
        return p * (1-p)  +  (1-p) * (1 - (1- p))
    else:
        output= []
        for c in counts:
            p = c[1]/n
            output.append(p * (1-p) )
    return sum(output)


    
    

In [150]:
print(' all zeros', get_impurity(np.zeros(100)))
print('half ones and zeros',get_impurity([0,1] * 100))
print('Mostly zeros', get_impurity([0, 0, 0, 1] * 100 )    )                        
print('Mostly Ones', get_impurity([1, 1 ,0, 1] * 100 )    ) 

 all zeros 0.0
half ones and zeros 0.5
Mostly zeros 0.375
Mostly Ones 0.375


In [151]:
def get_gini_gain(x, y):
    x = np.array(x)
    y = np.array(y)
    n = len(y)
    counts = Counter(x).most_common()
    g = get_impurity(y)
    output = {}
    for c in counts:
        g_after =  get_impurity(y[x == c[0]])
        gini_gain = g -  g_after
        output.update({c[0]:gini_gain })
    return output

    
    
    

In [153]:

x =  [1,1,1,1,2,2,2,2]
y = [0,0,0,1, 0,0,1,1]
get_gini_gain(x,y)

{1: 0.09375, 2: -0.03125}