Links úteis: 
1. http://www.patricklamle.com/Tutorials/Decision%20tree%20python/tuto_decision%20tree.html
2. https://jeremykun.com/tag/decision-trees/

Origem dos dados: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

Attribute Information:

1. Number of times pregnant 
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
3. Diastolic blood pressure (mm Hg) 
4. Triceps skin fold thickness (mm) 
5. 2-Hour serum insulin (mu U/ml) 
6. Body mass index (weight in kg/(height in m)^2) 
7. Diabetes pedigree function 
8. Age (years) 
9. Class variable (0 or 1) 

** UPDATE: Until 02/28/2011 this web page indicated that there were no missing values in the dataset. As pointed out by a repository user, this cannot be true: there are zeros in places where they are biologically impossible, such as the blood pressure attribute. It seems very likely that zero values encode missing data. However, since the dataset donors made no such statement we encourage you to use your best judgement and state your assumptions.

Class Distribution: (class value 1 is interpreted as "tested positive for
   diabetes")

   Class Value  Number of instances
   0            500
   1            268

Brief statistical analysis:

    Attribute number:    Mean:   Standard Deviation:
    1.                     3.8     3.4
    2.                   120.9    32.0
    3.                    69.1    19.4
    4.                    20.5    16.0
    5.                    79.8   115.2
    6.                    32.0     7.9
    7.                     0.5     0.3
    8.                    33.2    11.8


## Download the data

In [None]:
# Downloading the datasets:
import requests as re
resp = re.get('https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.names')
with open('data/pima-indians-diabetes.names', 'w') as f:
    f.write(resp.text)
resp = re.get('https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data')
with open('data/pima-indians-diabetes.data', 'w') as f:
    f.write('Times_pregnant,Plasma_glucose,Blood_pressure,Triceps_skin,Serum_insulin,'
            'BMI,Diabetes_pedigree,Age,Class\n')
    f.write(resp.text)

In [1]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('seaborn-notebook')
plt.style.use('ggplot')
import pandas as pd
import numpy as np
import math
from IPython.display import display

## Read the data

In [2]:
df = pd.read_csv('data/pima-indians-diabetes.data', sep=',')
df = df.reset_index().rename(columns={'index': 'id'})
display(df.head())
display(df.describe())

"""
Provavelmente tratar zeros nessas colunas como nulos:
Plasma_glucose, Blood_pressure, Triceps_skin, Serum_insulin, BMI
"""

Unnamed: 0,id,Times_pregnant,Plasma_glucose,Blood_pressure,Triceps_skin,Serum_insulin,BMI,Diabetes_pedigree,Age,Class
0,0,6,148,72,35,0,33.6,0.627,50,1
1,1,1,85,66,29,0,26.6,0.351,31,0
2,2,8,183,64,0,0,23.3,0.672,32,1
3,3,1,89,66,23,94,28.1,0.167,21,0
4,4,0,137,40,35,168,43.1,2.288,33,1


Unnamed: 0,id,Times_pregnant,Plasma_glucose,Blood_pressure,Triceps_skin,Serum_insulin,BMI,Diabetes_pedigree,Age,Class
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,383.5,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,221.846794,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,191.75,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,383.5,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,575.25,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,767.0,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


'\nProvavelmente tratar zeros nessas colunas como nulos:\nPlasma_glucose, Blood_pressure, Triceps_skin, Serum_insulin, BMI\n'

In [None]:
# import the scatter_matrix functionality
from pandas.tools.plotting import scatter_matrix

# define colors list, to be used to plot survived either red (=0) or green (=1)
colors=['red','green']

# make a scatter plot
scatter_matrix(df,figsize=[20,20],marker='o',c=df.Class.apply(lambda x:colors[x]))
#df.hist()
plt.show()
df.sort_values('Blood_pressure').reset_index().plot(kind='area', y='Blood_pressure')
plt.show()

## Select train/test dataset

In [3]:
split_ratio = 0.67
test_len = np.int(df.shape[0]*split_ratio)
train_len = df.shape[0] - test_len
df_train = df.sample(test_len)
print 'Training dataset size: %d' % df_train.shape[0]
df_test = df.sample(train_len)
print 'Test dataset size: %d' % df_test.shape[0]

Training dataset size: 514
Test dataset size: 254


## Splitting the data

1. Function to split data on a column with respect to a value
2. Evaluate entropy of cut

In [4]:
# Parameters for the decision three
max_depth = 4
min_node = 50

def log2(x):
    return math.log(x)/math.log(2)

#@classmethod
def entropyCalc(df, class_var='Class'):
    """
    Calculates Shannon entropy.
    """
    #ck = df[class_var].unique() # possible classes
    total_len = np.float64(df.shape[0])
    ent = 0.0
    for name, ddf in df.groupby(class_var):
        #display(ddf.head())
        p = ddf.shape[0]/total_len
        ent -= p*log2(p)
    return ent

def divideSet(df, col=0, value=0, class_var=None):
    """
    Divide set and calculate entropy gain.
    """
    if np.issubdtype(df.ix[:,col].dtype, np.number): # if is numeric
        mask = (df.ix[:,col] >= value)
    else:
        mask = (df.ix[:,col] == value)
    df1 = df[mask]
    df2 = df[~mask]
    
    # Calculate Shannon's entropy gain
    if class_var:
        total_len = np.float64(df.shape[0])
        df_ent = entropyCalc(df, class_var=class_var)
        df1_ent = entropyCalc(df1, class_var=class_var)
        df2_ent = entropyCalc(df2, class_var=class_var)
        entropy_gain = df_ent - df1.shape[0]/total_len*df1_ent - df2.shape[0]/total_len*df2_ent
    
    return df1, df2, entropy_gain

In [13]:
df10 = df.head(10)
print 'original set:'
print 'entropy:', entropyCalc(df10, class_var='Class')
display(df10)

df1, df2, entropy_gain = divideSet(df10, col=1, value=5, class_var='Class')
print 'Entropy gain with cut:', entropy_gain
print 'child set 1:'
print 'entropy:', entropyCalc(df1, class_var='Class')
display(df1)
print 'child set 2:'
print 'entropy:', entropyCalc(df2, class_var='Class')
display(df2)

original set:
entropy: 0.970950594455


Unnamed: 0,id,Times_pregnant,Plasma_glucose,Blood_pressure,Triceps_skin,Serum_insulin,BMI,Diabetes_pedigree,Age,Class
0,0,6,148,72,35,0,33.6,0.627,50,1
1,1,1,85,66,29,0,26.6,0.351,31,0
2,2,8,183,64,0,0,23.3,0.672,32,1
3,3,1,89,66,23,94,28.1,0.167,21,0
4,4,0,137,40,35,168,43.1,2.288,33,1
5,5,5,116,74,0,0,25.6,0.201,30,0
6,6,3,78,50,32,88,31.0,0.248,26,1
7,7,10,115,0,0,0,35.3,0.134,29,0
8,8,2,197,70,45,543,30.5,0.158,53,1
9,9,8,125,96,0,0,0.0,0.232,54,1


Entropy gain with cut: 0.0
child set 1:
entropy: 0.970950594455


Unnamed: 0,id,Times_pregnant,Plasma_glucose,Blood_pressure,Triceps_skin,Serum_insulin,BMI,Diabetes_pedigree,Age,Class
0,0,6,148,72,35,0,33.6,0.627,50,1
2,2,8,183,64,0,0,23.3,0.672,32,1
5,5,5,116,74,0,0,25.6,0.201,30,0
7,7,10,115,0,0,0,35.3,0.134,29,0
9,9,8,125,96,0,0,0.0,0.232,54,1


child set 2:
entropy: 0.970950594455


Unnamed: 0,id,Times_pregnant,Plasma_glucose,Blood_pressure,Triceps_skin,Serum_insulin,BMI,Diabetes_pedigree,Age,Class
1,1,1,85,66,29,0,26.6,0.351,31,0
3,3,1,89,66,23,94,28.1,0.167,21,0
4,4,0,137,40,35,168,43.1,2.288,33,1
6,6,3,78,50,32,88,31.0,0.248,26,1
8,8,2,197,70,45,543,30.5,0.158,53,1


In [16]:
print 'original set:'
print 'entropy:', entropyCalc(df_train, class_var='Class')

col_idx = 2
col = df_train.columns.tolist()[col_idx]
print 'Column:', col
entropy_gain_min = 0.0
for value in df_train.sort_values(col)[col].unique():
    df1, df2, entropy_gain = divideSet(df_train, col=col_idx, value=value, class_var='Class')
    if entropy_gain > entropy_gain_min:
        min_value = value
        entropy_gain_min = entropy_gain
        
print 'Minimum entropy gain:', entropy_gain_min
print 'For value:', min_value

df1, df2, entropy_gain = divideSet(df_train, col=col_idx, value=min_value, class_var='Class')
print 'Value:', value
print 'child set 1 size:   ', df1.shape[0]
print 'child set 1 entropy:', entropyCalc(df1, class_var='Class')
print 'child set 2 size:   ', df2.shape[0]
print 'child set 2 entropy:', entropyCalc(df2, class_var='Class')

original set:
entropy: 0.925264820451
Column: Plasma_glucose
Minimum entropy gain: 0.140589997358
For value: 124
Value: 199
child set 1 size:    199
child set 1 entropy: 0.969159542257
child set 2 size:    315
child set 2 entropy: 0.668127333844


In [56]:
import pdb

class DecisionTree(object):
    """
    Class Decision Tree.
    """
    import pandas as pd
    import numpy as np
    
    def __init__(self, df, class_var, max_depth, min_node, node_depth=0):
        
        # Parameters for the decision three
        self.max_depth = max_depth
        self.min_node  = min_node
        self.node_depth = node_depth
        
        self.class_var = class_var
        self.df        = df
        self.total_len = np.float64(df.shape[0])
    
    @staticmethod
    def log2(x):
        return math.log(x)/math.log(2)
    
    def build_tree(self):
        self.first_node = DecisionNode(self.df, self.class_var, self.max_depth, self.min_node, self.node_depth)
        self._searchOptCut(self.first_node)
        
    def _searchOptCut(self, node):
        cut_col_idx = None
        max_gain = 0.0
        if (node.node_depth < node.max_depth):
            for col_idx, col in enumerate(self.df.columns.tolist()):
                if col in ['id', self.class_var]:
                    continue
                for value in self.df.sort_values(col)[col].unique():
                    node.divideSet(col=col_idx, value=value)
                    if (max_gain < node.entropy_gain) and (node.child_nodes_max_len >= node.min_node):
                        cut_col_idx = col_idx
                        cut_value = value
                        max_gain = node.entropy_gain
            if cut_col_idx:
                node.divideSet(col=cut_col_idx, value=cut_value)
                self._searchOptCut(node.child_nodes[0])
                self._searchOptCut(node.child_nodes[1])
        else:
            pass
        
class DecisionNode(DecisionTree):
    """
    Class Decision Node.
    """
    def __init__(self, df, class_var, max_depth, min_node, node_depth,
                 col=-1, value=None, result=None, parent_node=None, child_nodes=None):
        super(DecisionNode, self).__init__(df, class_var, max_depth, min_node, node_depth)
        
        self.col     = col
        self.value   = value
        self.result  = result
        self.parent_node = parent_node
        self.child_nodes = child_nodes
        
        self.entropy     = self.entropyCalc()
        
    def entropyCalc(self):
        """
        Calculates Shannon entropy.
        """
        ent = 0.0
        for name, ddf in self.df.groupby(self.class_var):
            p = ddf.shape[0]/self.total_len
            ent -= p*self.log2(p)
        return ent
    
    def divideSet(self, col=0, value=0):
        """
        Divide set, create child nodes and calculate entropy gain.
        """
        df = self.df
        if np.issubdtype(df.ix[:,col].dtype, np.number): # if is numeric
            mask = (df.ix[:,col] >= value)
        else:
            mask = (df.ix[:,col] == value)
        self._createChildNodes(df, mask, col, value)

        # Calculate Shannon's entropy gain
        self._entropyGain()
    
    def printChildNodes(self):
        node = self.child_nodes[0]
        print '%d.1, col: %s, value: %s' % (node.node_depth, node.col, node.value)
        if node.child_nodes:
            node.printChildNodes()
        node = self.child_nodes[1]
        print '%d.2, col: %s, value: %s' % (node.node_depth, node.col, node.value)
        if node.child_nodes:
            node.printChildNodes()
        
    def _entropyGain(self):
        ch_nd_1, ch_nd_2 = self.child_nodes
        self.entropy_gain = self.entropy - (ch_nd_1.total_len*ch_nd_1.entropy + ch_nd_1.total_len*ch_nd_2.entropy)/self.total_len
    
    def _createChildNodes(self, df, mask, col, value):
        ch_nd_1 = DecisionNode(df[mask], self.class_var, self.max_depth, self.min_node, self.node_depth+1,
                               col=col, value=value, result=True, parent_node=self, child_nodes=None)
        ch_nd_2 = DecisionNode(df[~mask], self.class_var, self.max_depth, self.min_node, self.node_depth+1,
                               col=col, value=value, result=False, parent_node=self, child_nodes=None)
        self.child_nodes = (ch_nd_1, ch_nd_2)
        self.child_nodes_max_len = min(ch_nd_1.total_len, ch_nd_2.total_len)
        
        

In [57]:
tree = DecisionTree(df, 'Class', 4, 10)
tree.build_tree()
tree.first_node.printChildNodes()

1.1, col: 2, value: 195
2.1, col: 8, value: 81
2.2, col: 8, value: 81
1.2, col: 2, value: 195
2.1, col: 6, value: 49.7
3.1, col: 8, value: 81
3.2, col: 8, value: 81
2.2, col: 6, value: 49.7
3.1, col: 7, value: 1.461
4.1, col: 8, value: 81
4.2, col: 8, value: 81
3.2, col: 7, value: 1.461
4.1, col: 7, value: 1.268
4.2, col: 7, value: 1.268


In [55]:
tree.first_node.printChildNodes()

1.1, col: 9, value: 1
1.2, col: 9, value: 1


## Make Prediction

We are now ready to make predictions using the summaries prepared from our training data. Making predictions involves calculating the probability that a given data instance belongs to each class, then selecting the class with the largest probability as the prediction.

We can divide this part into the following tasks:

1.Calculate Gaussian Probability Density Function

2.Calculate Class Probabilities

3.Make a Prediction

4.Estimate Accuracy

In [None]:
var_cols = df_summary.stack().columns.tolist()

df_test_sample = df_test
display(df_test_sample.head())
df_test_sample = pd.melt(df_test_sample, id_vars=['id', 'Class'], value_vars=var_cols)
display(df_test_sample.head())

df_aux = df_summary.stack().reset_index()
df_aux.rename(columns={'level_1':'measures'}, inplace=True)
df_metled_summary = pd.melt(df_aux, id_vars=['Class', 'measures'], value_vars=var_cols)
df_metled_summary = df_metled_summary.pivot_table(values='value', index=['Class', 'variable'], columns=['measures'])
df_metled_summary = df_metled_summary.reset_index()
display(df_metled_summary.head())

df_test_sample = df_test_sample.merge(df_metled_summary, on=['variable'], how='left')
display(df_test_sample.head())


In [None]:
import pdb
import math
def calculateProbability(df):
    #pdb.set_trace()
    x = df.value
    mean = df['mean']
    stdev = df['std']
    exponent = np.exp(-(np.square(x-mean)/(2*np.square(stdev))))
    return (1 / (np.sqrt(2*math.pi) * stdev)) * exponent

df_test_sample['prob'] = df_test_sample.apply(calculateProbability, axis=1)

########################
# Avoid underflow
df_test_sample['log_prob'] = df_test_sample.prob.map(np.log10)
########################

display(df_test_sample.head())

In [None]:
df_test_sample = df_test_sample.groupby([
        'id', 'Class_x', 'Class_y'], as_index=False)[['prob', 'log_prob']].agg({'prob': np.prod, 'log_prob': np.sum})
display(df_test_sample.head())

###############
df_test_sample['odds'] = df_test_sample.groupby(['id'])['prob'].transform(sum)
display(df_test_sample.head())
df_test_sample['odds'] = df_test_sample['prob']/df_test_sample['odds']
display(df_test_sample.head())
###############

idx = (df_test_sample.groupby(['id', 'Class_x'])['prob'].transform(max) == df_test_sample['prob'])
df_test_sample = df_test_sample[idx]
df_test_sample['right_prediction'] = (df_test_sample.Class_x - df_test_sample.Class_y == 0)
display(df_test_sample.head())

In [None]:
df_result = df_test_sample.groupby('right_prediction')[['id']].count()
display(df_result)
print 'Success rate: %.1f%%' % ((df_result.ix[True]/df_result.sum()).values[0]*100)