### Decision Trees

Another form of machine learning is through the use of Decision Trees. In essence, Decision Trees are a method 
by which the data is split into categories until each resulting group is uniform in regard to the target.

For example, if I wanted to predict gender based upon a set of various features, I could split the initial dataset
along those features until each resulting group contained members of only one gender. 

Machine Learning algorithms perform this function by calculating the entropy of each split, and attempt to choose the optimal path with the least amount of entropy. In other words, the algorithm attempts to sort the data into the most organized groups possible, based upon the target being measured; a split resulting in subsets with 50% group A and 50% group B would have a high level of entropy, while a split resulting in subsets with 100% group A and 0% group B, would have a low level of entropy.

While interesting and effective for modeling purposes, Decision Trees do have a high risk of overfitting, so be wary when using to make predictions outside of your training data.

Below is a quick example of a Decision Tree.

In [1]:
import pandas as pd 
import numpy as np 
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

# Load the data
filename = 'data/Folds5x2_pp.xlsx'
df = pd.read_excel(filename)

In [2]:
#assign features to X and target to y

X = df[df.columns.difference(['PE'])]
y = df['PE']

In [3]:
#split data into training and testing subsets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)

In [4]:
#fit the Sklearn Decision Tree regressor to the training data, and evaluate R^2 using testing data

rt = DecisionTreeRegressor(random_state=1)

rt.fit(X_train, y_train)
rt.score(X_test, y_test)

0.9250580726905822