# Tree-based regression

Pros: Fits complex, nonlinear data

Cons: Difficult to interpret results

Works with: Numeric values, nominal values

CART is a well-known and well-documented tree-building algorithm that makes
binary splits and handles continuous variables. CART can handle regression with a simple
modification.

Regression trees
are similar to trees used for classification but with the leaves representing a numeric
value rather than a discrete one.

#### General approach to tree-based regression

1. Collect: Any method.

2. Prepare: Numeric values are needed. If you have nominal values, it’s a good idea
to map them into binary values.

3. Analyze: We’ll visualize the data in two-dimensional plots and generate trees as
dictionaries.

4. Train: The majority of the time will be spent building trees with models at the leaf
nodes.

5. Test: We’ll use the R2 value with test data to determine the quality of our models.

6. Use: We’ll use our trees to make forecasts. We can do almost anything with
these results.

#### Pseudo-code for createTree() would look like this:

Find the best feature to split on:
    
    If we can’t split the data, this node becomes a leaf node
    
    Make a binary split of the data
    
    Call createTree() on the right split of the data
    
    Call createTree() on the left split of the data

### CART tree-building

In [110]:
from numpy import *
#Previously we broke the target variable off into its own list,here we keep data together
def loadDataSet(fileName):      #general function to parse tab -delimited floats
    dataMat = []                #assume last column is target value
    fr = open(fileName)
    for line in fr.readlines():
        curLine = line.strip().split('\t')
        fltLine = list(map(float,curLine))#[float(x) for x in curLine]
        #map(float,curLine) #map all elements to float()
        
        dataMat.append(fltLine)
    return dataMat

In [99]:
#The two sets are created using array filtering for the given feature and value
def binSplitDataSet(dataSet, feature, value):
    mat0 = dataSet[nonzero(dataSet[:,feature] > value)[0],:][0]
    mat1 = dataSet[nonzero(dataSet[:,feature] <= value)[0],:][0]
    return mat0,mat1

There are four
arguments to createTree(): a dataset on which to build the tree and three optional
arguments. The three optional arguments tell the function which type of tree to create.
The argument leafType is the function used to create a leaf. The argument errType is
a function used for measuring the error on the dataset. The last argument, ops, is a tuple
of parameters for creating a tree.

The split is determined by the function chooseBestSplit().

In the case of regression trees, this model
is a constant value; in the case of model trees, this model is a linear equation.

In [100]:
def createTree(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):#assume dataSet is NumPy Mat so we can array filtering
    feat, val = chooseBestSplit(dataSet, leafType, errType, ops)#choose the best split
    if feat == None: return val #if the splitting hit a stop condition return val
    retTree = {}
    retTree['spInd'] = feat
    retTree['spVal'] = val
    lSet, rSet = binSplitDataSet(dataSet, feat, val)
    retTree['left'] = createTree(lSet, leafType, errType, ops)
    retTree['right'] = createTree(rSet, leafType, errType, ops)
    return retTree 

In [101]:
testMat = mat(eye(4))

In [102]:
mat0,mat1 = binSplitDataSet(testMat,1,0.5)

In [103]:
mat0

matrix([[ 0.,  1.,  0.,  0.]])

The regression tree method breaks up data using a tree with constant values
on the leaf nodes.

In order to construct a tree of piecewise constant values, we need to be able to
measure the consistency of data. It is done by total squared error.

### Building the tree

chooseBestSplit()does only two
things: split a dataset by the best possible split and generate a leaf node for a dataset.
    
The leafType argument is a
reference to a function that we use to create the leaf node. The errType argument is a
reference to a function that will be used to calculate the squared deviation from the
mean described earlier. Finally, ops is a tuple of user-defined parameters to help with
tree building.

For every feature:

    For every unique value:
        Split the dataset it two
        Measure the error of the two splits
        If the error is less than bestError ➞ set bestSplit to this split, update bestError
Return bestSplit feature and threshold

In [104]:
def regLeaf(dataSet):#returns the value used for each leaf
    a = dataSet[:,-1]
    return mean(a)

def regErr(dataSet):
    return var(dataSet[:,-1]) * shape(dataSet)[0]

In [214]:
def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):
    tolS = ops[0]; tolN = ops[1]
    #if all the target variables are the same value: quit and return value
    if len(set(dataSet[:,-1].T.tolist()[0])) == 1: #exit cond 1
        return None, leafType(dataSet)
    m,n = shape(dataSet)
    #the choice of the best feature is driven by Reduction in RSS error from mean
    S = errType(dataSet)
    bestS = inf; bestIndex = 0; bestValue = 0
    for featIndex in range(n-1):
        for splitVal in set(dataSet[:,featIndex]):
            mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal)
            if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): continue
            newS = errType(mat0) + errType(mat1)
            if newS < bestS: 
                bestIndex = featIndex
                bestValue = splitVal
                bestS = newS
    #if the decrease (S-bestS) is less than a threshold don't do the split
    if (S - bestS) < tolS: 
        return None, leafType(dataSet) #exit cond 2
    mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue)
    if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN):  #exit cond 3
        return None, leafType(dataSet)
    return bestIndex,bestValue#returns the best feature to split on
                              #and the value used for that split


In [215]:
myDat = loadDataSet('ex00.txt')

In [216]:
myDat

[[0.036098, 0.155096],
 [0.993349, 1.077553],
 [0.530897, 0.893462],
 [0.712386, 0.564858],
 [0.343554, -0.3717],
 [0.098016, -0.33276],
 [0.691115, 0.834391],
 [0.091358, 0.099935],
 [0.727098, 1.000567],
 [0.951949, 0.945255],
 [0.768596, 0.760219],
 [0.541314, 0.893748],
 [0.146366, 0.034283],
 [0.673195, 0.915077],
 [0.18351, 0.184843],
 [0.339563, 0.206783],
 [0.517921, 1.493586],
 [0.703755, 1.101678],
 [0.008307, 0.069976],
 [0.243909, -0.029467],
 [0.306964, -0.177321],
 [0.036492, 0.408155],
 [0.295511, 0.002882],
 [0.837522, 1.229373],
 [0.202054, -0.087744],
 [0.919384, 1.029889],
 [0.377201, -0.24355],
 [0.814825, 1.095206],
 [0.61127, 0.982036],
 [0.072243, -0.420983],
 [0.41023, 0.331722],
 [0.869077, 1.114825],
 [0.620599, 1.334421],
 [0.101149, 0.068834],
 [0.820802, 1.325907],
 [0.520044, 0.961983],
 [0.48813, -0.097791],
 [0.819823, 0.835264],
 [0.975022, 0.673579],
 [0.953112, 1.06469],
 [0.475976, -0.163707],
 [0.273147, -0.455219],
 [0.804586, 0.924033],
 [0.074795

In [217]:
myMat = mat(myDat)

In [218]:
myMat

matrix([[  3.60980000e-02,   1.55096000e-01],
        [  9.93349000e-01,   1.07755300e+00],
        [  5.30897000e-01,   8.93462000e-01],
        [  7.12386000e-01,   5.64858000e-01],
        [  3.43554000e-01,  -3.71700000e-01],
        [  9.80160000e-02,  -3.32760000e-01],
        [  6.91115000e-01,   8.34391000e-01],
        [  9.13580000e-02,   9.99350000e-02],
        [  7.27098000e-01,   1.00056700e+00],
        [  9.51949000e-01,   9.45255000e-01],
        [  7.68596000e-01,   7.60219000e-01],
        [  5.41314000e-01,   8.93748000e-01],
        [  1.46366000e-01,   3.42830000e-02],
        [  6.73195000e-01,   9.15077000e-01],
        [  1.83510000e-01,   1.84843000e-01],
        [  3.39563000e-01,   2.06783000e-01],
        [  5.17921000e-01,   1.49358600e+00],
        [  7.03755000e-01,   1.10167800e+00],
        [  8.30700000e-03,   6.99760000e-02],
        [  2.43909000e-01,  -2.94670000e-02],
        [  3.06964000e-01,  -1.77321000e-01],
        [  3.64920000e-02,   4.081

### List Comprehension

In [220]:
squares = []
for x in range(10):
    squares.append(x**2)
squares

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [222]:
squaresL = list(map(lambda x: x**2, range(10)))

In [223]:
squaresL

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [225]:
squaresList = [x**2 for x in range(10)]

In [226]:
squaresList

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

A list comprehension consists of brackets containing an expression followed by a for clause, then zero or more for or if clauses. The result will be a new list resulting from evaluating the expression in the context of the for and if clauses which follow it. For example, this listcomp combines the elements of two lists if they are not equal:

In [227]:
[(x, y) for x in [1,2,3] for y in [3,1,4] if x != y]

[(1, 3), (1, 4), (2, 3), (2, 1), (2, 4), (3, 1), (3, 4)]

In [228]:
combs = []
for x in [1,2,3]:
     for y in [3,1,4]:
        if x != y:
           combs.append((x, y))
combs

[(1, 3), (1, 4), (2, 3), (2, 1), (2, 4), (3, 1), (3, 4)]