Regression Tree with CART algorithms

The idea

The idea of CART is partitioning data into small chunks, each chunks will be a node, and then continue to partition these chunks into smaller chunks... and so on, until we can give the value of target attributes.

CART uses binary tree to present model. Each node is one of input attributes and the condition which be applied on this attribute to split into smaller chunks (exactly 2 chunks)

Example of binary decision tree

In the figure above, the root node will be the attribute "elev" and condition "< 7910". So records which contain elev < 7910 will be in the left branch (chunk), and the rest is in the right. With the left branch, we will continue to split it by applied rule: "if the value of slope < 20, go to the left, otherwise, go to the right". And so on...

When a new record comes, we want to predict the outcome value, we will use the same way to do. If elev < 7910, go to the left... until we get the left node. It's will be the predicted value.

The main problems is: how to choose the feature of each node, and the condition will be applied on chosen feature. We can see this condition will split a dataset into 2 sub dataset. So, we will try to find the best way to split a node to grow tree.

We will solve 2 problem by the bellow algorithm (CART).

General algorithm to build tree model

Input:

set of input features (attributes) F = {f1, f2,...,fn} and the output feature Y

So, it means, we will have a dataset like this:

f1v1, f2v1, f3v1,...,fnv1, Y1
f1v2, f2v2, f3v2,...,fnv2, Y2
...
f1vm, f2vm, f3vm,...,fnvm, Ym

In which, f1vi is the value i(th) of feature f1... We study about regression tree, so value of Y is continuous.

Algorithm:

// step 1
For each feature fi in F
    For each possible splitting points sj of fi
        calculate I(j) = (fi, sj)
    End for
    S(fi) = Best_Splitting_Point(fi) = argmax{I}
    MaxI(fi) = max{I}
End for

// step 2
For each (fi, S(fi))
    calculate G(fi, S(fi))
End for

Chosen_feature = argmax {G}
Splitting Point (condition) = S(fi)

The algorithm means for every feature fi in F, we will compute apply function I of fi on every possible splitting point of fi.

But what is possible splitting points of fi ?

If fi is a categorical feature (or also called discrete feature), possible splitting points is all subset of V, which V is set of values of fi. For example, V={hot, cold, mild}, possible splitting points are {hot}, {cold}, {mild}, {hot, cool}, {hot, mild}, {cool, mild}.

If fi is a numerical feature (or also called continuous feature), possible splitting points is all number from min value to max value of fi. But, we may only consider less values. For example, if all values is V1 < V2 < ...< Vn, we can only consider s1 = (v1 + v2)/2; s2=(v2 + v3)/2; ...; s(n-1) = (v(n-1) + vn)/2 and choose one. After applying I function on all possible splitting of of fi, we will get the point S(fi), which give us maximum value of I. S(fi) is the best splitting point for feature fi.

So, after get all best splitting point for each feature, we will apply function G on them, and find the feature which maximizes G. This is the feature using for currently node.

For simply, function G may be the max function of all I(fi): G(fi) = max{MaxI(fi)}

Till now, we solve the problem of choosing feature and choosing the best condition on that feature. But, how to calculate I. How is function I looks like ?

Breiman splitting

Original algorithms

Finding the best split for a continuous variable

Input: Nt cases (records), sum of their Y values (SumTotal), the variable Xv (which is presented for feature fi)

Output: The best cut-point split on Xv

Sort the cases according to their valus in Xv
SumRight = SumTotal; SumLeft = 0
NumRight = Nt; NumLeft = 0
BestTillNow = 0
For all instance i do:
    SumLeft = SumLeft + yi; SumRight = SumTotal - yi; // yi is the value of Y in instance i
    NumLeft = NumLeft + 1; NumRight = NumTotal - 1
    If (x(i+1) > x(i)) then
        NewSplitValue = SumLeft*SumLeft/NumLeft + SumRight*SumRight/NumRight
        If (NewSplitValue > BestTillNow)
            BestCutPoint = [x(i+1) + x(i)]/2
            BestTillNow = NewSplitValue
        EndIf
    EndIf
EndFor

Finding the best subset split for a discrete variable

This is very good algorithms, which help us to only consider N-1 subsets in 2^N subsets of a set which has N value.

Input: Nt cases, sum of their Y values (SumTotal), the variable Xv

Output: An order set of values of Xv and a partition of this set

Obtain the average Y value associated to each value of Xv
Sort the values of Xv according to the average Y associated to each value
SumRight = SumTotal; SumLeft = 0
NumRight = Nt; NumLeft = 0
BestTillNow = 0
For each value b of obtained ordered set of values do:
    YB = sum of the Y values of the cases with Xv = b
    NB = number of cases with Xv = b
    SumLeft = SumLeft + YB; SumRight = SumTotal - YB
    NumLeft = NumLeft + NB; NumRight = NumRight - NB
    NewSplitValue = SumLeft*SumLeft/NumLeft + SumRight*SumRight/NumRight
    If (NewSplitValue > BestTillNow) then
        BestTillNow = NewSplitValue
        BestPosition = position of b in set of ordered values
    EndIf
EndFor

Modified algorithm

You can see, the basic task of these above algorithms is counting. So, it's better when we work will big data by using spark, or hadoop, we try to count the frequency of a value of a feature and then apply these algorithms.

Assume that we have Nt cases, each case has form: (fi, x, sumY, frequency)

fi is the index or the name of feature x is a value of feature sumY is sum of all the associated value of Y frequency is the number of time fi has value x

For example, with dataset:

Temperature(X),Money(Y)
Cool, 10
Hot, 3
Cool, 8
Mild, 5

We will have 3 instances: (Temperature, Cool, 18, 2) (Temperature, Hot, 3, 1) (Temperature, Mild, 5, 1)

Input: Nt cases (whole dataset before counting), sum of their Y values (SumTotal), m aggregated instances which have form (fi, x, sumY, freq)

Output: the best cut point

Sort the instances according to x values if it's belong to continuous category, according to y/freq if it's belong to discrete category.
SumRight = SumTotal; SumLeft = 0
Numright = Nt; NumLeft = 0
BestTillNow = 0
For all instance i do:
    SumLeft = SumLeft + sumYi ; SumRight = SumTotal - sumYi
    NumLeft = NumLeft + freq ; NumRight = Nt - freq
    NewSplitValue = SumLeft*SumLeft/NumLeft + SumRight*SumRight/NumRight
    If (NewSplitValue > BestTillNow) then
        BestTillNow = NewSplitValue
        If this is continuous feature then
            BestCutPoint = [x(i+1) + x(i)]/2
        Else
            BestCutPoint = position of xi in set of ordered values
        EndIf
    EndIf
 EndFor

Example running step with spark + scala

For each value of each feature, emit tuple of feature' index, xvalue, sumYvalue, 1, denote by F(index, xvalue, yvalue, 1)

For instance, we consider feature Temperature(index 0), Humidity(index 1) and Money(index 2) in data set: Temperature, Humidity, Money, Some-Y-Feature

Assume that, after emitting, we have:

F(0,Hot,8,1)
F(0,Hot,6,1)
F(0,Cool,5,1)
F(1,Normal,10,1)
F(1,High,3,1)
F(1,High,8,1)
F(2,20,3,1)
F(2,10,2,1)
F(2,10,8,1)

Group by (index, xvalue), we get:

( (0,Hot), [F(0,Hot,8,1), F(0,Hot,6,1)] )
( (0,Cool), [F(0,Cool,5,1)] )
( (1,Normal), [F(1,Normal,10,1)] )
( (1,High), [F(1,High,3,1), F(1,High,8,1)] )
( (2,20), [F(2,20,3,1) )
( (2,10), [F(2,10,2,1)], F(2,10,8,1)] )

Sum Y values and frequency of elements in the second component:

F(0,Hot,14,2)
F(0,Cool,5,1)
F(1,Normal,10,1)
F(1,High,11,2)
F(2,20,3,1)
F(2,10,10,2)

Group by index and sort by xvalue with continuous category, and by sumY/freq with discrete category:

(0, [F(0,Cool,5,1), F(0,Hot,14,2)]
(1, [F(1,High,11,2)], F(1,Normal,10,1))
(2, [F(2,10,3,1), F(2,20,10,2)])

And then apply above algorithm on this data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly