Replies: 7 comments 14 replies
-
|
Setting mtry=p turns off random feature selection and essentially you are growing a bootstrapped (bagged) tree. This is a less randomized tree. When mtry<p this creates a more random tree and unusual things can happen. To see how this is playing a role in depth of a tree with discrete features, I suggest you use the |
Beta Was this translation helpful? Give feedback.
-
|
An x-var is removed from mtry selection (from this point onward in the splitting process) if it is first selected as an mtry candidate, and is then is detected as being pure. Consider a worst case scenario where mtry = 1. We select an x-var j, 1 <= j <= p. We split successfully on j at the root node. As a result of the split, an x-var k, k != j, becomes pure in the left daughter. The algorithm has no purity information on k at this point. In the next split attempt, on the left daughter, we randomly select k as our mtry candidate. The node terminates because we cannot split on k, and we have exhausted our mtry attempts. As you increase mtry = p, you will get deeper trees. |
Beta Was this translation helpful? Give feedback.
-
|
If your goal is to improve MSE by encouraging a deeper tree, then consider the following which might help. Make sure your 0/1 binary predictors are real-valued (no need for factors) Run the forests on the 10 features and see if that improves things. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you both very much. I'm really just trying to understand what the algorithm is doing in a relatively simple data scenario, not trying to improve performance for now. So, sticking with p=4 independent binary predictors, with a sufficiently high sample size (I'm using n=1200), I'm still trying to understand WHY trees will not be fully grown when mtry=1 and nodesize=1. In many cases, I get as few as 2 or 3 terminal nodes. Printing out the trees, as per Ishwaran's suggestion, indeed shows early termination (as does printing out the "leaf.count" values). When mtry=p, the # terminal nodes = unique factor-level combinations = 16 (4 x 4) in this case. But when mtry < p, # of terminal nodes < 16. The example that Kogalur gives about purity information not being available won't occur in my setting because the predictors are all independent and important (i.e., their effect is far from 0). If a pure variable cannot be selected as a candidate to be split on (other than in the case Kogalur states, which doesn't apply to the setting I described above), then my understanding is that only non-pure variables will be selected, from which a split should then proceed. But apparently not. So then, what else could be leading the trees to not be fully grown? Are there other constraints that I'm failing to appreciate that could cause such early termination? |
Beta Was this translation helpful? Give feedback.
-
|
Purity information is only available on those x-variables that have been the subject of mtry attempts. That purity information is known only at the parent node (the node that was split) for those x-variables. Go back and look at the worst case scenario above. The x-variable j was split at the root node. It was selected as our single mtry attempt. We know that j is impure, because we were able to split on it. We don't know that j is impure in one or both daughter nodes. It retains the parent state as a valid mtry x-variable. Impure x-variables can turn pure in a daughter after a split on a routine basis. We won't know that fact until we try to split on them further on down the tree. In the worse case scenario above, we pick k as the single mtry in the daughter. Then we test for purity, and we find that k is pure. So we terminate. You state that the scenario above does not apply to your case above. I think this is incorrect. Say that j is the colour of your eyes, and k is the colour of your hair. If we split people on the colour of their eyes at the root node, it might be that all people with brown hair end up going into the left daughter after the split on the root node. Next, we select hair color as our split in the left daughter. But we detect that it is now pure. So we terminate. We don't get another chance to split on any other x-variable, because mtry is one. |
Beta Was this translation helpful? Give feedback.
-
|
Here's a simple example that shows how to get 2^4=16 terminal nodes with p=4 binary (0/1) predictors. |
Beta Was this translation helpful? Give feedback.
-
|
Use Dr. Ishwaran's example, but set n = 100 or less, p = 4, ntree = 1, mtry = 1, nodesize = 1, nsplit = 0, bootstrap = 1. Use
and for repeatability
So
Take a look at o$forest$nativeArray and actually subset the data at the root split, and then at each subsequent split according to $parmID and $contPT. You'll find that you get pure nodes quite fast. This is a bit artificial because we only have binary variables. Here, in this example, if we split and send x1=1 to the left and x1=2 to the right, we know that x1 will be pure on the left and right. But in the general case, when real values are involved, we don't actually know the purity of x1 after the split. So if we try to split on x1 again, we will fail. |
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
-
Hi there,
I understand how pure variables are excluded from the pool of possible splitting variables. Could you clarify how this interacts with mtry when mtry < p? Specifically, is there some way in which this the number of pure vs non-pure variable could cause a tree to stop growing prematurely? For example, if I simulate binary predictors with p=4 (all important predictors) such that I have 16 unique factor-level combinations, set mtry=4 and set nodesize=1, I get 16 terminal nodes as expected. But when I set mtry=2 (or anything less than 4), I get <16 terminal nodes. I'm trying to understand what could cause this. Thank you!
Beta Was this translation helpful? Give feedback.
All reactions