DET has only one leaf prior to pruning every time #515

Closed
jeroneandrews opened this Issue Feb 4, 2016 · 4 comments

Projects

None yet

3 participants

@jeroneandrews

For some reason the DET keeps giving me only one leaf node, prior to pruning, and after pruning - for several datasets I have tried. Here is one of the datasets (attached):

5vs2Train1.txt

It contains 896 observations of the MNIST digit 5.

Any help towards solving this issue will be greatly appreciated.

Thanks in advance.

Output from terminal:
Jerones-MacBook:5vs2 Jerone$ mlpack_det -t 5vs2Train1.txt -T 5vs2Test.txt -e aTrainEst.txt -E aTestEst.txt -M aOutput.txt -v -f 0
[INFO ] Loading '5vs2Train1.txt' as CSV data. Size is 784 x 896.
[INFO ] Performing leave-one-out cross validation.
[INFO ] 1 leaf nodes in the tree using full dataset; minimum alpha: 1.79769e+308.
[INFO ] 1 trees in the sequence; maximum alpha: 0.
[INFO ] Optimal alpha: -1.
[INFO ] 1 leaf nodes in the optimally pruned tree; optimal alpha: -1.79769e+308.
[INFO ] Saving raw ASCII formatted data to 'aTrainEst.txt'.
[INFO ] Loading '5vs2Test.txt' as CSV data. Size is 784 x 1536.
[INFO ] Saving raw ASCII formatted data to 'aTestEst.txt'.
[INFO ]
[INFO ] Execution parameters:
[INFO ] folds: 0
[INFO ] help: false
[INFO ] info: ""
[INFO ] input_model_file: ""
[INFO ] max_leaf_size: 10
[INFO ] min_leaf_size: 5
[INFO ] output_model_file: aOutput.txt
[INFO ] test_file: 5vs2Test.txt
[INFO ] test_set_estimates_file: aTestEst.txt
[INFO ] training_file: 5vs2Train1.txt
[INFO ] training_set_estimates_file: aTrainEst.txt
[INFO ] verbose: true
[INFO ] version: false
[INFO ] vi_file: ""
[INFO ]
[INFO ] Program timers:
[INFO ] cross_validation: 12.433051s
[INFO ] det_estimation_time: 0.001443s
[INFO ] det_test_set_estimation: 0.002492s
[INFO ] det_training: 12.465670s
[INFO ] loading_data: 1.454027s
[INFO ] saving_data: 0.002742s
[INFO ] total_time: 13.933760s

@jeroneandrews jeroneandrews reopened this Feb 4, 2016
@rcurtin rcurtin added the T: defect label Feb 4, 2016
@rcurtin
Member
rcurtin commented Feb 10, 2016

Diagnosis: the log negative error of a DET is defined as

R(t) = log(|t|^2 / (N^2 V_t)).

At the first level of this tree, the volume of the node is the entire volume spanned by the data. i.e. V = the width of every dimension multiplied together. But some dimensions have width 0 in this dataset, so, V = 0 and R(t) = inf.

I don't yet know how I want to handle this problem for the mlpack code; I need to review the paper and maybe send Pari an email or something depending on what I can come up with.

A quick solution is to add tiny bits of noise to your data points, or to drop any dimensions that have zero range (i.e. where all of the rows have 0 in that dimension).

I'll keep digging and let you know what I think of.

@jeroneandrews

Thanks for your help. I'll try adding a bit of noise as a temporary solution.

@rcurtin
Member
rcurtin commented Feb 18, 2016

I talked with Pari and we decided that the best idea was just to ignore the zero-variance dimensions in the log negative error calculation. This change has been made in 4e069ab and should fix your issue, so there should be no more need to add noise. Let me know if it doesn't and we can reopen the ticket. Thanks for reporting the issue! :)

@rcurtin rcurtin closed this Feb 18, 2016
@rcurtin rcurtin added the R: fixed label Feb 18, 2016
@rcurtin rcurtin added this to the mlpack 2.0.2 milestone Feb 18, 2016
@jaelim
jaelim commented Apr 13, 2016

@jeroneandrews Hi, Could you please provide your other test file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment