For some reason the DET keeps giving me only one leaf node, prior to pruning, and after pruning - for several datasets I have tried. Here is one of the datasets (attached):
It contains 896 observations of the MNIST digit 5.
Any help towards solving this issue will be greatly appreciated.
Thanks in advance.
Output from terminal:
Jerones-MacBook:5vs2 Jerone$ mlpack_det -t 5vs2Train1.txt -T 5vs2Test.txt -e aTrainEst.txt -E aTestEst.txt -M aOutput.txt -v -f 0
[INFO ] Loading '5vs2Train1.txt' as CSV data. Size is 784 x 896.
[INFO ] Performing leave-one-out cross validation.
[INFO ] 1 leaf nodes in the tree using full dataset; minimum alpha: 1.79769e+308.
[INFO ] 1 trees in the sequence; maximum alpha: 0.
[INFO ] Optimal alpha: -1.
[INFO ] 1 leaf nodes in the optimally pruned tree; optimal alpha: -1.79769e+308.
[INFO ] Saving raw ASCII formatted data to 'aTrainEst.txt'.
[INFO ] Loading '5vs2Test.txt' as CSV data. Size is 784 x 1536.
[INFO ] Saving raw ASCII formatted data to 'aTestEst.txt'.
[INFO ] Execution parameters:
[INFO ] folds: 0
[INFO ] help: false
[INFO ] info: ""
[INFO ] input_model_file: ""
[INFO ] max_leaf_size: 10
[INFO ] min_leaf_size: 5
[INFO ] output_model_file: aOutput.txt
[INFO ] test_file: 5vs2Test.txt
[INFO ] test_set_estimates_file: aTestEst.txt
[INFO ] training_file: 5vs2Train1.txt
[INFO ] training_set_estimates_file: aTrainEst.txt
[INFO ] verbose: true
[INFO ] version: false
[INFO ] vi_file: ""
[INFO ] Program timers:
[INFO ] cross_validation: 12.433051s
[INFO ] det_estimation_time: 0.001443s
[INFO ] det_test_set_estimation: 0.002492s
[INFO ] det_training: 12.465670s
[INFO ] loading_data: 1.454027s
[INFO ] saving_data: 0.002742s
[INFO ] total_time: 13.933760s
Diagnosis: the log negative error of a DET is defined as
R(t) = log(|t|^2 / (N^2 V_t)).
At the first level of this tree, the volume of the node is the entire volume spanned by the data. i.e. V = the width of every dimension multiplied together. But some dimensions have width 0 in this dataset, so, V = 0 and R(t) = inf.
I don't yet know how I want to handle this problem for the mlpack code; I need to review the paper and maybe send Pari an email or something depending on what I can come up with.
A quick solution is to add tiny bits of noise to your data points, or to drop any dimensions that have zero range (i.e. where all of the rows have 0 in that dimension).
I'll keep digging and let you know what I think of.
Thanks for your help. I'll try adding a bit of noise as a temporary solution.
I talked with Pari and we decided that the best idea was just to ignore the zero-variance dimensions in the log negative error calculation. This change has been made in 4e069ab and should fix your issue, so there should be no more need to add noise. Let me know if it doesn't and we can reopen the ticket. Thanks for reporting the issue! :)
@jeroneandrews Hi, Could you please provide your other test file?