New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DecisionTree and RegressionTree uses too much memory space #70
Comments
Good point. I will look into how to reduce the memory footage. I don't quite get the part of bag[N]. Can you please give more details? Thanks! BTW, how large is your data and RAM in the OOM example? |
Instead of
I'm using Kaggle rossmann dataset for evaluation. The number of training instances is about 1 million. |
By default random forest train fully grown trees. Before we get a solution to reduce the memory usage, you can use maxNodes to control the tree size as a work around. Actually this also often has better accuracy for large data. |
Sorry, I don't really get your bag example. What's the meaning of bag index and value? Thanks! |
Replace sampling as follows:
The original code is
|
The depth of binary tree of N nodes is The number of nodes in a binary tree of depth D is |
How about you try the bag idea and see the memory and speed effect? If it is good, please do a pull request. Thanks a lot! |
By using bag approach, it worked fine. It might be other better approaches though. You can find my approach in |
Thanks! The accuracy is same? How much memory usage reduced? |
I'm not performing all the unit tests of smile, but it was same results for Iris and Weather unit tests. |
Thanks! How's speed? |
I might consume 110% or so for construction as it involves some additional computation, e.g. in But, I think it can be negligible because memory is more valuable resource than CPU for constructing a decision tree (suppose building decision trees in parallel for RandomForest). When getting a competitive result to RF of scikit-learn for Kaggle rossmannn dataset, it requires depth 30. Logloss were 0.29751 for depth=10, 0.13112 for depth=30, and 0.12992 for an infinite depth. Partial overfitting + bunch of model ensemble sometime be a good solution. A fav cynical quote about kaggle: |
Cool. Can you please make a pull request so that we can get your changes? Thanks a lot! |
Training a RandomForest with the dataset San Francisco Crime (Kaggle) was getting a OOM, plus a long run time, I decided to change the GC for G1, with the configuration: -XX:+UseG1GC, and solved the problems. |
Thanks Douglas! How's the result compared to other machine learning libraries in terms of speed, memory usage, accuracy, etc.? |
Still I haven't tested this dataset with other libraries. |
I found that smile's RegressionTree construction is very slow when compared scikit-learn for rossmann dataset. I guess |
Did scimitar-learn discretize numeric values first? Thanks |
No, treated all variables as Numeric attributes. Just converted/identified categorical variables to numbers as follows:
Uploaded the training data at https://dl.dropboxusercontent.com/u/13123103/rossmann_train.tsv.bz2 Test code snippet: @Test
public void testRossmann() throws FileNotFoundException, IOException, ParseException {
DelimitedTextParser parser = new DelimitedTextParser();
parser.setDelimiter("\t");
parser.setResponseIndex(new NumericAttribute("sales"), 17);
parser.setColumnNames(true);
AttributeDataset dataset = parser.parse(new File("/Users/myui/Downloads/rossmann_train.tsv"));
double[][] features = new double[dataset.size()][];
dataset.toArray(features);
double[] label = new double[dataset.size()];
dataset.toArray(label);
int maxLeafs = Integer.MAX_VALUE;
StopWatch stopwatch = new StopWatch();
RegressionTree tree = new RegressionTree(null, features, label, maxLeafs);
System.out.println(stopwatch);
} |
As I guess, Little surprised that For memory consuming objects, I found that For rossmann dataset, |
We check every possible split for numeric value (it is up to n possible splits). For large data, it is a big cost. To speed up, we could check only a small number of splits. Says at every 1% step. This should speed up 100 times. The error may be larger though. |
The variable order is shared by all trees in the forest. It should not cause memory problems if I didn't make a bug. The matrix is large itself though. As large as the input data if all variables are numeric. |
Btw, why do you treat all variables as numerical? It doesn't make sense to me. |
@haifengl just to compare the accuracy of Smile to Scikit and to investigate CPU bottlenecks of Smile where treating all variables are quantitative. Scikit cannot handle categorical variables (i.e., one-hot encoder is required for that) but accuracy and training speed was excellent. I'll compare when using proper attribute types in Smile. The test result will follow.
Agreed. However, better to reduce memory usages for where the number of variables is large. |
Thanks! Can you also contribute your bag implementation back to smile? Thanks again! |
@haifengl still working on reducing CPU bottlenecks on node split. Will do after fixing the implementation! |
Sounds it's a good idea. I'll test that for where the number of training instances is large. |
Cool. Thank you! |
Hi @myui, can you please make a pull request for your improvements on memory usage? Thank you very much! |
@myui, how's your test result with checking less splits for numeric value? Thanks! |
Not yet tested sampling scheme for splitting numeric value. Getting busy for other daily jobs :-( |
Thanks! I am also working on my bag implementation. Will let you know the performance. |
I found that
DecisionTree
andRegressionTree
implementation use too large heap space for constructing trees assplit
method call depth grows becausetrueSamples
andfalseSamples
is not released for GC until eachsplit
method call returns.https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/classification/DecisionTree.java#L655
https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/regression/RegressionTree.java#L599
Other RandomForest implementation uses
bag[N]
, in which bag[i] representssample[bag[i]]++
, instead ofsamples[N]
andbag.length
is reduced as the tree depth grows. In smile,N
is the number of training examples and it consumesO(2N * depth)
in split.Releasing
samples
andtrueSamples
as soon as possible helps a little.https://github.com/myui/hivemall/pull/259/files
Here is a condition that OOM happened due to recursive node splits.
Using SparseIntArray for
samples
is an option but it'll consume more time for tree construction. @haifengl How do you think?The text was updated successfully, but these errors were encountered: