DecisionTree and RegressionTree uses too much memory space #70

myui · 2016-03-02T09:03:50Z

I found that DecisionTree and RegressionTree implementation use too large heap space for constructing trees as split method call depth grows because trueSamples and falseSamples is not released for GC until each split method call returns.

https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/classification/DecisionTree.java#L655
https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/regression/RegressionTree.java#L599

Other RandomForest implementation uses bag[N], in which bag[i] represents sample[bag[i]]++, instead of samples[N] and bag.length is reduced as the tree depth grows. In smile, N is the number of training examples and it consumes O(2N * depth) in split.

Releasing samples and trueSamples as soon as possible helps a little.
https://github.com/myui/hivemall/pull/259/files

Here is a condition that OOM happened due to recursive node splits.

    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:705)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree.<init>(RegressionTree.java:799)

Using SparseIntArray for samples is an option but it'll consume more time for tree construction. @haifengl How do you think?

The text was updated successfully, but these errors were encountered:

haifengl · 2016-03-02T14:52:09Z

Good point. I will look into how to reduce the memory footage. I don't quite get the part of bag[N]. Can you please give more details? Thanks!

BTW, how large is your data and RAM in the OOM example?

myui · 2016-03-03T00:51:01Z

Instead of samples[0]=1, samples[1]=2, samples[2]=0, use bag[0]=1, bag[1]=0, bag[2]=1.
As depth grows, bag.length can be reduced because a bag is divided into two bags on a node split.

-Xmx is set to be restricted, 2GB. But, I guess the current implementation consumes more than 10 GB as it consumes 2N * nodes * 4 bytes.

I'm using Kaggle rossmann dataset for evaluation. The number of training instances is about 1 million.
https://www.kaggle.com/c/rossmann-store-sales/data

haifengl · 2016-03-03T01:08:08Z

By default random forest train fully grown trees. Before we get a solution to reduce the memory usage, you can use maxNodes to control the tree size as a work around. Actually this also often has better accuracy for large data.

haifengl · 2016-03-03T01:09:07Z

Sorry, I don't really get your bag example. What's the meaning of bag index and value? Thanks!

myui · 2016-03-03T01:22:13Z

Replace sampling as follows:

            // Training samples draw with replacement.
            final int[] bags = new int[N];
            final BitSet sampled = new BitSet(N);
            for (int i = 0; i < N; i++) {
                int index = rnd1.nextInt(N);
                bags[i] = index;
                sampled.set(index);
            }
            ...
            ...
            // out-of-bag prediction
            for (int i = sampled.nextClearBit(0); i < N; i = sampled.nextClearBit(i + 1)) {
                 int p = tree.predict(_x[i]);
                 ...
            }

The original code is

smile/core/src/main/java/smile/classification/RandomForest.java

Line 231 in 9e10829

    
           Random random = new Random(Thread.currentThread().getId() * System.currentTimeMillis());

myui · 2016-03-03T03:00:47Z

The depth of binary tree of N nodes is log_2(N+1)−1 and 5.658.. where N = 100. The maxNodes of smile is actually maxLeafNodes and thus N should be 127 for depth 6 (though DecisionTree is un-balanced and not accurate). Depth 6 is often too small where the number of features is large and I would like to target depth 10.

The number of nodes in a binary tree of depth D is 2^D−1 and the current implementation would consume 2N * nodes * 4 bytes ≈ 8 GiB where D = 10, nodes = 1023 (512 leaf nodes), and N = 1,000,000.

haifengl · 2016-03-03T03:23:42Z

How about you try the bag idea and see the memory and speed effect? If it is good, please do a pull request. Thanks a lot!

myui · 2016-03-03T11:24:27Z

By using bag approach, it worked fine. It might be other better approaches though.

You can find my approach in
https://github.com/myui/hivemall/blob/master/core/src/main/java/hivemall/smile/classification/DecisionTree.java#L632

haifengl · 2016-03-03T12:47:51Z

Thanks! The accuracy is same? How much memory usage reduced?

myui · 2016-03-04T09:44:37Z

I'm not performing all the unit tests of smile, but it was same results for Iris and Weather unit tests.
It can run depth=10 for rossmann dataset while the original smile implementation caused OOM at depth=5 with -Xmx2g.

haifengl · 2016-03-04T16:26:28Z

Thanks! How's speed?

myui · 2016-03-07T07:01:33Z

I might consume 110% or so for construction as it involves some additional computation, e.g. in
https://github.com/myui/hivemall/blob/master/core/src/main/java/hivemall/smile/classification/DecisionTree.java#L568

But, I think it can be negligible because memory is more valuable resource than CPU for constructing a decision tree (suppose building decision trees in parallel for RandomForest).

When getting a competitive result to RF of scikit-learn for Kaggle rossmannn dataset, it requires depth 30. Logloss were 0.29751 for depth=10, 0.13112 for depth=30, and 0.12992 for an infinite depth.

Partial overfitting + bunch of model ensemble sometime be a good solution. A fav cynical quote about kaggle: Overfitting? Nay, it's called winning.

haifengl · 2016-03-07T13:04:52Z

Cool. Can you please make a pull request so that we can get your changes? Thanks a lot!

douglasArantes · 2016-03-07T17:41:55Z

Training a RandomForest with the dataset San Francisco Crime (Kaggle) was getting a OOM, plus a long run time, I decided to change the GC for G1, with the configuration: -XX:+UseG1GC, and solved the problems.

haifengl · 2016-03-07T18:27:22Z

Thanks Douglas! How's the result compared to other machine learning libraries in terms of speed, memory usage, accuracy, etc.?

douglasArantes · 2016-03-07T18:39:08Z

Still I haven't tested this dataset with other libraries.

myui · 2016-03-09T02:49:00Z

I found that smile's RegressionTree construction is very slow when compared scikit-learn for rossmann dataset.

I guess findBestSplit call for Attribute.Type.NUMERIC is the main factor though it's better for use a profiler for a detailed investigation.
https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/regression/RegressionTree.java#L492

haifengl · 2016-03-09T02:59:35Z

Did scimitar-learn discretize numeric values first? Thanks

myui · 2016-03-09T05:47:52Z

No, treated all variables as Numeric attributes. Just converted/identified categorical variables to numbers as follows:

stateholiday    store   promo   dayofweek       schoolholiday   promo2sinceweek competitionopensinceyear        assortment      promo2sinceyear competitiondistance     promointerval   promo2  storetype       competitionopensincemonth       year    month   day     target_sales
0       660     0       2       0       0       0       0       0       1200    0       1       0       0       0       0       0       8.085178748074537
0       669     0       2       0       1       1       0       1       17080   0       1       1       1       0       0       0       7.88570539124302
0       760     0       2       0       0       2       0       0       560     0       0       0       2       0       0       0       8.696008608880904
0       1018    0       3       0       0       1       1       0       140     0       0       2       3       0       0       1       8.931419805192975
0       15      1       4       0       2       3       1       2       4110    0       1       1       4       0       0       2       8.725994381014571
0       685     0       6       0       3       4       0       3       650     0       1       0       0       0       1       3       8.37816098272068
0       686     0       6       0       0       5       0       0       20050   0       0       0       5       0       1       3       7.661997558901893
0       687     0       6       0       0       0       1       0       2770    0       0       1       0       0       1       3       8.857799727175905

Uploaded the training data at https://dl.dropboxusercontent.com/u/13123103/rossmann_train.tsv.bz2

Test code snippet:

    @Test
    public void testRossmann() throws FileNotFoundException, IOException, ParseException {
        DelimitedTextParser parser = new DelimitedTextParser();
        parser.setDelimiter("\t");
        parser.setResponseIndex(new NumericAttribute("sales"), 17);
        parser.setColumnNames(true);
        AttributeDataset dataset = parser.parse(new File("/Users/myui/Downloads/rossmann_train.tsv"));
        double[][] features = new double[dataset.size()][];
        dataset.toArray(features);
        double[] label = new double[dataset.size()];
        dataset.toArray(label);

        int maxLeafs = Integer.MAX_VALUE;
        StopWatch stopwatch = new StopWatch();
        RegressionTree tree = new RegressionTree(null, features, label, maxLeafs);        
        System.out.println(stopwatch);
    }

myui · 2016-03-09T07:24:35Z

As I guess, findBestSplit() and findBestSplit(int n, double sum, int j) consumes most of time when profiling the above test.
https://gyazo.com/584f1cafe5e1d76057256ed9bfed55c6

Little surprised that findBestSplit() itself was consuming time while 52% of it was consumed by findBestSplit(int n, double sum, int j).

For memory consuming objects, I found that int[][] order is the dominator after applying my bag approach.
https://gyazo.com/1e57183f244f0139d35024c6e22a598b

For rossmann dataset, int[17][N] is allocated for 17 explanatory variables where N is the number of training instances. This scheme should be revised.

haifengl · 2016-03-10T03:22:28Z

We check every possible split for numeric value (it is up to n possible splits). For large data, it is a big cost. To speed up, we could check only a small number of splits. Says at every 1% step. This should speed up 100 times. The error may be larger though.

haifengl · 2016-03-10T03:24:56Z

The variable order is shared by all trees in the forest. It should not cause memory problems if I didn't make a bug. The matrix is large itself though. As large as the input data if all variables are numeric.

haifengl · 2016-03-10T03:27:33Z

Btw, why do you treat all variables as numerical? It doesn't make sense to me.

myui · 2016-03-10T03:56:25Z

@haifengl just to compare the accuracy of Smile to Scikit and to investigate CPU bottlenecks of Smile where treating all variables are quantitative.

Scikit cannot handle categorical variables (i.e., one-hot encoder is required for that) but accuracy and training speed was excellent.

I'll compare when using proper attribute types in Smile. The test result will follow.

The variable order is shared by all trees in the forest. It should not cause memory problems

Agreed. However, better to reduce memory usages for where the number of variables is large.

haifengl · 2016-03-10T04:11:14Z

Thanks! Can you also contribute your bag implementation back to smile? Thanks again!

myui · 2016-03-10T04:33:34Z

@haifengl still working on reducing CPU bottlenecks on node split. Will do after fixing the implementation!

myui · 2016-03-10T04:49:35Z

We check every possible split for numeric value (it is up to n possible splits). For large data, it is a big cost. To speed up, we could check only a small number of splits. Says at every 1% step. This should speed up 100 times. The error may be larger though.

Sounds it's a good idea. I'll test that for where the number of training instances is large.

haifengl · 2016-03-10T18:33:16Z

Cool. Thank you!

haifengl · 2016-06-05T14:34:22Z

Hi @myui, can you please make a pull request for your improvements on memory usage? Thank you very much!

haifengl · 2016-06-05T14:35:55Z

@myui, how's your test result with checking less splits for numeric value? Thanks!

myui · 2016-06-06T04:51:50Z

Not yet tested sampling scheme for splitting numeric value. Getting busy for other daily jobs :-(
Issued a ticket to my own project.

haifengl · 2016-06-07T14:53:27Z

Thanks! I am also working on my bag implementation. Will let you know the performance.

myui mentioned this issue Mar 3, 2016

Feature/reduce randomforest memoryusage myui/hivemall#264

Merged

myui mentioned this issue Jun 6, 2016

Implement sampling scheme for splitting numeric values of DecisionTrees myui/hivemall#298

Open

haifengl closed this as completed Aug 3, 2017

haifengl added the enhancement label Oct 27, 2017

myui mentioned this issue Apr 25, 2018

Update CTR prediction example with RandomForest workflow treasure-data/treasure-boxes#102

Open

myui mentioned this issue May 2, 2018

Training a Random Forest with a big dataset seems very slow #257

Closed

twotabbies mentioned this issue Aug 6, 2019

Speed up DecisionTree training by partitioning the order arrays #448

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DecisionTree and RegressionTree uses too much memory space #70

DecisionTree and RegressionTree uses too much memory space #70

myui commented Mar 2, 2016

haifengl commented Mar 2, 2016

myui commented Mar 3, 2016

haifengl commented Mar 3, 2016

haifengl commented Mar 3, 2016

myui commented Mar 3, 2016

myui commented Mar 3, 2016

haifengl commented Mar 3, 2016

myui commented Mar 3, 2016

haifengl commented Mar 3, 2016

myui commented Mar 4, 2016

haifengl commented Mar 4, 2016

myui commented Mar 7, 2016

haifengl commented Mar 7, 2016

douglasArantes commented Mar 7, 2016

haifengl commented Mar 7, 2016

douglasArantes commented Mar 7, 2016

myui commented Mar 9, 2016

haifengl commented Mar 9, 2016

myui commented Mar 9, 2016

myui commented Mar 9, 2016

haifengl commented Mar 10, 2016

haifengl commented Mar 10, 2016

haifengl commented Mar 10, 2016

myui commented Mar 10, 2016

haifengl commented Mar 10, 2016

myui commented Mar 10, 2016

myui commented Mar 10, 2016

haifengl commented Mar 10, 2016

haifengl commented Jun 5, 2016

haifengl commented Jun 5, 2016

myui commented Jun 6, 2016

haifengl commented Jun 7, 2016

DecisionTree and RegressionTree uses too much memory space #70

DecisionTree and RegressionTree uses too much memory space #70

Comments

myui commented Mar 2, 2016

haifengl commented Mar 2, 2016

myui commented Mar 3, 2016

haifengl commented Mar 3, 2016

haifengl commented Mar 3, 2016

myui commented Mar 3, 2016

myui commented Mar 3, 2016

haifengl commented Mar 3, 2016

myui commented Mar 3, 2016

haifengl commented Mar 3, 2016

myui commented Mar 4, 2016

haifengl commented Mar 4, 2016

myui commented Mar 7, 2016

haifengl commented Mar 7, 2016

douglasArantes commented Mar 7, 2016

haifengl commented Mar 7, 2016

douglasArantes commented Mar 7, 2016

myui commented Mar 9, 2016

haifengl commented Mar 9, 2016

myui commented Mar 9, 2016

myui commented Mar 9, 2016

haifengl commented Mar 10, 2016

haifengl commented Mar 10, 2016

haifengl commented Mar 10, 2016

myui commented Mar 10, 2016

haifengl commented Mar 10, 2016

myui commented Mar 10, 2016

myui commented Mar 10, 2016

haifengl commented Mar 10, 2016

haifengl commented Jun 5, 2016

haifengl commented Jun 5, 2016

myui commented Jun 6, 2016

haifengl commented Jun 7, 2016