Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DecisionTree and RegressionTree uses too much memory space #70

Closed
myui opened this issue Mar 2, 2016 · 32 comments
Closed

DecisionTree and RegressionTree uses too much memory space #70

myui opened this issue Mar 2, 2016 · 32 comments

Comments

@myui
Copy link
Contributor

myui commented Mar 2, 2016

I found that DecisionTree and RegressionTree implementation use too large heap space for constructing trees as split method call depth grows because trueSamples and falseSamples is not released for GC until each split method call returns.

https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/classification/DecisionTree.java#L655
https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/regression/RegressionTree.java#L599

Other RandomForest implementation uses bag[N], in which bag[i] represents sample[bag[i]]++, instead of samples[N] and bag.length is reduced as the tree depth grows. In smile, N is the number of training examples and it consumes O(2N * depth) in split.

Releasing samples and trueSamples as soon as possible helps a little.
https://github.com/myui/hivemall/pull/259/files

Here is a condition that OOM happened due to recursive node splits.

    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:705)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:714)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree$TrainNode.split(RegressionTree.java:701)
    at smile.regression.RegressionTree.<init>(RegressionTree.java:799)

Using SparseIntArray for samples is an option but it'll consume more time for tree construction. @haifengl How do you think?

@haifengl
Copy link
Owner

haifengl commented Mar 2, 2016

Good point. I will look into how to reduce the memory footage. I don't quite get the part of bag[N]. Can you please give more details? Thanks!

BTW, how large is your data and RAM in the OOM example?

@myui
Copy link
Contributor Author

myui commented Mar 3, 2016

Instead of samples[0]=1, samples[1]=2, samples[2]=0, use bag[0]=1, bag[1]=0, bag[2]=1.
As depth grows, bag.length can be reduced because a bag is divided into two bags on a node split.

-Xmx is set to be restricted, 2GB. But, I guess the current implementation consumes more than 10 GB as it consumes 2N * nodes * 4 bytes.

I'm using Kaggle rossmann dataset for evaluation. The number of training instances is about 1 million.
https://www.kaggle.com/c/rossmann-store-sales/data

@haifengl
Copy link
Owner

haifengl commented Mar 3, 2016

By default random forest train fully grown trees. Before we get a solution to reduce the memory usage, you can use maxNodes to control the tree size as a work around. Actually this also often has better accuracy for large data.

@haifengl
Copy link
Owner

haifengl commented Mar 3, 2016

Sorry, I don't really get your bag example. What's the meaning of bag index and value? Thanks!

@myui
Copy link
Contributor Author

myui commented Mar 3, 2016

Replace sampling as follows:

            // Training samples draw with replacement.
            final int[] bags = new int[N];
            final BitSet sampled = new BitSet(N);
            for (int i = 0; i < N; i++) {
                int index = rnd1.nextInt(N);
                bags[i] = index;
                sampled.set(index);
            }
            ...
            ...
            // out-of-bag prediction
            for (int i = sampled.nextClearBit(0); i < N; i = sampled.nextClearBit(i + 1)) {
                 int p = tree.predict(_x[i]);
                 ...
            }

The original code is

Random random = new Random(Thread.currentThread().getId() * System.currentTimeMillis());

@myui
Copy link
Contributor Author

myui commented Mar 3, 2016

The depth of binary tree of N nodes is log_2(N+1)−1 and 5.658.. where N = 100. The maxNodes of smile is actually maxLeafNodes and thus N should be 127 for depth 6 (though DecisionTree is un-balanced and not accurate). Depth 6 is often too small where the number of features is large and I would like to target depth 10.

The number of nodes in a binary tree of depth D is 2^D−1 and the current implementation would consume 2N * nodes * 4 bytes ≈ 8 GiB where D = 10, nodes = 1023 (512 leaf nodes), and N = 1,000,000.

@haifengl
Copy link
Owner

haifengl commented Mar 3, 2016

How about you try the bag idea and see the memory and speed effect? If it is good, please do a pull request. Thanks a lot!

@myui
Copy link
Contributor Author

myui commented Mar 3, 2016

By using bag approach, it worked fine. It might be other better approaches though.

You can find my approach in
https://github.com/myui/hivemall/blob/master/core/src/main/java/hivemall/smile/classification/DecisionTree.java#L632

@haifengl
Copy link
Owner

haifengl commented Mar 3, 2016

Thanks! The accuracy is same? How much memory usage reduced?

@myui
Copy link
Contributor Author

myui commented Mar 4, 2016

I'm not performing all the unit tests of smile, but it was same results for Iris and Weather unit tests.
It can run depth=10 for rossmann dataset while the original smile implementation caused OOM at depth=5 with -Xmx2g.

@haifengl
Copy link
Owner

haifengl commented Mar 4, 2016

Thanks! How's speed?

@myui
Copy link
Contributor Author

myui commented Mar 7, 2016

I might consume 110% or so for construction as it involves some additional computation, e.g. in
https://github.com/myui/hivemall/blob/master/core/src/main/java/hivemall/smile/classification/DecisionTree.java#L568

But, I think it can be negligible because memory is more valuable resource than CPU for constructing a decision tree (suppose building decision trees in parallel for RandomForest).

When getting a competitive result to RF of scikit-learn for Kaggle rossmannn dataset, it requires depth 30. Logloss were 0.29751 for depth=10, 0.13112 for depth=30, and 0.12992 for an infinite depth.

Partial overfitting + bunch of model ensemble sometime be a good solution. A fav cynical quote about kaggle: Overfitting? Nay, it's called winning.

@haifengl
Copy link
Owner

haifengl commented Mar 7, 2016

Cool. Can you please make a pull request so that we can get your changes? Thanks a lot!

@douglasArantes
Copy link

Training a RandomForest with the dataset San Francisco Crime (Kaggle) was getting a OOM, plus a long run time, I decided to change the GC for G1, with the configuration: -XX:+UseG1GC, and solved the problems.

@haifengl
Copy link
Owner

haifengl commented Mar 7, 2016

Thanks Douglas! How's the result compared to other machine learning libraries in terms of speed, memory usage, accuracy, etc.?

@douglasArantes
Copy link

Still I haven't tested this dataset with other libraries.

@myui
Copy link
Contributor Author

myui commented Mar 9, 2016

I found that smile's RegressionTree construction is very slow when compared scikit-learn for rossmann dataset.

I guess findBestSplit call for Attribute.Type.NUMERIC is the main factor though it's better for use a profiler for a detailed investigation.
https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/regression/RegressionTree.java#L492

@haifengl
Copy link
Owner

haifengl commented Mar 9, 2016

Did scimitar-learn discretize numeric values first? Thanks

@myui
Copy link
Contributor Author

myui commented Mar 9, 2016

No, treated all variables as Numeric attributes. Just converted/identified categorical variables to numbers as follows:

stateholiday    store   promo   dayofweek       schoolholiday   promo2sinceweek competitionopensinceyear        assortment      promo2sinceyear competitiondistance     promointerval   promo2  storetype       competitionopensincemonth       year    month   day     target_sales
0       660     0       2       0       0       0       0       0       1200    0       1       0       0       0       0       0       8.085178748074537
0       669     0       2       0       1       1       0       1       17080   0       1       1       1       0       0       0       7.88570539124302
0       760     0       2       0       0       2       0       0       560     0       0       0       2       0       0       0       8.696008608880904
0       1018    0       3       0       0       1       1       0       140     0       0       2       3       0       0       1       8.931419805192975
0       15      1       4       0       2       3       1       2       4110    0       1       1       4       0       0       2       8.725994381014571
0       685     0       6       0       3       4       0       3       650     0       1       0       0       0       1       3       8.37816098272068
0       686     0       6       0       0       5       0       0       20050   0       0       0       5       0       1       3       7.661997558901893
0       687     0       6       0       0       0       1       0       2770    0       0       1       0       0       1       3       8.857799727175905

Uploaded the training data at https://dl.dropboxusercontent.com/u/13123103/rossmann_train.tsv.bz2

Test code snippet:

    @Test
    public void testRossmann() throws FileNotFoundException, IOException, ParseException {
        DelimitedTextParser parser = new DelimitedTextParser();
        parser.setDelimiter("\t");
        parser.setResponseIndex(new NumericAttribute("sales"), 17);
        parser.setColumnNames(true);
        AttributeDataset dataset = parser.parse(new File("/Users/myui/Downloads/rossmann_train.tsv"));
        double[][] features = new double[dataset.size()][];
        dataset.toArray(features);
        double[] label = new double[dataset.size()];
        dataset.toArray(label);

        int maxLeafs = Integer.MAX_VALUE;
        StopWatch stopwatch = new StopWatch();
        RegressionTree tree = new RegressionTree(null, features, label, maxLeafs);        
        System.out.println(stopwatch);
    }

@myui
Copy link
Contributor Author

myui commented Mar 9, 2016

As I guess, findBestSplit() and findBestSplit(int n, double sum, int j) consumes most of time when profiling the above test.
https://gyazo.com/584f1cafe5e1d76057256ed9bfed55c6

Little surprised that findBestSplit() itself was consuming time while 52% of it was consumed by findBestSplit(int n, double sum, int j).

For memory consuming objects, I found that int[][] order is the dominator after applying my bag approach.
https://gyazo.com/1e57183f244f0139d35024c6e22a598b

For rossmann dataset, int[17][N] is allocated for 17 explanatory variables where N is the number of training instances. This scheme should be revised.

@haifengl
Copy link
Owner

We check every possible split for numeric value (it is up to n possible splits). For large data, it is a big cost. To speed up, we could check only a small number of splits. Says at every 1% step. This should speed up 100 times. The error may be larger though.

@haifengl
Copy link
Owner

The variable order is shared by all trees in the forest. It should not cause memory problems if I didn't make a bug. The matrix is large itself though. As large as the input data if all variables are numeric.

@haifengl
Copy link
Owner

Btw, why do you treat all variables as numerical? It doesn't make sense to me.

@myui
Copy link
Contributor Author

myui commented Mar 10, 2016

@haifengl just to compare the accuracy of Smile to Scikit and to investigate CPU bottlenecks of Smile where treating all variables are quantitative.

Scikit cannot handle categorical variables (i.e., one-hot encoder is required for that) but accuracy and training speed was excellent.

I'll compare when using proper attribute types in Smile. The test result will follow.

The variable order is shared by all trees in the forest. It should not cause memory problems 

Agreed. However, better to reduce memory usages for where the number of variables is large.

@haifengl
Copy link
Owner

Thanks! Can you also contribute your bag implementation back to smile? Thanks again!

@myui
Copy link
Contributor Author

myui commented Mar 10, 2016

@haifengl still working on reducing CPU bottlenecks on node split. Will do after fixing the implementation!

@myui
Copy link
Contributor Author

myui commented Mar 10, 2016

We check every possible split for numeric value (it is up to n possible splits). For large data, it is a big cost. To speed up, we could check only a small number of splits. Says at every 1% step. This should speed up 100 times. The error may be larger though.

Sounds it's a good idea. I'll test that for where the number of training instances is large.

@haifengl
Copy link
Owner

Cool. Thank you!

@haifengl
Copy link
Owner

haifengl commented Jun 5, 2016

Hi @myui, can you please make a pull request for your improvements on memory usage? Thank you very much!

@haifengl
Copy link
Owner

haifengl commented Jun 5, 2016

@myui, how's your test result with checking less splits for numeric value? Thanks!

@myui
Copy link
Contributor Author

myui commented Jun 6, 2016

Not yet tested sampling scheme for splitting numeric value. Getting busy for other daily jobs :-(
Issued a ticket to my own project.

@haifengl
Copy link
Owner

haifengl commented Jun 7, 2016

Thanks! I am also working on my bag implementation. Will let you know the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants