Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance issue with random forests #18

Closed
amir2040 opened this issue Apr 10, 2022 · 5 comments
Closed

performance issue with random forests #18

amir2040 opened this issue Apr 10, 2022 · 5 comments

Comments

@amir2040
Copy link

amir2040 commented Apr 10, 2022

Hi,
I'm running a RANDOM_FOREST model trained in tf_df, by using yggdrasil c++ api and inference time taking about 50 μs, But as you said it probably shouldn't take more than 10 μs.

Also running in large batches(vs batch size =1) or using --copt=-mavx2 doesn't make a difference at all!
I've used benchmark_inference tool and result was the same.
Another interesting observation was difference between min & max execution time per instance,
exec times for 10 run:
########################################
0 max 2133059 min 17293 avg 52250
1 max 1054634 min 16696 avg 52982
2 max 1038110 min 14949 avg 45468
3 max 1068611 min 16752 avg 53064
4 max 1657415 min 16790 avg 54514
5 max 1125537 min 16432 avg 53145
6 max 1939590 min 17591 avg 74354
7 max 2997816 min 17284 avg 70325
8 max 1064365 min 16554 avg 56063
9 max 1044182 min 16145 avg 51429
########################################

even if i ignore some of first execution times (for cache miss) the variance between exec times are still high.
########################################
0 max 1488318 min 15841 avg 51085
1 max 955384 min 16501 avg 45567
2 max 928377 min 16370 avg 44606
3 max 1018261 min 15124 avg 44204
4 max 1429345 min 17299 avg 79810
5 max 1628887 min 17539 avg 80997
6 max 2126679 min 16487 avg 67346
7 max 1058939 min 16616 avg 53941
8 max 1098449 min 16242 avg 48047
9 max 1103341 min 16659 avg 53750
########################################

Wondering if there is a problem in model or inference setup or this is the best performance i can get.
model spec:
RANDOM_FOREST
300 root(s), 618972 node(s), and 28 input feature(s).
RandomForestOptPred engine

compiled with this flags:
--config=linux_cpp17 --config=linux_avx2 --repo_env=CC=gcc-9 --copt=-mavx2

system spec:
Ubuntu 18.04.4
cpu Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
on esxi virtual machine

@achoum
Copy link
Collaborator

achoum commented Apr 11, 2022

Hi Amir,

Thanks for the report. Here are some thoughts.

The inference speed of the model depends on multiple factors:

  1. The most important fact is the type of model. Generally, Random Forest models are slower than the Gradient Boosted Tree models. In your case, and if the quality is similar, the main solution to speed-up the inference is to use a Gradient Boosted Tree model instead. The impact of --copt=-mavx2 should be visible then.

  2. The second important factor is size of the model. And this size depends on the size of the datasets (number of examples, number of features, type of features) and its hyper-parameters. For example, increasing the "num_trees" of the Random Forest model will make it more accurate but also bigger. In this post, the dataset was small (few hundred examples).

  3. The speed and load of your machine. Since the benchmark is not multi-threaded, the reported value is ~ the inference speed per core. Those figures should be relatively stable. Unstable results could be explained by other programs running concurrently as the benchmark, or because of variance in the CPU speed (for example, if your machine is configured with power-saving mode; i.e. make sure to run benchmarks in "performance" mode).

How many examples are there in the benchmark? If this value is small, the stability of the results can be improved using more rounds (i.e. increase the value of --num_runs).

@amir2040
Copy link
Author

amir2040 commented Apr 11, 2022

Thanks for reply and suggestions,
There is about 19k examples in benchmark and using more rounds didn't make a difference.
I'm using performance mode and an isolated core to run benchmark.
Number of records in dataspec is about 75k.
So the question about huge difference between min, max and average still remains.
If i can get avg execution time near min (17μs ) that would be ideal.
Maybe some of longer execution times occur because of cache misses(specially for first examples), but following runs still take way more than min.


update:
Ah, after some observation i've noticed exec times for a specific example are quite stable, but some examples take more time than others. It should be because of difference in leaf depth.
A file containing exec times (in nanosec) attached.

sbs.txt

@achoum
Copy link
Collaborator

achoum commented Apr 12, 2022

Can you do the following:

  1. Train and measure the speed of a gradient boosted model.
  2. Report the node depth histogram of your Random Forest model. It is printed in the model description (c++, cli).
  3. Since you are not using the benchmark_inference tool, can you share details about how you implemented your benchmark. Notably, the file you shared seems to contain the timing of individual model calls (instead of average over multiple calls). Make sure not to contaminate you benchmark with it.
  4. For the sake of it, can you also share the performance of the benchmark_inference tool.

@amir2040
Copy link
Author

Thanks for suggestions.
1-I'll be working on it.
2-
Depth by leafs:
Count: 309636 Average: 12.1423 StdDev: 2.35022
Min: 3 Max: 15 Ignored: 0
[ 3, 4) 15 0.00% 0.00%
[ 4, 5) 128 0.04% 0.05%
[ 5, 6) 754 0.24% 0.29%
[ 6, 7) 2882 0.93% 1.22%
[ 7, 8) 6945 2.24% 3.46% #
[ 8, 9) 13921 4.50% 7.96% ##
[ 9, 10) 22572 7.29% 15.25% ###
[ 10, 11) 30770 9.94% 25.19% #####
[ 11, 12) 38115 12.31% 37.50% ######
[ 12, 13) 42086 13.59% 51.09% ######
[ 13, 14) 43319 13.99% 65.08% #######
[ 14, 15) 42147 13.61% 78.69% ######
[ 15, 15] 65982 21.31% 100.00% ##########

3-I've followed the example in usermanual :
loading model, building fast engine, getting features by name, allocation and setting example and finally calling predict(the part i measure).
I can run it with different batch sizes but using different batch size(num_examples) doesn't change exec time.
I've done another benchmark where i set one example then call predict like 10 times sth like this:

for (int iter = 0; iter < ITERATION; iter++)
      {
        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &begin);
        engine->Predict(*examples, /*num_examples=*/1, &predictions);
        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
        long seconds = end.tv_sec - begin.tv_sec;
        long nanoseconds = end.tv_nsec - begin.tv_nsec;
        long exec_time = seconds * 1e9 + nanoseconds;
        if(iter != 0){
          avg += exec_time;
          min = exec_time < min ? exec_time:min;
          max = exec_time > max ? exec_time:max;
        }
          
      }

and exec time comes down to about 35 μs.
4-
Result with this options:(changing them doesn't have much impact on exec time)
-generic=false --batch_size=32 --num_runs=40 --warmup_runs=5
time/example(us) time/batch(us) method
47.389 1516.4 RandomForestOptPred [virtual interface]
63.123 2019.9 RandomForestGeneric [virtual interface]

@achoum
Copy link
Collaborator

achoum commented Dec 15, 2023

Thanks for the details and the patience :).

50 μs / example / core is a slow speed, but it is possible for a large model. As I mentioned before, using gradient boosted trees would lead to significantly faster models.

The histogram you shared indicates that most of the forest paths go to full depth (15). While this is common in large datasets, on small datasets, it could indicate that the model is spending a lot of energy by failing to generate features. This type of problem is typically observed when feeding string IDs (e.g. example id, user id) to the model. Example IDs are not generalizable. Looking at the other histogram in the model description can help you identify which feature is at fault.

This benchmark code has a few issues.

  • Predict is executed on a single example at a time. It is better to run it on multiple examples.
  • Predict is always executed on the same example. CPU predictive capabilities are very good today, possibly making this benchmark overoptimistic. The benchmark should be executed on multiple different examples.
  • Instead of checking and summing time inside of the for-loop, it is better to check it one outside.
  • The prediction object is not used, possibly leading to unwanted compiler optimization / code pruning.

A simple and accurate way to run a benchmark is to use the model.benchmark() method in the Python API:
https://ydf.readthedocs.io/en/latest/tutorial/getting_started/#benchmark-model-speed

@achoum achoum closed this as completed Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants