Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backend] Utilizes no_alloc variant of JErasure to avoid heap allocation #1

Merged
merged 4 commits into from
Feb 2, 2016

Conversation

windkit
Copy link
Contributor

@windkit windkit commented Jan 19, 2016

Description

JErasure Library uses lots of heap allocation for temporary structure.
The large number of small allocation/deallocation may have performance impact.
When used in NIF, some memory issue is encountered (Random Segmentation Fault)

Purposed Solution

Added a separate code path which avoid memory allocation (structures are allocated by caller beforehand), putting temporary structure in stack.

Related Issue

leo-project/leofs#440

Relate PR

leo-project/jerasure#3
leo-project/jerasure#4
#1

@mocchira
Copy link
Member

@windkit
Since we will use a large amount of stack with this PR,
In order to tune +sss Erlang emulator flag( http://www.erlang.org/doc/man/erl.html ),
we need to grasp the maximum amount of stack per thread in the worst case ( the most nested call tree including jerasure ).
Can you estimate that with the practical maximum k, m and w.
I'd recommend you using a runtime estimation. ( http://stackoverflow.com/questions/1756285/stack-size-estimation )

@windkit
Copy link
Contributor Author

windkit commented Jan 19, 2016

@mocchira Stack size is indeed one of my concerns about the PRs

The biggest usage happens in jerasure_schedule_decode_data_lazy

int smart_n[k*m*w*w+1][5];
int *smartptr[k*m*w*w+1][5];
...
// jerasure_generate_decoding_data_schedule_noalloc()
int real_decoding_matrix [k*w*(cdf+ddf)*w];
int decoding_matrix[k*k*w*w];
int inverse[k*k*w*w];  

It would consume ~10*k*k*w*w*(size of int) for each instance.
Practically w = 8 (k + m < 2 ^ w) should be enough for Cauchy-RS, k = 20 (number of data), m < k.
In total, it would take ~2MB.

To set the suggested stack size, we need to first limit the possible parameters from users. Alternatively, we can move back these large structures to heap (with enif_alloc provided in erl_nif.h)

@mocchira
Copy link
Member

@windkit Thanks.
2MB sounds reasonable to me.

Given that there are 32 cores (which typically means 32 scheduler threads on default VM parameters),
It will take approximately 64MB (32 * 2) in stack ( no problem ).

I may ask you to benchmark erasure coding features with different +sss settings.

@mocchira
Copy link
Member

@windkit
Please link this information(#1) into somewhere around leo_erasure/README.md.

@windkit
Copy link
Contributor Author

windkit commented Jan 20, 2016

While the size of int varies with different platforms, it is usually 4 bytes.
Tested on Ubuntu 14.04.3 x64 with GCC 4.8.4 and Erlang 17.5, setting the suggested stack size to 64 kiloword (+sss 64) aka 512 KB works fine with Cauchy-RS {20,4,8} during decoding test.

I will start benchmarking the performance with different stack size.

@mocchira
Copy link
Member

LGTM.
I will merge after finishing to benchmark.

@windkit
Copy link
Contributor Author

windkit commented Jan 22, 2016

Benchmark Results are uploaded. Cauchy-RS is used as it is the most heavy one in terms of stack size.

In short, the PR improves the performance a bit (~5%) and stack size (+sss) has no significant effect on the performance

Encoding

https://github.com/leo-project/notes/tree/master/leofs/benchmark/libs/leo_erasure/20160121_1m_cauchyrs_k10m4_t32_enc

Decoding

https://github.com/leo-project/notes/tree/master/leofs/benchmark/libs/leo_erasure/20160121_1m_cauchyrs_k10m4_t32_dec

@mocchira
Copy link
Member

@windkit Thanks.
To make sure,
Can you do a long running test for around 8 hours?

@windkit
Copy link
Contributor Author

windkit commented Jan 22, 2016

Sure, I am running a 10 hours, encode+decode 1:1 test now.

@yosukehara
Copy link
Member

@windkit After the current benchmark, I'd like to ask you to benchmark LeoFS v1.4.0 with latest leo_erasure for long duration - 3hours, 6hours and more, which is similar to 20151222_isars_k10m4_15m_r49w1_60min_1.

@windkit
Copy link
Contributor Author

windkit commented Jan 25, 2016

@mocchira Please find the long run test result at
https://github.com/leo-project/notes/tree/master/leofs/benchmark/libs/leo_erasure/20160122_1m_cauchyrs_k10m4_t32_10hr
With R:W = 1:1, the throughput is the average of the two, around 7000ops

@yosukehara I will now move on to the testing with LeoFS

@windkit
Copy link
Contributor Author

windkit commented Jan 25, 2016

@yosukehara
Copy link
Member

@windkit Thanks for benchmarking that. It is good result to me, and I've just recognized it is almost same result with 20151222_isars_k10m4_15m_r49w1_60min_1.

I'd like to ask you to benchmark both isars and vandrs w/LeoFS v1.4.0-pre.3-dev for long duration, 4hours or 6hours.

@windkit
Copy link
Contributor Author

windkit commented Jan 26, 2016

@yosukehara I will start testing the two coding scheme for 6 hours to check the stability.

@mocchira
Copy link
Member

@windkit thanks.
I'll also check your PR for jerasure tomorrow and
merge both of PRs if there is no problem.

@yosukehara
Copy link
Member

@windkit Thanks a lot.

@mocchira
Copy link
Member

@windkit
To make sure,
Since jerasure code base was changed,
I'd recommend you to do primary benchmarks ( most longest long running etc... ) again.

@windkit
Copy link
Contributor Author

windkit commented Jan 29, 2016

@mocchira I am starting 6 hrs test for all the coding scheme supported.

@windkit
Copy link
Contributor Author

windkit commented Feb 1, 2016

@windkit
Copy link
Contributor Author

windkit commented Feb 1, 2016

By mistake, I messed the branches, currently I am fixing them, please wait for the moment.

@windkit
Copy link
Contributor Author

windkit commented Feb 1, 2016

The problem has been fixed, sorry for that.

In the process I spotted another memory leak with Coding* getCoder in c_src/leo_erasure_nif.cpp, I will fix it separately.

mocchira added a commit that referenced this pull request Feb 2, 2016
[Backend] Utilizes no_alloc variant of JErasure to avoid heap allocation
@mocchira mocchira merged commit 59464fa into leo-project:develop Feb 2, 2016
@mocchira
Copy link
Member

mocchira commented Feb 2, 2016

LGTM.
Thank you as always!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants