Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak P-1 #40

Closed
sillygitter opened this issue Apr 11, 2019 · 16 comments
Closed

Memory leak P-1 #40

sillygitter opened this issue Apr 11, 2019 · 16 comments

Comments

@sillygitter
Copy link
Contributor

Each time a new test starts it uses a little system memory and doesn't seem to free it afterwards. Encountered when doing many small P-1 tests, it took ~150 tests to fill 16GB of memory so it's unlikely to be encountered under normal use.

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)
@valeriob01
Copy link
Contributor

That should happen sooner on the RX580 which has 8GB RAM.

@valeriob01
Copy link
Contributor

Doing the pm1 test right now on RX580, lets see how soon it occurs.
However it did find "no factor", will post the gpuowl.log later.

@preda
Copy link
Owner

preda commented Apr 11, 2019

Does this concern GPU memory or main system memory?

@preda
Copy link
Owner

preda commented Apr 11, 2019

OK. Could it be that there was not enough memory on the GPU, for example because something else (or another instance) was running at the same time taking up some GPU memory?

P-1 allocates by default almost all GPU memory, at start of stage2. It does check the available memory at the very beginning of the test (when it computes the plan), but allocates it at the start of stage2. So if something reduces the GPU memory at that point, the allocation in stage 2 will fail with a bad_alloc

@preda
Copy link
Owner

preda commented Apr 11, 2019

To identify a memory leak, you should observe a gradual reduction in available memory in time. E.g. every time a new P-1 test starts, the GPU available memory is reduced by let's say 100MB, etc.

@preda
Copy link
Owner

preda commented Apr 11, 2019

On the failed P-1, what was the "buffers" message at the start of test?
GPU RAM fits 388 stage2 buffers @ 40.0 MB each, using 360

@valeriob01
Copy link
Contributor

  • The first exponent:
    GPU RAM fits 184 stage2 buffers @ 40.0 MB each, using 180
  • subsequent exponents until now:
    GPU RAM fits 205 stage2 buffers @ 36.0 MB each, using 192

@valeriob01
Copy link
Contributor

{"exponent":"86100473", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"6.4-f6d3153"}, "timestamp":"2019-04-11 15:04:44 UTC", "user":"selroc", "computer":"RX580-9", "fft-length":4718592, "B1":20000, "B2":600000}

verified with:
./openowl -pm1 86100473 -B1 20000 -user selroc -cpu RX580-9 -device 0

No factor found.

@preda
Copy link
Owner

preda commented Apr 11, 2019

@valeriob01: That's fine. Here's why:
grep 86100473 test-pm1/pm1.txt
86100473,15290240534639630110561,74,223,323467
So the factor is 15290240534639630110561.
In gp (PARI-GP):
factor(15290240534639630110560)
%1 =
[ 2 5]
[ 3 1]
[ 5 1]
[ 23 1]
[ 223 2]
[ 323467 1]
[86100473 1]

The key is in the exponent of 223; in order to cover 223^2 in B1, we must have B1>=223^2, which is larger than 20000. Try it with B1=50000 or larger.

Thanks for reporting it, luckily it was a bug in the test not in the program :)

@valeriob01
Copy link
Contributor

OK.

./openowl -pm1 86100473 -B1 50000 -user selroc -cpu RX580-9 -device 0

...
2019-04-12 05:57:28 RX580-9 86100473 P-1 GPU RAM fits 205 stage2 buffers @ 36.0 MB each, using 192
...

2019-04-12 06:09:36 RX580-9 {"exponent":"86100473", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"6.4-f6d3153"}, "timestamp":"2019-04-12 04:09:36 UTC", "user":"selroc", "computer":"RX580-9", "fft-length":4718592, "B1":50000, "B2":1500000, "factors":["15290240534639630110561"]}

@sillygitter
Copy link
Contributor Author

Sorry I didn't reply sooner, 4G signal is hard to come by. I meant system memory not GPU memory, this PC just happens to have 16GB of DDR4. Each time a test starts the memory usage of openowl increases when you look at it with top, until it fails to allocate memory for the next test and quits. GPU RAM is unaffected.

@valeriob01
Copy link
Contributor

I think some part of P-1 computation is done on the cpu when a test start or finish because the cpu becomes hot.

@preda
Copy link
Owner

preda commented Apr 12, 2019

There might be a leak related to GMP which is used in the GCD computation on CPU. Investigating.

@preda
Copy link
Owner

preda commented Apr 12, 2019

Thanks! fixed

@preda preda closed this as completed Apr 12, 2019
@sillygitter
Copy link
Contributor Author

Wow that was quick. Nice one.

@preda
Copy link
Owner

preda commented Apr 12, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants