Memory leak ? #310

darkyne · 2020-12-21T12:57:26Z

When running a cluster of nodes locally for a certain time, they end up crashing for no more memory space with my computer freezing.
At the beginning this was occurring after like 30min with 5 nodes running in a cluster sharing an awset CRDT.
Then I tried reducing the state_interval (to 1000ms instead of 10000ms for example) and noticed this memory problem comes much quicker.

I noticed the faultive was the lasp_ets_storage_backend process since it was quickly increasing to hundreds of MB.
Trying to figure out what it was storing, I tried printing whenever it stores in ets table just to know what was going on...
What it stores looks very strange to me.

I run locally a little cluster of 5 nodes with an awset that is slowly updated (value is update every 10sec and state_interval is also 10000ms). As you can see (I simply outputed the record when lasp_ets_sotrage_backend tries to store in ets), it looks like it stores the state_awset with some extremely redundant information about partisan_remote_reference (there are only 5 nodes but their reference are repeated tons of times).

What you can see on the image is just one record being put in ets table. I don't understand why one record would contain such redundant information.

Any idea about what's going on or is it just a bug...?

cmeiklejohn · 2021-01-01T16:58:42Z

First of all, if you send a pid between nodes, it's converted to a partisan_remote_reference because pid's can't be sent between nodes not connected with distributed Erlang, so it has to convert it to a format that works without disterl. That's what that represents.

Second, without seeing the code that's inserting the values into the set, I don't know whether or not this is a bug or not. It could be that you're using the awset incorrectly.

darkyne · 2021-01-02T11:49:44Z

Thanks for consideration.
Like you, I directly thought there would be an issue in my own code.
This is why I freshly cloned Lasp from this repo to have a clean version with no modification at all and check if the same behaviour shows up.

I simply followed the instruction from readme:
clone
make
open two terminals
rebar3 shell --name a@127.0.0.1 in first terminal
rebar3 shell --name b@127.0.0.1 in second terminal

From node a:
lasp_peer_service:join('b@127.0.0.1').
lasp:update({<<"test">>, state_awset}, {add, "5"}, self()).
(simply putting a silly value inside a set so that the node will send its state. If I'm not wrong update will declare the state_awset so no need to explicitly declare it).
erlang:process_info(whereis(lasp_ets_storage_backend), memory). Manually from time to time to visualize the process size.

Result:
It is very slow but with time the size gets bigger and bigger. As an example, it started around 140 kB and was around 500-600 kB after 10 minutes running (with the exact cluster of 2 nodes just described).
Previously (not on the clean version just cloned), I did some little modifications and it was making the memory utilization increasing much faster but still the fact even on the clean correct version it grows seems like a potential issue to me.

As a remark:
If I create a bigger cluster (let's say 5 nodes instead of 2) the growth in memory utilization growth is faster.
If I decrease the state_interval (default was 10000ms) let's say to 1000ms, the memory utilization growth is a lot faster.

By simply adding a little print in the lasp_ets_storage_backend to see what it was recording into ets table, which makes the terminal outputs disgusting but does not modify any behaviour, I see something similar to the screenshot above (that screenshot was on a cluster of 5 nodes) where some references are repeated many times. I don't know why it does this. But the fact it is exact redundancy inside a single record seems strange to me.

I will put a screenshot of that little print in the clean just cloned version to see if it's the exact same thing as before (printing the Record inside do_put function just before ets:insert).

I might be wrong in my approach, if you believe it's the case, please let me know.

darkyne · 2021-01-02T12:20:48Z

Here is the outputs from the io:format printing what is put in ets table.
This is from the exact 2 nodes local cluster described just above, node b is the source of the initial update (putting element "5" in awset). With time, the size of the record increases with more and more references to node b slowly making the process memory size increasing.

As explained, this is from the clean just cloned version of this repo with no modification at all (only adding the little print in lasp_ets_storage_backend).

After waiting a little bit, it goes like this (clearly showing some strange redundancy):

cmeiklejohn · 2021-01-02T14:24:27Z

I want you to try something else.

Instead of:

lasp:update({<<"test">>, state_awset}, {add, "5"}, self()).

try:

lasp:update({<<"test">>, state_awset}, {add, "5"}, term_to_binary(node())).

This should help narrow the problem.

Related: you really shouldn't be using pid() or self() as part of CRDT updates without converting it to a list or binary first. Process identifiers are special memory references that appear different (and are manipulated by Distributed Erlang on the wire, causing nodes to be connected/disconnected) depending on what node you look at it from. For example: 0.156.0 on node A is going to appear as something like 1.156.0 on node B, for example (the leading zero indicates that a process is local and not remote, therefore process identifiers are relative -- w.r.t. the node -- references and not absolute.)

darkyne · 2021-01-02T14:29:45Z

Oh. That surprises me because using self() in the CRDT update line was precisely what is written in the readme example I followed.
I will try with term_to_binary(node()) and check if it changes something.

cmeiklejohn · 2021-01-02T14:32:13Z

Yes, well. The README hasn't been updated in like 4 years where the software has and we've learned a lot since then.

darkyne · 2021-01-02T14:55:36Z

Wow. Process size is not increasing anymore and the Record content is much cleaner !
Well I got tricked by the README... spent hours not understand what was causing this, haha.
Thanks for the easy resolution, was not obvious to me.

darkyne · 2021-01-02T15:07:34Z

I suppose using term_to_binary(node()) instead of self() in the README easily closes this :)

cmeiklejohn · 2021-01-04T16:50:22Z

#312 addresses the README update.

cmeiklejohn · 2021-01-04T17:25:11Z

#313 addresses prevention of non-iolist values as actors to ensure proper node independent serialization.

darkyne closed this as completed Jan 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak ? #310

Memory leak ? #310

darkyne commented Dec 21, 2020 •

edited

Loading

cmeiklejohn commented Jan 1, 2021

darkyne commented Jan 2, 2021 •

edited

Loading

darkyne commented Jan 2, 2021 •

edited

Loading

cmeiklejohn commented Jan 2, 2021 •

edited

Loading

darkyne commented Jan 2, 2021 •

edited

Loading

cmeiklejohn commented Jan 2, 2021

darkyne commented Jan 2, 2021

darkyne commented Jan 2, 2021

cmeiklejohn commented Jan 4, 2021

cmeiklejohn commented Jan 4, 2021

Memory leak ? #310

Memory leak ? #310

Comments

darkyne commented Dec 21, 2020 • edited Loading

cmeiklejohn commented Jan 1, 2021

darkyne commented Jan 2, 2021 • edited Loading

darkyne commented Jan 2, 2021 • edited Loading

cmeiklejohn commented Jan 2, 2021 • edited Loading

darkyne commented Jan 2, 2021 • edited Loading

cmeiklejohn commented Jan 2, 2021

darkyne commented Jan 2, 2021

darkyne commented Jan 2, 2021

cmeiklejohn commented Jan 4, 2021

cmeiklejohn commented Jan 4, 2021

darkyne commented Dec 21, 2020 •

edited

Loading

darkyne commented Jan 2, 2021 •

edited

Loading

darkyne commented Jan 2, 2021 •

edited

Loading

cmeiklejohn commented Jan 2, 2021 •

edited

Loading

darkyne commented Jan 2, 2021 •

edited

Loading