Skip to content
This repository has been archived by the owner on Sep 28, 2021. It is now read-only.

Memory leak ? #310

Closed
darkyne opened this issue Dec 21, 2020 · 10 comments
Closed

Memory leak ? #310

darkyne opened this issue Dec 21, 2020 · 10 comments

Comments

@darkyne
Copy link

darkyne commented Dec 21, 2020

When running a cluster of nodes locally for a certain time, they end up crashing for no more memory space with my computer freezing.
At the beginning this was occurring after like 30min with 5 nodes running in a cluster sharing an awset CRDT.
Then I tried reducing the state_interval (to 1000ms instead of 10000ms for example) and noticed this memory problem comes much quicker.

I noticed the faultive was the lasp_ets_storage_backend process since it was quickly increasing to hundreds of MB.
Trying to figure out what it was storing, I tried printing whenever it stores in ets table just to know what was going on...
What it stores looks very strange to me.

I run locally a little cluster of 5 nodes with an awset that is slowly updated (value is update every 10sec and state_interval is also 10000ms). As you can see (I simply outputed the record when lasp_ets_sotrage_backend tries to store in ets), it looks like it stores the state_awset with some extremely redundant information about partisan_remote_reference (there are only 5 nodes but their reference are repeated tons of times).

What you can see on the image is just one record being put in ets table. I don't understand why one record would contain such redundant information.

Any idea about what's going on or is it just a bug...?
lasp_ets_storage_backend record

@cmeiklejohn
Copy link
Member

First of all, if you send a pid between nodes, it's converted to a partisan_remote_reference because pid's can't be sent between nodes not connected with distributed Erlang, so it has to convert it to a format that works without disterl. That's what that represents.

Second, without seeing the code that's inserting the values into the set, I don't know whether or not this is a bug or not. It could be that you're using the awset incorrectly.

@darkyne
Copy link
Author

darkyne commented Jan 2, 2021

Thanks for consideration.
Like you, I directly thought there would be an issue in my own code.
This is why I freshly cloned Lasp from this repo to have a clean version with no modification at all and check if the same behaviour shows up.

I simply followed the instruction from readme:
clone
make
open two terminals
rebar3 shell --name a@127.0.0.1 in first terminal
rebar3 shell --name b@127.0.0.1 in second terminal

From node a:
lasp_peer_service:join('b@127.0.0.1').
lasp:update({<<"test">>, state_awset}, {add, "5"}, self()).
(simply putting a silly value inside a set so that the node will send its state. If I'm not wrong update will declare the state_awset so no need to explicitly declare it).
erlang:process_info(whereis(lasp_ets_storage_backend), memory). Manually from time to time to visualize the process size.

Result:
It is very slow but with time the size gets bigger and bigger. As an example, it started around 140 kB and was around 500-600 kB after 10 minutes running (with the exact cluster of 2 nodes just described).
Previously (not on the clean version just cloned), I did some little modifications and it was making the memory utilization increasing much faster but still the fact even on the clean correct version it grows seems like a potential issue to me.

As a remark:
If I create a bigger cluster (let's say 5 nodes instead of 2) the growth in memory utilization growth is faster.
If I decrease the state_interval (default was 10000ms) let's say to 1000ms, the memory utilization growth is a lot faster.

By simply adding a little print in the lasp_ets_storage_backend to see what it was recording into ets table, which makes the terminal outputs disgusting but does not modify any behaviour, I see something similar to the screenshot above (that screenshot was on a cluster of 5 nodes) where some references are repeated many times. I don't know why it does this. But the fact it is exact redundancy inside a single record seems strange to me.

I will put a screenshot of that little print in the clean just cloned version to see if it's the exact same thing as before (printing the Record inside do_put function just before ets:insert).

I might be wrong in my approach, if you believe it's the case, please let me know.

@darkyne
Copy link
Author

darkyne commented Jan 2, 2021

Here is the outputs from the io:format printing what is put in ets table.
This is from the exact 2 nodes local cluster described just above, node b is the source of the initial update (putting element "5" in awset). With time, the size of the record increases with more and more references to node b slowly making the process memory size increasing.

As explained, this is from the clean just cloned version of this repo with no modification at all (only adding the little print in lasp_ets_storage_backend).

image

After waiting a little bit, it goes like this (clearly showing some strange redundancy):
image

@cmeiklejohn
Copy link
Member

cmeiklejohn commented Jan 2, 2021

I want you to try something else.

Instead of:

lasp:update({<<"test">>, state_awset}, {add, "5"}, self()).

try:

lasp:update({<<"test">>, state_awset}, {add, "5"}, term_to_binary(node())).

This should help narrow the problem.

Related: you really shouldn't be using pid() or self() as part of CRDT updates without converting it to a list or binary first. Process identifiers are special memory references that appear different (and are manipulated by Distributed Erlang on the wire, causing nodes to be connected/disconnected) depending on what node you look at it from. For example: 0.156.0 on node A is going to appear as something like 1.156.0 on node B, for example (the leading zero indicates that a process is local and not remote, therefore process identifiers are relative -- w.r.t. the node -- references and not absolute.)

@darkyne
Copy link
Author

darkyne commented Jan 2, 2021

Oh. That surprises me because using self() in the CRDT update line was precisely what is written in the readme example I followed.
I will try with term_to_binary(node()) and check if it changes something.

@cmeiklejohn
Copy link
Member

Yes, well. The README hasn't been updated in like 4 years where the software has and we've learned a lot since then.

@darkyne
Copy link
Author

darkyne commented Jan 2, 2021

Wow. Process size is not increasing anymore and the Record content is much cleaner !
Well I got tricked by the README... spent hours not understand what was causing this, haha.
Thanks for the easy resolution, was not obvious to me.

@darkyne darkyne closed this as completed Jan 2, 2021
@darkyne
Copy link
Author

darkyne commented Jan 2, 2021

I suppose using term_to_binary(node()) instead of self() in the README easily closes this :)

@cmeiklejohn
Copy link
Member

#312 addresses the README update.

@cmeiklejohn
Copy link
Member

#313 addresses prevention of non-iolist values as actors to ensure proper node independent serialization.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants