New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak #1531
Comments
This might actually have been the culprit of the OOM crashes that we were seeing the other day. |
Do we have any idea of a workload that can reproduce this? So far I haven't been able to. |
FYI we are seeing what we believe is to be this issue (or something similar) on pipelinedb v0.9.5 on ubuntu linux:
We see this regularly and it continually leaks memory until the whole process gets killed every few days. |
@sat can you share your continuous view definitions here or in a private Gittr channel? |
@derekjn let me know if you want me to provide more context CREATE CONTINUOUS VIEW ml_score_signal_stw_view AS
SELECT coalesce(g.group_member_id, s.group_id) as group_id, signal_type, avg(score) as score_avg, min(score) as min_score, max(score) as max_score, percentile_cont(array[0.25, 0.5, 0.75]) WITHIN GROUP (ORDER BY score) as percentiles, max(sampled_at) as last_sampled_at, count(*) as signals_count FROM
ml_score_signal_stream s
-- TODO: Add sender_id to groups and use name as human reference
LEFT JOIN groups_flattened g ON g.group_id = s.group_id
WHERE (sampled_at >= clock_timestamp() - interval '1 minute')
GROUP BY coalesce(g.group_member_id, s.group_id), signal_type; CREATE CONTINUOUS VIEW ml_score_signal_latest_view AS
SELECT group_id, keyed_max(last_sampled_at, score_avg) as score_avg,
keyed_max(last_sampled_at, min_score) as min_score,
keyed_max(last_sampled_at, max_score) as max_score,
keyed_max(last_sampled_at, percentile_25) as percentile_25,
keyed_max(last_sampled_at, percentile_50) as percentile_50,
keyed_max(last_sampled_at, percentile_75) as percentile_75,
max(last_sampled_at) as last_sampled_at,
keyed_max(last_sampled_at, signals_count) as signals_count
FROM ml_score_signal_latest_stream
GROUP BY group_id; CREATE CONTINUOUS VIEW ml_score_signal_ttw_view AS
SELECT coalesce(g.group_member_id, s.group_id) as group_id, minute(sampled_at) as minute, signal_type, avg(score) as score_avg, min(score) as min_score, max(score) as max_score, percentile_cont(array[0.25, 0.5, 0.75]) WITHIN GROUP (ORDER BY score) as percentiles, max(sampled_at) as last_sampled_at, count(*) as signals_count
FROM ml_score_signal_stream s
LEFT JOIN groups_flattened g ON g.group_id = s.group_id
WHERE sampled_at >= clock_timestamp() - interval '1 hour'
GROUP BY minute, coalesce(g.group_member_id, s.group_id), signal_type; CREATE CONTINUOUS VIEW health_status_signal_group_latest_view AS
SELECT group_id, keyed_max(sampled_at, score) as score, max(sampled_at) as last_sampled_at FROM
health_status_signal_stream
GROUP BY group_id; CREATE CONTINUOUS VIEW system_status_signal_group_latest_view AS
SELECT group_id, keyed_max(sampled_at, score) as score, avg(score) as avg_score, max(sampled_at) as last_sampled_at FROM
system_status_signal_stream
GROUP BY group_id; |
Thanks @sat! One more thing that would be helpful is a
|
@derekjn here you go, if it makes it easier i can give you an sql dump. |
@sat after doing some initial investigation, a couple of more things would be helpful:
|
@derekjn would a full schema / data dump be helpful? |
Sure, wouldnt hurt. Thanks! |
@derekjn my colleague @dominicpacquing will send you it with a PM in Gitter |
Hi @sat, I've run quite a few tests against the dump you guys sent us, and haven't been able to reproduce any unexpected behavior. And we're running these tests on CVs that are several orders of magnitude larger than the sizes you previously indicated. A couple of notes:
|
@derekjn ok, at the moment we are using monit to terminate pipelinedb at 80% system memory usage. It then re-exhibits the same slow leak. We can try the TTLs for the tumbling windows used but unfortunately the sliding windows are needed. The other factor that may not have been reproduced on your end is the use of the pipeline_kinesis extension. |
@sat gotcha. If the sliding-window CVs cached for writing to output streams can't account for the memory consumption, then this is still a bug and we'll get to the bottom of it. A couple more questions as we hone in on the issue:
|
@derekjn The number of rows - potentially hundreds to thousands. It depends on the join table (groups_flattened). At the moment we are only running it against something quite small (10 rows). That join which is used by the GROUP BY may be duplicating a lot of records as essentially it is maintaining aggregates for a recursive grouping structure. Eg. like a graph, a leaf group also belongs to the group that contains it and then the group that contain that group, all the way up to the super group that contains all groups. In order to maintain the aggregates for all groups then that data point would need to be duplicated in all of it's grouped aggregates (rows)? I can elaborate on this if it's not clear. The increasing memory consumption can be isolated to the "autovacuum launcher process". Is this what you mean? We will try the TTL on the views and report back also. |
@sat are you able to run PipelineDB under
This will create a file for each pid launched by the |
@derekjn sure, will do that now |
@sat were you guys able to take a memory dump with |
@derekjn i saw that comment and it must be someone with a related issue. Will take the memory dump today. Sorry for the delay. |
@derekjn does this help? |
@derekjn, ok, so am i safe to run HEAD of master? It is a slightly quicker leak when the system is under load (nothing was being ingested when i ran this). |
Yes, we'll also backport this to the latest 0.9.6 release. I'll let you know here when the releases are published. |
@derekjn ok thanks |
@sat the 0.9.6 releases containing the backported fix have been published. |
@derekjn, i used the updated 0.9.6 release and still looks like i'm getting the same issue. |
So right when I mentioned that the 0.9.6 releases were updated, the nightly build probably wasn't updated yet. Can you run this?
|
Hmm, that revision should definitely contain the fix. The cause of the leak was very obvious once we got the dump, but I'll double check that there isn't anything else. Also note that if you downloaded the nightly binary last night after the fix went out, it would not have contained the fix. Only the 0.9.6 release was patched, is it possible that you were using the nightly version? |
@derekjn i built it based on the following: https://www.pipelinedb.com/download/0.9.6/ubuntu14 |
Ok, I'm double checking the fix from last night, will report back shortly. |
@sat I haven't been able to repro the issue on the latest release binaries. What is the y axis on the chart you attached, GB? If so, it looks like about an increase of < 1GB over an 8-hour period, which could easily be legitimate and not a leak. Or have you been able to isolate to the autovac process? |
@derekjn you can use the same dump. The Y axis is %. I can leave it running for the weekend, here is today only: I can run valgrind against it again if it runs out of memory over the weekend. There is nothing being ingested into the views at present, so i'm not sure what could account for the increased memory usage? |
Hmm, what exactly do you mean by use the same dump? If there's less than a 1% increase in memory usage over a long period of time, then there's nothing that would indicate a problem anywhere. Various system caches, shared buffers, etc. can easily accumulate small amounts of memory over time. I'll keep looking to see if there's a small leak anywhere, but it doesn't seem like it. If nothing is being ingested, then no part of PipelineDB is really even running, so this would be happening at the PostgreSQL level. |
@derekjn same dump as my colleague gave you on gitter. Yes, I think it's fine. Had a look at the memory consumption of pipeline and it's below 9% and the autovacuum process is now well down the queue of top memory consumers. Interestingly, the awslogs python agent that is providing the statistics for those graphs (in cloudwatch) is using quite a bit of memory :(. Will monitor over the weekend and see. |
@derekjn all good |
The text was updated successfully, but these errors were encountered: