Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port isolation from old TSDB PR #6841

Merged
merged 2 commits into from Feb 28, 2020
Merged

Port isolation from old TSDB PR #6841

merged 2 commits into from Feb 28, 2020

Conversation

beorn7
Copy link
Member

@beorn7 beorn7 commented Feb 18, 2020

The original PR was prometheus-junkyard/tsdb#306 .

I tried to carefully adjust to the new world order, but please give this a very careful review, especially around iterator reuse (marked with a TODO).

On the bright side, I definitely found and fixed a bug in txRing.

@beorn7
Copy link
Member Author

beorn7 commented Feb 18, 2020

Fixes #1893.

@beorn7
Copy link
Member Author

beorn7 commented Feb 18, 2020

Note: Recently merged #6777 didn't trigger any formal merge conflicts, but it still breaks the tests in here. I'll rebase at my next convenience. This should not touch anything else but tests, as far as I can see, so please go ahead with the review.

Copy link
Contributor

@brian-brazil brian-brazil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see this finally back on the table.

tsdb/isolation.go Outdated Show resolved Hide resolved
tsdb/db_test.go Outdated Show resolved Hide resolved
tsdb/head.go Outdated Show resolved Hide resolved
tsdb/isolation.go Outdated Show resolved Hide resolved
tsdb/head.go Outdated Show resolved Hide resolved
@roidelapluie
Copy link
Member

Just dropping a note that I would like, once this is ready for merge, to prombench this against master.

@brian-brazil
Copy link
Contributor

We should prombench as soon as it compiles. I wonder how my memory estimates from 3 years ago hold up.

@roidelapluie
Copy link
Member

roidelapluie commented Feb 18, 2020

Is this documentation worthy ? in tsdb/docs ?

@beorn7
Copy link
Member Author

beorn7 commented Feb 18, 2020

/benchmark master

@roidelapluie
Copy link
Member

/benchmark master

/prombench master

@prombot
Copy link
Contributor

prombot commented Feb 18, 2020

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-6841 and master

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart v2.12.0

@beorn7
Copy link
Member Author

beorn7 commented Feb 19, 2020

Something locks up, and then the instances OOM.
Debugging now.
@gouthamve your eyes on this will be much appreciated.

@beorn7
Copy link
Member Author

beorn7 commented Feb 19, 2020

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Feb 19, 2020

Benchmark cancel is in progress.

@roidelapluie
Copy link
Member

This pull request has been worked on by multiple people in the last years. That alone shows how this feature is needed and also the complexity of this issue.

I want to drop a note here that as this is a important and risky change, as release shepherd, this pull request is under my watch and if we do not settle on it (merge it) before the 5th of March, I will put my veto on it for the 2.17 release.

That said, we are still 2 weeks away from this date and the prometheus and tsdb maintainers can merge it in the time between without my intervention.

@krasi-georgiev
Copy link
Contributor

@beorn7 the prombench has loki logs, so maybe that will help with the debuging
I had a quick look and can't find the logs though, maybe @geekodour can point us to where the logs are for this crashing prometheus

@geekodour
Copy link
Member

@beorn7
Copy link
Member Author

beorn7 commented Feb 20, 2020

I had a look at the logs, that's where my conclusions are coming from. My next goal is to reproduce the crash locally so that I don't have to run prombench for eight hours. So far no luck.
It would be great if @gouthamve could chime in as he has probably the most intimate understanding how this code works.

@beorn7
Copy link
Member Author

beorn7 commented Feb 20, 2020

@roidelapluie yes sure. It's your call when the time has come. Currently, we don't even know if this will work at all.

@beorn7
Copy link
Member Author

beorn7 commented Feb 20, 2020

Looking at this dashboard, the hypothesis is the following:

For some reason, the low watermark got stuck around 02:45 UTC. First crash happened a while after the head truncation at 05:00, which apparently ran into trouble (head chunks didn't drop). Things didn't look to bad after that, but then the server went into a tight crash loop.

I'll focus my investigation on possible reasons why the low watermark didn't get updated. (Wild guess: A scrape got canceled halfway through.)

Still, hints from people who know this better than me highly appreciated.

@beorn7
Copy link
Member Author

beorn7 commented Feb 20, 2020

/prombench master

@prombot
Copy link
Contributor

prombot commented Feb 20, 2020

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-6841 and master

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart v2.12.0

@beorn7
Copy link
Member Author

beorn7 commented Feb 21, 2020

I removed the downsizing code. Prombench has run for ~20 hours now without issues. RAM usage is increased a bit, but not more than before, when the downsizing code was included.

Maybe the downsizing code has a bug, which is plausible, but not proven yet.
Or we were just lucky in the current Prombench run.

Today I'm busy with other things, so I'll let Prombench run for a few days.

@prombot
Copy link
Contributor

prombot commented Feb 23, 2020

Benchmark tests are running for 3 days! If this is intended ignore this message otherwise you can cancel it by commenting: /prombench cancel

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my. I think I see it, but...

Making ingestion of all the samples from a single scrape atomic would solve the problem

Do we know how much of overhead this simple solution would give? Would it really be too slow? Do we have data?

It looks good, I reviewed most of, still can't wrap around the writeID and isolation logic will continue review later. Also, can we remove/clarify commented code? It does not help in review 😄

Some suggestions for now.

tsdb/db_test.go Outdated Show resolved Hide resolved
tsdb/db_test.go Show resolved Hide resolved
tsdb/head.go Show resolved Hide resolved
tsdb/head.go Outdated Show resolved Hide resolved
tsdb/head.go Outdated Show resolved Hide resolved
tsdb/head.go Outdated Show resolved Hide resolved
tsdb/head.go Outdated Show resolved Hide resolved
tsdb/head.go Outdated Show resolved Hide resolved
tsdb/head.go Outdated Show resolved Hide resolved
tsdb/head_test.go Outdated Show resolved Hide resolved
@beorn7
Copy link
Member Author

beorn7 commented Feb 27, 2020

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Feb 27, 2020

Benchmark cancel is in progress.

@beorn7
Copy link
Member Author

beorn7 commented Feb 27, 2020

Benchmark looks OK, but we have to benchmark the new changes anyway. So canceling the current run.

@beorn7
Copy link
Member Author

beorn7 commented Feb 27, 2020

Don't we need (*txRing) cleanupAppendIDsBelow when Rolling Back ?

We do. I have added it, including amending a test so that it now exposes the bug.

tsdb/head_test.go Outdated Show resolved Hide resolved
@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2020

/prombench master

@prombot
Copy link
Contributor

prombot commented Feb 28, 2020

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-6841 and master

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart v2.12.0

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2020

Prombench results:

  • 3% increase in CPU usage.
  • Avg query latencies increased 4% for query, 19% for query_range.
  • Avg. go_memstats_alloc_bytes are 19% increased.

This is now much higher than before, but I believe due to the buggy "cleanup everything while no reads are in progress" we had before, we didn't really do the full story before.

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2020

I think those values are still within expectations, based on what @brian-brazil said above.

This will still, I guess, be noticed painfully by users with high-load tight-resource setups.

beorn7 and others added 2 commits February 28, 2020 14:17
Signed-off-by: beorn7 <beorn@grafana.com>
This has been ported from prometheus-junkyard/tsdb#306.

Original implementation by @brian-brazil, explained in detail in the
2nd half of this talk:
https://promcon.io/2017-munich/talks/staleness-in-prometheus-2-0/

The implementation was then processed by @gouthamve into the PR linked
above. Relevant slide deck:
https://docs.google.com/presentation/d/1-ICg7PEmDHYcITykD2SR2xwg56Tzf4gr8zfz1OerY5Y/edit?usp=drivesdk

Signed-off-by: beorn7 <beorn@grafana.com>
Co-authored-by: Brian Brazil <brian.brazil@robustperception.io>
Co-authored-by: Goutham Veeramachaneni <gouthamve@gmail.com>
@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2020

Rebased and squashed.

Last call for objections, otherwise I'll merge in about an hour or so…

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2020

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Feb 28, 2020

Benchmark cancel is in progress.

@beorn7 beorn7 merged commit d137cdd into master Feb 28, 2020
@beorn7 beorn7 deleted the beorn7/isolation branch February 28, 2020 16:48
@roidelapluie
Copy link
Member

Hooray 🎉🎉🎉🎉

@roidelapluie
Copy link
Member

@beorn7 @brian-brazil can we agree on a sentence to put at the beginning of the release notes about this? Especially something about the memory increase.

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2020

I'm also running this now on a moderately loaded production server for comparison (500k series, 30k samples/s, very low query volume). The increase in RAM consumption is even more pronounced here: The peak value of go_memstats_heap_alloc_bytes jumps from 4.4GB to 7.2GB. container_memory_working_set_bytes goes from 8.2GB to 11.3GB.

I'll do a bit of heap analysis at my next convenience to see if there are low hanging fruit.

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2020

This should definitely come with a big warning in the release notes. It will anyway be a hard sell, given that only very few users will have ever noticed the problems with isolation. I'd also keep reverting this on the agenda. We need to give it some thought…

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2020

Peak container_memory_working_set_bytes (just before head truncation) is 11.7GB vs. 8.6GB, i.e. 36% increase.

@beorn7
Copy link
Member Author

beorn7 commented Feb 29, 2020

I think I have found the bug: When we replay the WAL, we dutifully append every sample with appendID 0. Those are all cleaned up after the next commit, but the ring buffer has then already grown to accommodate all samples in the WAL, and it will never shrink again.

The solution should be easy: appendID==0 should never be recorded. PR in preparation…

@roidelapluie
Copy link
Member

I think I have found the bug: When we replay the WAL, we dutifully append every sample with appendID 0. Those are all cleaned up after the next commit, but the ring buffer has then already grown to accommodate all samples in the WAL, and it will never shrink again.

The solution should be easy: appendID==0 should never be recorded. PR in preparation…

I still think it might be worthy to have the downsizing code again

@beorn7
Copy link
Member Author

beorn7 commented Feb 29, 2020

I still think it might be worthy to have the downsizing code again

Let's see how it turns out in practice. Downsizing could easily let to oscillating behavior, which would be worse than no downsizing (spikier memory usage, more allocations, slower appends).

@bobrik
Copy link
Contributor

bobrik commented Mar 23, 2020

I think this could fix #4580 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants