Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write and compaction stability #8384

Merged
merged 7 commits into from
May 12, 2017
Merged

Write and compaction stability #8384

merged 7 commits into from
May 12, 2017

Conversation

jwilder
Copy link
Contributor

@jwilder jwilder commented May 11, 2017

Required for all non-trivial PRs
  • Rebased/mergable
  • Tests pass
  • CHANGELOG.md updated
  • Sign CLA (if not already signed)

These are some changes to improve stability for writes and compactions. #8302 improved write throughput by ~2x. As a result, snapshot compactions are written much more frequently which cause TSM files to accumulate much faster. The current compaction planning attempts to kicks off as many compactions as possible and allow the go runtime to manage them.

When there are many shards, this can cause problems with many full or expensive compactions running at once spiking CPU and using a lot of disk space. #8348 added a config option to limit the global number of running level and full compactions across shards, but an individual level could still kick off more compactions than the limit. The PR limits concurrent compactions within a level and reduces some allocations to improve stability and jitter.

I ran a 2B write load w/ 50 concurrent writers against 1.2.3, master and this PR. Total series is 100k w/ 1 value. There was no query load. This is on a c4.8xlarge. I also ran a larger 20B run against master and this PR to see when many level/full compactions within a shard would be kicked off an what the impact was.

The main changes in this PR:

  • Convert Point.Name to return []byte instead of string. We were converting from []byte to string to []byte in many places causes lots of garbage.
  • Switch the TSM WAL buffer pool back to a managed pool from sync.Pool. The sync.Pool wasn't able to hold onto the allocations so we were allocating many []byte over and over. The pool used is not limited to drop buffers that are too big.
  • Limit concurrent compactions within a shard. Instead of kicking off as many goroutines as there are compactions groups, cap it at 4 (determined empirically to work well).
  • Rework key parsing to not allocate and intermediate tag map that we ignore.

Writes

This is HTTP write throughput and latency. 1.2.3 sustains about 1.1M w/s and latency varies around 210ms-320ms. master and this PR sustain around 2.3M w/s. master latency fluctuates around 100-110ms and with this change it reduces to 90-100ms. In the master run, you can see the throughput jump around quite a bit and is slowly going down. On larger test runs (20B), this variance can be more dramatic. This variance appears to be due to CPU starvation caused by lots of allocations and compactions competing for CPU.

2B Point Run

writes

20B Point Run

This larger test run shows what happens to writes when too many level or full compactions kick in at once within a shard.

master

screen shot 2017-05-11 at 3 02 45 pm

In the middle of the graph, the write throughput fluctuations are correlated with about 21 concurrent compactions running (14 full and 7 level 2).

screen shot 2017-05-11 at 3 07 48 pm

This is the same write load against this PR.

screen shot 2017-05-11 at 3 05 34 pm

Compactions are limited in this run and larger variations are gone.

screen shot 2017-05-11 at 3 15 17 pm

Compactions

Watching the TSM file count shows that master compactions are not keeping up with the increased write load and are slowly backing up. This is a 2B point run.

filecount

The current compaction planning is also spinning up a lot of compactions that are competing with each other and slowing progress.

running

With this change, file count stays more steady with the increased write load and the number of concurrent compactions stays more consistent.

While I was not measuring it, I noticed CPU util and load was significantly reduced from master to this change as well.

This pool was previously a pool.Bytes to avoid repetitive allocations.
It was recently switchted to a sync.Pool because pool.Bytes held onto
very larger buffers at times which were never released.  sync.Pool is
showing up in allocation profiles quite frequently.

This switches the pool to a new pool that limits how many buffers are
in the pool as well as the max size of each buffer in the pool.  This
provides better bounds on allocations.
@jwilder jwilder added this to the 1.3.0 milestone May 11, 2017
@jwilder jwilder added the review label May 11, 2017
models/points.go Outdated
return buf[:i-1], nil
}
return buf[:i], err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "ignore the error" but it gets passed back. I just want to make sure that's intentional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it should be returning nil here and err should err should be _.

jwilder and others added 6 commits May 12, 2017 14:05
Measurement name and field were converted between []byte and string
repetively causing lots of garbage.  This switches the code to use
[]byte in the write path.
This changes full compactions within a shard to run sequentially
instead of running all the compaction groups in parallel.  Normally,
there is only 1 full compaction group to run.  At times, there could
be several which causes instability if they are all running concurrently
as they tie up a cpu for long periods of time.

Level compactions are also capped to a max of 4 concurrently running for each level
in a shard.  This prevents sudden spikes in CPU and disk usage due to a large backlog
of tsm files at a given level.
@@ -383,20 +387,17 @@ func (l *WAL) writeToLog(entry WALEntry) (int, error) {
// limit how many concurrent encodings can be in flight. Since we can only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment block appears redundant now that the limiter is gone

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Wal.limiter field is no longer used, so can we remove that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I didn't see these comments earlier before I merged. I'll submit a new PR to remove these lines.

@jwilder jwilder merged commit 3b70086 into master May 12, 2017
@jwilder jwilder deleted the jw-write-values branch May 12, 2017 20:48
@jwilder jwilder removed the review label May 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants