Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved compaction scheduling #8886

Merged
merged 16 commits into from
Oct 4, 2017
Merged

Improved compaction scheduling #8886

merged 16 commits into from
Oct 4, 2017

Conversation

jwilder
Copy link
Contributor

@jwilder jwilder commented Sep 27, 2017

Required for all non-trivial PRs
  • Rebased/mergable
  • Tests pass
  • CHANGELOG.md updated
  • Sign CLA (if not already signed)

This fixes some regressions and new issues due to recent changes from #8872 and #8856.

  • Fix long process stalls #8872 fixed some long process stalls due to major page faults, but also causes a significant reduction in throughput for lower cardinalities. This reverts of O_SYNC flag for the wal for now.
  • Snapshot compaction improvements #8856 increased the throughput for snapshot compactions which keeps memory usage lower at higher cardinalities, but increases the number of level 1 tsm files more quickly. This exposes a issue with the current compaction planning process where the planning for a level would get block on a slower compaction within the level. For example, a level 3 compaction might plan for 4 groups to be compacted concurrently. If 3 completed quickly, the level 3 planner would wait until the last compaction completed. Since lower levels are getting written more quickly, a large backlog could accumulate. This changes the planning to better handle these situations and make use of all the cores allocated for compactions (limited by max-concurrent-compactions).

There are also two bug fixes where the temp TSM index file could be left around when running out of disk space as well as fixing the cache mem bytes stat to include the size of the series key.

Copy link
Contributor

@e-dard e-dard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily changes, but just a couple of queries about closing/removing files.

return nil
}

if err := d.fd.Close(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if there is a problem closing the file we're not going to clean up, but we're not going to return an error to the caller either? This scenario seems unusual enough that we don't want to ignore it or not log it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be returning the err here. Will fix.

}

if f, ok := t.wrapped.(*os.File); ok {
f.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we're going to ignore a Close error (fine) but then still try and remove the file. This seems at odds with how the directIndex is handled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix this up too.

@jwilder jwilder changed the title Fix regressions Fix regressions and improve compaction scheduling Oct 3, 2017
@jwilder jwilder changed the title Fix regressions and improve compaction scheduling Improved compaction scheduling Oct 3, 2017
Doing this for the WAL reduces throughput quite a bit.
This changes the compaction scheduling to better utilize the available
cores that are free.  Previously, a level was planned in its own goroutine
and would kick off a number of compactions groups.  The problem with this
model was that if there were 4 groups, and 3 completed quickly, the planning
would be blocked for that level until the last group finished.  If the compactions
at the prior level are running more quickly, a large backlog could accumlate.

This now moves the planning to a single goroutine that plans each level in
succession and starts as many groups as it can.  When one group finishes,
the planning will start the next group for the level.
Shows up under high cardinality compactions.
This switches the thresholds that are used for writing snapshots
concurrently.  This scales better than the prior model.
The chunked slice is unnecessary and we can re-use k.blocks throughout
the compaction.
This gives an indication as to whether compactions are backed up
or not.
With higher cardinality or larger series keys, the files can roll
over early which causes them to take longer to be compacted by higher
levels.  This causes larger disk usage and higher numbers of tsm files
at times.
One shard might be able to run a compaction, but could fail to
limits being hit.  This loop would continue indefinitely as the
same task would continue to be rescheduled.
This check doesn't make sense for high cardinality data as the files
typically get big and sparse very quickly.  This causes a lot of extra
disk space to be used which is taken up by large indexes and sparse
data.
If there is a backlog of level 3 and 4 compacitons, but few level 1
and 2 compactions, allow them to use some excess capacity.
Some files seem to get orphan behind higher levels.  This causes
the compactions to get blocked as the lowere level files will not
get picked up by their lower level planners.  This allows the full
plan to identify them and pull them into their plans.
@jwilder jwilder merged commit 7e2b22f into master Oct 4, 2017
@ghost ghost removed the review label Oct 4, 2017
@jwilder jwilder deleted the jw-planner branch October 4, 2017 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants