Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix loading huge series into RAM when points are overwritten #6556

Merged
merged 1 commit into from
May 5, 2016

Conversation

jwilder
Copy link
Contributor

@jwilder jwilder commented May 4, 2016

Required for all non-trivial PRs
  • Rebased/mergable
  • Tests pass
  • CHANGELOG.md updated

In some query scenarios, if there are a lot of points on disk for the same series spread across many blocks in TSM files and a point is overwritten near the beginning of the shard's time range, the full series could be loaded into RAM triggering OOMs and huge allocations. I believe this same problem exists in the compactor, but this PR only fixes the query path. A separate fix will be needed for compactions.

The issue was that the KeyCursor code that handles overwriting points had a simple implementation that just deduped the whole series in this case. This falls over when the series is quite large.

Instead, the KeyCursor has been changed to only decode blocks with updated points. It then keeps track of what section of the blocks have been read so they are not re-read when the later points are decoded.

Since the points in a block are always sorted, the code was also changed to remove the Deduplicate calls since they end up reallocating the slice (as well as creating an equally sized map). Instead, we do a sorted merge and re-use the slice as much as we can.

To reproduce this issue, I wrote 50M points to a single series and overwrote the first point a few times until the TSM file blocks were in the state that this issue surfaced. Then I ran a select count(value) from cpu query to could every point in the series.

Some before and after system stats:

Before

$time influx -database stress -execute "select count(value) from cpu"
name: cpu
---------
time    count
0   50000000


real    0m39.994s
user    0m0.004s
sys 0m0.006s

$ ps -o rss,vsz,pid $(pgrep influxd)
   RSS      VSZ   PID
5960276 573563196 20644

After

$ time influx -database stress -execute "select count(value) from cpu"
name: cpu
---------
time    count
0   50000000

real    0m11.352s
user    0m0.004s
sys 0m0.006s
$ ps -o rss,vsz,pid $(pgrep influxd)
   RSS      VSZ   PID
137428 573564744 20907

Query time reduced from 40s to 11.3s and RSS from 6GB to 137MB.

@jwilder jwilder added this to the 1.0.0 milestone May 4, 2016
@mention-bot
Copy link

By analyzing the blame information on this pull request, we identified @mark-rushakoff and @joelegasse to be potential reviewers

@jwilder
Copy link
Contributor Author

jwilder commented May 4, 2016

@benbjohnson

a[i] = b[j]
i++
j++
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but we should probably save UnixNano() since we're calling it twice. Maybe atime & btime?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait, I was thinking those were time.Time values. Ok, it probably doesn't make a big difference if they're just grabbing int64 values underneath.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a small savings to save the slice indexing call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: you could save a few lines with:

if a[i].UnixNano() > b[j].UnixNano() {
    a[i], b[j] = b[j], a[i]
} else if a[i].UnixNano() == b[j].UnixNano(){
    a[i] = b[j]
    j++
}
i++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@e-dard That's nicer. I'll update it to use:

    var i, j int
    for ; i < len(a) && j < len(b); i++ {
        av, bv := a[i].UnixNano(), b[j].UnixNano()
        if av > bv {
            a[i], b[j] = b[j], a[i]
        } else if av == bv {
            a[i] = b[j]
            j++
        }
    }

@benbjohnson
Copy link
Contributor

Overall it lgtm. Just a few minor comments.

@e-dard
Copy link
Contributor

e-dard commented May 5, 2016

LGTM 👍

In some query scenarios, if there are a lot of points on disk spread
across many blocks in TSM files and a point is overwritten near the
begginning of the shard's timerange, the full series could be loaded
into RAM triggering OOMs and huge allocations.

The issue was that the KeyCursor code that handles overwriting points
had a simple implementation that just deduped the whole series in this
case.  This falls over when the series is quite large.

Instead, the KeyCursor has been changed to only decode blocks with
updated points.  It then keeps track of what section of the blocks
have been read so they are not re-read when the later points are
decoded.

Since the points in a block are always sorted, the code was also changed
to remove the Deduplicate calls since they end up
reallocating the slice.  Instead, we do a sorted merge and re-use
the slice as much as we can.
@jwilder jwilder merged commit fbf1e4a into master May 5, 2016
@jwilder jwilder deleted the jw-tsm-values branch May 5, 2016 16:09
jwilder added a commit that referenced this pull request May 6, 2016
If a large series contains a point that is overwritten, the compactor
would load the whole series into RAM during a full compaction.  If
the series was large, it could cause very large RAM spikes and OOMs.

The change reworks the compactor to merge blocks more incrementally
similar to the fix done in #6556.
jwilder added a commit that referenced this pull request May 17, 2016
If a large series contains a point that is overwritten, the compactor
would load the whole series into RAM during a full compaction.  If
the series was large, it could cause very large RAM spikes and OOMs.

The change reworks the compactor to merge blocks more incrementally
similar to the fix done in #6556.

Fixes #6557
@jwilder jwilder mentioned this pull request May 17, 2016
3 tasks
jwilder added a commit that referenced this pull request May 18, 2016
If a large series contains a point that is overwritten, the compactor
would load the whole series into RAM during a full compaction.  If
the series was large, it could cause very large RAM spikes and OOMs.

The change reworks the compactor to merge blocks more incrementally
similar to the fix done in #6556.

Fixes #6557
jwilder added a commit that referenced this pull request May 18, 2016
If a large series contains a point that is overwritten, the compactor
would load the whole series into RAM during a full compaction.  If
the series was large, it could cause very large RAM spikes and OOMs.

The change reworks the compactor to merge blocks more incrementally
similar to the fix done in #6556.

Fixes #6557
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants