Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAL #3036

Merged
merged 1 commit into from
Jun 25, 2015
Merged

WAL #3036

merged 1 commit into from
Jun 25, 2015

Conversation

benbjohnson
Copy link
Contributor

Overview

This commit adds a write ahead log to the shard. Entries are cached in memory and periodically flushed back into the index. The goal is to optimize write performance by flushing multiple points into series instead of writing a single point which causes a lot of random write disk overhead.

TODO

  • Add flushing configuration.
  • Add periodic flusher.
  • Integrate with query engine.
  • Partition flushes so that a large section of the WAL can be saved in multiple small batches.

@benbjohnson
Copy link
Contributor Author

The CI failures are expected since the query engine isn't hooked up. This PR just validates that the WAL can be written, flushed, and restarted-and-flushed.

for _, p := range points {
bp, err := tx.CreateBucketIfNotExists(p.Key())
// Generate an autoincrementing index for the WAL.
id, _ := wal.NextSequence()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The id that is created here, is used as a key. Data is inserted using this key. Is this data ever deleted from the WAL bucket?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nope, not yet. I missed that. I think I might actually change the flusher to go off the WAL bucket instead. It'll make consistency easier.

@otoolep
Copy link
Contributor

otoolep commented Jun 18, 2015

Generally makes sense. Would we really need the cache if queries weren't a concern?. Couldn't the flusher just walk the WAL keys and build the batches on the fly?

Of course, queries are a concern. :-) But I want to fully understand the motivation behind the cache. After all, due to the Bolt mmap, that cached data is now in memory twice (most likely).

@benbjohnson
Copy link
Contributor Author

The cache is only needed for queries. It'd be nice to just query directly off the WAL but the points are not necessarily going to be in order.

@benbjohnson
Copy link
Contributor Author

@otoolep I made some changes in 6d0337e:

  • Flush from the WAL bucket instead of the cache.
  • Remove the WAL bucket after flush.
  • Add autoflush background goroutine.
  • Add configurable MaxWALSize.

I'm going to hook up the query side next.

// is responsible for combining the output of many shards into a single query result.
const (
// DefaultMaxWALSize is the default size of the WAL before it is flushed.
DefaultMaxWALSize = 10 * 1024 * 1024 // 10MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker for this PR, but we'll need to make this configurable through tsdb/config.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the WAL size and flush interval to the TSDB config.

@pauldix
Copy link
Member

pauldix commented Jun 20, 2015

I think this makes sense. I'd be interested to see what the perf is like when the WAL gets close to its max size. Particularly if you have a single series key that is very hot (many times per second) so its cache entry would be very large.

I assume that most people would want a WAL much larger than 10MB. Would be good to test with ones in the hundreds of megabytes.

We should probably force a flush of a specific key after it gets up to a certain size. No need to flush the entire WAL, just that key.

I assume that you'd be removing the calls to flush in the query_executor_test.go once the query side of things is wired up?

Also, flush should be forced after a certain time as well as just size. That way once a shard is cold for writes, its WAL will get flushed. Maybe just force a flush if we haven't received a write after 10m?

@benbjohnson
Copy link
Contributor Author

@otoolep @pauldix I added query integration, time-based flushing, and WAL options to the config. The flushes have been removed from the query_executor_test.go and the default WAL size was bumped to 100MB. I'll start doing some performance testing with it to make sure that it's working properly.

We should probably force a flush of a specific key after it gets up to a certain size. No need to flush the entire WAL, just that key.

We could do individual key flushing but that starts to complicate the WAL and has some other overhead. I'd prefer to defer that until it's an issue. I think striping the flushes is a better way to go, personally. I didn't want to include that in this first cut because that's also more complicated.

}

// Otherwise read from the cache.
// Continue skipping ahead through duplicate keys in the cache list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could there be duplicate keys? Points for the series written for the same timestamp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@benbjohnson
Copy link
Contributor Author

@otoolep @pauldix

I'm running benchmarks on a WAL-enabled influxd and here's what I'm finding:

  • Write speed to the WAL is fairly consistent. About 50ms for batches of 5,000.
  • Flushing the WAL takes a long time and it blocks so we'll need to do striping. I'm trying to simulate striping by limiting the number of series and I'm seeing about 150K writes/sec. It's not very optimized though and I would guess we could get to 200K+ pretty easily.
  • Memory usage is staying very consistent.
  • IOPS are around 30 - 50 (except when the WAL flushes and then it jumps momentarily to 1000-1500). Striping will help even this out.

if err := s.db.Update(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("series"))
for _, k := range keys {
if err := b.Delete([]byte(k)); err != nil {
return err
}
if err := tx.DeleteBucket([]byte(k)); err != nil {
if err := tx.DeleteBucket([]byte(k)); err != nil && err != bolt.ErrBucketNotFound {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it might be fixing another error that people have reported. Some users have complained of "bucket not found" coming back to them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely good to know the return errors from Bolt's operations. They're generally documented in the godoc but you can also find the exact list of named errors if you open up the function:

https://github.com/boltdb/bolt/blob/master/bucket.go#L208-L212
https://github.com/boltdb/bolt/blob/master/bucket.go#L218-L223

Failing from a bolt.ErrBucketNotFound and returning that error causes the whole Update() to rollback which is probably not what you want.

@otoolep
Copy link
Contributor

otoolep commented Jun 23, 2015

Looking forward to seeing this in action. +1

@pauldix
Copy link
Member

pauldix commented Jun 24, 2015

@benbjohnson looks good so far. What do you mean by striping? I'm still concerned about series that have a much higher frequency than others. Say you have one series that gets an average of 500 points per second. Then you have a bunch of other series are regular that only get 1 point every 10 seconds.

I don't think the same approach is going to work well for both. So you really need a way to flush per series, IMO.

But I'm interested in hearing more about the striping approach and if you think that'll solve this potential issue.

@benbjohnson
Copy link
Contributor Author

@pauldix By striping I mean that we break out the WAL into, say 256 buckets, and series are sharded by those buckets. Then we can flush those buckets individually so that we're not flushing the whole WAL at once. It also still allows us to get the benefit of grouping series writes together.

As for variable series sizes, I think that if you have a series with 500 w/sec then it's going to dwarf your series with a write every 10s. It seems like a premature optimization at this point.

@pauldix
Copy link
Member

pauldix commented Jun 24, 2015

@benbjohnson cool, will be curious to see how striping affects performance.

@otoolep
Copy link
Contributor

otoolep commented Jun 24, 2015

If this change makes it for 0.9.1, doesn't it also need an upgrade case?

On Wednesday, June 24, 2015, Paul Dix notifications@github.com wrote:

@benbjohnson https://github.com/benbjohnson cool, will be curious to
see how striping affects performance.


Reply to this email directly or view it on GitHub
#3036 (comment).

@otoolep
Copy link
Contributor

otoolep commented Jun 24, 2015

Of course the system will handle no WAL bucket existing and create it. But
we should be sure about this and test upgrade.

On Wednesday, June 24, 2015, Philip O'Toole philip@influxdb.com wrote:

If this change makes it for 0.9.1, doesn't it also need an upgrade case?

On Wednesday, June 24, 2015, Paul Dix <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

@benbjohnson https://github.com/benbjohnson cool, will be curious to
see how striping affects performance.


Reply to this email directly or view it on GitHub
#3036 (comment).

@benbjohnson benbjohnson changed the title Add write ahead log WAL Jun 25, 2015
@benbjohnson
Copy link
Contributor Author

@otoolep @pauldix This WAL PR is ready for review.

@otoolep
Copy link
Contributor

otoolep commented Jun 25, 2015

@benbjohnson -- what impact did striping have on the system? Does it meet the design goals?

return
}

// WALPartitionN is the number of partitions in the write ahead log.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a warning in the comment that this cannot be reduced without possibly making data in partitions numbered larger from becoming invisible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WAL gets flushed on shard open so it can rebuild the partition buckets automatically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow you, perhaps I am missing something. A system is running with partition count of 8. It crashes hard. It stays down. It's upgraded to a version with a partition count of 4, and then restarted. The call to Flush() will only flush partitions 0-3 inclusive.

This is a real edge case, but I want to be sure I understand the code. Not saying we have to fix it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the fix for this would be easy enough? The flush that takes place on open should walk the partition buckets actually on disk, and flush those? Ignore the partition count in the code, just for the first flush? Again, this is an edge case, but perhaps the fix is easy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it wouldn't catch it in the flush. I thought I fixed that. Good catch. I fixed it in the latest commit (b574e2f)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blame my work on Kafka at Loggly. Increasing partition count was, like, a 10-second operation. Decreasing it was fraught. :-)

@otoolep
Copy link
Contributor

otoolep commented Jun 25, 2015

@benbjohnson -- how does partitioning actually help us? The WAL buckets are still only in 1 BotlDB instance, so only 1 writer can be doing its thing at once. There is still only 1 shard mutex as well. I don't follow how it helps? Can you explain?

@benbjohnson
Copy link
Contributor Author

@otoolep The main goal of partitioning was to lower the amount of continuous time that the flush blocks. Previously flushing 1.5M points would take 8s which is a long time to block for. Now with partitioning it will block 8 times for about 1s each time which is much more reasonable from a client perspective.

Partitioning raised the IOPS to the WAL a bit -- from about 50 IOPS to 100-150 IOPS but it's still within our design goal. Previously we were seeing up to 20,000 IOPS on the box continuously. Now it's closer to 150 IOPS and the flushes push it up to 1500-2000 IOPS momentarily.

@otoolep
Copy link
Contributor

otoolep commented Jun 25, 2015

OK, thanks @benbjohnson -- if it's about more, but shorter, delays, that makes sense.

@pauldix
Copy link
Member

pauldix commented Jun 25, 2015

+1 looks awesome. We'll have to run this through its paces testing the next 7 days :)

This commit adds a write ahead log to the shard. Entries are cached
in memory and periodically flushed back into the index. The WAL and
the cache are both partitioned into buckets so that flushing doesn't
stop the world as long.
benbjohnson added a commit that referenced this pull request Jun 25, 2015
@benbjohnson benbjohnson merged commit e10fb0c into influxdata:master Jun 25, 2015
@benbjohnson benbjohnson deleted the wal branch June 25, 2015 21:51
@huhongbo
Copy link

seems have better performance,but I test for 10000msg/s pull from kafka and the influxdb hang without any log infomation

@pauldix
Copy link
Member

pauldix commented Jun 26, 2015

@huhongbo can you give more information? How does it hang? What is your schema? How are you writing data? How many points do you write before it hangs? Basically, we need as much information as you can give. Our testing hasn't shown the issue you're talking about so we'd like to reproduce it if possible.

@huhongbo
Copy link

https://gist.github.com/huhongbo/4d0882e4e0262dfa6991 have some data sample
I'm writing the data using line protocol, every 10000 line to batch write to influxdb
The very strange thing is I have to empty the db to restore the write speed.
I'm using the 2 cpu 32core 2.0G CPU 128G mem and Raid 1 2x2T disk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants