WAL #3036

benbjohnson · 2015-06-18T16:17:48Z

Overview

This commit adds a write ahead log to the shard. Entries are cached in memory and periodically flushed back into the index. The goal is to optimize write performance by flushing multiple points into series instead of writing a single point which causes a lot of random write disk overhead.

TODO

Add flushing configuration.
Add periodic flusher.
Integrate with query engine.
Partition flushes so that a large section of the WAL can be saved in multiple small batches.

benbjohnson · 2015-06-18T16:24:10Z

The CI failures are expected since the query engine isn't hooked up. This PR just validates that the WAL can be written, flushed, and restarted-and-flushed.

otoolep · 2015-06-18T18:23:18Z

tsdb/shard.go

 		for _, p := range points {
-			bp, err := tx.CreateBucketIfNotExists(p.Key())
+			// Generate an autoincrementing index for the WAL.
+			id, _ := wal.NextSequence()


The id that is created here, is used as a key. Data is inserted using this key. Is this data ever deleted from the WAL bucket?

Ah, nope, not yet. I missed that. I think I might actually change the flusher to go off the WAL bucket instead. It'll make consistency easier.

otoolep · 2015-06-18T18:25:32Z

Generally makes sense. Would we really need the cache if queries weren't a concern?. Couldn't the flusher just walk the WAL keys and build the batches on the fly?

Of course, queries are a concern. :-) But I want to fully understand the motivation behind the cache. After all, due to the Bolt mmap, that cached data is now in memory twice (most likely).

benbjohnson · 2015-06-18T20:47:23Z

The cache is only needed for queries. It'd be nice to just query directly off the WAL but the points are not necessarily going to be in order.

benbjohnson · 2015-06-19T17:46:32Z

@otoolep I made some changes in 6d0337e:

Flush from the WAL bucket instead of the cache.
Remove the WAL bucket after flush.
Add autoflush background goroutine.
Add configurable MaxWALSize.

I'm going to hook up the query side next.

otoolep · 2015-06-19T18:48:59Z

tsdb/shard.go

-// is responsible for combining the output of many shards into a single query result.
+const (
+	// DefaultMaxWALSize is the default size of the WAL before it is flushed.
+	DefaultMaxWALSize = 10 * 1024 * 1024 // 10MB


Not a blocker for this PR, but we'll need to make this configurable through tsdb/config.go.

I added the WAL size and flush interval to the TSDB config.

pauldix · 2015-06-20T10:59:34Z

I think this makes sense. I'd be interested to see what the perf is like when the WAL gets close to its max size. Particularly if you have a single series key that is very hot (many times per second) so its cache entry would be very large.

I assume that most people would want a WAL much larger than 10MB. Would be good to test with ones in the hundreds of megabytes.

We should probably force a flush of a specific key after it gets up to a certain size. No need to flush the entire WAL, just that key.

I assume that you'd be removing the calls to flush in the query_executor_test.go once the query side of things is wired up?

Also, flush should be forced after a certain time as well as just size. That way once a shard is cold for writes, its WAL will get flushed. Maybe just force a flush if we haven't received a write after 10m?

benbjohnson · 2015-06-22T22:00:27Z

@otoolep @pauldix I added query integration, time-based flushing, and WAL options to the config. The flushes have been removed from the query_executor_test.go and the default WAL size was bumped to 100MB. I'll start doing some performance testing with it to make sure that it's working properly.

We should probably force a flush of a specific key after it gets up to a certain size. No need to flush the entire WAL, just that key.

We could do individual key flushing but that starts to complicate the WAL and has some other overhead. I'd prefer to defer that until it's an issue. I think striping the flushes is a better way to go, personally. I didn't want to include that in this first cut because that's also more complicated.

otoolep · 2015-06-22T23:48:03Z

tsdb/shard.go

+	}
+
+	// Otherwise read from the cache.
+	// Continue skipping ahead through duplicate keys in the cache list.


How could there be duplicate keys? Points for the series written for the same timestamp?

benbjohnson · 2015-06-23T17:46:38Z

@otoolep @pauldix

I'm running benchmarks on a WAL-enabled influxd and here's what I'm finding:

Write speed to the WAL is fairly consistent. About 50ms for batches of 5,000.
Flushing the WAL takes a long time and it blocks so we'll need to do striping. I'm trying to simulate striping by limiting the number of series and I'm seeing about 150K writes/sec. It's not very optimized though and I would guess we could get to 200K+ pretty easily.
Memory usage is staying very consistent.
IOPS are around 30 - 50 (except when the WAL flushes and then it jumps momentarily to 1000-1500). Striping will help even this out.

otoolep · 2015-06-23T18:02:57Z

tsdb/shard.go

 	if err := s.db.Update(func(tx *bolt.Tx) error {
 		b := tx.Bucket([]byte("series"))
 		for _, k := range keys {
 			if err := b.Delete([]byte(k)); err != nil {
 				return err
 			}
-			if err := tx.DeleteBucket([]byte(k)); err != nil {
+			if err := tx.DeleteBucket([]byte(k)); err != nil && err != bolt.ErrBucketNotFound {


This looks like it might be fixing another error that people have reported. Some users have complained of "bucket not found" coming back to them.

It's definitely good to know the return errors from Bolt's operations. They're generally documented in the godoc but you can also find the exact list of named errors if you open up the function:

https://github.com/boltdb/bolt/blob/master/bucket.go#L208-L212
https://github.com/boltdb/bolt/blob/master/bucket.go#L218-L223

Failing from a bolt.ErrBucketNotFound and returning that error causes the whole Update() to rollback which is probably not what you want.

otoolep · 2015-06-23T18:08:56Z

Looking forward to seeing this in action. +1

pauldix · 2015-06-24T15:45:05Z

@benbjohnson looks good so far. What do you mean by striping? I'm still concerned about series that have a much higher frequency than others. Say you have one series that gets an average of 500 points per second. Then you have a bunch of other series are regular that only get 1 point every 10 seconds.

I don't think the same approach is going to work well for both. So you really need a way to flush per series, IMO.

But I'm interested in hearing more about the striping approach and if you think that'll solve this potential issue.

benbjohnson · 2015-06-24T16:35:13Z

@pauldix By striping I mean that we break out the WAL into, say 256 buckets, and series are sharded by those buckets. Then we can flush those buckets individually so that we're not flushing the whole WAL at once. It also still allows us to get the benefit of grouping series writes together.

As for variable series sizes, I think that if you have a series with 500 w/sec then it's going to dwarf your series with a write every 10s. It seems like a premature optimization at this point.

pauldix · 2015-06-24T17:02:40Z

@benbjohnson cool, will be curious to see how striping affects performance.

otoolep · 2015-06-24T17:18:46Z

If this change makes it for 0.9.1, doesn't it also need an upgrade case?

On Wednesday, June 24, 2015, Paul Dix notifications@github.com wrote:

@benbjohnson https://github.com/benbjohnson cool, will be curious to
see how striping affects performance.

—
Reply to this email directly or view it on GitHub
#3036 (comment).

otoolep · 2015-06-24T17:21:38Z

Of course the system will handle no WAL bucket existing and create it. But
we should be sure about this and test upgrade.

On Wednesday, June 24, 2015, Philip O'Toole philip@influxdb.com wrote:

If this change makes it for 0.9.1, doesn't it also need an upgrade case?

On Wednesday, June 24, 2015, Paul Dix <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

@benbjohnson https://github.com/benbjohnson cool, will be curious to
see how striping affects performance.

—
Reply to this email directly or view it on GitHub
#3036 (comment).

benbjohnson · 2015-06-25T15:18:27Z

@otoolep @pauldix This WAL PR is ready for review.

otoolep · 2015-06-25T17:27:18Z

@benbjohnson -- what impact did striping have on the system? Does it meet the design goals?

otoolep · 2015-06-25T17:45:15Z

tsdb/shard.go

+	return
+}
+
+// WALPartitionN is the number of partitions in the write ahead log.


Do we need a warning in the comment that this cannot be reduced without possibly making data in partitions numbered larger from becoming invisible?

The WAL gets flushed on shard open so it can rebuild the partition buckets automatically.

I don't follow you, perhaps I am missing something. A system is running with partition count of 8. It crashes hard. It stays down. It's upgraded to a version with a partition count of 4, and then restarted. The call to Flush() will only flush partitions 0-3 inclusive.

This is a real edge case, but I want to be sure I understand the code. Not saying we have to fix it.

Perhaps the fix for this would be easy enough? The flush that takes place on open should walk the partition buckets actually on disk, and flush those? Ignore the partition count in the code, just for the first flush? Again, this is an edge case, but perhaps the fix is easy.

You're right, it wouldn't catch it in the flush. I thought I fixed that. Good catch. I fixed it in the latest commit (b574e2f)

Blame my work on Kafka at Loggly. Increasing partition count was, like, a 10-second operation. Decreasing it was fraught. :-)

otoolep · 2015-06-25T17:55:57Z

@benbjohnson -- how does partitioning actually help us? The WAL buckets are still only in 1 BotlDB instance, so only 1 writer can be doing its thing at once. There is still only 1 shard mutex as well. I don't follow how it helps? Can you explain?

benbjohnson · 2015-06-25T18:05:15Z

@otoolep The main goal of partitioning was to lower the amount of continuous time that the flush blocks. Previously flushing 1.5M points would take 8s which is a long time to block for. Now with partitioning it will block 8 times for about 1s each time which is much more reasonable from a client perspective.

Partitioning raised the IOPS to the WAL a bit -- from about 50 IOPS to 100-150 IOPS but it's still within our design goal. Previously we were seeing up to 20,000 IOPS on the box continuously. Now it's closer to 150 IOPS and the flushes push it up to 1500-2000 IOPS momentarily.

otoolep · 2015-06-25T18:07:58Z

OK, thanks @benbjohnson -- if it's about more, but shorter, delays, that makes sense.

pauldix · 2015-06-25T20:22:09Z

+1 looks awesome. We'll have to run this through its paces testing the next 7 days :)

This commit adds a write ahead log to the shard. Entries are cached in memory and periodically flushed back into the index. The WAL and the cache are both partitioned into buckets so that flushing doesn't stop the world as long.

WAL

huhongbo · 2015-06-26T06:18:32Z

seems have better performance,but I test for 10000msg/s pull from kafka and the influxdb hang without any log infomation

pauldix · 2015-06-26T06:21:31Z

@huhongbo can you give more information? How does it hang? What is your schema? How are you writing data? How many points do you write before it hangs? Basically, we need as much information as you can give. Our testing hasn't shown the issue you're talking about so we'd like to reproduce it if possible.

huhongbo · 2015-06-26T06:44:44Z

https://gist.github.com/huhongbo/4d0882e4e0262dfa6991 have some data sample
I'm writing the data using line protocol, every 10000 line to batch write to influxdb
The very strange thing is I have to empty the db to restore the write speed.
I'm using the 2 cpu 32core 2.0G CPU 128G mem and Raid 1 2x2T disk

benbjohnson added the 2 - Working label Jun 18, 2015

otoolep reviewed Jun 18, 2015
View reviewed changes

benbjohnson force-pushed the wal branch from c83f1d3 to 6d0337e Compare June 19, 2015 17:44

otoolep reviewed Jun 19, 2015
View reviewed changes

benbjohnson force-pushed the wal branch 2 times, most recently from 5a5c225 to 1eae26d Compare June 22, 2015 21:56

otoolep reviewed Jun 22, 2015
View reviewed changes

benbjohnson force-pushed the wal branch from 1eae26d to 0cf8d7f Compare June 23, 2015 16:21

otoolep reviewed Jun 23, 2015
View reviewed changes

benbjohnson force-pushed the wal branch 2 times, most recently from 76c48e6 to 3af1f32 Compare June 25, 2015 14:40

benbjohnson changed the title ~~Add write ahead log~~ WAL Jun 25, 2015

benbjohnson force-pushed the wal branch from 45fac79 to ed8a0b0 Compare June 25, 2015 15:17

otoolep reviewed Jun 25, 2015
View reviewed changes

benbjohnson force-pushed the wal branch 3 times, most recently from 2b40fe2 to 657971b Compare June 25, 2015 21:42

Add write ahead log

b574e2f

This commit adds a write ahead log to the shard. Entries are cached in memory and periodically flushed back into the index. The WAL and the cache are both partitioned into buckets so that flushing doesn't stop the world as long.

benbjohnson force-pushed the wal branch from 657971b to b574e2f Compare June 25, 2015 21:47

benbjohnson added a commit that referenced this pull request Jun 25, 2015

Merge pull request #3036 from benbjohnson/wal

e10fb0c

WAL

benbjohnson merged commit e10fb0c into influxdata:master Jun 25, 2015

benbjohnson removed the 2 - Working label Jun 25, 2015

benbjohnson deleted the wal branch June 25, 2015 21:51

WAL #3036

WAL #3036

Conversation

benbjohnson commented Jun 18, 2015

Overview

TODO

benbjohnson commented Jun 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

otoolep commented Jun 18, 2015

benbjohnson commented Jun 18, 2015

benbjohnson commented Jun 19, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pauldix commented Jun 20, 2015

benbjohnson commented Jun 22, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benbjohnson commented Jun 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

otoolep commented Jun 23, 2015

pauldix commented Jun 24, 2015

benbjohnson commented Jun 24, 2015

pauldix commented Jun 24, 2015

otoolep commented Jun 24, 2015

otoolep commented Jun 24, 2015

benbjohnson commented Jun 25, 2015

otoolep commented Jun 25, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

otoolep commented Jun 25, 2015

benbjohnson commented Jun 25, 2015

otoolep commented Jun 25, 2015

pauldix commented Jun 25, 2015

huhongbo commented Jun 26, 2015

pauldix commented Jun 26, 2015

huhongbo commented Jun 26, 2015