Add roaring bitmaps to TSI index files. #10122

benbjohnson · 2018-07-24T17:06:22Z

Problem

Planning time for high cardinality data sets can consume a majority of the total query time. We previously added support for building series id sets (which wrap roaring bitmaps) which improved set operations between indices, however, the planning time is now consumed by the time to build the sets at query time.

Solution

This pull request uses the mmap support added in influxdata/roaring#1 and embeds series id sets directly into the TSI index for measurements and tag values. This allows us to reference the data in the mmap file directly so that sets do not need to be built at query time.

This uses the flag field for measurement and tag value entries to support backwards compatibility with the older file format.

TODO

Additional performance testing
Backwards compatibility tests

Required for all non-trivial PRs

Sign CLA (if not already signed)

e-dard

Just some general nits/suggestions at the moment.

e-dard · 2018-07-25T10:43:46Z

tsdb/index/tsi1/index_files.go

@@ -185,6 +187,13 @@ func (p IndexFiles) CompactTo(w io.Writer, sfile *tsdb.SeriesFile, m, k uint64,
 		return n, err
 	}

+	// Ensure block is word aligned.
+	if offset := (n) % 8; offset != 0 {


are those parentheses necessary?

The n needed a hug. Removed in 8d66027.

e-dard · 2018-07-25T10:50:56Z

tsdb/index/tsi1/tsi1.go

@@ -545,8 +548,17 @@ func uvarint(data []byte) (value uint64, n int, err error) {
 	return
 }

+// memalign returns data if its memory address is word align.


Can't this be 4 bytes for 32-bit architectures? We could put the alignment size behind a build flag.

I think roaring only uses popcnt for amd64 so the alignment only needs to be for 64-bit. Switching for architectures would only save us a few bytes so I don't think it's worth making it GOARCH dependent.

e-dard · 2018-07-25T10:55:45Z

tsdb/index/tsi1/measurement_block.go

+		mm.seriesIDSet = tsdb.NewSeriesIDSet()
+	}
+	for _, seriesID := range seriesIDs {
+		mm.seriesIDSet.AddNoLock(seriesID)


Might be faster to add these in bulk. I will write a benchmark and see.

e-dard · 2018-07-25T10:56:12Z

tsdb/index/tsi1/partition.go

 	fs, err := p.RetainFileSet()
 	if err != nil {
-		return nil // TODO(edd): this should probably return an error.
+		return nil, err // TODO(edd): this should probably return an error.


Can you remove this comment 😄

Removed in 8d66027.

e-dard · 2018-07-25T10:57:15Z

tsdb/index/tsi1/tag_block.go

 	series struct {
 		n    uint64 // Series count
 		data []byte // Raw series data
 	}

+	seriesIDSetData []byte


Maybe a little comment about this being used instead of series above when bitmaps are in use?

Added a comment in 8d66027.

e-dard · 2018-07-25T11:00:27Z

tsdb/index/tsi1/tag_block.go

+		return ss, nil
+	}
+
+	// Otherwise decode series ids from uvarint encoding.


Can't we memoize this? Otherwise we're going to be building a series id set each time the block is read until the next TSI compaction kicks in and converts to the bitmap.

It doesn't seem like it's worth trying to optimize index files that are just going to get recompacted eventually. It also could add a lot to the heap if there are a lot of tag key values.

e-dard · 2018-07-25T11:04:03Z

tsdb/series_set.go

+	s.RLock()
+	defer s.RUnlock()
+
+	var a []uint64


This can be sized to s.bitmap.GetCardinality().

Having said that, I think this is only called from SeriesIDs() on the tag value block, which is only called from within tests.

Given that, I think it would be better to return a []uint32 here and then have the caller to the slice allocation and conversion. The reason being that if we expose this API here it may end up getting consumed and it allocates a []uint64 each time.

The rest of the SeriesIDSet API uses []uint64 so it seems strange to me to return []uint32.

zeebo · 2018-07-25T20:59:54Z

tsdb/series_set.go

+// References to the underlying data are used so data should not be reused by caller.
+func (s *SeriesIDSet) UnmarshalBinaryUnsafe(data []byte) error {
+	s.RLock()
+	defer s.RUnlock()


Should this be a write lock?

@zeebo I think we only need to protect access to bitmap, and not worry about mutations happening inside the roaring library.

Ok. I asked because UnmarshalBinary above grabbed the write lock.

Oh I see what you mean now. I guess this should be a lock, though it shouldn’t ever be called concurrently with another method on the type.

Fixed in cb828f0. 👍

…ones.

e-dard · 2018-07-27T15:06:35Z

@benbjohnson can we see if we can have a more graceful/helpful exit if the user downgrades? I started up a previous version after using this branch and hit:

⇒  INFLUXDB_DATA_INDEX_VERSION=tsi1 INFLUXDB_DATA_CACHE_MAX_MEMORY_SIZE=5g ossinfluxd-v1.6.1rc1

 8888888           .d888 888                   8888888b.  888888b.
   888            d88P"  888                   888  "Y88b 888  "88b
   888            888    888                   888    888 888  .88P
   888   88888b.  888888 888 888  888 888  888 888    888 8888888K.
   888   888 "88b 888    888 888  888  Y8bd8P' 888    888 888  "Y88b
   888   888  888 888    888 888  888   X88K   888    888 888    888
   888   888  888 888    888 Y88b 888 .d8""8b. 888  .d88P 888   d88P
 8888888 888  888 888    888  "Y88888 888  888 8888888P"  8888888P"

2018-07-27T15:07:13.302124Z     info    InfluxDB starting       {"log_id": "09Z4Jj5W000", "version": "unknown", "branch": "unknown", "commit": "unknown"}
2018-07-27T15:07:13.302154Z     info    Go runtime      {"log_id": "09Z4Jj5W000", "version": "go1.10", "maxprocs": 24}
2018-07-27T15:07:13.402618Z     info    Using data dir  {"log_id": "09Z4Jj5W000", "service": "store", "path": "/home/edd/.influxdb/data"}
2018-07-27T15:07:13.402647Z     info    Compaction settings     {"log_id": "09Z4Jj5W000", "service": "store", "max_concurrent_compactions": 12, "throughput_bytes_per_second": 50331648, "throughput_burst_bytes": 50331648}
2018-07-27T15:07:13.402666Z     info    Open store (start)      {"log_id": "09Z4Jj5W000", "service": "store", "trace_id": "09Z4JjUW001", "op_name": "tsdb_open", "op_event": "start"}
2018-07-27T15:07:13.407542Z     info    Reading file    {"log_id": "09Z4Jj5W000", "engine": "tsm1", "service": "cacheloader", "path": "/home/edd/.influxdb/wal/_internal/monitor/1/_00001.wal", "size": 562990}
2018-07-27T15:07:14.065901Z     info    Opened shard    {"log_id": "09Z4Jj5W000", "service": "store", "trace_id": "09Z4JjUW001", "op_name": "tsdb_open", "index_version": "tsi1", "path": "/home/edd/.influxdb/data/_internal/monitor/1", "duration": "662.420ms"}
panic: runtime error: index out of range [recovered]
        panic: [Index file: /home/edd/.influxdb/data/stress/autogen/3/index/5/L4-00000079.tsi] runtime error: index out of range

goroutine 7 [running]:
github.com/influxdata/influxdb/tsdb/index/tsi1.(*IndexFile).Open.func1(0xc4200f65a0)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/index_file.go:117 +0xff
panic(0xd4a7c0, 0x1419310)
        /usr/local/go/src/runtime/panic.go:505 +0x229
encoding/binary.binary.bigEndian.Uint64(...)
        /usr/local/go/src/encoding/binary/binary.go:124
github.com/influxdata/influxdb/tsdb/index/tsi1.(*MeasurementBlockElem).UnmarshalBinary(0xc420290000, 0x7fb22ff3be8b, 0x5, 0xc3d95, 0x7fb22fe881eb, 0xc4320fac10)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/measurement_block.go:419 +0x326
github.com/influxdata/influxdb/tsdb/index/tsi1.(*blockMeasurementIterator).Next(0xc420290000, 0xc4320fac00, 0xc4320e4c78)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/measurement_block.go:182 +0x5c
github.com/influxdata/influxdb/tsdb/index/tsi1.(*IndexFile).UnmarshalBinary(0xc4200f65a0, 0x7fb22da98000, 0x2567c20, 0x2567c20, 0x2567c20, 0x2567c20)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/index_file.go:216 +0x49f
github.com/influxdata/influxdb/tsdb/index/tsi1.(*IndexFile).Open(0xc4200f65a0, 0x0, 0x0)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/index_file.go:129 +0x10b
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).openIndexFile(0xc42050c7e0, 0xc4320f7f40, 0x41, 0xc4320f7f40, 0x41, 0xc42ca6c330)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/partition.go:264 +0x6f
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Open(0xc42050c7e0, 0x0, 0x0)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/partition.go:206 +0x7b1
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Open.func1(0xc4210ae114, 0x8, 0xc4200fd900, 0xc4201da780, 0x1)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/index.go:257 +0x30
created by github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Open
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/index.go:251 +0x5b0
edd@work:~/go/src/github.com/influxdata/influxdb|master

Maybe something explaining they need to delete their indexes and rebuild them?

e-dard · 2018-07-27T15:51:59Z

I think we need to hold off merging this until we figure a couple of things out.

TSI Compactions

I'm seeing around a 10% reduction in series insertion performance, which I think is down to the TSI compactions.

Here are a couple of flames during a run where I'm inserting 30M series. Both profiles were taken when about 19M series had been inserted.

Archive.zip

Disk Usage

I need to dig into this more but it looks like this PR is taking up more space on disk than 1.6.1. I would expect a compressed bitmap with 30M/8 series to take up less than writing out the binary encoding of the series ids.

1.6.1

edd@work:~|⇒  du -h ~/.influxdb/data/stress
121M    /home/edd/.influxdb/data/stress/autogen/2/index/6
122M    /home/edd/.influxdb/data/stress/autogen/2/index/3
122M    /home/edd/.influxdb/data/stress/autogen/2/index/5
122M    /home/edd/.influxdb/data/stress/autogen/2/index/7
122M    /home/edd/.influxdb/data/stress/autogen/2/index/4
122M    /home/edd/.influxdb/data/stress/autogen/2/index/1
122M    /home/edd/.influxdb/data/stress/autogen/2/index/0
122M    /home/edd/.influxdb/data/stress/autogen/2/index/2
972M    /home/edd/.influxdb/data/stress/autogen/2/index
3.6G    /home/edd/.influxdb/data/stress/autogen/2
3.6G    /home/edd/.influxdb/data/stress/autogen
256M    /home/edd/.influxdb/data/stress/_series/02
256M    /home/edd/.influxdb/data/stress/_series/04
256M    /home/edd/.influxdb/data/stress/_series/01
256M    /home/edd/.influxdb/data/stress/_series/00
256M    /home/edd/.influxdb/data/stress/_series/05
256M    /home/edd/.influxdb/data/stress/_series/03
256M    /home/edd/.influxdb/data/stress/_series/06
256M    /home/edd/.influxdb/data/stress/_series/07
2.0G    /home/edd/.influxdb/data/stress/_series
5.6G    /home/edd/.influxdb/data/stress

PR

edd@work:~|⇒  du -h ~/.influxdb/data/stress
192M    /home/edd/.influxdb/data/stress/autogen/3/index/6
191M    /home/edd/.influxdb/data/stress/autogen/3/index/3
191M    /home/edd/.influxdb/data/stress/autogen/3/index/5
191M    /home/edd/.influxdb/data/stress/autogen/3/index/7
191M    /home/edd/.influxdb/data/stress/autogen/3/index/4
191M    /home/edd/.influxdb/data/stress/autogen/3/index/1
191M    /home/edd/.influxdb/data/stress/autogen/3/index/0
191M    /home/edd/.influxdb/data/stress/autogen/3/index/2
1.5G    /home/edd/.influxdb/data/stress/autogen/3/index
4.1G    /home/edd/.influxdb/data/stress/autogen/3
4.1G    /home/edd/.influxdb/data/stress/autogen
256M    /home/edd/.influxdb/data/stress/_series/02
256M    /home/edd/.influxdb/data/stress/_series/04
256M    /home/edd/.influxdb/data/stress/_series/01
256M    /home/edd/.influxdb/data/stress/_series/00
256M    /home/edd/.influxdb/data/stress/_series/05
256M    /home/edd/.influxdb/data/stress/_series/03
256M    /home/edd/.influxdb/data/stress/_series/06
256M    /home/edd/.influxdb/data/stress/_series/07
2.0G    /home/edd/.influxdb/data/stress/_series
6.1G    /home/edd/.influxdb/data/stress

e-dard · 2018-07-27T16:24:47Z

Actually, on the disk thing I was thinking about this the wrong way around....

So with 30M series in one tag (which is what I did in this case), means 30M separate bitmaps. I expect that that's going to take up more size than 8 bytes for a single series ID.

e-dard · 2018-07-30T15:47:59Z

tsdb/index/tsi1/index_files.go

-			return n, err
-		}
-	}
+	// if offset := n % 8; offset != 0 {


Can we remove the commented code if we don't need it?

Add roaring bitmaps to TSI index files.

fdfd038

benbjohnson added the kind/feature-request label Jul 24, 2018

benbjohnson added this to the 1.7 milestone Jul 24, 2018

benbjohnson self-assigned this Jul 24, 2018

benbjohnson requested a review from e-dard July 24, 2018 17:06

ghost added the review label Jul 24, 2018

e-dard reviewed Jul 25, 2018

View reviewed changes

zeebo reviewed Jul 25, 2018

View reviewed changes

Fix roaring dependency, minor PR fixes.

cb828f0

benbjohnson force-pushed the bj-tsi-roaring branch from 8d66027 to cb828f0 Compare July 26, 2018 08:32

benbjohnson added 2 commits July 26, 2018 23:48

Refactor file set tag value iterators to support series sets & tombst…

80d0132

…ones.

Add legacy tsi1 uvarint encoding test.

66920a1

Use roaring.Bitmap.FromBuffer(), remove memory alignment.

5612511

e-dard reviewed Jul 30, 2018

View reviewed changes

Fix megacheck

61af08a

ghost assigned e-dard Jul 31, 2018

Revert to RoaringBitmap org

7662249

e-dard force-pushed the bj-tsi-roaring branch from 12e5738 to 7662249 Compare July 31, 2018 14:17

influxdata deleted a comment from hercules-influx Jul 31, 2018

Fix megacheck

19a4f1c

benbjohnson merged commit 99e2ad6 into master Jul 31, 2018

ghost removed the review label Jul 31, 2018

e-dard modified the milestones: 1.7, 1.6.1 Jul 31, 2018

e-dard added the needs-backport/1.6 label Jul 31, 2018

benbjohnson mentioned this pull request Jul 31, 2018

Add roaring bitmaps to TSI index files. #10143

Merged

e-dard removed the needs-backport/1.6 label Aug 2, 2018

e-dard deleted the bj-tsi-roaring branch January 7, 2019 11:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add roaring bitmaps to TSI index files. #10122

Add roaring bitmaps to TSI index files. #10122

benbjohnson commented Jul 24, 2018 •

edited

Loading

e-dard left a comment

e-dard Jul 25, 2018

benbjohnson Jul 25, 2018

e-dard Jul 25, 2018

benbjohnson Jul 25, 2018

e-dard Jul 25, 2018

e-dard Jul 25, 2018

benbjohnson Jul 25, 2018

e-dard Jul 25, 2018

benbjohnson Jul 25, 2018

e-dard Jul 25, 2018

benbjohnson Jul 25, 2018

e-dard Jul 25, 2018

e-dard Jul 25, 2018

benbjohnson Jul 25, 2018

zeebo Jul 25, 2018

e-dard Jul 25, 2018

zeebo Jul 25, 2018

e-dard Jul 26, 2018

benbjohnson Jul 26, 2018

e-dard commented Jul 27, 2018

e-dard commented Jul 27, 2018

e-dard commented Jul 27, 2018

e-dard Jul 30, 2018

Add roaring bitmaps to TSI index files. #10122

Add roaring bitmaps to TSI index files. #10122

Conversation

benbjohnson commented Jul 24, 2018 • edited Loading

Problem

Solution

TODO

Required for all non-trivial PRs

e-dard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-dard commented Jul 27, 2018

e-dard commented Jul 27, 2018

TSI Compactions

Disk Usage

e-dard commented Jul 27, 2018

Choose a reason for hiding this comment

benbjohnson commented Jul 24, 2018 •

edited

Loading