Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add roaring bitmaps to TSI index files. #10122

Merged
merged 8 commits into from
Jul 31, 2018
Merged

Add roaring bitmaps to TSI index files. #10122

merged 8 commits into from
Jul 31, 2018

Conversation

benbjohnson
Copy link
Contributor

@benbjohnson benbjohnson commented Jul 24, 2018

Problem

Planning time for high cardinality data sets can consume a majority of the total query time. We previously added support for building series id sets (which wrap roaring bitmaps) which improved set operations between indices, however, the planning time is now consumed by the time to build the sets at query time.

Solution

This pull request uses the mmap support added in influxdata/roaring#1 and embeds series id sets directly into the TSI index for measurements and tag values. This allows us to reference the data in the mmap file directly so that sets do not need to be built at query time.

This uses the flag field for measurement and tag value entries to support backwards compatibility with the older file format.

TODO

  • Additional performance testing
  • Backwards compatibility tests
Required for all non-trivial PRs
  • Sign CLA (if not already signed)

@benbjohnson benbjohnson added this to the 1.7 milestone Jul 24, 2018
@benbjohnson benbjohnson self-assigned this Jul 24, 2018
@benbjohnson benbjohnson requested a review from e-dard July 24, 2018 17:06
@ghost ghost added the review label Jul 24, 2018
Copy link
Contributor

@e-dard e-dard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some general nits/suggestions at the moment.

@@ -185,6 +187,13 @@ func (p IndexFiles) CompactTo(w io.Writer, sfile *tsdb.SeriesFile, m, k uint64,
return n, err
}

// Ensure block is word aligned.
if offset := (n) % 8; offset != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are those parentheses necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The n needed a hug. Removed in 8d66027.

@@ -545,8 +548,17 @@ func uvarint(data []byte) (value uint64, n int, err error) {
return
}

// memalign returns data if its memory address is word align.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be 4 bytes for 32-bit architectures? We could put the alignment size behind a build flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think roaring only uses popcnt for amd64 so the alignment only needs to be for 64-bit. Switching for architectures would only save us a few bytes so I don't think it's worth making it GOARCH dependent.

mm.seriesIDSet = tsdb.NewSeriesIDSet()
}
for _, seriesID := range seriesIDs {
mm.seriesIDSet.AddNoLock(seriesID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be faster to add these in bulk. I will write a benchmark and see.

fs, err := p.RetainFileSet()
if err != nil {
return nil // TODO(edd): this should probably return an error.
return nil, err // TODO(edd): this should probably return an error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this comment 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 8d66027.

series struct {
n uint64 // Series count
data []byte // Raw series data
}

seriesIDSetData []byte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a little comment about this being used instead of series above when bitmaps are in use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment in 8d66027.

return ss, nil
}

// Otherwise decode series ids from uvarint encoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we memoize this? Otherwise we're going to be building a series id set each time the block is read until the next TSI compaction kicks in and converts to the bitmap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like it's worth trying to optimize index files that are just going to get recompacted eventually. It also could add a lot to the heap if there are a lot of tag key values.

s.RLock()
defer s.RUnlock()

var a []uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be sized to s.bitmap.GetCardinality().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that, I think this is only called from SeriesIDs() on the tag value block, which is only called from within tests.

Given that, I think it would be better to return a []uint32 here and then have the caller to the slice allocation and conversion. The reason being that if we expose this API here it may end up getting consumed and it allocates a []uint64 each time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the SeriesIDSet API uses []uint64 so it seems strange to me to return []uint32.

// References to the underlying data are used so data should not be reused by caller.
func (s *SeriesIDSet) UnmarshalBinaryUnsafe(data []byte) error {
s.RLock()
defer s.RUnlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a write lock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zeebo I think we only need to protect access to bitmap, and not worry about mutations happening inside the roaring library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I asked because UnmarshalBinary above grabbed the write lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see what you mean now. I guess this should be a lock, though it shouldn’t ever be called concurrently with another method on the type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cb828f0. 👍

@e-dard
Copy link
Contributor

e-dard commented Jul 27, 2018

@benbjohnson can we see if we can have a more graceful/helpful exit if the user downgrades? I started up a previous version after using this branch and hit:

⇒  INFLUXDB_DATA_INDEX_VERSION=tsi1 INFLUXDB_DATA_CACHE_MAX_MEMORY_SIZE=5g ossinfluxd-v1.6.1rc1

 8888888           .d888 888                   8888888b.  888888b.
   888            d88P"  888                   888  "Y88b 888  "88b
   888            888    888                   888    888 888  .88P
   888   88888b.  888888 888 888  888 888  888 888    888 8888888K.
   888   888 "88b 888    888 888  888  Y8bd8P' 888    888 888  "Y88b
   888   888  888 888    888 888  888   X88K   888    888 888    888
   888   888  888 888    888 Y88b 888 .d8""8b. 888  .d88P 888   d88P
 8888888 888  888 888    888  "Y88888 888  888 8888888P"  8888888P"

2018-07-27T15:07:13.302124Z     info    InfluxDB starting       {"log_id": "09Z4Jj5W000", "version": "unknown", "branch": "unknown", "commit": "unknown"}
2018-07-27T15:07:13.302154Z     info    Go runtime      {"log_id": "09Z4Jj5W000", "version": "go1.10", "maxprocs": 24}
2018-07-27T15:07:13.402618Z     info    Using data dir  {"log_id": "09Z4Jj5W000", "service": "store", "path": "/home/edd/.influxdb/data"}
2018-07-27T15:07:13.402647Z     info    Compaction settings     {"log_id": "09Z4Jj5W000", "service": "store", "max_concurrent_compactions": 12, "throughput_bytes_per_second": 50331648, "throughput_burst_bytes": 50331648}
2018-07-27T15:07:13.402666Z     info    Open store (start)      {"log_id": "09Z4Jj5W000", "service": "store", "trace_id": "09Z4JjUW001", "op_name": "tsdb_open", "op_event": "start"}
2018-07-27T15:07:13.407542Z     info    Reading file    {"log_id": "09Z4Jj5W000", "engine": "tsm1", "service": "cacheloader", "path": "/home/edd/.influxdb/wal/_internal/monitor/1/_00001.wal", "size": 562990}
2018-07-27T15:07:14.065901Z     info    Opened shard    {"log_id": "09Z4Jj5W000", "service": "store", "trace_id": "09Z4JjUW001", "op_name": "tsdb_open", "index_version": "tsi1", "path": "/home/edd/.influxdb/data/_internal/monitor/1", "duration": "662.420ms"}
panic: runtime error: index out of range [recovered]
        panic: [Index file: /home/edd/.influxdb/data/stress/autogen/3/index/5/L4-00000079.tsi] runtime error: index out of range

goroutine 7 [running]:
github.com/influxdata/influxdb/tsdb/index/tsi1.(*IndexFile).Open.func1(0xc4200f65a0)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/index_file.go:117 +0xff
panic(0xd4a7c0, 0x1419310)
        /usr/local/go/src/runtime/panic.go:505 +0x229
encoding/binary.binary.bigEndian.Uint64(...)
        /usr/local/go/src/encoding/binary/binary.go:124
github.com/influxdata/influxdb/tsdb/index/tsi1.(*MeasurementBlockElem).UnmarshalBinary(0xc420290000, 0x7fb22ff3be8b, 0x5, 0xc3d95, 0x7fb22fe881eb, 0xc4320fac10)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/measurement_block.go:419 +0x326
github.com/influxdata/influxdb/tsdb/index/tsi1.(*blockMeasurementIterator).Next(0xc420290000, 0xc4320fac00, 0xc4320e4c78)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/measurement_block.go:182 +0x5c
github.com/influxdata/influxdb/tsdb/index/tsi1.(*IndexFile).UnmarshalBinary(0xc4200f65a0, 0x7fb22da98000, 0x2567c20, 0x2567c20, 0x2567c20, 0x2567c20)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/index_file.go:216 +0x49f
github.com/influxdata/influxdb/tsdb/index/tsi1.(*IndexFile).Open(0xc4200f65a0, 0x0, 0x0)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/index_file.go:129 +0x10b
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).openIndexFile(0xc42050c7e0, 0xc4320f7f40, 0x41, 0xc4320f7f40, 0x41, 0xc42ca6c330)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/partition.go:264 +0x6f
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Open(0xc42050c7e0, 0x0, 0x0)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/partition.go:206 +0x7b1
github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Open.func1(0xc4210ae114, 0x8, 0xc4200fd900, 0xc4201da780, 0x1)
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/index.go:257 +0x30
created by github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Open
        /home/edd/go/src/github.com/influxdata/influxdb/tsdb/index/tsi1/index.go:251 +0x5b0
edd@work:~/go/src/github.com/influxdata/influxdb|master

Maybe something explaining they need to delete their indexes and rebuild them?

@e-dard
Copy link
Contributor

e-dard commented Jul 27, 2018

I think we need to hold off merging this until we figure a couple of things out.

TSI Compactions

I'm seeing around a 10% reduction in series insertion performance, which I think is down to the TSI compactions.

Here are a couple of flames during a run where I'm inserting 30M series. Both profiles were taken when about 19M series had been inserted.

Archive.zip

Disk Usage

I need to dig into this more but it looks like this PR is taking up more space on disk than 1.6.1. I would expect a compressed bitmap with 30M/8 series to take up less than writing out the binary encoding of the series ids.

1.6.1

edd@work:~|⇒  du -h ~/.influxdb/data/stress
121M    /home/edd/.influxdb/data/stress/autogen/2/index/6
122M    /home/edd/.influxdb/data/stress/autogen/2/index/3
122M    /home/edd/.influxdb/data/stress/autogen/2/index/5
122M    /home/edd/.influxdb/data/stress/autogen/2/index/7
122M    /home/edd/.influxdb/data/stress/autogen/2/index/4
122M    /home/edd/.influxdb/data/stress/autogen/2/index/1
122M    /home/edd/.influxdb/data/stress/autogen/2/index/0
122M    /home/edd/.influxdb/data/stress/autogen/2/index/2
972M    /home/edd/.influxdb/data/stress/autogen/2/index
3.6G    /home/edd/.influxdb/data/stress/autogen/2
3.6G    /home/edd/.influxdb/data/stress/autogen
256M    /home/edd/.influxdb/data/stress/_series/02
256M    /home/edd/.influxdb/data/stress/_series/04
256M    /home/edd/.influxdb/data/stress/_series/01
256M    /home/edd/.influxdb/data/stress/_series/00
256M    /home/edd/.influxdb/data/stress/_series/05
256M    /home/edd/.influxdb/data/stress/_series/03
256M    /home/edd/.influxdb/data/stress/_series/06
256M    /home/edd/.influxdb/data/stress/_series/07
2.0G    /home/edd/.influxdb/data/stress/_series
5.6G    /home/edd/.influxdb/data/stress

PR

edd@work:~|⇒  du -h ~/.influxdb/data/stress
192M    /home/edd/.influxdb/data/stress/autogen/3/index/6
191M    /home/edd/.influxdb/data/stress/autogen/3/index/3
191M    /home/edd/.influxdb/data/stress/autogen/3/index/5
191M    /home/edd/.influxdb/data/stress/autogen/3/index/7
191M    /home/edd/.influxdb/data/stress/autogen/3/index/4
191M    /home/edd/.influxdb/data/stress/autogen/3/index/1
191M    /home/edd/.influxdb/data/stress/autogen/3/index/0
191M    /home/edd/.influxdb/data/stress/autogen/3/index/2
1.5G    /home/edd/.influxdb/data/stress/autogen/3/index
4.1G    /home/edd/.influxdb/data/stress/autogen/3
4.1G    /home/edd/.influxdb/data/stress/autogen
256M    /home/edd/.influxdb/data/stress/_series/02
256M    /home/edd/.influxdb/data/stress/_series/04
256M    /home/edd/.influxdb/data/stress/_series/01
256M    /home/edd/.influxdb/data/stress/_series/00
256M    /home/edd/.influxdb/data/stress/_series/05
256M    /home/edd/.influxdb/data/stress/_series/03
256M    /home/edd/.influxdb/data/stress/_series/06
256M    /home/edd/.influxdb/data/stress/_series/07
2.0G    /home/edd/.influxdb/data/stress/_series
6.1G    /home/edd/.influxdb/data/stress

@e-dard
Copy link
Contributor

e-dard commented Jul 27, 2018

Actually, on the disk thing I was thinking about this the wrong way around....

So with 30M series in one tag (which is what I did in this case), means 30M separate bitmaps. I expect that that's going to take up more size than 8 bytes for a single series ID.

return n, err
}
}
// if offset := n % 8; offset != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the commented code if we don't need it?

@ghost ghost assigned e-dard Jul 31, 2018
@benbjohnson benbjohnson merged commit 99e2ad6 into master Jul 31, 2018
@ghost ghost removed the review label Jul 31, 2018
@e-dard e-dard modified the milestones: 1.7, 1.6.1 Jul 31, 2018
@e-dard e-dard deleted the bj-tsi-roaring branch January 7, 2019 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants