Integration tests #5
Looking really good!
And looks like you also found a data race, should I handle it? or do you want to fix it yourself?
FYI I started to create a Go client in this branch which includes some refactoring (message package) that will conflict with this, but should be easy to fix, and this will get merged first anyway.
Re. race ... was actually wondering during your talk too - is the index really necessary? I mean, Segments are more or less of constant size and sorted by time. Since messages have different sizes, binary search is not feasible, but linear scan is. There is a whole rant at Kafka's design page (https://kafka.apache.org/08/design.html) about performance of linear disk reads. Given constant size (c) of a segment then finding the message with a given timestamp would be O(c) which is O(1).
The index has been there since the very beginning, the only reason is that I saw kafka doing it and since it seemed easy to replicate I didn’t question it. Then I started to wonder too, but I left it there assuming they knew what they where doing and over time I started to see the point.
Aside from the faster offset lookup it has some other upsides, like allowing BigLog to be completely format agnostic, and additionally it got the timestamps in it which I think is nice to keep separated from the actual data.
But where “I believe” the index is going to make the real difference is streaming (wip)
While performance is stated as a non goal, scanning one by one is impossible to achieve any meaningful throughput the moment there’s a little bit of network latency.
Streaming works similar to how kafka actually works, the clients says: give me as much as you can up to N offsets and up to X bytes of data. And the servers sends them all at once.
The streamer just needs to look at the index to calculate what it needs to send, and pass a LimitedReader directly over the data file to the ResponseWriter. This is more efficient than loading into memory all requested data before starting to reply, maybe even to just respond with an error.
Notice also that for performance messages are also batched and compressed before being written.
That being said, I’d be curious to see an alternative index-less version and compare the two, but to be a fair comparison all features need to be implemented first.
It’s be also interesting to see how a regular index file instead of a pre-allocated memory-mapped would work.
Re index. Something like tar format would IMHO remove the need for index ... so every segment file would have format
Great, welcome to the list of contributors :)
index: We kind of have that format already (except the timestamp), from the netlog side. If you want to experiment, it shouldn't be too difficult to implement an alternative streamer without the index and run some benchmarks.
If you decide to do so, I'd recommend you to base it on this branch where I'm slowly implementing a Go client, since there the message format has already been spin off into its own package and you won't have circular dependency problems.
I still think the index is going to perform much better (specially in traditional disks), take into account that you need to inform the client about what you are actually sending in the HTTP headers before you start writing the response. You could use HTTP trailers instead, but I fear that would make developing clients more difficult (many people hasn't even heard of HTTP trailers).
Maybe I'm wrong though. Or maybe with/without index could be a setting.