Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage discussion: VersionStore vs TickStore, allowed options for VersionStore.write.. #473

Closed
rueberger opened this issue Dec 20, 2017 · 19 comments

Comments

@rueberger
Copy link

rueberger commented Dec 20, 2017

First of all - my thanks to the maintainers. This library is exactly what I was looking for and looks very promising.

I've been having a bit of trouble figuring how to optimally use arctic though. I've been following the examples in /howto which are... sparse. Is there somewhere else I might find examples or docs?

Now, some dumb questions about VersionStore and TickStore:

  • I've noticed that every time I write to a VersionStore, an entirely new version is created. Are finer-grained options for versioning available? For instance, I would like to write streaming updates to a single version, only incrementing version when manually specified. I tried just passing version=1 to lib.write, but this doesn't seem to be supported.
  • In what scenarios might one want to use VersionStore vs TickStore? It's not clear to me what the differences are from the README or the code.
  • My current use case is primarily as a database for streams - for this use case TickStore is recommended? Is there a reason one might want to use VersionStore for this?
  • Is TickStore appropriate for data which may have more than row for each timestamp (event data)? Nope, not allowed by TickStore

Thanks in advance for your help and patience!

@bmoscon
Copy link
Collaborator

bmoscon commented Dec 20, 2017

  1. I dont think you can avoid writing a new version each time. You can tell it to remove old versions with prune_previous_version=True on writes

  2. tickstore is for constant streams of data, version store is for working with data (i.e. playing around with it). It keeps versions so you can 'undo' changes and keep track of updates overtime. It sounds like you'd want to use tickstore.

  3. Tickstore is very limited (i.e. no strings, querying by date range is very specific, etc) and it also is only very fast and efficient if you do the "bulking" yourself. If you append an update every time you get one, performance will be very bad. You need to cache updates and write them in an interval that makes sense for your retrieval. Version Store will internally cache updates in mongo and when they are large enough, it will compress them and re-write the symbol's data

  4. I thought tickstore supported multi indexes, but maybe not? (I dont use it much). That being said, you can also write a new symbol for new data with overlapping timestamps.

I'll be working on our long open issue #3 this weekend and over the holiday break. So look for some docs in the upcoming docs/ folder very soon . . .

I'm going to mark this as a dupe of #3 and close it

@bmoscon bmoscon closed this as completed Dec 20, 2017
@rueberger
Copy link
Author

Thanks for the quick response!

Glad to hear there are more docs coming, looking forward to that!

One more question...

What would you recommend for storing streaming event data, where you may have more than one row for each timestamp? (eg order data) Is this use case supported by arctic?

@jamesblackburn
Copy link
Contributor

What's the granularity of your timestamps. We use this library for tabular data here - so can end up with relatively wide rows. You can append multiple times with the same timestamp if you like.

VersionStore we use for data that's minute bar data and lower frequency. We append rows across thousands of instruments every minute using VersionStore. I wouldn't worry too much about iterating the version count. The library should look after pruning old versions for you.

The TickStore we use for streaming tick data (where we do our writes using a Java version of this API).

@rueberger
Copy link
Author

Not exactly sure what you mean by granularity, but using timestamps with a one second resolution there could potentially be thousands of rows sharing a timestamp. At microsecond resolution should be more reasonable, but no guarantee of uniqueness.

I thought tickstore supported multi indexes, but maybe not? (I dont use it much). That being said, you can also write a new symbol for new data with overlapping timestamps.

I made that conclusion on the basis of a really quick test where I got an error trying to insert rows with a default settings TicksStore. I better take a closer look.

@bmoscon
Copy link
Collaborator

bmoscon commented Dec 29, 2017

he means how frequent are the ticks

@rueberger
Copy link
Author

This is for a stream of event data - there is no fixed frequency. The events have 'real' valued timestamps, although I'm afraid I couldn't say what the precision is.

@bmoscon
Copy link
Collaborator

bmoscon commented Jan 9, 2018

i think tickstore is a fine choice then - but its better to batch the writes up using something like kinesis/kafka. I think @jamesblackburn has some slides or a link to slides where he describes how this is done internally.

@rueberger
Copy link
Author

Thanks @bmoscon, it would be really useful to hear how you guys handle batching internally.

Recently I've been worrying that because of the batching/buffering difficulties we might not be able to use Arctic after all. We don't really have a good way of batching writes prior to passing them to Arctic, and really want to treat it just like any database engine that will intelligently buffer writes. Which is a shame because it is otherwise so ideally suited to our purposes!

Is the design of arctic amenable to exposing the client as a lightweight service that can do buffering? There's no good way to do it purely from the client side.

It seems to me that in the niche of lightweight timeseries databases, Arctic has little competition. And that some simple buffering would put in solid competition with some of the much heavier solutions like OpenTSDB and Druid.

I would be all too happy to help if it saves me from having to maintain an OpenTSDB deployment.

@bmoscon
Copy link
Collaborator

bmoscon commented Jan 9, 2018

That being said - version store will batch data and compress it at certain intervals so it may be fine for your use case

@jamesblackburn
Copy link
Contributor

VersionStore is pretty good up to minute frequency data. It can stretch further depending on how symbols you are writing simultaneously. The batching frequency is currently a hard coded constant but can be changed to suit your use case.

@rueberger
Copy link
Author

Thanks for the info.

I have followed the same approach and am now using Kafka to buffer writes. Any chance you guys could comment on what best practices are for using Arctic in this way? What's the time complexity for appending to tick stores?

@sfkiwi
Copy link

sfkiwi commented Aug 28, 2018

@rueberger How did your implementation go using Kafka. Are you still using it. I'm attempting to capture order book data and I got a little concerned when I saw the comment above about

it being good up to minute frequency data

but I think this was only referring to VersionStore and not to TickStore.

How does Arctic compare to, say, just dumping the tick stream to a Cassandra cluster which also is fairly lightweight and has excellent write speeds.

@bmoscon
Copy link
Collaborator

bmoscon commented Aug 28, 2018

its easy to set up something like redis to batch the data for writes every minute

@sfkiwi
Copy link

sfkiwi commented Aug 28, 2018

Presumably that would only work if your incoming stream is bursting beyond the DB write throughput, but if it’s constantly sending hundreds of rows a second then you’ll eventually run out of memory in the redis cache if this is running constantly.

@rueberger
Copy link
Author

I no longer use arctic at all - I built a lightweight timeseries store over mongo and do indeed use kafka to decouple data harvesting and insertion. Works great but it's a big bespoke solution so I'm not sure I can recommend that route unless you're doing exactly the same thing as me.

@sfkiwi
Copy link

sfkiwi commented Aug 28, 2018

That depends on what your doing :) sounds like we might be though given your initial comment about storing order data.

@bmoscon
Copy link
Collaborator

bmoscon commented Aug 28, 2018

I use redis to batch updates to arctic, it works just fine. You just need to periodically clean out the written data from redis.

@saeedamen
Copy link

It might worth having a look at Redis new streams functionality for producer/consumer usage (similar to using Kafka). I've started to use it to batch the data in preparation for getting it to dump to Arctic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants