Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store API design #184

Open
jptmoore opened this issue Oct 30, 2017 · 23 comments
Open

Store API design #184

jptmoore opened this issue Oct 30, 2017 · 23 comments

Comments

@jptmoore
Copy link
Collaborator

The API in the new store currently looks like below.

The current implementation supports POST/GET of JSON, text and binary data.

Suggestions welcome on changes/additions.

Key/Value API

Write entry

URL: /kv/<key>
Method: POST
Parameters: JSON body of data, replace <key> with a key
Notes: store data using given key

Read entry

URL: /kv/<key>
Method: GET
Parameters: replace <key> with a key
Notes: return data for given key

Time series API

Write entry

URL: /ts/<id>
Method: POST
Parameters: JSON body of data, replace <id> with an identifier
Notes: add data to time series with given identifier

Read latest entry

URL: /ts/<id>/latest
Method: GET
Parameters: replace <id> with an identifier
Notes: return the latest entry

Read last number of entries

URL: /ts/<id>/last/<n>
Method: GET
Parameters: replace <id> with an identifier, replace <n> with the number of entries
Notes: return the number of entries requested

Read all entries since a time

URL: /ts/<id>/since/<from>
Method: GET
Parameters: replace <id> with an identifier, replace <from> with epoch seconds
Notes: return the number of entries from time provided

Read all entries in a time range

URL: /ts/<id>/range/<from>/<to>
Method: GET
Parameters: replace <id> with an identifier, replace <from> and <to> with epoch seconds
Notes: return the number of entries in time range provided
@cgreenhalgh
Copy link

For info, my current use case is importing rides from Strava, but it would be similar for importing tweets, fitbit activities, etc., that all have a well defined time of occurrence (started, posted, ...).

For the time series part of the API

A few clarifications to the details of the current proposal above, assuming it follows store-timeseries:

  • all times are seconds (integer) since UNIX epoch (so sub-second accuracy is not available)
  • the time associated with a value is the store time at which it is written (so it is not possible to provide an explicit time to associate with a value)
  • the values returned by range and since are arrays of values with the store times stripped out (so it is not possible to access the timestamps associated with values)
  • range end time is exclusive

Currently the most-used store is the store-json (API documented there) which differs in a few ways including:

  • times are milliseconds (integer) since UNIX epoch
  • write accepts a JSON object with fields data (the value) and timestamp (explicit timestamp, optional) or raw value
  • range end time is inclusive
  • all read operations (latest, range and since) return (arrays of) JSON objects with data and timestamp (and also datasource_id, but that isn't so useful externally except for subscriptions)

time series proposal

I would propose:

  • choosing a standard time representation that allows better than second accuracy, e.g. float64 seconds since UNIX epoch
  • allowing explicit timestamps to be specified when writing values, perhaps using the approach in the current store-json (will increase code compatibility)
  • returning values with explicit timestamps as in store-json
  • confirm if range end time is exclusive

In addition I suggest:

  • there should be a bulk write operation that takes an array of timestamped values
  • there should be some support for pagination, i.e. an optional limit on the number of entries returned by since and range (but unlike the current store-timeseries this should be after filtering, not before)

key value API

I haven't really used this so I won't make a concrete proposal, but it has always seemed odd to me that from a datasource perspective the datasource id IS the key, so it is really just a single value store, not a key-value store! It would seem to make more sense if there were, e.g.:

  • two levels of key, the first being the datastore ID and second unconstrained (a single default id would replace current use)

And consequently also:

  • a list operation that returned existing second-level keys, probably within some optional key range
  • a delete operation (or a well-defined 'does not exist' value to write), probably within some optional key range
  • a read multi operation that returns a map of second-level keys and values, probably within some optional key range
  • a write multi operation that sets multiple key-values at once

I would probably look at redis for inspiration.

@Toshbrown
Copy link
Contributor

The next driver I'm writing is an IMAP email driver. A number of apps will want to process the stored data filtering for addresses and/or keywords in subject and body. The only way to achieve this with the current store would be to retrieve all the emails and prase the data in the app.

So the question is should the store support some kind of query beyond filtering on timestamps?

@jptmoore
Copy link
Collaborator Author

Thanks Chris. Everything looks reasonable. I will comment further when I tackle the points in more detail in the new store.

@mor1
Copy link
Contributor

mor1 commented Oct 31, 2017

@jptmoore @cgreenhalgh Thoughts (some of which repeat Chris' comments):

  • Following discussion of time, presumably /latest, /last, /since, /range must refer to the receipt-times rather than the datum-times? (I think this is another argument for having the driver able to interpret at least the timestamp of the source to normalise it.)

  • A well-defined timestamp type seems like a good idea. I would avoid floats myself though-- eg, Windows filestamps are uint64 nanoseconds since 1/1/1600 IIRC. There was some possibly-related discussion about this in the Mirage/OCaml worlds a while back, see eg Complete the clock interface with .. mirage/mirage-clock#1 and documentation for Ptime at http://erratique.ch/software/ptime/doc/index. If space considerations were a concern, some kind of packed 2*32 bit representation where a bit is used to indicate if there's a second 32-bit value or not, perhaps. (But that doesn't make any sense at all if we continue using an ASCII encoded JSON representation for everything.)

  • On the last point, are we going to enforce JSON encoding everywhere? For things like email, it seems that encodings will stack up pretty badly in that case. (Though perhaps this is just a use case for a BLOB store, to avoid fiddling with the bits in the mails.)

  • @Toshbrown I haven't forgotten about sorting out my offlineimap config (and related) to a container. Will get back to it soon. To expose mail as a useful datasource, will want some robust MIME etc parsing too-- there's a newish OCaml library that's good for this (processed my personal store of ~850k mails fine, and reasonably quickly), let me know if you want the URL (can't recall it right now)...

  • I see the need for pagination from a UI/web point of view -- how will that fit in with datasources where data is to be streamed? (I guess the "store" in question will look different?)

(Other suggestions all sound reasonable though.)

@Toshbrown
Copy link
Contributor

@mor1 I was going to use this https://github.com/emersion/go-imap, but if you really want an OCaml version, let me know i won't wast any time implementing it. It looks like email may form part of the risk awareness/communication studies we will be starting here at some point soon.

@mor1
Copy link
Contributor

mor1 commented Oct 31, 2017

@mor1 Not so bothered about OCaml version per se-- there's an OCaml IMAP implementation (a couple I think in fact), but it's the MIME parsing that I was mentioning particularly. It's insanely complicated to get right, but absolutely necessary if you want to robustly process mail contents (rather than just transport mails to/from/between servers).

  • ocaml-imap
  • imaplet -- less actively developed than the above I believe (was for a PhD, recently completed)
  • MrMime

@Toshbrown
Copy link
Contributor

Toshbrown commented Oct 31, 2017

@mor1 so your thinking about doing MIME parsing in the store?

I was going to do it in the driver then link using UUIDs for the binary parts

This is getting a bit off topic I will create an issue to discuss the details of the IMAP driver

@mor1
Copy link
Contributor

mor1 commented Oct 31, 2017

@Toshbrown not in the store per se, but you may want to explicitly put the results of having extracted content from the mail into the store (attachments, mail headers, etc). or perhaps in a derived store rather than that associated with the imap (email) driver. even just extracting attachments robustly is a surprising pita. (agree this is off-topic though.)

@cgreenhalgh
Copy link

See also me-box/core-export-service#28 on a possible job queue store API that might also be supported, e.g. job queue store API.

Also worth noting that the current store-json subscription API isn't shown here, and will need to be supported (or something like it)

@jptmoore
Copy link
Collaborator Author

jptmoore commented Nov 1, 2017

@cgreenhalgh

I made some changes below (needs testing and error handing e.g. reporting path errors back to client etc)

You can try out the changes from the docker client/server

choosing a standard time representation that allows better than second accuracy, e.g. float64 seconds since UNIX epoch

Using milliseconds since epoch now

allowing explicit timestamps to be specified when writing values, perhaps using the approach in the current store-json (will increase code compatibility)

You can post with URL: /ts/[id]/at/[time] to specify your own time

returning values with explicit timestamps as in store-json

Timestamps are returned with data like this [1509564588450, [1,2,3,4,5]]

confirm if range end time is exclusive

The end range is now inclusive.

@Toshbrown
Copy link
Contributor

You can post with URL: /ts/[id]/at/[time] to specify your own time

does this overwrite the internal timestamp? and are there any constraints on its format?

@jptmoore
Copy link
Collaborator Author

jptmoore commented Nov 2, 2017

does this overwrite the internal timestamp? and are there any constraints on its format?

Yes, it overwrites the internal one. It is an integer of epoch milliseconds.

@cgreenhalgh
Copy link

@jptmoore On the updated API...

Returning the time-value in a heterogeneous array (aka array representing a tuple, rather than an object with named fields) makes it problematic to type in some languages and more complicated to marshal/unmarshal (e.g. in go, which I'm using at the moment). It's not impossible but it is a hastle. It may also reduce consistency with the notification type/values??

The example value wasn't ever so clear but I believe (hope) [1,2,3,4,5] is a single value and every value has its own timestamp.

@mor1 On the type of time I know ns since (an) Epoch was mentioned but a caution I would give about that is that a sufficient range of values can't be exactly represented in a float64 which is all that some languages will use for numbers (e.g. javascript, max int 2^^53-1). Milliseconds is OK with me. I think microseconds might also fit but is rather non-standard.

When you say we can try them, I thought the new store only supported the zeromq transport, but afaik the go client library for this doesn't exist/isn't complete yet? ( @Toshbrown ?)

@Toshbrown
Copy link
Contributor

Toshbrown commented Nov 2, 2017

@cgreenhalgh I've started updating the go library Toshbrown/lib-go-databox. I've got basic KV and TS reads and writes working with tokens inside the databox example code is Toshbrown/driver-tplink-smart-plug.

It's not ready to go yet. I need to add the observe API, think about API exposed to app/driver developers, and turn the handle on the rest of the endpoints once they are stable. I'm thinking it will be mid next week before I get a chance to finish it (working on other projects until the 7th of nov)

By trying I think @jptmoore is referring to the client and server he uses outside of the databox for testing here. Its all wrapped in docker containers and allows all the functionality to be tested.

@jptmoore
Copy link
Collaborator Author

jptmoore commented Nov 2, 2017

@cgreenhalgh

Returning the time-value in a heterogeneous array (aka array representing a tuple, rather than an object with named fields) makes it problematic to type in some languages and more complicated to marshal/unmarshal (e.g. in go, which I'm using at the moment). It's not impossible but it is a hastle. It may also reduce consistency with the notification type/values??

Do you have an example of some JSON you would like to be returned?

The example value wasn't ever so clear but I believe (hope) [1,2,3,4,5] is a single value and every value has its own timestamp.

Yeh, [1,2,3,4,5] is the JSON data POSTed. The API takes any JSON as the value.

@cgreenhalgh
Copy link

@jptmoore

Do you have an example of some JSON you would like to be returned?

Current store-json uses (for your example) {"timestamp":1509564588450, "data":[1,2,3,4,5]}. Or you could use shorter property names ("ts"/"t","data"/"d"??) if you are worried about the byte count, not compressing and not too bothered about readability :-) Obviously a little more overhead than the array/tuple encoding, so that's the trade-off.

@jptmoore
Copy link
Collaborator Author

jptmoore commented Nov 2, 2017

@cgreenhalgh

I pushed a new image which returns in this format:

{"timestamp":1509626879783,"data":[1,2,3]}

@Toshbrown
Copy link
Contributor

@jptmoore While updateing lib-go-databox I was trying the new /ts/<id>/at/<t>

and requsted permissions like this from the arbitor:

{"target":"driver-tplink-smart-plug-core-store","path":"/ts/tosh/*","method":"POST"}

These are granted by the arbiter but rejected by the store. Do you parse wildcards in the macaroon caveats?

requesting permissions like this:

{"target":"driver-tplink-smart-plug-core-store","path":"/ts/tosh/at/1509641953193165","method":"POST"}

works fine, but this means that the macaroons can't be cached when using this endpoint.

@jptmoore
Copy link
Collaborator Author

jptmoore commented Nov 2, 2017

@Toshbrown

works fine, but this means that the macaroons can't be cached when using this endpoint.

Yeh, currently it is matching the exact path so will need to implement wildcards.

@jptmoore
Copy link
Collaborator Author

jptmoore commented Nov 3, 2017

@Toshbrown

I have pushed a new image which should support wildcards.

@cgreenhalgh
Copy link

cgreenhalgh commented Nov 20, 2017

@jptmoore I'd like to push hard for a bulk add operation in the timeseries API. I know @Toshbrown hit this as a (speed) limitation with the google takeout import, and I was struggling with a simple performance test (adding 1,000s items in a reasonable time - even activity/heartrate at 1/minute = 1440/day...). Having this in the API amortises the overhead of the request/response communication and also opens the option of handling the set of values within a single transaction/commit in the datastore for further optimisation.

Perhaps

  • POST
  • to /ts/[id]/import.
  • Body should be JSON array of timestamp/data records (e.g. [{"timestamp":1509626879783,"data":[1,2,3]},...]).

I'm not sure about generating events: should each value generate an event, or only the last value, or should it generate a distinct import event, or nothing? For my current use cases I don't need any events (I'd be updating a parallel KV store with source metadata in parallel and could use the event from that).

It also raises a question (in my mind, at least) about whether the existing write entry point should change, e.g. to POST ts/[id]/value.

@jptmoore
Copy link
Collaborator Author

@cgreenhalgh could you give me a sample of the bulk JSON data you have to test with please.

@cgreenhalgh
Copy link

I assume the same kind of thing as you get back from a range query, e.g. for a simple value

[
  { "timestamp": 1509626879783, "data": 14.5 },
  { "timestamp": 1509626880783, "data": 14.8 },
  { "timestamp": 1509626881783, "data": 16.0 },
  { "timestamp": 1509626882783, "data": 16.5 },
]

or for a complex value

[
  { "timestamp": 1509626879783, "data": { "event": "event type 1", "value": 42, "content": "something" } },
  { "timestamp": 1509626880783, "data": { "event": "event type 1", "value": 44, "content": "nothing" } },
  { "timestamp": 1509626881783, "data": { "event": "event type 2", "content": "smells a bit" } },
  { "timestamp": 1509626882783, "data": { "event": "event type 1", "value": 48, "content": "something again" } },
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants