Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC0006: RSF version 1 #12

Open
wants to merge 15 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@arnau
Copy link
Collaborator

commented Apr 5, 2018

Context

RSF has been documented in multiple places but never in a consistent manner. This is an attempt to describe the current state of things and serve as the starting point to create the first RSF specification.

Note that RSF has no versioning system. This is a good opportunity to discuss if we should add one to allow non backwards-compatible changes without breaking clients.

Guidance to review

Review the current implementation of RSF is accurately described in this RFC and that the explanation is understandable and unequivocal.

@arnau arnau requested review from nacnudus, michaelabenyohai and bahmady Apr 5, 2018

@arnau

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 5, 2018

/cc @gidsg

@arnau arnau added rfc wip labels Apr 5, 2018

is that diverges from the rest of the API where serialisation is expressed
either via suffix or via media type. The problem with using the same approach,
say `GET /register.rsf` is that we are not providing the same information when
querying `GET /register.json`.

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

Does this mean that the endpoint GET /download-rsf doesn't provide the same information as GET /register.rsf, or that the user has to provide different information?

The current implementation uses `GET /download-rsf`. The main issue with that
is that diverges from the rest of the API where serialisation is expressed
either via suffix or via media type. The problem with using the same approach,
say `GET /register.rsf` is that we are not providing the same information when
querying `GET /register.json`.

This comment has been minimized.

Copy link
@arnau

arnau Apr 20, 2018

Author Collaborator

Well, /register.json right now offers some sort of summary-metadata of the register. RSF by nature is the full register. So both things are at odds.

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

I didn't know about that endpoint 😬. It goes without saying that I think it should have a different name, e.g. /summary.json.

This comment has been minimized.

Copy link
@arnau

arnau Apr 20, 2018

Author Collaborator

At least it's clearer in intention 👍

say `GET /register.rsf` is that we are not providing the same information when
querying `GET /register.json`.

What is a good name for a resource that represents the whole raw database?

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

I think "register" is a good name for a register ;)

GET /register.rsf

This comment has been minimized.

Copy link
@michaelabenyohai

michaelabenyohai May 3, 2018

There is another endpoint called /download-register that downloads a ZIP file containing "the whole database". We should also consider whether we keep that an/or rename it in line with these thoughts.

This comment has been minimized.

Copy link
@michaelabenyohai

michaelabenyohai May 3, 2018

Also note there are other RSF endpoints:

  • /download-rsf/n gets RSF for the register after entry-number n, i.e. it returns the whole register from entry number n+1 to the end of the register.
  • /download-rsf/n/m gets RSF for entries n+1 to m'

This comment has been minimized.

Copy link
@arnau

arnau May 3, 2018

Author Collaborator

Added the description in a new commit.

This comment has been minimized.

Copy link
@arnau

arnau Aug 22, 2018

Author Collaborator

I'm thinking that the /download-register endpoint should be renamed to /archive. With that change, I think having /archive.rsf or better /archive -H 'Accept: application/vnd.rsf would help consolidate resources that are conceptually the same.

To accommodate the filtering that happens in /download-rsf/n we could do something on the lines of /archive.rsf?from=n&to=m.

Thoughts?

/cc @MatMoore

This comment has been minimized.

Copy link
@arnau

arnau Aug 22, 2018

Author Collaborator

To be clear, /archive (zip file) would be normative, rsf a non-normative extension.

This comment has been minimized.

Copy link
@MatMoore

MatMoore Aug 23, 2018

That sounds like a good idea - I like the name archive better than register for this purpose, and it makes sense for the zip & rsf to be different representations of the same thing.


### Commands

#### <a id="assert-root-hash-command">`assert-root-hash` command</a>

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

Why are there two assert-root-hash commands per register? How do you use them?

This comment has been minimized.

Copy link
@arnau

arnau Apr 20, 2018

Author Collaborator

assert-root-hash can appear as many times as you want in a RSF file. The idea is that every time you find one, you can use it to verify you have got all the previous entries right, it's a checkpoint really.

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

Thanks, that answers my later question about whether an RSF file starts at the beginning of a register. I should have checked with echo -n | sha256sum. In that case should this spec require that the the first line of an RSF file be assert-root-hash?

This comment has been minimized.

Copy link
@arnau

arnau Apr 20, 2018

Author Collaborator

I suggested that in another comment, the potential problem is that currently it is not a requirement so some RSF might be broken if we impose this restriction. Let me assess how big of a deal it is really.

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

Since the scope of this review is "Review the current implementation of RSF is accurately described in this RFC", I'll back off and open this in an issue instead.

#### <a id="add-item-command">`add-item` command</a>

Adds a new [Item resource][item-res] to the register. It will require an
[`append-entry` command](#append-entry-command) to make it visible to users.

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

Instead of "make it visible to users", I would put "associate the item with a record key".

This comment has been minimized.

Copy link
@arnau

arnau Apr 20, 2018

Author Collaborator

umm, I like what you are suggesting but I don't like the idea of using the word "record". @michaelabenyohai ideas?

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

How about "There must be a corresponding append-entry command that refers to the item's hash"? Noting the later comment on "It is illegal to have orphan items", which insists on the append-entry for the file to be valid.

This comment has been minimized.

Copy link
@arnau

arnau Apr 20, 2018

Author Collaborator

That could work 👍

2. The `key` of the entry. The primary key field is the field with the same
name as the register.
3. The `timestamp` of the entry. This is the time at which the entry was
appended to the register.

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

I believe this is the first time we have defined what the timestamp means. Are we sure?

This comment has been minimized.

Copy link
@arnau

arnau Apr 20, 2018

Author Collaborator

I think it's the first time we have something that clear yes. Based on usage I'd say yes, the timestamp is the consequence of minting an item so it's the recording time for the entry. It mimics git's behaviour.

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

It makes me nervous because that's only the way we've used it. I haven't thought about the consequences of timestamps being out of sequence. @michaelabenyohai am I being paranoid?

This comment has been minimized.

Copy link
@arnau

arnau Apr 20, 2018

Author Collaborator

what would be the problem of having timestamps out of sequence? The order of the log is dictated by the entry number.

This comment has been minimized.

Copy link
@arnau

arnau Apr 20, 2018

Author Collaborator

It could mess with tooling using the timestamp to infer something related to time outside "time of recording" but again, nothing you wouldn't see in git or similar.

I think it's up to the tooling to be zealous about timestamps to the extent a tool can be.

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

Okay I'm satisfied 🙂

This comment has been minimized.

Copy link
@arnau

arnau May 3, 2018

Author Collaborator

For reference, RFC0003 covers this topic (#16 )

* [Commands](#commands) are executed in order of appearance, top to bottom.
* Entries are numbered in sequence in order of appearance starting with 1 if
the register is empty, otherwise incrementing on the latest entry number
found in the register.

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

Is there any way to tell whether an RSF file represents a complete register, or only appends to a register?

This comment has been minimized.

Copy link
@arnau

arnau Apr 20, 2018

Author Collaborator

The only way as far as I know is via assert-root-hash. In a situation where the RSF has no assertions you wouldn't know.

This comment has been minimized.

Copy link
@arnau

arnau Apr 20, 2018

Author Collaborator

One possibility that would require evolving the implementation would be requiring an assert-root-hash as the first command of any RSF file.

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

In that case I think it would be useful to require an assert-root-hash in the first line of every file, for the sake of reassembling a register from its patches.

This comment has been minimized.

Copy link
@michaelabenyohai

michaelabenyohai May 3, 2018

Should we specify that this currently does not include system entries (though it should do).

This comment has been minimized.

Copy link
@michaelabenyohai

michaelabenyohai May 3, 2018

By this I mean that entry numbers are computed independently for user and system currently (as they are separate logs on the inside).

This comment has been minimized.

Copy link
@arnau

arnau May 3, 2018

Author Collaborator

Yes, it needs to explain the current implementation exactly.

the [`add-item` command](#add-item-command) that introduces the item is
referencing.
* It is illegal to have orphan items. An `add-item` must have at least one
`append-entry` referencing to the item.

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

If the the validity of an add-item depends on a subsequent append-entry, then a file that gets accidentally truncated in transit will be invalid. This seems unnecessarily strict, but I haven't thought hard about the disadvantages of orphan items. Is it to do with treating patches as transactions?

This comment has been minimized.

Copy link
@arnau

arnau Apr 20, 2018

Author Collaborator

Yes, the original discussion about tradeoffs of allowing orphan items or not made the team decide to be transactional per file so if something wrong happens in the middle of consuming a delta, it must be rolled back so you can retry again with the same delta without complex gymnastics. There were other considerations around how to handle orphans in the API and the garbage collection of them if allowed.

I think it's a reasonable restriction given the consequences of not having it.

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

Does it matter whether the add-item command comes before or after the append-entry one?

This comment has been minimized.

Copy link
@arnau

arnau Apr 20, 2018

Author Collaborator

If my memory serves me well, it has to do with exposing the entry to the API users and not having the item available yet (the record couldn't be computed either). This is me reading between lines and guessing a bit here. In all fairness if we think it's ok to expose an entry as soon as it gets parsed, it means we are not behaving transactionally (a rollback could cause a disruption in someone that already consumed an item). I'll try to pull out the rationale for this restriction but, if it's not satisfactory I might reconsider having the restriction in place.

#### Type checking

Although not part of the RSF specification, it is worth mentioning that a
Registers' implementation is expected to type check the data according to the

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

I don't think the apostrophe is needed on "Registers'"

to the state rolled back.


### Examples

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

Should these examples be valid patches?

This comment has been minimized.

Copy link
@arnau

arnau Apr 20, 2018

Author Collaborator

They should yes, if I haven't mess it up, they are 😱

This comment has been minimized.

Copy link
@nacnudus

nacnudus Apr 20, 2018

Haha, I don't think you've messed them up. But I think the spec should require an assert-root-hash in the first and final lines.

arnau added some commits Apr 5, 2018

Translate from multiple docs
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Remove alternative media type
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Add type checking section
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Add multi-item example
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Note on REST API
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Reword root hash argument
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Add thoughts on a new resource that is what RSF represents
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Fix typo
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Reword add-item command as suggested by @nacnudus
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>

@arnau arnau force-pushed the rsf-spec branch from 8f4530d to 3ed0474 Apr 26, 2018

append-entry = %s"append-entry" HTAB type HTAB key HTAB timestamp HTAB hash-list
type = "user" / "system"
key = alphanum / %x2D / %x5F

This comment has been minimized.

Copy link
@michaelabenyohai

michaelabenyohai May 3, 2018

I'm not sure I follow what alphanum / %x2D / %x5F means. Can you explain?

This comment has been minimized.

Copy link
@arnau

arnau May 3, 2018

Author Collaborator

alphanum is defined a few lines below as alphanum = ALPHA / DIGIT, so the key definition is: "any ASCII character (A-Z, a-z) or a hypen (%x2D, -) or an underscore (%x5F, _)".

alpha and digit are defined in https://tools.ietf.org/html/rfc5234#appendix-B.1
%x2D and %x5F are the hexadecimal codes for hypen and underscore respectively and / defines an OR.

This comment has been minimized.

Copy link
@michaelabenyohai

michaelabenyohai May 3, 2018

Ah I missed the definition of alphanum so that makes sense now.

Does this imply there is only one character in a key? i.e it's either a single ASCII character, hyphen or underscore. It doesn't appear to say that a key can be made up of multiple of these things but I might be reading it wrong still.

This comment has been minimized.

Copy link
@arnau

arnau May 3, 2018

Author Collaborator

HA, very good point. I have to add multiple chars :)

I'm wondering if the first char has special treatment in the current implementation? (e.g. can a key start with a hypen: -foo? If not, this needs further restriction

log = command *(CRLF command) [CRLF]
command = add-item / append-entry / assert-root-hash
assert-root-hash = %s"assert-root-hash" HTAB hash

This comment has been minimized.

Copy link
@michaelabenyohai

michaelabenyohai May 3, 2018

Does %s mean case sensitive? If so, why do we only use it for "add-item" etc and not for things like "user"? Not that I think we've ever specified whether these things are case sensitive or not.

This comment has been minimized.

Copy link
@arnau

arnau May 3, 2018

Author Collaborator

Good point. Yes %s means case sensitive and all tokens that should be strictly in lower case should be prepended with that. I'll amend them 👍

This comment has been minimized.

Copy link
@arnau

arnau May 3, 2018

Author Collaborator

Fixed in a new commit.


Asserts that the provided root hash is the same as the one computed from the
current entry log as defined in the [Digital Proofs][digital-proofs]
specification.

This comment has been minimized.

Copy link
@michaelabenyohai

michaelabenyohai May 3, 2018

Should we specify that although the included in RSF, an assert-root-hash command currently ignores all system entries? It only includes user entries.

This comment has been minimized.

Copy link
@arnau

arnau May 3, 2018

Author Collaborator

Yes, absolutely. I will amend this too 👍

##### Arguments

1. The `type` of the entry determines if the entry belongs to the data log
(`user`) or to the metadata log (`system`).

This comment has been minimized.

Copy link
@michaelabenyohai

michaelabenyohai May 3, 2018

This implies that the data log and the metadata log are separate things. Is this intentional? I know they are kind of separate currently (e.g. system entries are ignored in root-hashes) but they do all appear in the same "log" in the RSF. I guess this kind of does correctly explain how things are now (even if we want to change them).

This comment has been minimized.

Copy link
@arnau

arnau May 3, 2018

Author Collaborator

I think this RFC should document how things are right now. When we change it, we will have another RFC that explains the change and can refer to the original RFC as its starting point.

This comment has been minimized.

Copy link
@nacnudus

nacnudus May 3, 2018

The logs are intertwined because A data item must conform to the current schema derived from the previous system entries, so even though system entries aren't in the root hashes, the RSF is still invalid if they are reordered.

This comment has been minimized.

Copy link
@arnau

arnau May 4, 2018

Author Collaborator

You are right, there is a level of checking that guarantees consistency cross log 👍

* It is illegal to have broken references. An `append-entry` must reference an
item previously introduced by an `add-item` command.
* The item in the `add-item` command must always be in the canonical form.

This comment has been minimized.

Copy link
@michaelabenyohai

michaelabenyohai May 3, 2018

Should we mention anything about identical consecutive append-entry commands currently being illegal?

i.e. this is illegal

append-entry	user	GB	2010-11-12T13:14:15Z	sha-256:08bef0039a4f0fb52f3a5ce4b97d7927bf159bc254b8881c45d95945617237f6
append-entry	user	GB	2010-11-12T13:14:50Z	sha-256:08bef0039a4f0fb52f3a5ce4b97d7927bf159bc254b8881c45d95945617237f6

This is ok

append-entry	user	GB	2010-11-12T13:14:15Z	sha-256:08bef0039a4f0fb52f3a5ce4b97d7927bf159bc254b8881c45d95945617237f6
append-entry	user	GB	2010-11-12T13:14:30Z	sha-256:490636974f8087e4518d222eba08851dd3e2b85095f2b1427ff6ecd3fa482435
append-entry	user	GB	2010-11-12T13:14:50Z	sha-256:08bef0039a4f0fb52f3a5ce4b97d7927bf159bc254b8881c45d95945617237f6

This comment has been minimized.

Copy link
@arnau

arnau May 3, 2018

Author Collaborator

With the recent changes, is this still the situation? I'll need clarification on this.

This comment has been minimized.

Copy link
@michaelabenyohai

michaelabenyohai May 3, 2018

This is currently how it is implemented, yes. ORJ won't let you load RSF that looks like the first example. However, government-service register will currently produce illegal RSF, because we accidentally allowed these duplicate entries into the register before coding the rule.

This comment has been minimized.

Copy link
@nacnudus

nacnudus May 3, 2018

The rule is good hygiene but should it be enforced in the spec?

This comment has been minimized.

Copy link
@arnau

arnau May 4, 2018

Author Collaborator

The goal of the spec is to allow different implementations to be built independently and be compatible with each other. If we don't enforce the same rules, a valid RSF in one tool could be invalid in another one. To me, this is a spec failure.

This comment has been minimized.

Copy link
@arnau

arnau May 4, 2018

Author Collaborator

We can discuss if it should be a rule that exits parsing (and thus rolls back the entire patch), it should be ignored entirely (skipped) or raise a warning but keep parsing.

Given that the current implementation fails, the initial spec should reflect that exact behaviour.

* It is illegal to have orphan items. An `add-item` must have at least one
`append-entry` referencing to the item.
* It is illegal to have broken references. An `append-entry` must reference an
item previously introduced by an `add-item` command.

This comment has been minimized.

Copy link
@michaelabenyohai

michaelabenyohai May 3, 2018

In the current implementation, if a corresponding add-item does not appear in the RSF file then it checks to see if the item already exists in the register. If it does, then the RSF file is deemed valid, otherwise it's rejected. This means that an RSF file that has an "orphan entry" could actually be valid if it is a patch on an existing register.

The inverse is not true for add-item. If an add-item is in an RSF file, the corresponding append-entry must appear in the same RSF file.

This comment has been minimized.

Copy link
@arnau

arnau May 3, 2018

Author Collaborator

Ah! I misunderstood this behaviour. I'll amend 👍

arnau added some commits May 3, 2018

Fix ABNF case sensitive tokens
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Add download-rsf range endpoints
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Fix the assert-root-hash behaviour
Explains that the current behaviour uses only data entries to generate
the root hash which leaves metadata entries unasserted.

Also makes explicit that the numbering for data and metadata logs are
independent.

Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Fix ABNF for entry keys
Ensures keys have one or more characters where the first is
alphanumeric.

Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Add identical consecutive rule
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
Fix rule for orphan entries
Signed-off-by: Arnau Siches <arnau.siches@digital.cabinet-office.gov.uk>
@MatMoore
Copy link

left a comment

This makes sense to me, and I think it's good to specify the grammar and semantics of RSF, but I'd rather the registers spec doesn't require RSF as way to mint or download data. Can we leave out the REST API section entirely, and address it as an ADR in openregisters-java?

It feels like there are already a lot of different ways of downloading data from a register, and I'm wary about adding any more endpoints that do similar things to existing ones, because it makes the API harder to use. At the moment RSF is an internal thing and API users shouldn't need to understand it.

I like @michaelabenyohai's idea of combining with the download-register endpoint as the RSF could just be requested through content negotiation.

append-entry = %s"append-entry" HTAB type HTAB key HTAB timestamp HTAB hash-list
type = %s"user" / %s"system"
key = alphanum *(alphanum / %x2D / %x5F)

This comment has been minimized.

Copy link
@MatMoore

MatMoore Aug 6, 2018

I think this depends on #22
Also: what about forward slashes?

This comment has been minimized.

Copy link
@arnau

arnau Aug 10, 2018

Author Collaborator

It does indeed. I need to amend that once #22 is accepted.

append-entry = %s"append-entry" HTAB type HTAB key HTAB timestamp HTAB hash-list
type = %s"user" / %s"system"
key = alphanum *(alphanum / %x2D / %x5F)
hash-list = hash *(list-separator hash)

This comment has been minimized.

Copy link
@MatMoore

MatMoore Aug 6, 2018

In the RSF I've looked at, this has just been a single item hash. When would this be a list?

This comment has been minimized.

Copy link
@arnau

arnau Aug 10, 2018

Author Collaborator

The situation is theoretical and it is when you have an index. It is experimental and in review to assess if benefits outweigh (perceived) complexity.


### Media type

The current media type is `application/uk-gov-rsf`. It should change to

This comment has been minimized.

Copy link
@MatMoore

MatMoore Aug 6, 2018

As this is specific to ORJ, can we say that the media type of RSF is application/vnd.rsf but application/uk-gov-rsf may also be used for legacy reasons? Then we can fix that at any time without raising another RFC.

This comment has been minimized.

Copy link
@arnau

arnau Aug 10, 2018

Author Collaborator

Sounds reasonable 👍

@arnau arnau changed the title RSF version 1 RFC0006: RSF version 1 Aug 9, 2018

@arnau

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 10, 2018

@MatMoore although I may agree RSF is not normative, I'd like to have it in the spec as an example of how the reference implementation has approached register serialisation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.