Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More granular key storage/sync #137

Closed
andrewgdotcom opened this issue Jul 2, 2021 · 6 comments
Closed

More granular key storage/sync #137

andrewgdotcom opened this issue Jul 2, 2021 · 6 comments
Assignees

Comments

@andrewgdotcom
Copy link

Having a standard format for groups of packets smaller than a full TPK would help solve several problems with keyserver sync. At the moment, the smallest unit of data used by keyservers is the TPK, but these are highly mutable objects and it is possible for two fully-updated keyservers, intentionally or otherwise, to disagree on the canonical form of a given TPK. This can be due to local blacklisting, abusive sigs, oversized packets, lack of canonical ordering, or many other possibilities.

It should be possible to break up a TPK into more tightly-defined "differential" public keys (DPKs), immutable in themselves but aggregatable to reconstitute the full TPK. These could then be treated as atomic changes, each of which could be accepted or rejected by a given keyserver as per its local policy.

Keyservers would then sync DPKs rather than TPKs, and reconstitute TPKs on demand.

A DPK should contain:

  1. a primary key
  2. one first-party sig made by that primary key
  3. all packets signed over by that sig (recursively)

Examples:

  1. a primary key, an attestation, the third-party sig that the attestation is over, and the UID that the sig is over.
  2. a primary key, an sbind, and the subkey that the sbind is over
  3. a primary key, and a self-sig
  4. a primary key, and a revocation (and nothing else)

(The above assumes that #135 has already been done).

We could treat bare revocation packets as a special case of a DPK, for compatibility with other usage in the wild [1].

The back end storage would need to be refactored to store individual packets rather than full TPKs. A TPK or DPK would then exist in storage as a set of indexes into the table of packets. If two sigs existed with the same functional intent, only the most recent would be stored, and older ones discarded.

We could then (temporarily?) run two forms of recon in parallel, using the standard protocol - one based on TPKs as normal, and one based on DPKs on a new port, say 11372 (we would need to maintain two PTrees etc). The recon listening on the new port would advertise its capability in the handshake, so it should be sufficient to just point its peers at the new port instead of the old one. Lookups for DPKs could be performed on the standard port, perhaps with a different endpoint or query parameter to avoid confusion.

Dumps, uploads etc. would work as before, but when writing to the database we would transparently deconstruct each TPK into its components.

[1] https://gitlab.com/openpgp-wg/rfc4880bis/-/issues/14

@dkg
Copy link

dkg commented Jul 7, 2021

I like the concepts here, but i'm not certain how they'd play out in practice. The three main cases i can see for this DPK concept would be:

  • internal storage
  • transmission between peers
  • calculations for sync

I think i see how it works for calculations for sync, but using an actual DPK seems inefficient for either of the two former cases.

The common thing in all of your example DPKs is of course the primary key. So a certificate with two subkeys, three user IDs, two of which have attested third-party certifications would decompse into 7 DPKs.

Thinking about internal storage, surely there'd be no reason to store the (uncompressable) public primary key 7 times.

Thinking about transmitting: you might want to transmit some subset of those DPKs. Consider a syncing peer that knows about all the subkeys and all the user IDs for the example certificate described above, but who hasn't seen any third-party certifications yet. Is there some reason that you'd want to transmit the primary public key twice in order to share both sets of third-party certifications with that peer?

In the context of a redesigned synchronization of a certificate that both parties already know primary pubkey, there might be no need to transfer the pubkey at all, just an index or fingerprint in the context of the sync exchange to be able to tell which key you're talking about!

The other common element conceptually in each of your proposed DPKs is a signature from the primary key itself. This is the property that @teythoon has been calling "self-sovereignty". I think it's right on target. So the transferable elements are in some sense just the signature and anything beyond the pubkey that is covered by the signature, if the context already implies the pubkey. We could call such an object a "stripped DPK".

A few notes about your proposed examples:

a primary key, an attestation, the third-party sig that the attestation is over, and the UID that the sig is over.

I assume "attestation" here means 1pa3pc. Note that the design of 1pa3pc is such that a single attestation can attest to more than one third-party certification. To do this safely, you'd still need a canonical ordering of the attested third-party certifications; this could be either the order in which their digests appear in the attestation, or some other ordering.

a primary key, an sbind, and the subkey that the sbind is over

For signing-capable subkeys, this presumably also requires the cross-sig (as an embedded packet in the subkey binding signature)

a primary key, and a self-sig

For a direct key signature (e.g. stating preferences or expiration of the primary key itself), this is sufficient. For a self-sig over a specific User ID or User Attribute, presumably the DPK would need to also include the UID or UAT packet itself.

a primary key, and a revocation (and nothing else)

This is the (as-yet-unstandardized) structure that is currently exported by keys.openpgp.org or other hagrid installations. Interestingly, the "stripped DPK" of this example is also just a single standalone revocation packet, which is another as-yet-unstandardized structure that is widely referred to as a "revocation certificate".


Even if you were to transmit these differential objects when syncing, you might find it to be more efficient to collapse them together (e.g. if you have multiple self-sigs over a given user ID that the peer does not have, you don't want to have to transmit the user ID twice). The resulting collapsed object for transmission would resemble an OpenPGP certificate with some elements omitted.

@andrewgdotcom
Copy link
Author

Thinking about internal storage, surely there'd be no reason to store the (uncompressable) public primary key 7 times.

Agreed, I only intended DPKs to be a transmission format, not a storage format. As stated in the OP, backend storage would be decomposed into individual packets. TPKs and DPKs would be structured as arrays of indexes into the packet table (cf. Unix files as arrays of indexes into the "inode table"). Since we are still dealing with an append-only datastore, we shouldn't need to count references.

Consider a syncing peer that knows about all the subkeys and all the user IDs for the example certificate described above, but who hasn't seen any third-party certifications yet. Is there some reason that you'd want to transmit the primary public key twice in order to share both sets of third-party certifications with that peer?

I agree this cries out for deduplication, however it seems unavoidable if we accept that a) the transmissible form (TPK or DPK) should be checkable for self-consistency as a standalone object, and b) it is the receiver of the data that chooses which items to request, based on the results of the recon comparison, and has no way to know in advance if deduplication is possible.

If we don't accept a) then we potentially find ourselves ingesting garbled or orphaned data fragments that can't be reassembled into a usable format. And b) seems fixable only by implementing PKS-style push updates. "Stripped DPKs" would be more efficient on the wire, but there has to be some other way of ensuring that the referenced primary key actually exists, otherwise a "stripped DPK" would be a vector for unverifiable garbage (the same applies to "revocation certificates").

Note that the design of 1pa3pc is such that a single attestation can attest to more than one third-party certification. To do this safely, you'd still need a canonical ordering of the attested third-party certifications; this could be either the order in which their digests appear in the attestation, or some other ordering.

Your proposed ordering feels correct, it avoids an unnecessary sort. :-)

@teythoon
Copy link

To be honest, I don't understand the benefit of the deconstruction into DPKs.

Every TPK can be naturally deconstructed into individual components, by taking the primary key, the component (like user id, or subkey), and any related signatures. The certificate holder retains sovereignty because every self-signature is authenticated by the primary key, and third party certifications are attested to by the primary key using 1pa3pc. Individual components can be joined by joining the packet sequences, modulo the primary key all but the first joined component, and redoing the canonicalization (which should deduplicate components).

These kind of destructuring and joining operations are supported today by Sequoia, without additional constructs.

@andrewgdotcom
Copy link
Author

@teythoon I think we're talking about the same thing here. The advantage of DPKs is that they would be small, well-defined objects (with well-behaved checksums) that are both meaningful and amenable to comparison, and would greatly simplify the process of keyserver sync. They would be intermediate objects between packets, which are meaningless by themselves, and TPKs, which are disordered, bloated, and difficult to compare.

@andrewgdotcom
Copy link
Author

I've thought about this some more, and I think I have a recon v2 solution that has all the advantages of DPKs but doesn't require actual decomposition into DPKs at any point.

The biggest problem with sync as currently implemented stems from object identifiers being opaque hashes. If we assigned object IDs to individual packets, but constructed those IDs so that they embedded the primary key's (fingerprint|long-id), then we could sync atomic diffs without requiring any novel data structures. The receiving side could truncate the reconciled object IDs to discover their associated primary keys, and fetch any updates by (deduplicated) primary key ID.

Immediate advantages would include:

  1. ptrees would be append-only; no object IDs would be deleted during normal operation
  2. wrong-way updates would be eliminated
  3. canonical packet ordering would no longer be an issue
  4. no db changes would be necessary

Future improvements would then be possible:

  1. if the object ID also embedded the packet type and length, keyservers could tell which changes were undesirable (e.g. UAT packets, unattested 3rd party sigs, spam) and refrain from requesting them (this would require fake recon)
  2. (alternatively?) use separate ptrees for each packet type, so keyservers could ignore the existence of unwanted packet types
  3. ability to request/serve individual packets over hkp, to minimise data transfer

I suggested using a separate port above for recon v2, however this would require manual configuration changes by server operators which would hinder adoption. Better IMO to use the recon-as-server advertisement to encourage recon-as-client to negotiate a protocol upgrade. We can't use conflux.recon.flags, as these indicate hard forks, but we could use conflux.recon.version, which has been decoupled from the code version in hockeypuck and is currently defined in the default config file to be backwards-compatible with SKS. I would suggest that conflux.recon.version and conflux.recon.filters be hardcoded in future so that they can be relied upon to accurately indicate the server's feature set, instead of being open to operator misconfiguration, as does occasionally happen now.

@andrewgdotcom
Copy link
Author

An improved version of this proposal is discussed in https://github.com/hockeypuck/hockeypuck/wiki/HIP-2:-SKS-v2-protocol

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants