proposal: consider adding a chain storage interface to ctfe #691

rolandshoemaker · 2020-05-13T20:25:23Z

This is an extended version of google/trillian#1968 with more concrete numbers (and probably in the right place this time).

Currently ctfe sends both the leaf certificate and the rest of the chain (intermediates, and root) to trillian for storage. While the leaf is necessary for merkle tree operations (as part of the tree leaf) the rest of the chain is not, and is stored as an opaque blob in ExtraData. Unlike the leaf, which is unique, the chain is mostly made up of duplicated data since the number of possible chains is typically a (relatively) small number, especially when a CA is using a specific log for a hierarchy. Assuming most chains have three entries, a leaf, intermediate, and root, and that certificates are on average around 1.5kb a LeafData row ends up being about 5kb with the chain in ExtraData taking up 3kb of that (ignoring indexes and storage engine overhead).

Pulling this data out of the trillian storage backend would make it possible to more easily deduplicate this data, which would provide a significant storage win, and likely a bit of a performance win because the will be able to store more rows in memory/hot cache if they are smaller, I think the main benefit here would be in sequencing operations, and get-entries type operations.

Some concrete numbers for the oak 2020 Let's Encrypt log: the database for the log is around ~3TB for ~420 million entries. Looking through all the entries in the log the chains, excluding leaves, make up around 1TB of data (again this ignores the database overhead which would add some % to this number) of which only 1.5MB are unique certificates. Deduplicating this would be a ~33% storage win (which is not a insignificant cost).

I think there are two obvious approaches to this, using the ExtraData to store some kind of identifier, or set of identifiers that link to entries in a chain storage backend, or don't store anything in ExtraData and use the leaf identifier hash to as an entry in the chain storage backend to link it to the chain. Either way when the ctfe stored a chain it'd decompose the chain into leaf + rest, and send the rest to the chain storage backend, and then send the leaf to trillian. In the other direction once a leaf or set of leaves is returned the ctfe would then reconstruct the chain by querying the storage backend. The one downside here is that it would increase the number of database queries, i.e. for a get-entries call we currently only need N calls to the trillian database, but with two storage backends you would need N+1, in the best case, or N^2 in the worst depending on query/schema design.

I haven't specified a formal API in this proposal because I want to gauge interest in the overall concept, if there is interest I will try to spend some time coming up with a more concrete proposal.

mhutchinson · 2020-05-14T11:06:34Z

Hi Roland. You're right that the certificate chains can entail a significant amount of duplicate data, so your rationale for coming up with a scheme to deduplicate makes a lot of sense to me.

We don't have the bandwidth to do this work, but we're open to pull requests that implement this behaviour. As Trillian and CTFE are both used elsewhere, this gives the following constraints:

changes are only in CTFE (i.e. no change to Trillian code to support this)
any behaviour change is flag guarded, and the default behaviour without specifying this flag remains unchanged (i.e. write/read the full chain to ExtraData, with no extra backends required)

Your proposal sounds good to me, i.e. that some kind of key is serialized to ExtraData, which is then parsed by CTFE, which uses this key to load the cert chain from a new storage backend . This will require CTFE being configured with a new backend storage. This backend should not be required if the new flag is not provided. I don't have strong opinions on what this key format or backend should be, but as a guiding pragmatic principle I propose that we start with the minimal thing that works for you, and we can make this more generic as other users need.

I think we're on the same page here, but let me know if you think we're not :-)

pav-kv · 2020-05-14T15:40:08Z

Hi Roland. I would also like to +1 on this idea, and the proposed suggestions from Martin.

One thing to add re:

using the ExtraData to store some kind of identifier, or don't store anything in ExtraData and use the leaf identifier hash to as an entry in the chain storage backend to link it to the chain

TL;DR: I think that you should store some identifier in ExtraData.

The leaf hash may seem uniquely identifying a leaf, but I can see a couple of duplication issues (depending on whether you use the Merkle leaf hash, or Trillian's concept of "identity hash"):

for a fixed Merkle leaf hash (which depends on cert+timestamp), there technically can be multiple chains, so if Trillian backend allows duplicates, CTFE won't know how to dedup simply based on leaf hash
the concept of "identity hash" in Trillian has even more duplicates - a fixed cert can have multiple different timestamps

mhutchinson · 2020-05-14T16:02:44Z

Thanks for the input Pavel. My instinct would be to consider using a CAS for the certificate chains. Essentially, hash the bytes you originally stuck in ExtraData, and that's your key. Store the map of {Hash(chainBytes)->chainBytes} somewhere. If you do have a lot of duplicate chains, then the number of entries in the CAS will be small. When you read a range of certs for get-entries, you only need to do a lookup for the unique chains.

rolandshoemaker · 2020-05-15T22:49:44Z

Sounds good, we (Let's Encrypt) are definitely interested in seeing this get done, so are likely to spend engineering time working on it at some point in the near future.

All of your feedback makes sense, I think the one main question I have is if we'd want to design this as a generic interface with a MySQL implementation (similarly to how trillian has the LogStorage interface, and then a handful of implementations) or just dive straight in with a MySQL implementation and call it a day (for now anyway)?

pphaneuf · 2020-05-15T23:04:49Z

For my part, I would recommend the latter (dive straight into whatever backend you're intending to use). First, it's difficult to craft an interface around a single implementation, and second, there's little to be gained (in fact, usually leading to premature generalisation). We can refactor later, especially if there's no interface that need to be maintained.

roger2hk · 2024-03-11T16:39:24Z

Thanks @rolandshoemaker for the initiative. Here are the requirements and proposal.

Requirements

Storage cost should be reduced by at least 33% for newly deduplicated entries.
No operation is required for existing logs when the change is rolled out.
No existing migration is required to enable the new storage saving logic for newly added entries.
No significant performance degradation.
No change to the entire chain presentation (RFC 6962 section 3.1).

Proposal

Trillian `ExtraData` Field Changes

Add the hash of the entire chain except leaf certificate. To provide backward compatibility, the logic needs to fall back to the certificate chain data stored in ExtraData field when the hash is empty.

Non-Leaf Certificate Storage

A new table NonLeafCertificateChain for storing the hash and the value of the non-leaf certificate chain of intermediate certificates and root certificates.

CTFE Storage Backend

MySQL without any restriction on instance and database choices.
Optional multi-tenancy support: Log operators can choose to share the non-leaf certificate chain among multiple logs by using the same CTFE database.

CTFE Cache

github.com/hashicorp/golang-lru/v2 package by default.
A cache interface will be provided for other cache implementation (e.g. Redis).

Deployment

No operation is required for existing CT logs.
Log operators can choose to opt-in this change for new CT logs or existing CT logs for newly submitted entires.
New CTFE flags are created in the LogMultiConfig.

Future Work

Migration for existing log entries in existing CT logs

This is quite risky to migrate live data in the merkle tree, extensive testing is required to cover all scenarios.

Note that enabling the ExtraData change in the existing CT logs without any data migration does not reduce any storage for existing entries immediately.

roger2hk · 2024-05-22T16:52:57Z

The pull requests to deduplicate the issuance chain in extra data are all merged.

pav-kv mentioned this issue May 14, 2020

Support ExtraData de-duplication in MySQL backend google/trillian#1968

Closed

roger2hk self-assigned this Feb 13, 2024

This was referenced Apr 29, 2024

Add issuance chain service #1452

Merged

Add IssuanceChainCache LRU implementation #1454

Merged

Add CTFE extra data storage saving configs validation #1456

Merged

This was referenced May 7, 2024

Add IssuanceChainStorage MySQL implementation #1462

Merged

CTFE Extra Data Issuance Chain Deduplication #1477

Merged

roger2hk closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: consider adding a chain storage interface to ctfe #691

proposal: consider adding a chain storage interface to ctfe #691

rolandshoemaker commented May 13, 2020

mhutchinson commented May 14, 2020

pav-kv commented May 14, 2020

mhutchinson commented May 14, 2020

rolandshoemaker commented May 15, 2020

pphaneuf commented May 15, 2020

roger2hk commented Mar 11, 2024 •

edited

Loading

roger2hk commented May 22, 2024

proposal: consider adding a chain storage interface to ctfe #691

proposal: consider adding a chain storage interface to ctfe #691

Comments

rolandshoemaker commented May 13, 2020

mhutchinson commented May 14, 2020

pav-kv commented May 14, 2020

mhutchinson commented May 14, 2020

rolandshoemaker commented May 15, 2020

pphaneuf commented May 15, 2020

roger2hk commented Mar 11, 2024 • edited Loading

Requirements

Proposal

Trillian ExtraData Field Changes

Non-Leaf Certificate Storage

CTFE Storage Backend

CTFE Cache

Deployment

Future Work

Migration for existing log entries in existing CT logs

roger2hk commented May 22, 2024

roger2hk commented Mar 11, 2024 •

edited

Loading

Trillian `ExtraData` Field Changes