ctclone Fails with Incomplete CT Log Responses #1044

tired-engineer · 2023-12-27T09:48:00Z

Issue Description

The ctclone tool encounters failures when Certificate Transparency (CT) logs return fewer entries than requested. This issue becomes problematic during periodic ctclone operations or at the ends of the logs. At certain points, the position in the cloned log may align in such a way that the CT log returns less than the expected number of entries.

Specific Example

An instance of this issue can be observed with Sectigo's Sabre 2025h2. When a "negotiated" page size of 256 is used, the log returns only 165 entries for a specific range. This response unexpectedly brings the position to 52992 (207 * 256), causing ctclone to fail with the following error:

worker 0: Retryable error getting data: "LeafFetcher.Batch(52827, 256): wanted 256 leaves but got 165"

Additional Context

Further details about this behavior in CT logs can be found in the discussion on Let's Encrypt's forum regarding the 'coerced get-entries' feature.

Important Note

The proposed solution covers only the cases where the actual log's page size is a multiple of the requested batch size. It is assumed that the negotiated batch size requested by ctclone aligns with this multiple. If the actual page size does not conform to this pattern, additional adjustments may be required.

Edited: formatting.

The text was updated successfully, but these errors were encountered:

mhutchinson · 2024-01-03T14:48:56Z

Thanks for the well written bug report and the attempted fix in #1045. This sounds like it should be an easy fix, but I think that appearance is deceptive. I wasn't aware of any logs trying to coerce alignment as specific indices, and the design of the cloner doesn't lend itself well to readjusting to these hints; the current design does a very greedy strategy in the batch.go Bulk method where it starts goroutine workers with statically computed ranges to fetch.

I think the BatchFetch mechanism will need some way to elegantly report up that it only got partial results. The cloner will then need to cancel other fetches in progress and restart fetching again, using a start index based on the last result fetched from the incomplete range. To give confidence that this works, we'd probably want to add some kind of functional test that emulates a log performing this coercion.

I'm open to receiving PRs for this. If none are forthcoming I'll try to make time for this but it's unlikely to happen in the next couple of weeks, at least.

tired-engineer · 2024-01-06T13:00:18Z

I agree, got actually caught up in that "deception" and still have not fixed an issue.

Another option could be to run at first just one request, check it for the alignment and spin workers accordingly received result.
In the approach you offer I think it's important to also implement the cleanup of cancelled or finished fetchers, as it might break the following insertion transactions.

What do you think would be the preferred way to approach the issue?

mhutchinson · 2024-01-08T17:11:56Z

I like your proposal to have special-casing only on startup. Once a full chunk has been fetched, then it can enter full parallel mode as it currently does, and die on incomplete chunks.

The reason I like this is:

it's simpler than trying to realign mid-run
if the whole process is run by some orchestrator (e.g. docker) then this can be configured to restart the process on death. The events of dying/restarting will fix any transient alignment problems when the startup special casing reruns

I think this approach is worth investigating. It seems strictly better than the current behaviour. Is this something you'd be able to contribute, @tired-engineer ?

tired-engineer · 2024-01-08T21:23:19Z

Hey, @mhutchinson,

Thanks for getting back to that issue and for the feedback. I will give it a try this week, hopefully will be able to contribute.

tired-engineer · 2024-01-16T12:51:45Z

Sent the first implementation for review, a bit later than expected. I think it covers more or less everything we've discussed on this thread.

… Logs (#1055) Adapt ctclone to handle the 'coerced get-entries' feature by: 1. trying to retrieve just one request 2. checking the number of returned results 3. either continue with the rest workers, or stop fetching, store the incomplete tile, and exiting, delegating to the orchestration system to start the process again. Fixes #1044 --------- Co-authored-by: Martin Hutchinson <mhutchinson@gmail.com>

tired-engineer mentioned this issue Dec 27, 2023

Adapt ctclone for 'coerced get-entries' in Let's Encrypt and other CT Logs #1045

Closed

mhutchinson self-assigned this Jan 3, 2024

tired-engineer mentioned this issue Jan 16, 2024

Adapt ctclone for 'coerced get-entries' in Let's Encrypt and other CT Logs #1055

Merged

mhutchinson closed this as completed in #1055 Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ctclone Fails with Incomplete CT Log Responses #1044

ctclone Fails with Incomplete CT Log Responses #1044

tired-engineer commented Dec 27, 2023 •

edited

mhutchinson commented Jan 3, 2024

tired-engineer commented Jan 6, 2024

mhutchinson commented Jan 8, 2024

tired-engineer commented Jan 8, 2024

tired-engineer commented Jan 16, 2024

ctclone Fails with Incomplete CT Log Responses #1044

ctclone Fails with Incomplete CT Log Responses #1044

Comments

tired-engineer commented Dec 27, 2023 • edited

Issue Description

Specific Example

Additional Context

Important Note

mhutchinson commented Jan 3, 2024

tired-engineer commented Jan 6, 2024

mhutchinson commented Jan 8, 2024

tired-engineer commented Jan 8, 2024

tired-engineer commented Jan 16, 2024

tired-engineer commented Dec 27, 2023 •

edited