Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctclone Fails with Incomplete CT Log Responses #1044

Closed
tired-engineer opened this issue Dec 27, 2023 · 5 comments · Fixed by #1055
Closed

ctclone Fails with Incomplete CT Log Responses #1044

tired-engineer opened this issue Dec 27, 2023 · 5 comments · Fixed by #1055
Assignees

Comments

@tired-engineer
Copy link
Contributor

tired-engineer commented Dec 27, 2023

Issue Description

The ctclone tool encounters failures when Certificate Transparency (CT) logs return fewer entries than requested. This issue becomes problematic during periodic ctclone operations or at the ends of the logs. At certain points, the position in the cloned log may align in such a way that the CT log returns less than the expected number of entries.

Specific Example

An instance of this issue can be observed with Sectigo's Sabre 2025h2. When a "negotiated" page size of 256 is used, the log returns only 165 entries for a specific range. This response unexpectedly brings the position to 52992 (207 * 256), causing ctclone to fail with the following error:

worker 0: Retryable error getting data: "LeafFetcher.Batch(52827, 256): wanted 256 leaves but got 165"

Additional Context

Further details about this behavior in CT logs can be found in the discussion on Let's Encrypt's forum regarding the 'coerced get-entries' feature.

Important Note

The proposed solution covers only the cases where the actual log's page size is a multiple of the requested batch size. It is assumed that the negotiated batch size requested by ctclone aligns with this multiple. If the actual page size does not conform to this pattern, additional adjustments may be required.

Edited: formatting.

@mhutchinson
Copy link
Contributor

Thanks for the well written bug report and the attempted fix in #1045. This sounds like it should be an easy fix, but I think that appearance is deceptive. I wasn't aware of any logs trying to coerce alignment as specific indices, and the design of the cloner doesn't lend itself well to readjusting to these hints; the current design does a very greedy strategy in the batch.go Bulk method where it starts goroutine workers with statically computed ranges to fetch.

I think the BatchFetch mechanism will need some way to elegantly report up that it only got partial results. The cloner will then need to cancel other fetches in progress and restart fetching again, using a start index based on the last result fetched from the incomplete range. To give confidence that this works, we'd probably want to add some kind of functional test that emulates a log performing this coercion.

I'm open to receiving PRs for this. If none are forthcoming I'll try to make time for this but it's unlikely to happen in the next couple of weeks, at least.

@tired-engineer
Copy link
Contributor Author

I agree, got actually caught up in that "deception" and still have not fixed an issue.

Another option could be to run at first just one request, check it for the alignment and spin workers accordingly received result.
In the approach you offer I think it's important to also implement the cleanup of cancelled or finished fetchers, as it might break the following insertion transactions.

What do you think would be the preferred way to approach the issue?

@mhutchinson
Copy link
Contributor

I like your proposal to have special-casing only on startup. Once a full chunk has been fetched, then it can enter full parallel mode as it currently does, and die on incomplete chunks.

The reason I like this is:

  • it's simpler than trying to realign mid-run
  • if the whole process is run by some orchestrator (e.g. docker) then this can be configured to restart the process on death. The events of dying/restarting will fix any transient alignment problems when the startup special casing reruns

I think this approach is worth investigating. It seems strictly better than the current behaviour. Is this something you'd be able to contribute, @tired-engineer ?

@tired-engineer
Copy link
Contributor Author

Hey, @mhutchinson,

Thanks for getting back to that issue and for the feedback. I will give it a try this week, hopefully will be able to contribute.

@tired-engineer
Copy link
Contributor Author

Sent the first implementation for review, a bit later than expected. I think it covers more or less everything we've discussed on this thread.

mhutchinson added a commit that referenced this issue Mar 20, 2024
… Logs (#1055)

Adapt ctclone to handle the 'coerced get-entries' feature by: 
 1. trying to retrieve just one request
 2. checking the number of returned results
 3. either continue with the rest workers, or stop fetching, store the incomplete tile, and exiting, delegating to the orchestration system to start the process again.

Fixes #1044

---------

Co-authored-by: Martin Hutchinson <mhutchinson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment