Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge Gateway CF Worker system errors #51

Open
vasco-santos opened this issue Apr 22, 2022 · 5 comments
Open

Edge Gateway CF Worker system errors #51

vasco-santos opened this issue Apr 22, 2022 · 5 comments
Assignees
Labels
kind/bug A bug in existing code (including security flaws) P2 Medium: Good to have, but can wait until someone steps up pi/sre-kickoff

Comments

@vasco-santos
Copy link
Member

vasco-santos commented Apr 22, 2022

We have been seeing CF errors in the wild Network Connection Lost, together with some sporadic KV GET failed: 403 Forbidden + KV GET failed: 501 Not Implemented (when we use the KV get for every request).

Some details of one of the requests: worker_id_string _5IO2WMTFx + cf_ray_string 6ff44daab9ee5a55

Per convos in Cloudflare Developers Discord, we are trying a temporary solution recommended at https://discord.com/channels/595317990191398933/779390076219686943/967047749881192468 to work around these errors #50.

We should look into the root issue with CF Workers team

@vasco-santos vasco-santos added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Apr 22, 2022
@fabiolrodriguez fabiolrodriguez self-assigned this Apr 26, 2022
@dchoi27 dchoi27 added P2 Medium: Good to have, but can wait until someone steps up pi/sre-kickoff and removed need/triage Needs initial labeling and prioritization labels Apr 28, 2022
@olizilla
Copy link
Contributor

olizilla commented May 5, 2022

@vasco-santos can you quote the recommended solution here in the issue please.

@vasco-santos
Copy link
Member Author

@vasco-santos can you quote the recommended solution here in the issue please.

Not really a fancy solution:

We had a single get on KV namespace for each request. We hve millions of requests each day worldwide . So we had to implement a recursive fail-safe pattern

We did this, looks like it helped but with ~15M req/24H we start hitting these issues in CF. We need to talk with them and try go get this fixed at the root, in the meantime a temporary solution could be replicating the KV and have like 3 different KV with the same content. Then, randomly choose one of them per request to check Denylist. Cron Job would need to update content on the 3 of them.

@vasco-santos
Copy link
Member Author

There is another angle we can go here. We do this check before going into cache as we can be caching a bad response.

Considering cache hit is really high, we can reduce a lot the number of requests to the KV if we turn things around and go to cache first. We really just need to remove things from cache when they are added to KV https://developers.cloudflare.com/workers/runtime-apis/cache/#delete.

For this, we would need a protected (JWT token) route in the worker gateway (or API) /cache/delete and automate this call when we update KV.

Thoughts @olizilla ?

@olizilla
Copy link
Contributor

olizilla commented Sep 2, 2022

Yes that seems reasonable. If it's in the cache, let's respond ASAP! If it's bad, clear it out of the cache!

@vasco-santos
Copy link
Member Author

An update here: Cloudflare KV seems to have scaled and we have not seen these errors for a few weeks now, even with larger scale on our side. Keeping the retry for now does not hurt though.

Even though, I think we should track work for the cache optimization described above. I created an issue for this in the new appropriate repo web3-storage/reads#44

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) P2 Medium: Good to have, but can wait until someone steps up pi/sre-kickoff
Projects
None yet
Development

No branches or pull requests

4 participants