Report request errors to gridscale #180

twiebe · 2021-02-05T13:54:57Z

To improve stability of our services, we need to identify issues and bottlenecks first.

In the past, when encountering issues in API reliability that were temporary in nature, we have implemented retry mechanisms into gsclient-go and the gridscale-terraform-plugin where applicable to improve the stability of the overall process. Also, underlying issues have been identified and, where possible, solved accordingly.

These retry mechanisms have the side-effect that, once in place, they mask underyling issues. To improve quality for everyone, we need to make them visible again.

This can be tackled from either the client or the server-side. For the latter, the APIs would provide metrics of request success. This does not make all issues visible though. In front of the actual APIs, there is an application server. In front of that, a web server. In front of that a reverse proxy. In front of that a firewall. In front of that a router. Etc. Issues in these layers, or the rolling deployment of the APIs itself, would not be visible in API-level metrics. Instead, we want our SDKs to report errors to us - with an opt-out functionality.

gsclient-go has been selected as the first prototype - mainly, because it is used by terraform as well. Terraform, in return, makes use of higher concurrency than many other clients do and tends to trigger deficiencies in provisioning more efficiently.

The idea is to send these reports to a publically available sentry instance. Networking errors (connection reset f.e.) and provisioning errors (Storage creation error f.e.) alike. To investigate issues, I presume we need

X-Request-Id, if available
Error message/classifier
Accessed resource type URL (f.e. /objects/servers) for aggregation
Access method (GET, POST, PATCH, DELETE, etc) for aggregation
Time

These reports shall also be sent when the requests is retried. They shall not block the SDK and, as such, shall send reports asynchronously. Users must be able to opt-out.

What are your thoughts?

The text was updated successfully, but these errors were encountered:

nvthongswansea · 2021-02-05T18:11:06Z

I wonder if we should publish our sentry information (like connection info, etc.).

bkircher · 2021-02-06T12:49:11Z

What are your thoughts?

If the goal is to get backtraces from failed API requests I think it should happen on the server-side of the API not on the client-side. Reasons:

It is much simpler to implement. Adding Sentry to a couple of services on the backend can be done with only a few lines and can just be always on. Not much configuration handling needed. No opt-in/-out skirmish.
It is more accurate, since events can be sent more reliably as network conditions on the server-side are known. Also the information depth of the stack traces is much deeper (you would see where it failed, in what environment, et cetera). If you collect traces on the client you essentially just get the status code, a short error message, and maybe a request ID (that you still have to manually map to server-side collected traces).
It is safer for users. If we would add this to the clients we would have to provide users with a way to opt in or out. We would have to document why we are doing telemetry and what data is collected. I guess, we cannot guarantee with Sentry that there are no secrets included in the collected data and the potential that the data is personal data in the sense of GDPR would involve extra overhead on our side to clean-up those information on Sentry side before it is actually stored.

So really, tl;dr, if the goal is getting good traces on failed API requests, it is better done on server-side where requests are actually processed rather than client libraries or applications.

twiebe changed the title ~~Report request errors to gridscale sentry~~ Report request errors to gridscale Feb 5, 2021

bkircher added the discussion Things that need a decision label Feb 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report request errors to gridscale #180

Report request errors to gridscale #180

twiebe commented Feb 5, 2021

nvthongswansea commented Feb 5, 2021

bkircher commented Feb 6, 2021

Report request errors to gridscale #180

Report request errors to gridscale #180

Comments

twiebe commented Feb 5, 2021

nvthongswansea commented Feb 5, 2021

bkircher commented Feb 6, 2021