Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report request errors to gridscale #180

Open
twiebe opened this issue Feb 5, 2021 · 2 comments
Open

Report request errors to gridscale #180

twiebe opened this issue Feb 5, 2021 · 2 comments
Labels
discussion Things that need a decision

Comments

@twiebe
Copy link
Member

twiebe commented Feb 5, 2021

To improve stability of our services, we need to identify issues and bottlenecks first.

In the past, when encountering issues in API reliability that were temporary in nature, we have implemented retry mechanisms into gsclient-go and the gridscale-terraform-plugin where applicable to improve the stability of the overall process. Also, underlying issues have been identified and, where possible, solved accordingly.

These retry mechanisms have the side-effect that, once in place, they mask underyling issues. To improve quality for everyone, we need to make them visible again.

This can be tackled from either the client or the server-side. For the latter, the APIs would provide metrics of request success. This does not make all issues visible though. In front of the actual APIs, there is an application server. In front of that, a web server. In front of that a reverse proxy. In front of that a firewall. In front of that a router. Etc. Issues in these layers, or the rolling deployment of the APIs itself, would not be visible in API-level metrics. Instead, we want our SDKs to report errors to us - with an opt-out functionality.

gsclient-go has been selected as the first prototype - mainly, because it is used by terraform as well. Terraform, in return, makes use of higher concurrency than many other clients do and tends to trigger deficiencies in provisioning more efficiently.

The idea is to send these reports to a publically available sentry instance. Networking errors (connection reset f.e.) and provisioning errors (Storage creation error f.e.) alike. To investigate issues, I presume we need

  • X-Request-Id, if available
  • Error message/classifier
  • Accessed resource type URL (f.e. /objects/servers) for aggregation
  • Access method (GET, POST, PATCH, DELETE, etc) for aggregation
  • Time

These reports shall also be sent when the requests is retried. They shall not block the SDK and, as such, shall send reports asynchronously. Users must be able to opt-out.

What are your thoughts?

@twiebe twiebe changed the title Report request errors to gridscale sentry Report request errors to gridscale Feb 5, 2021
@nvthongswansea
Copy link
Member

I wonder if we should publish our sentry information (like connection info, etc.).

@bkircher
Copy link
Contributor

bkircher commented Feb 6, 2021

What are your thoughts?

If the goal is to get backtraces from failed API requests I think it should happen on the server-side of the API not on the client-side. Reasons:

  1. It is much simpler to implement. Adding Sentry to a couple of services on the backend can be done with only a few lines and can just be always on. Not much configuration handling needed. No opt-in/-out skirmish.
  2. It is more accurate, since events can be sent more reliably as network conditions on the server-side are known. Also the information depth of the stack traces is much deeper (you would see where it failed, in what environment, et cetera). If you collect traces on the client you essentially just get the status code, a short error message, and maybe a request ID (that you still have to manually map to server-side collected traces).
  3. It is safer for users. If we would add this to the clients we would have to provide users with a way to opt in or out. We would have to document why we are doing telemetry and what data is collected. I guess, we cannot guarantee with Sentry that there are no secrets included in the collected data and the potential that the data is personal data in the sense of GDPR would involve extra overhead on our side to clean-up those information on Sentry side before it is actually stored.

So really, tl;dr, if the goal is getting good traces on failed API requests, it is better done on server-side where requests are actually processed rather than client libraries or applications.

@bkircher bkircher added the discussion Things that need a decision label Feb 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Things that need a decision
Projects
None yet
Development

No branches or pull requests

3 participants