You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To improve stability of our services, we need to identify issues and bottlenecks first.
In the past, when encountering issues in API reliability that were temporary in nature, we have implemented retry mechanisms into gsclient-go and the gridscale-terraform-plugin where applicable to improve the stability of the overall process. Also, underlying issues have been identified and, where possible, solved accordingly.
These retry mechanisms have the side-effect that, once in place, they mask underyling issues. To improve quality for everyone, we need to make them visible again.
This can be tackled from either the client or the server-side. For the latter, the APIs would provide metrics of request success. This does not make all issues visible though. In front of the actual APIs, there is an application server. In front of that, a web server. In front of that a reverse proxy. In front of that a firewall. In front of that a router. Etc. Issues in these layers, or the rolling deployment of the APIs itself, would not be visible in API-level metrics. Instead, we want our SDKs to report errors to us - with an opt-out functionality.
gsclient-go has been selected as the first prototype - mainly, because it is used by terraform as well. Terraform, in return, makes use of higher concurrency than many other clients do and tends to trigger deficiencies in provisioning more efficiently.
The idea is to send these reports to a publically available sentry instance. Networking errors (connection reset f.e.) and provisioning errors (Storage creation error f.e.) alike. To investigate issues, I presume we need
X-Request-Id, if available
Error message/classifier
Accessed resource type URL (f.e. /objects/servers) for aggregation
Access method (GET, POST, PATCH, DELETE, etc) for aggregation
Time
These reports shall also be sent when the requests is retried. They shall not block the SDK and, as such, shall send reports asynchronously. Users must be able to opt-out.
What are your thoughts?
The text was updated successfully, but these errors were encountered:
twiebe
changed the title
Report request errors to gridscale sentry
Report request errors to gridscale
Feb 5, 2021
If the goal is to get backtraces from failed API requests I think it should happen on the server-side of the API not on the client-side. Reasons:
It is much simpler to implement. Adding Sentry to a couple of services on the backend can be done with only a few lines and can just be always on. Not much configuration handling needed. No opt-in/-out skirmish.
It is more accurate, since events can be sent more reliably as network conditions on the server-side are known. Also the information depth of the stack traces is much deeper (you would see where it failed, in what environment, et cetera). If you collect traces on the client you essentially just get the status code, a short error message, and maybe a request ID (that you still have to manually map to server-side collected traces).
It is safer for users. If we would add this to the clients we would have to provide users with a way to opt in or out. We would have to document why we are doing telemetry and what data is collected. I guess, we cannot guarantee with Sentry that there are no secrets included in the collected data and the potential that the data is personal data in the sense of GDPR would involve extra overhead on our side to clean-up those information on Sentry side before it is actually stored.
So really, tl;dr, if the goal is getting good traces on failed API requests, it is better done on server-side where requests are actually processed rather than client libraries or applications.
To improve stability of our services, we need to identify issues and bottlenecks first.
In the past, when encountering issues in API reliability that were temporary in nature, we have implemented retry mechanisms into gsclient-go and the gridscale-terraform-plugin where applicable to improve the stability of the overall process. Also, underlying issues have been identified and, where possible, solved accordingly.
These retry mechanisms have the side-effect that, once in place, they mask underyling issues. To improve quality for everyone, we need to make them visible again.
This can be tackled from either the client or the server-side. For the latter, the APIs would provide metrics of request success. This does not make all issues visible though. In front of the actual APIs, there is an application server. In front of that, a web server. In front of that a reverse proxy. In front of that a firewall. In front of that a router. Etc. Issues in these layers, or the rolling deployment of the APIs itself, would not be visible in API-level metrics. Instead, we want our SDKs to report errors to us - with an opt-out functionality.
gsclient-go
has been selected as the first prototype - mainly, because it is used by terraform as well. Terraform, in return, makes use of higher concurrency than many other clients do and tends to trigger deficiencies in provisioning more efficiently.The idea is to send these reports to a publically available sentry instance. Networking errors (connection reset f.e.) and provisioning errors (Storage creation error f.e.) alike. To investigate issues, I presume we need
X-Request-Id
, if available/objects/servers
) for aggregationGET
,POST
,PATCH
,DELETE
, etc) for aggregationThese reports shall also be sent when the requests is retried. They shall not block the SDK and, as such, shall send reports asynchronously. Users must be able to opt-out.
What are your thoughts?
The text was updated successfully, but these errors were encountered: