Clean up detection sentry events + tests#5833
Conversation
Otherwise, it can sometimes remain unclear in the diagnostics, whether it was InstallationV2 or InstallationV2CacheBust that timed out.
The current production logs show two types of verification timeouts: * service_error: "Unhandled Browserless response status: 408" (vast majority of cases) * service_error: :timeout (only a few cases) The latter happens when we hit the Req receive_timeout (endpoint_timeout + 2s). I've seen Browserless not respect the timeout param from time to time, so it's better to keep the timeout logic "in-house" only.
...but still consider them "unhandled" for telemetry, also notifying Sentry and logging the warning.
Also rename current liveview modules and routes, removing the v2 suffix
Also fix dockerignore and elixir.yml referencing a wrong priv path
1931b19 to
c2163df
Compare
| String.contains?(extra, "net::") -> failure(:client_issue) | ||
| String.contains?(String.downcase(extra), "execution context") -> failure(:client_issue) |
There was a problem hiding this comment.
nitpick, non-blocking: This remap from :browserless_client_error to :client_issue and :unknown_issue loses some information in the process. At this point, extra is known, but once we map it to :unknown_issue, it gets thrown away and in the checks module we go to the case
_unknown_failure ->
{true, true, "Unknown failure"}
It seems there's actually two codes that we want to emit from the Detection check, :browserless_client_error and :browserless_client_error_silenced
There was a problem hiding this comment.
The information doesn't get lost. These error-grouping atoms are just for determining which kind of telemetry we want. In any case, if it's an interesting case, the whole diagnostics struct (e.g. %{..., service_error: %{code: :browserless_client_error, extra: "whatever message"}}, ...) is logged and captured by Sentry as well.
|
|
||
| def interpret(%__MODULE__{service_error: %{code: code}}, _url) | ||
| when code in [:domain_not_found, :invalid_url] do | ||
| failure(:client_issue) |
There was a problem hiding this comment.
nitpick, non-blocking: this does not seem like a client issue (browser driven by Browserless.io) issue, it's invalid input issue: we don't even start the browser in these scenarios
There was a problem hiding this comment.
Fair point, thanks! I've renamed it to :customer_website_issue which should make it more clear. 04942e6
Changes
This PR cleans up Sentry reporting for detection. Doing so, we're also changing the way we observe detection failures. Instead of "unhandled vs handled" (which makes sense in verification), a detection result is either a success or a failure. The telemetry events are also renamed accordingly in this PR.
A detection failure can be two things - an issue with the customer website, or an issue on our side. The issues on our side are further split into Browserless issues vs our own. Hence the telemetry bit is branched out quite a lot, but I think it should help us prioritize any future issues better.
Tests
Changelog
Documentation
Dark mode