Create metric to record upload failures #948
Create metric to record upload failures #948brizental wants to merge 5 commits intomozilla:masterfrom
Conversation
mdboom
left a comment
There was a problem hiding this comment.
This looks great.
As for timeouts, I guess those will get recorded as "recoverable" errors now? My gut thinks that's an important class of error to separate out, but I could be convinced otherwise. There are probably a whole other array of errors that could happen, and we have to draw the line somewhere at how fine-grained we want to be about them.
| /// * `uuid` - The UUID of the ping in question. | ||
| /// * `status` - The upload result. | ||
| pub fn process_ping_upload_response(&self, uuid: &str, status: UploadResult) { | ||
| status.record_error(self); |
There was a problem hiding this comment.
Note-worthy (before I give this a full look): The Python bindings won't initialize a full Glean anymore in the upload process.
Therefore Python will be unable to report any of those errors
(As this code here is in src/lib.rs only, it will all continue to work and not crash or anything)
There was a problem hiding this comment.
Yes I saw that and it is the reason why I put it in Glean and not in the UploadManager. We don't want to record anything in case there is no Glean object, correct?
There was a problem hiding this comment.
Good catch, @badboy. We could create a bug for that and not block this -- I still wonder if a good way around that would be to have a pure-Rust uploading binary (which would actually do a full init so it could record these things).
There was a problem hiding this comment.
Let's file a bug to add support for upload error recording to Python.
There was a problem hiding this comment.
badboy
left a comment
There was a problem hiding this comment.
Minimal documentation nits only.
| type: labeled_counter | ||
| description: | ||
| Counts the number of times pings fail to upload | ||
| and the reason why they failed. |
There was a problem hiding this comment.
Maybe it's worth mentioning that (obviously) these numbers are reported after the fact?
That is: if a metrics ping upload fails and gets retried it still won't have this in there, only the next one will.
There was a problem hiding this comment.
It could be confusing mention this, because this metric will count the failures for all ping uploads, not only the metrics ping. It could be misleading to mention this situation
if a metrics ping upload fails and gets retried it still won't have this in there, only the next one will.
because it only applies for the metrics ping.
There was a problem hiding this comment.
Hm, @mdboom, got an idea how to phrase this?
I'd like us to start being a bit more explicit of when and where to expect this metric
There was a problem hiding this comment.
It's not great -- this is the best I could come up with. What do you think?
Counts the number of ping upload failures, by type of failure.
This includes failures for all ping types, though the counts appear in the next successfully sent `metrics` ping.
|
This android failure on CI is probably unrelated since it failed on a commit where I only changed description text in the |
It's also only on TaskCluster, can be ignored for now. |
Related to Bug 1589124
I am not sure adding all this logic to the
UploadResultis the best idea. I saw theerror_recordingmodule, but that is very specific to the metric errors we may get. I also thought about adding this to each binding, but that would mean a lot of repeated code and also if a user decides to implement the uploader themselves they would have to know about this metric.I also didn't add a label for timeout errors in this PR and I think we should discuss whether it should be added at all, since it would mean yet another different result variant for the
UploadResultenum. If we decide that yes we should have a new Timeout variant I can add it in a different PR.