Skip to content

Create metric to record upload failures #948

Closed
brizental wants to merge 5 commits intomozilla:masterfrom
brizental:1589124
Closed

Create metric to record upload failures #948
brizental wants to merge 5 commits intomozilla:masterfrom
brizental:1589124

Conversation

@brizental
Copy link
Copy Markdown
Contributor

Related to Bug 1589124

I am not sure adding all this logic to the UploadResult is the best idea. I saw the error_recording module, but that is very specific to the metric errors we may get. I also thought about adding this to each binding, but that would mean a lot of repeated code and also if a user decides to implement the uploader themselves they would have to know about this metric.

I also didn't add a label for timeout errors in this PR and I think we should discuss whether it should be added at all, since it would mean yet another different result variant for the UploadResult enum. If we decide that yes we should have a new Timeout variant I can add it in a different PR.

@brizental brizental requested review from badboy and mdboom June 8, 2020 12:36
@auto-assign auto-assign Bot requested a review from Dexterp37 June 8, 2020 12:36
@brizental brizental removed the request for review from Dexterp37 June 8, 2020 12:36
Copy link
Copy Markdown
Contributor

@mdboom mdboom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great.

As for timeouts, I guess those will get recorded as "recoverable" errors now? My gut thinks that's an important class of error to separate out, but I could be convinced otherwise. There are probably a whole other array of errors that could happen, and we have to draw the line somewhere at how fine-grained we want to be about them.

Comment thread glean-core/src/upload/result.rs Outdated
Comment thread glean-core/src/lib.rs Outdated
/// * `uuid` - The UUID of the ping in question.
/// * `status` - The upload result.
pub fn process_ping_upload_response(&self, uuid: &str, status: UploadResult) {
status.record_error(self);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note-worthy (before I give this a full look): The Python bindings won't initialize a full Glean anymore in the upload process.
Therefore Python will be unable to report any of those errors
(As this code here is in src/lib.rs only, it will all continue to work and not crash or anything)

Copy link
Copy Markdown
Contributor Author

@brizental brizental Jun 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I saw that and it is the reason why I put it in Glean and not in the UploadManager. We don't want to record anything in case there is no Glean object, correct?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, @badboy. We could create a bug for that and not block this -- I still wonder if a good way around that would be to have a pure-Rust uploading binary (which would actually do a full init so it could record these things).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's file a bug to add support for upload error recording to Python.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brizental brizental requested a review from mdboom June 8, 2020 15:44
@brizental brizental requested a review from badboy June 9, 2020 09:04
Copy link
Copy Markdown
Member

@badboy badboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimal documentation nits only.

Comment thread glean-core/metrics.yaml Outdated
type: labeled_counter
description:
Counts the number of times pings fail to upload
and the reason why they failed.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth mentioning that (obviously) these numbers are reported after the fact?
That is: if a metrics ping upload fails and gets retried it still won't have this in there, only the next one will.

Copy link
Copy Markdown
Contributor Author

@brizental brizental Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be confusing mention this, because this metric will count the failures for all ping uploads, not only the metrics ping. It could be misleading to mention this situation

if a metrics ping upload fails and gets retried it still won't have this in there, only the next one will.

because it only applies for the metrics ping.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, @mdboom, got an idea how to phrase this?
I'd like us to start being a bit more explicit of when and where to expect this metric

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not great -- this is the best I could come up with. What do you think?

Counts the number of ping upload failures, by type of failure.
This includes failures for all ping types, though the counts appear in the next successfully sent `metrics` ping.

Comment thread glean-core/src/lib.rs
Comment thread glean-core/src/metrics/labeled.rs
Comment thread glean-core/src/upload/result.rs
@brizental
Copy link
Copy Markdown
Contributor Author

This android failure on CI is probably unrelated since it failed on a commit where I only changed description text in the metrics.yaml.

@badboy
Copy link
Copy Markdown
Member

badboy commented Jun 10, 2020

This android failure on CI is probably unrelated since it failed on a commit where I only changed description text in the metrics.yaml.

It's also only on TaskCluster, can be ignored for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants