Need monitoring/alerting to check whether Knative prow jobs run properly #15

jessiezcc · 2018-07-18T22:49:42Z

/area test-and-release
/kind dev

Expected Behavior
When Prow job fails to run properly, we should get alert/notification automatically

Actual Behavior
We are finding out the failing job run manually.

steuhs · 2018-07-18T23:18:41Z

Do we also want a dashboard that shows whether the HEAD run of each job is successful?

steuhs · 2018-07-18T23:19:46Z

Through what channel should we get the alert? Email, or maybe slack bot?

jessiezcc · 2018-07-18T23:40:01Z

Kubernetes already has a dashboard for job status, isn't it. Slack or Git is good since we want to community visibility. Would be nice to auto create an issue and notify OWNERs.

steuhs · 2018-07-19T00:01:24Z

@jessiezcc I chatted with Sen, there is no dashboard that does what I asked above

steuhs · 2018-07-19T00:05:33Z

I am not sure if Github would be a good channel. The only place I can think of posting the status is the Issues section. Do we want to use the Issues section in this repo to record all the job failures in other repos?

adrcunha · 2018-09-05T17:31:28Z

I suggest using Stackdriver, at least as an initial solution; this way we can have some monitoring up and running ASAP.

steuhs · 2018-09-22T00:32:18Z

It looks like prow has some of its own way of reporting https://github.com/kubernetes/test-infra/tree/master/prow/report
I am trying to see if we can build on top of what they have

steuhs · 2018-10-01T17:45:25Z

Yutong is working on prow's reporting feature. https://github.com/kubernetes/test-infra/tree/master/prow/report. I am trying to see if and to what extent we can use that feature

steuhs · 2018-10-01T18:00:27Z

Looks like this pkg have a template and function to post prowjob issues on github https://github.com/kubernetes/test-infra/blob/master/prow/report/report.go

cjwagner · 2018-10-01T18:31:48Z

@jessiezcc I chatted with Sen, there is no dashboard that does what I asked above

What exactly are you referring to? I think we have mechanisms to achieve everything listed on this issue except for reporting job failures to slack.

steuhs · 2018-10-01T18:36:34Z

@cjwagner I think you are talking about the status context that is showing up at the end of each PR page. Isn't that limited to pre-submit checks? There is also post-submit and periodical jobs we want to monitor, I believe.

cjwagner · 2018-10-01T18:39:17Z

That is just one of the mechanisms we have. We have configurable email alerting available through testgrid and we can display the status of the last run of a job with svg badges.

What exactly are you trying to report on?

steuhs · 2018-12-13T23:26:51Z

@cjwagner Who is working on those reporting features you mentioned? I'd like to get the more detail on what has been implemented and what is planned.

cjwagner · 2018-12-14T22:44:04Z

Those are Testgrid feature so @michelle192837 is the one who implemented them. Please refer to the documentation first though, it describes the features and how to use them: https://github.com/kubernetes/test-infra/tree/master/testgrid#email-alerts

adrcunha · 2018-12-14T22:47:55Z

Testgrid e-mail alerting is already enabled by #261. We want a lower level job monitoring so we can act faster when something goes wrong.

cjwagner · 2018-12-14T23:09:37Z

According to that PR body, you configured Testgrid to report only after 3 consecutive failures. If you reported on the first failure that would have the effect you want right? Or are you saying that Testgrid's update period itself is too slow for some use case that you have?

adrcunha · 2018-12-14T23:18:48Z

Or are you saying that Testgrid's update period itself is too slow for some use case that you have?

That's correct. Example: suppose that we push a bad Prow config, for example, and the crons or pull jobs don't run. Currently we have no way of knowing that, unless someone stumbles upon it and reports (e.g., pull test jobs never finish for your PR).

krzyzacy · 2018-12-19T01:27:48Z

Why it never finishes? Timeout on your prowjob should work?

If you change some presubmit jobs in your config, you probably always want to manually triggered on a PR right? For example, we have a little playground in k/k like kubernetes/kubernetes#46662

adrcunha · 2018-12-19T05:48:21Z

Maybe that was just a bad example. But the idea, as Cold put it clearly, is to have some sort of monitoring in place so we are aware of issues with our Prow jobs (presubmits, postsubmits, crons, etc) way faster than the time it takes for Testgrid to update and us to check it out and realize that something is not right.

That's the motivation. I'll leave the details to Stephen, who's working on this issue.

steuhs · 2019-01-08T23:44:08Z

@adrcunha I read this https://github.com/kubernetes/test-infra/tree/master/testgrid#email-alerts. It seems to me that we can use the TestGrid, with num_failures_to_alert set to 1. With that change, I don't see any use case where TestGrid would be too slow - we don't need to rely on periodical jobs, we can monitor on post-submit jobs to get the failure report as soon as it happens. Please correct me if you can give me a use case that we cannot use TestGrid's alerting mechanism. @michelle192837 Please provide your opinion as well.

adrcunha · 2019-01-08T23:53:12Z

num_failures_to_alert will only report failed tests, not broken jobs (one that never reports test status, for example). Also, it has a delay of up to 2h due to Testgrid updates. This issue is about monitoring the jobs, not test failures.

michelle192837 · 2019-01-08T23:59:51Z

Adriano is correct that you'll have to deal with the TestGrid update delay either way (though it's a lot less than 2H in worst case in external; more like 30 minutes delay with bad luck). That said, it does seem like broken jobs should timeout and report at some point, producing a failed result, so that the only delay you have to deal with is the update delay.

So I guess if the problem is 'I want to know when my Prow jobs are failing', you can get that (holding to TestGrid's update cycles) with TestGrid alerting. If it's a potential misconfiguration thing for Prow, that seems like that should be added to/handled by Prow presubmit tests? And if Prow jobs are staying up forever, seems like that should be fixed with a timeout.

ETA: That said, let me know if I'm missing something here. num_failures_to_alert = 1 on a dashboard might be a good first step either way.

adrcunha · 2019-01-09T00:11:28Z

Timeouts are already in place, num_failures_to_alert is already setup (but won't work until we have our own Testgrid backend). On top of that, we want the quickest possible way to identify when Prow jobs are misbehaving; we don't want to wait 30 minutes, or 2h, or for a user to report issues on Slack. Scenarios include bad configs (secrets, ACLs), k8s pods failures, resource exhaustion, etc. Less frequent jobs (like the nightly releases or playground update) are more concerning, since we tend to realize they're broken too late in the game when we rely only on Testgrid (even if the report is automated).

michelle192837 · 2019-01-09T00:23:43Z

Mm, fair enough.

steuhs · 2019-01-16T01:13:59Z

@adrcunha I discussed with @cjwagner, bad configs such as invalid container address and wrong secrets will result in pending state and there is no way to tell if there is any real issue (see kubernetes/test-infra#9694) when it is in pending state. We can not alter that unless we change the design and coding of Kubernetes itself. To avoid long wait time we can shorten the timeout for pending state to 20 mins or so - we can tailor the number by looking into historical duration of pending state.
Cole also mentioned prow rarely fail before a job start - it only happens several times in a year, for different reasons.
For the particular case of checking whether the knative docker images are missing or non-accessible, there is an internal task assigned to @tcnghia
Considering all the three factors mentioned above, I think the most cost effective way is to shorten the timeout for pending state and monitor on prow job status change.

adrcunha · 2019-01-16T01:26:26Z

I agree about shortening the timeout.
What's your proposal for monitoring the Prow job status change?
Please clarify and propose how the mentioned task about missing/non-accessible Knative images can be used for monitoring/alert.
Please rule out Stackdriver as a good monitoring solution before we reduce the solution to a simple timeout reduction, which still heavily relies on human checks.

steuhs · 2019-01-16T21:50:23Z

@adrcunha
2. & 4. We don't necessarily need to rule out Stackdriver because the status change can be monitored through there. Indeed it seems to be a better alternative to monitor the status change there, comparing to using Crier. I am doing the investigation. One potential advantage of using Stackdriver would be the alerting system it provides
3. I mean Nghia is working on "Cloud Run GKE prober should fail when release images are missing" (b/120081643). So for those kinds of failures we will have a separate solution

adrcunha · 2019-01-16T21:58:11Z

3 doesn't apply.
For 4 looks like I wasn't clear. I indeed meant "do NOT rule out Stackdriver unless it's proven that it doesn't help". I advocated using SD as an easy monitoring solution from day 1.

srinivashegde86 · 2019-05-20T18:09:41Z

Should be handled by the knative-monitoring proposal

Produced via: `./hack/update-deps.sh --upgrade && ./hack/update-codegen.sh` /assign tcnghia /cc tcnghia

jessiezcc mentioned this issue Jul 18, 2018

Need monitoring/alerting to check whether Knative prow jobs run properly knative/serving#1618

Closed

jessiezcc assigned steuhs Jul 26, 2018

steuhs assigned krzyzacy Dec 13, 2018

jessiezcc added this to the M2 milestone Feb 26, 2019

srinivashegde86 closed this as completed May 20, 2019

Cynocracy pushed a commit to Cynocracy/test-infra that referenced this issue Jul 22, 2020

[master] Auto-update dependencies (knative#15)

c5b1bdd

Produced via: `./hack/update-deps.sh --upgrade && ./hack/update-codegen.sh` /assign tcnghia /cc tcnghia

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need monitoring/alerting to check whether Knative prow jobs run properly #15

Need monitoring/alerting to check whether Knative prow jobs run properly #15

jessiezcc commented Jul 18, 2018

steuhs commented Jul 18, 2018

steuhs commented Jul 18, 2018

jessiezcc commented Jul 18, 2018

steuhs commented Jul 19, 2018

steuhs commented Jul 19, 2018

adrcunha commented Sep 5, 2018

steuhs commented Sep 22, 2018

steuhs commented Oct 1, 2018

steuhs commented Oct 1, 2018

cjwagner commented Oct 1, 2018

steuhs commented Oct 1, 2018

cjwagner commented Oct 1, 2018

steuhs commented Dec 13, 2018

cjwagner commented Dec 14, 2018

adrcunha commented Dec 14, 2018

cjwagner commented Dec 14, 2018

adrcunha commented Dec 14, 2018

krzyzacy commented Dec 19, 2018

adrcunha commented Dec 19, 2018

steuhs commented Jan 8, 2019

adrcunha commented Jan 8, 2019

michelle192837 commented Jan 8, 2019 •

edited

Loading

adrcunha commented Jan 9, 2019

michelle192837 commented Jan 9, 2019

steuhs commented Jan 16, 2019

adrcunha commented Jan 16, 2019

steuhs commented Jan 16, 2019

adrcunha commented Jan 16, 2019

srinivashegde86 commented May 20, 2019

Need monitoring/alerting to check whether Knative prow jobs run properly #15

Need monitoring/alerting to check whether Knative prow jobs run properly #15

Comments

jessiezcc commented Jul 18, 2018

steuhs commented Jul 18, 2018

steuhs commented Jul 18, 2018

jessiezcc commented Jul 18, 2018

steuhs commented Jul 19, 2018

steuhs commented Jul 19, 2018

adrcunha commented Sep 5, 2018

steuhs commented Sep 22, 2018

steuhs commented Oct 1, 2018

steuhs commented Oct 1, 2018

cjwagner commented Oct 1, 2018

steuhs commented Oct 1, 2018

cjwagner commented Oct 1, 2018

steuhs commented Dec 13, 2018

cjwagner commented Dec 14, 2018

adrcunha commented Dec 14, 2018

cjwagner commented Dec 14, 2018

adrcunha commented Dec 14, 2018

krzyzacy commented Dec 19, 2018

adrcunha commented Dec 19, 2018

steuhs commented Jan 8, 2019

adrcunha commented Jan 8, 2019

michelle192837 commented Jan 8, 2019 • edited Loading

adrcunha commented Jan 9, 2019

michelle192837 commented Jan 9, 2019

steuhs commented Jan 16, 2019

adrcunha commented Jan 16, 2019

steuhs commented Jan 16, 2019

adrcunha commented Jan 16, 2019

srinivashegde86 commented May 20, 2019

michelle192837 commented Jan 8, 2019 •

edited

Loading