-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need monitoring/alerting to check whether Knative prow jobs run properly #15
Comments
Do we also want a dashboard that shows whether the HEAD run of each job is successful? |
Through what channel should we get the alert? Email, or maybe slack bot? |
Kubernetes already has a dashboard for job status, isn't it. Slack or Git is good since we want to community visibility. Would be nice to auto create an issue and notify OWNERs. |
@jessiezcc I chatted with Sen, there is no dashboard that does what I asked above |
I am not sure if Github would be a good channel. The only place I can think of posting the status is the Issues section. Do we want to use the Issues section in this repo to record all the job failures in other repos? |
I suggest using Stackdriver, at least as an initial solution; this way we can have some monitoring up and running ASAP. |
It looks like prow has some of its own way of reporting https://github.com/kubernetes/test-infra/tree/master/prow/report |
Yutong is working on prow's reporting feature. https://github.com/kubernetes/test-infra/tree/master/prow/report. I am trying to see if and to what extent we can use that feature |
Looks like this pkg have a template and function to post prowjob issues on github https://github.com/kubernetes/test-infra/blob/master/prow/report/report.go |
What exactly are you referring to? I think we have mechanisms to achieve everything listed on this issue except for reporting job failures to slack. |
@cjwagner I think you are talking about the status context that is showing up at the end of each PR page. Isn't that limited to pre-submit checks? There is also post-submit and periodical jobs we want to monitor, I believe. |
That is just one of the mechanisms we have. We have configurable email alerting available through testgrid and we can display the status of the last run of a job with svg badges. What exactly are you trying to report on? |
@cjwagner Who is working on those reporting features you mentioned? I'd like to get the more detail on what has been implemented and what is planned. |
Those are Testgrid feature so @michelle192837 is the one who implemented them. Please refer to the documentation first though, it describes the features and how to use them: https://github.com/kubernetes/test-infra/tree/master/testgrid#email-alerts |
Testgrid e-mail alerting is already enabled by #261. We want a lower level job monitoring so we can act faster when something goes wrong. |
According to that PR body, you configured Testgrid to report only after 3 consecutive failures. If you reported on the first failure that would have the effect you want right? Or are you saying that Testgrid's update period itself is too slow for some use case that you have? |
That's correct. Example: suppose that we push a bad Prow config, for example, and the crons or pull jobs don't run. Currently we have no way of knowing that, unless someone stumbles upon it and reports (e.g., pull test jobs never finish for your PR). |
Why it never finishes? Timeout on your prowjob should work? If you change some presubmit jobs in your config, you probably always want to manually triggered on a PR right? For example, we have a little playground in k/k like kubernetes/kubernetes#46662 |
Maybe that was just a bad example. But the idea, as Cold put it clearly, is to have some sort of monitoring in place so we are aware of issues with our Prow jobs (presubmits, postsubmits, crons, etc) way faster than the time it takes for Testgrid to update and us to check it out and realize that something is not right. That's the motivation. I'll leave the details to Stephen, who's working on this issue. |
@adrcunha I read this https://github.com/kubernetes/test-infra/tree/master/testgrid#email-alerts. It seems to me that we can use the TestGrid, with num_failures_to_alert set to 1. With that change, I don't see any use case where TestGrid would be too slow - we don't need to rely on periodical jobs, we can monitor on post-submit jobs to get the failure report as soon as it happens. Please correct me if you can give me a use case that we cannot use TestGrid's alerting mechanism. @michelle192837 Please provide your opinion as well. |
|
Adriano is correct that you'll have to deal with the TestGrid update delay either way (though it's a lot less than 2H in worst case in external; more like 30 minutes delay with bad luck). That said, it does seem like broken jobs should timeout and report at some point, producing a failed result, so that the only delay you have to deal with is the update delay. So I guess if the problem is 'I want to know when my Prow jobs are failing', you can get that (holding to TestGrid's update cycles) with TestGrid alerting. If it's a potential misconfiguration thing for Prow, that seems like that should be added to/handled by Prow presubmit tests? And if Prow jobs are staying up forever, seems like that should be fixed with a timeout. ETA: That said, let me know if I'm missing something here. num_failures_to_alert = 1 on a dashboard might be a good first step either way. |
Timeouts are already in place, |
Mm, fair enough. |
@adrcunha I discussed with @cjwagner, bad configs such as invalid container address and wrong secrets will result in pending state and there is no way to tell if there is any real issue (see kubernetes/test-infra#9694) when it is in pending state. We can not alter that unless we change the design and coding of Kubernetes itself. To avoid long wait time we can shorten the timeout for pending state to 20 mins or so - we can tailor the number by looking into historical duration of pending state. |
|
@adrcunha |
3 doesn't apply. |
Should be handled by the knative-monitoring proposal |
Produced via: `./hack/update-deps.sh --upgrade && ./hack/update-codegen.sh` /assign tcnghia /cc tcnghia
/area test-and-release
/kind dev
Expected Behavior
When Prow job fails to run properly, we should get alert/notification automatically
Actual Behavior
We are finding out the failing job run manually.
The text was updated successfully, but these errors were encountered: