Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need monitoring/alerting to check whether Knative prow jobs run properly #15

Closed
jessiezcc opened this issue Jul 18, 2018 · 29 comments
Closed
Assignees
Milestone

Comments

@jessiezcc
Copy link
Contributor

/area test-and-release
/kind dev

Expected Behavior
When Prow job fails to run properly, we should get alert/notification automatically

Actual Behavior
We are finding out the failing job run manually.

@steuhs
Copy link
Contributor

steuhs commented Jul 18, 2018

Do we also want a dashboard that shows whether the HEAD run of each job is successful?

@steuhs
Copy link
Contributor

steuhs commented Jul 18, 2018

Through what channel should we get the alert? Email, or maybe slack bot?

@jessiezcc
Copy link
Contributor Author

Kubernetes already has a dashboard for job status, isn't it. Slack or Git is good since we want to community visibility. Would be nice to auto create an issue and notify OWNERs.

@steuhs
Copy link
Contributor

steuhs commented Jul 19, 2018

@jessiezcc I chatted with Sen, there is no dashboard that does what I asked above

@steuhs
Copy link
Contributor

steuhs commented Jul 19, 2018

I am not sure if Github would be a good channel. The only place I can think of posting the status is the Issues section. Do we want to use the Issues section in this repo to record all the job failures in other repos?

@adrcunha
Copy link
Contributor

adrcunha commented Sep 5, 2018

I suggest using Stackdriver, at least as an initial solution; this way we can have some monitoring up and running ASAP.

@steuhs
Copy link
Contributor

steuhs commented Sep 22, 2018

It looks like prow has some of its own way of reporting https://github.com/kubernetes/test-infra/tree/master/prow/report
I am trying to see if we can build on top of what they have

@steuhs
Copy link
Contributor

steuhs commented Oct 1, 2018

Yutong is working on prow's reporting feature. https://github.com/kubernetes/test-infra/tree/master/prow/report. I am trying to see if and to what extent we can use that feature

@steuhs
Copy link
Contributor

steuhs commented Oct 1, 2018

Looks like this pkg have a template and function to post prowjob issues on github https://github.com/kubernetes/test-infra/blob/master/prow/report/report.go

@cjwagner
Copy link
Contributor

cjwagner commented Oct 1, 2018

@jessiezcc I chatted with Sen, there is no dashboard that does what I asked above

What exactly are you referring to? I think we have mechanisms to achieve everything listed on this issue except for reporting job failures to slack.

@steuhs
Copy link
Contributor

steuhs commented Oct 1, 2018

@cjwagner I think you are talking about the status context that is showing up at the end of each PR page. Isn't that limited to pre-submit checks? There is also post-submit and periodical jobs we want to monitor, I believe.

@cjwagner
Copy link
Contributor

cjwagner commented Oct 1, 2018

That is just one of the mechanisms we have. We have configurable email alerting available through testgrid and we can display the status of the last run of a job with svg badges.

What exactly are you trying to report on?

@steuhs
Copy link
Contributor

steuhs commented Dec 13, 2018

@cjwagner Who is working on those reporting features you mentioned? I'd like to get the more detail on what has been implemented and what is planned.

@cjwagner
Copy link
Contributor

Those are Testgrid feature so @michelle192837 is the one who implemented them. Please refer to the documentation first though, it describes the features and how to use them: https://github.com/kubernetes/test-infra/tree/master/testgrid#email-alerts

@adrcunha
Copy link
Contributor

Testgrid e-mail alerting is already enabled by #261. We want a lower level job monitoring so we can act faster when something goes wrong.

@cjwagner
Copy link
Contributor

According to that PR body, you configured Testgrid to report only after 3 consecutive failures. If you reported on the first failure that would have the effect you want right? Or are you saying that Testgrid's update period itself is too slow for some use case that you have?

@adrcunha
Copy link
Contributor

Or are you saying that Testgrid's update period itself is too slow for some use case that you have?

That's correct. Example: suppose that we push a bad Prow config, for example, and the crons or pull jobs don't run. Currently we have no way of knowing that, unless someone stumbles upon it and reports (e.g., pull test jobs never finish for your PR).

@krzyzacy
Copy link

Why it never finishes? Timeout on your prowjob should work?

If you change some presubmit jobs in your config, you probably always want to manually triggered on a PR right? For example, we have a little playground in k/k like kubernetes/kubernetes#46662

@adrcunha
Copy link
Contributor

Maybe that was just a bad example. But the idea, as Cold put it clearly, is to have some sort of monitoring in place so we are aware of issues with our Prow jobs (presubmits, postsubmits, crons, etc) way faster than the time it takes for Testgrid to update and us to check it out and realize that something is not right.

That's the motivation. I'll leave the details to Stephen, who's working on this issue.

@steuhs
Copy link
Contributor

steuhs commented Jan 8, 2019

@adrcunha I read this https://github.com/kubernetes/test-infra/tree/master/testgrid#email-alerts. It seems to me that we can use the TestGrid, with num_failures_to_alert set to 1. With that change, I don't see any use case where TestGrid would be too slow - we don't need to rely on periodical jobs, we can monitor on post-submit jobs to get the failure report as soon as it happens. Please correct me if you can give me a use case that we cannot use TestGrid's alerting mechanism. @michelle192837 Please provide your opinion as well.

@adrcunha
Copy link
Contributor

adrcunha commented Jan 8, 2019

num_failures_to_alert will only report failed tests, not broken jobs (one that never reports test status, for example). Also, it has a delay of up to 2h due to Testgrid updates. This issue is about monitoring the jobs, not test failures.

@michelle192837
Copy link

michelle192837 commented Jan 8, 2019

Adriano is correct that you'll have to deal with the TestGrid update delay either way (though it's a lot less than 2H in worst case in external; more like 30 minutes delay with bad luck). That said, it does seem like broken jobs should timeout and report at some point, producing a failed result, so that the only delay you have to deal with is the update delay.

So I guess if the problem is 'I want to know when my Prow jobs are failing', you can get that (holding to TestGrid's update cycles) with TestGrid alerting. If it's a potential misconfiguration thing for Prow, that seems like that should be added to/handled by Prow presubmit tests? And if Prow jobs are staying up forever, seems like that should be fixed with a timeout.

ETA: That said, let me know if I'm missing something here. num_failures_to_alert = 1 on a dashboard might be a good first step either way.

@adrcunha
Copy link
Contributor

adrcunha commented Jan 9, 2019

Timeouts are already in place, num_failures_to_alert is already setup (but won't work until we have our own Testgrid backend). On top of that, we want the quickest possible way to identify when Prow jobs are misbehaving; we don't want to wait 30 minutes, or 2h, or for a user to report issues on Slack. Scenarios include bad configs (secrets, ACLs), k8s pods failures, resource exhaustion, etc. Less frequent jobs (like the nightly releases or playground update) are more concerning, since we tend to realize they're broken too late in the game when we rely only on Testgrid (even if the report is automated).

@michelle192837
Copy link

Mm, fair enough.

@steuhs
Copy link
Contributor

steuhs commented Jan 16, 2019

@adrcunha I discussed with @cjwagner, bad configs such as invalid container address and wrong secrets will result in pending state and there is no way to tell if there is any real issue (see kubernetes/test-infra#9694) when it is in pending state. We can not alter that unless we change the design and coding of Kubernetes itself. To avoid long wait time we can shorten the timeout for pending state to 20 mins or so - we can tailor the number by looking into historical duration of pending state.
Cole also mentioned prow rarely fail before a job start - it only happens several times in a year, for different reasons.
For the particular case of checking whether the knative docker images are missing or non-accessible, there is an internal task assigned to @tcnghia
Considering all the three factors mentioned above, I think the most cost effective way is to shorten the timeout for pending state and monitor on prow job status change.

@adrcunha
Copy link
Contributor

  1. I agree about shortening the timeout.
  2. What's your proposal for monitoring the Prow job status change?
  3. Please clarify and propose how the mentioned task about missing/non-accessible Knative images can be used for monitoring/alert.
  4. Please rule out Stackdriver as a good monitoring solution before we reduce the solution to a simple timeout reduction, which still heavily relies on human checks.

@steuhs
Copy link
Contributor

steuhs commented Jan 16, 2019

@adrcunha
2. & 4. We don't necessarily need to rule out Stackdriver because the status change can be monitored through there. Indeed it seems to be a better alternative to monitor the status change there, comparing to using Crier. I am doing the investigation. One potential advantage of using Stackdriver would be the alerting system it provides
3. I mean Nghia is working on "Cloud Run GKE prober should fail when release images are missing" (b/120081643). So for those kinds of failures we will have a separate solution

@adrcunha
Copy link
Contributor

3 doesn't apply.
For 4 looks like I wasn't clear. I indeed meant "do NOT rule out Stackdriver unless it's proven that it doesn't help". I advocated using SD as an easy monitoring solution from day 1.

@jessiezcc jessiezcc added this to the M2 milestone Feb 26, 2019
@srinivashegde86
Copy link
Contributor

Should be handled by the knative-monitoring proposal

Cynocracy pushed a commit to Cynocracy/test-infra that referenced this issue Jul 22, 2020
Produced via:
  `./hack/update-deps.sh --upgrade && ./hack/update-codegen.sh`
/assign tcnghia
/cc tcnghia
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants