Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

we have no alerts for label_sync failing to run #9121

Closed
spiffxp opened this issue Aug 22, 2018 · 10 comments
Closed

we have no alerts for label_sync failing to run #9121

spiffxp opened this issue Aug 22, 2018 · 10 comments
Labels
area/label_sync Issues or PRs related to code in /label_sync kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯
Milestone

Comments

@spiffxp
Copy link
Member

spiffxp commented Aug 22, 2018

/kind cleanup
/area label_sync

ref: #9054

While trying to deploy a new copy of the label_sync cronjob, I noticed that it had hung a few days ago. An invalid label description caused it to hang, and no more copies were run subsequently. How could we alert on this kind of situation?

I'm thinking of a pattern where logs are uploaded to GCS and consumed by testgrid just like all of our other maintenance jobs, and an alert is setup to fire if we see failure (or no success within a certain time)

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. area/label_sync Issues or PRs related to code in /label_sync labels Aug 22, 2018
@stevekuznetsov
Copy link
Contributor

stevekuznetsov commented Aug 22, 2018

Can we migrate it to a periodic job to give it an entry in a infrastructure TestGrid view along with the commenters, etc? I'd image peribolos and other tools would join it there.

@fejta
Copy link
Contributor

fejta commented Aug 22, 2018

periodic job

We need to create a cluster key that targets a cluster with trusted credentials, and write automation to ensure that only sig-testing jobs can use it.

@spiffxp
Copy link
Member Author

spiffxp commented Oct 5, 2018

/milestone v1.13
We should be able to do this now, follow the model used by peribolos

@k8s-ci-robot k8s-ci-robot added this to the v1.13 milestone Oct 5, 2018
@stevekuznetsov
Copy link
Contributor

Unfortunately the test grid approach isn't useful for us and we have the same problem! Maybe we could have a Prometheus alert if we publish per-job failure metrics?

@spiffxp
Copy link
Member Author

spiffxp commented Jan 3, 2019

/milestone clear

I'm opting to punt this if the testgrid approach isn't good enough. Feel free to add back in to v1.14 if you're inclined to take the approach you suggested @stevekuznetsov

@k8s-ci-robot k8s-ci-robot removed this from the v1.13 milestone Jan 3, 2019
@stevekuznetsov
Copy link
Contributor

You guys have the background to run it as a periodic now, so I think that should work ...

@spiffxp
Copy link
Member Author

spiffxp commented Jan 3, 2019

/milestone v1.14

OK, let's see if we can get to this

@k8s-ci-robot k8s-ci-robot added this to the v1.14 milestone Jan 3, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2019
@stevekuznetsov
Copy link
Contributor

It's running as a periodic now after @cjwagner changed it so alerting on it in testgrid should not be hard?

@cjwagner
Copy link
Member

It's running as a periodic now after @cjwagner changed it so alerting on it in testgrid should not be hard?

Already done:

- name: ci-test-infra-label-sync
gcs_prefix: kubernetes-jenkins/logs/ci-test-infra-label-sync
alert_stale_results_hours: 12
num_failures_to_alert: 6 # Runs every 1h. Alert when it's been failing for 6 hours

We could probably tighten the alert window now.
/shrug

@k8s-ci-robot k8s-ci-robot added the ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯ label Apr 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/label_sync Issues or PRs related to code in /label_sync kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯
Projects
None yet
Development

No branches or pull requests

6 participants