Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_site cron job does not remove non-existent sites/slivertools from datastore #187

Open
nkinkade opened this issue Apr 23, 2019 · 0 comments

Comments

@nkinkade
Copy link
Contributor

Currently, if a pod gets retired and is removed from all M-Lab configurations, it lives on in the mlab-ns datastore. This hasn't historically been an issue because the mlab-ns status for these non-existent sites will always be offline because it has no monitoring data, and as such would never direct any clients to these pods.

However, acting on a bug report/tip from @sbs, @evfirerob discovered that if the Github Maintenance Exporter ever had a record for the site/node, that our mlab-ns Prometheus queries will report these non-existent sites as online. Here is a case in point to make explaining this easier:

Some months ago we retired the site LOS01. During the process of retiring the site we at some point the put the site into GMX maintenance mode. Once the site was fully retired, the Github issue which put the pod into GMX maintenance mode was closed. Closing the issue set the GMX timeseries LOS01 to a value of 0 (i.e., it was taken out of maintenance mode). All is good... except that GMX state never goes away (the golang prometheus client does not allow you to delete timeseries), and our mlab-ns Prom queries were still return 0 for LOS01 based on nothing more than GMX maintenance was 0.

The quick resolution to this is to simply make sure that all Site and SliverTool mlab-ns datastore records are removed for retired sites. There is actually a step in the current decommissioning documentation touching on the fact that entries for retired sites remain in mlab-ns. We should update that section to direct the operator to remove those entries.

A possible longer term solution would be to have the check_site cron job make sure the datastore is in sync with reality by deleting any removed sites. Some thoughts:

  • This is more work in the existing mlab-ns code base, which we probably don't want.
  • Adding removal code to an automated cron job could be seen as dangerous in that it could delete more than we want via some bug.
@nkinkade nkinkade changed the title check_site cron job should remove non-existent sites/slivertools from datastore check_site cron job does not remove non-existent sites/slivertools from datastore Apr 24, 2019
@autolabel autolabel bot removed the review/triage label May 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants