`check_site` cron job does not remove non-existent sites/slivertools from datastore #187

nkinkade · 2019-04-23T20:00:35Z

Currently, if a pod gets retired and is removed from all M-Lab configurations, it lives on in the mlab-ns datastore. This hasn't historically been an issue because the mlab-ns status for these non-existent sites will always be offline because it has no monitoring data, and as such would never direct any clients to these pods.

However, acting on a bug report/tip from @sbs, @evfirerob discovered that if the Github Maintenance Exporter ever had a record for the site/node, that our mlab-ns Prometheus queries will report these non-existent sites as online. Here is a case in point to make explaining this easier:

Some months ago we retired the site LOS01. During the process of retiring the site we at some point the put the site into GMX maintenance mode. Once the site was fully retired, the Github issue which put the pod into GMX maintenance mode was closed. Closing the issue set the GMX timeseries LOS01 to a value of 0 (i.e., it was taken out of maintenance mode). All is good... except that GMX state never goes away (the golang prometheus client does not allow you to delete timeseries), and our mlab-ns Prom queries were still return 0 for LOS01 based on nothing more than GMX maintenance was 0.

The quick resolution to this is to simply make sure that all Site and SliverTool mlab-ns datastore records are removed for retired sites. There is actually a step in the current decommissioning documentation touching on the fact that entries for retired sites remain in mlab-ns. We should update that section to direct the operator to remove those entries.

A possible longer term solution would be to have the check_site cron job make sure the datastore is in sync with reality by deleting any removed sites. Some thoughts:

This is more work in the existing mlab-ns code base, which we probably don't want.
Adding removal code to an automated cron job could be seen as dangerous in that it could delete more than we want via some bug.

The text was updated successfully, but these errors were encountered:

autolabel bot added the review/triage label Apr 23, 2019

nkinkade changed the title ~~check_site cron job should remove non-existent sites/slivertools from datastore~~ check_site cron job does not remove non-existent sites/slivertools from datastore Apr 24, 2019

pboothe added backlog P2 labels May 6, 2019

autolabel bot removed the review/triage label May 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`check_site` cron job does not remove non-existent sites/slivertools from datastore #187

`check_site` cron job does not remove non-existent sites/slivertools from datastore #187

nkinkade commented Apr 23, 2019

check_site cron job does not remove non-existent sites/slivertools from datastore #187

check_site cron job does not remove non-existent sites/slivertools from datastore #187

Comments

nkinkade commented Apr 23, 2019

`check_site` cron job does not remove non-existent sites/slivertools from datastore #187

`check_site` cron job does not remove non-existent sites/slivertools from datastore #187