You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, if a pod gets retired and is removed from all M-Lab configurations, it lives on in the mlab-ns datastore. This hasn't historically been an issue because the mlab-ns status for these non-existent sites will always be offline because it has no monitoring data, and as such would never direct any clients to these pods.
However, acting on a bug report/tip from @sbs, @evfirerob discovered that if the Github Maintenance Exporter ever had a record for the site/node, that our mlab-ns Prometheus queries will report these non-existent sites as online. Here is a case in point to make explaining this easier:
Some months ago we retired the site LOS01. During the process of retiring the site we at some point the put the site into GMX maintenance mode. Once the site was fully retired, the Github issue which put the pod into GMX maintenance mode was closed. Closing the issue set the GMX timeseries LOS01 to a value of 0 (i.e., it was taken out of maintenance mode). All is good... except that GMX state never goes away (the golang prometheus client does not allow you to delete timeseries), and our mlab-ns Prom queries were still return 0 for LOS01 based on nothing more than GMX maintenance was 0.
The quick resolution to this is to simply make sure that all Site and SliverTool mlab-ns datastore records are removed for retired sites. There is actually a step in the current decommissioning documentation touching on the fact that entries for retired sites remain in mlab-ns. We should update that section to direct the operator to remove those entries.
A possible longer term solution would be to have the check_site cron job make sure the datastore is in sync with reality by deleting any removed sites. Some thoughts:
This is more work in the existing mlab-ns code base, which we probably don't want.
Adding removal code to an automated cron job could be seen as dangerous in that it could delete more than we want via some bug.
The text was updated successfully, but these errors were encountered:
nkinkade
changed the title
check_site cron job should remove non-existent sites/slivertools from datastorecheck_site cron job does not remove non-existent sites/slivertools from datastore
Apr 24, 2019
Currently, if a pod gets retired and is removed from all M-Lab configurations, it lives on in the mlab-ns datastore. This hasn't historically been an issue because the mlab-ns status for these non-existent sites will always be
offline
because it has no monitoring data, and as such would never direct any clients to these pods.However, acting on a bug report/tip from @sbs, @evfirerob discovered that if the Github Maintenance Exporter ever had a record for the site/node, that our mlab-ns Prometheus queries will report these non-existent sites as
online
. Here is a case in point to make explaining this easier:Some months ago we retired the site LOS01. During the process of retiring the site we at some point the put the site into GMX maintenance mode. Once the site was fully retired, the Github issue which put the pod into GMX maintenance mode was closed. Closing the issue set the GMX timeseries LOS01 to a value of 0 (i.e., it was taken out of maintenance mode). All is good... except that GMX state never goes away (the golang prometheus client does not allow you to delete timeseries), and our mlab-ns Prom queries were still return 0 for LOS01 based on nothing more than GMX maintenance was 0.
The quick resolution to this is to simply make sure that all Site and SliverTool mlab-ns datastore records are removed for retired sites. There is actually a step in the current decommissioning documentation touching on the fact that entries for retired sites remain in mlab-ns. We should update that section to direct the operator to remove those entries.
A possible longer term solution would be to have the
check_site
cron job make sure the datastore is in sync with reality by deleting any removed sites. Some thoughts:The text was updated successfully, but these errors were encountered: