Improve Logging and System Monitoring #109
Labels
Maintenance
Maintain function or increase stability
Pipeline: Admin
Administration tasks; may touch multiple pipeline areas, but not clearly owned by any of them
Projects
Monitoring Dashboard
Currently we have a dashboard here. But:
(a) it just doesn't look very good.
(b) the access restrictions are painful. Users must have a GCP account which has explicit access to our project. Preferably, the dashboard would just be public.
(c) it contains the info I (Troy) generally want to see, but I'm sure could benefit from changes/expansion based on input from others (developers and users).
Perhaps we should consider Grafana (Using Google Cloud Monitoring in Grafana).
Logging
Currently the broker code only uses
google.cloud.logging
. We should instead use standard Pythonlogging
with a GCP handler.VM uptime checks
We should develop and implement a better system for these checks (perhaps using Terraform or GCP Workflows). Currently, VM status is checked twice per day for
RUNNING
orTERMINATED
, as expected for the time of day. This is done by a Cloud Function (check_cue_response) triggered by a cron job via a Pub/Sub message. It logs errors asCRITICAL
and triggers the alerts described below.Alerting
Some alerting for serious errors has been setup manually, mostly in relation to uptime checks for the VMs. We should think through which errors really need to be alerted on and implement those using a better (TBD) system.
The text was updated successfully, but these errors were encountered: