Improve Logging and System Monitoring #109

troyraen · 2021-11-03T19:39:15Z

Monitoring Dashboard

Currently we have a dashboard here. But:
(a) it just doesn't look very good.
(b) the access restrictions are painful. Users must have a GCP account which has explicit access to our project. Preferably, the dashboard would just be public.
(c) it contains the info I (Troy) generally want to see, but I'm sure could benefit from changes/expansion based on input from others (developers and users).
Perhaps we should consider Grafana (Using Google Cloud Monitoring in Grafana).

Logging

Currently the broker code only uses google.cloud.logging. We should instead use standard Python logging with a GCP handler.

VM uptime checks

We should develop and implement a better system for these checks (perhaps using Terraform or GCP Workflows). Currently, VM status is checked twice per day for RUNNING or TERMINATED, as expected for the time of day. This is done by a Cloud Function (check_cue_response) triggered by a cron job via a Pub/Sub message. It logs errors as CRITICAL and triggers the alerts described below.

Alerting

Some alerting for serious errors has been setup manually, mostly in relation to uptime checks for the VMs. We should think through which errors really need to be alerted on and implement those using a better (TBD) system.

The text was updated successfully, but these errors were encountered:

troyraen · 2022-01-06T18:17:37Z

Note to my future self:

Try this (from mtg with Ross?)

from lib import settings
settings.init()
logging.config.fileConfig(settings.logging_conf_file)
logger = logging.getLogger(__name__)

troyraen mentioned this issue Nov 21, 2021

Organizing Issues #114

Closed

troyraen added this to Backlog in Broker Pipeline Nov 22, 2021

troyraen moved this from Backlog to To do in Broker Pipeline Nov 22, 2021

troyraen added Maintenance Maintain function or increase stability Pipeline: Admin Administration tasks; may touch multiple pipeline areas, but not clearly owned by any of them labels Nov 24, 2021

troyraen changed the title ~~Improve Logging~~ Improve Logging and System Monitoring Jan 6, 2022

troyraen added a commit that referenced this issue Mar 3, 2022

link to #109 in the docs

f6b26eb

troyraen mentioned this issue Sep 26, 2022

Replace the Metadata Collector with BigQuery subscriptions #172

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Logging and System Monitoring #109

Improve Logging and System Monitoring #109

troyraen commented Nov 3, 2021 •

edited

troyraen commented Jan 6, 2022 •

edited

Improve Logging and System Monitoring #109

Improve Logging and System Monitoring #109

Comments

troyraen commented Nov 3, 2021 • edited

Monitoring Dashboard

Logging

VM uptime checks

Alerting

troyraen commented Jan 6, 2022 • edited

troyraen commented Nov 3, 2021 •

edited

troyraen commented Jan 6, 2022 •

edited