Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Logging and System Monitoring #109

Open
troyraen opened this issue Nov 3, 2021 · 1 comment
Open

Improve Logging and System Monitoring #109

troyraen opened this issue Nov 3, 2021 · 1 comment
Labels
Maintenance Maintain function or increase stability Pipeline: Admin Administration tasks; may touch multiple pipeline areas, but not clearly owned by any of them

Comments

@troyraen
Copy link
Collaborator

troyraen commented Nov 3, 2021

Monitoring Dashboard

Currently we have a dashboard here. But:
(a) it just doesn't look very good.
(b) the access restrictions are painful. Users must have a GCP account which has explicit access to our project. Preferably, the dashboard would just be public.
(c) it contains the info I (Troy) generally want to see, but I'm sure could benefit from changes/expansion based on input from others (developers and users).
Perhaps we should consider Grafana (Using Google Cloud Monitoring in Grafana).

Logging

Currently the broker code only uses google.cloud.logging. We should instead use standard Python logging with a GCP handler.

VM uptime checks

We should develop and implement a better system for these checks (perhaps using Terraform or GCP Workflows). Currently, VM status is checked twice per day for RUNNING or TERMINATED, as expected for the time of day. This is done by a Cloud Function (check_cue_response) triggered by a cron job via a Pub/Sub message. It logs errors as CRITICAL and triggers the alerts described below.

Alerting

Some alerting for serious errors has been setup manually, mostly in relation to uptime checks for the VMs. We should think through which errors really need to be alerted on and implement those using a better (TBD) system.

@troyraen troyraen added this to Backlog in Broker Pipeline Nov 22, 2021
@troyraen troyraen moved this from Backlog to To do in Broker Pipeline Nov 22, 2021
@troyraen troyraen added Maintenance Maintain function or increase stability Pipeline: Admin Administration tasks; may touch multiple pipeline areas, but not clearly owned by any of them labels Nov 24, 2021
@troyraen troyraen changed the title Improve Logging Improve Logging and System Monitoring Jan 6, 2022
@troyraen
Copy link
Collaborator Author

troyraen commented Jan 6, 2022

Note to my future self:

Try this (from mtg with Ross?)

from lib import settings
settings.init()
logging.config.fileConfig(settings.logging_conf_file)
logger = logging.getLogger(__name__)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Maintenance Maintain function or increase stability Pipeline: Admin Administration tasks; may touch multiple pipeline areas, but not clearly owned by any of them
Projects
Development

No branches or pull requests

1 participant