Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHT Sync monitoring #84

Closed
andrablaj opened this issue Apr 12, 2024 · 4 comments
Closed

CHT Sync monitoring #84

andrablaj opened this issue Apr 12, 2024 · 4 comments
Assignees
Labels
Priority: 2 - Medium Normal priority

Comments

@andrablaj
Copy link
Member

Set up and document a monitoring solution for CHT Sync (CHT Watchdog), together with relevant metrics and alerts.

@mrjones-plip
Copy link
Contributor

mrjones-plip commented May 17, 2024

I think one of the key metrics to have, from a "is my dashboard up to date" perspective, is a "last sequence ID synced from couchdb" for all databases being synced. I suggest we need this as a key feature for launch.

Based on how Watchdog monitors this today, a drop in replacement would be to have a /metrics HTTP endpoint that looks like this (I choose "logstash" as the metric, but this can be what ever good name makes sense):

# HELP logstash_progress_sequence cht-sync backlog.
# TYPE logstash_progress_sequence counter
logstash_progress_sequence{cht_instance="cht.example.com",db="_users",job="db_targets",target="postgres.example.com"} 4
logstash_progress_sequence{cht_instance="cht.example.com",db="medic",job="db_targets",target="postgres.example.com"} 232
logstash_progress_sequence{cht_instance="cht.example.com",db="medic-logs",job="db_targets",target="postgres.example.com"} 21
logstash_progress_sequence{cht_instance="cht.example.com",db="medic-sentinel",job="db_targets",target="postgres.example.com"} 130
logstash_progress_sequence{cht_instance="cht.example.com",db="medic-users-meta",job="db_targets",target="postgres.example.com"} 6
# HELP scrape_duration_seconds How long it took to scrape the target in seconds
# TYPE scrape_duration_seconds gauge
scrape_duration_seconds{job="db_targets",target="postgres.example.com"} 0.000498091
# HELP up 1 if the target is reachable, or 0 if the scrape failed
# TYPE up gauge
up{job="db_targets",target="postgres.example.com"} 1

If exposing it as a Prometheus native endpoint is too hard, then simply mirroring the SQL Schema used in couch2pg will be fine. Here's the couchdb_progress schema:

CREATE TABLE
  public.couchdb_progress (
    seq character varying NULL,
    source character varying NOT NULL
  );

And here's 4 example rows. Note that each row allows you to know which CHT Core instance is being maintained, which database it is, the sequence count and the sequence ID. Sequence ID is truncated for brevity, they're much longer:

"132-g1AAAAOReJyV0s1N-SNIP-tdx7tu4XR6D2hQ"	"cht.example.com/medic"
"20-g1AAAANheJyV0s9Nw-SNIP-cs5OlfYfl5HuKQ"	"cht.example.com/medic-logs"
"3-g1AAAANBeJyV0s9Nwz-SNIP-lX0qHS_QDY5OzD"	"cht.example.com/medic-users-meta"
"72-g1AAAAOReJyd0s1Nw-SNIP-XB7blz_Q_E5fc2"	"cht.example.com/medic-sentinel"

@andrablaj andrablaj added this to the CHT Sync Production milestone May 30, 2024
@andrablaj andrablaj added the Priority: 3 - Low Can be bumped from the release label May 30, 2024
@witash witash self-assigned this Jul 16, 2024
@witash witash added Priority: 2 - Medium Normal priority and removed Priority: 3 - Low Can be bumped from the release labels Jul 16, 2024
@witash
Copy link
Contributor

witash commented Jul 26, 2024

having metrics scraped from sql is convenient since most of what we need is there already or easy to add.
there's a lot of possible configurations for accessing the metrics
can keep the exporter in watchdog, but that requires direct database access which partners may not want or may not be possible
added sql_exporter to the helm chart to export metrics from within the cluster, then it requires ingress which otherwise may not be needed

for the metrics themselves

  1. sequence from couchdb_progress for each couch2pg instance
  2. last_sequence - current_sequence. can only be approximate, but shows more directly how many changes are not synced. since couch2pg is updating it, it isn't accurate if couch2pg is down. so also
  3. couch2pg liveness for each instance. uses last update from couchdb_progress; if too far out of date, can assume couch2pg is not running for some reason
  4. dbt run stats. dbt can save run statistics to the db which may be useful to see if dbt runs are taking long
  5. dbt root table last update - current time. most direct metric for "is my dashboard up to date". can still be some delay between root table and dashboards

Potentially could do (dashboard last update - current time) but that would have to be dynamic somehow

@mrjones-plip
Copy link
Contributor

Looking good @witash ! Agree that remote DB permissions vs exposing ingress is a tricky choice to make. I defer to eco team for how to best proceed, but suspect that leaving it in the DB would be fine as this status quo as compared to couch2pg and we can always improve it later.

Which ever route we go, be sure we end up with the URL of the CHT instance of the stats! Critical for multi-tenant CHT Sync deployments which I think MoH KE wanted.

witash added a commit that referenced this issue Jul 30, 2024
witash added a commit that referenced this issue Jul 30, 2024
witash added a commit that referenced this issue Aug 2, 2024
witash added a commit that referenced this issue Aug 2, 2024
witash added a commit that referenced this issue Aug 2, 2024
witash added a commit that referenced this issue Aug 2, 2024
witash added a commit that referenced this issue Aug 5, 2024
witash added a commit that referenced this issue Aug 5, 2024
* feat(#84): add optional sql exporter and ingress

* feat(#84): adding pending and update time to couch2pg

* feat(#84): adding dbt monitoring queries

* chore(#84): fix lints

* chore(#84): fix tests

* feat(#84): separate request to get pending, and null if unknown

* chore(#84): adding tests

* feat(#84): better query for dbt_latency

* chore(#84): fixing lint

* chore(#84): fixing tests

* chore(#84): adding upgrade script
@andrablaj
Copy link
Member Author

@witash this issue was moved to Done. If done, can you please close the ticket?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: 2 - Medium Normal priority
Projects
Status: Done
Development

No branches or pull requests

3 participants