CHT Sync monitoring #84

andrablaj · 2024-04-12T15:00:47Z

Set up and document a monitoring solution for CHT Sync (CHT Watchdog), together with relevant metrics and alerts.

mrjones-plip · 2024-05-17T03:33:31Z

I think one of the key metrics to have, from a "is my dashboard up to date" perspective, is a "last sequence ID synced from couchdb" for all databases being synced. I suggest we need this as a key feature for launch.

Based on how Watchdog monitors this today, a drop in replacement would be to have a /metrics HTTP endpoint that looks like this (I choose "logstash" as the metric, but this can be what ever good name makes sense):

# HELP logstash_progress_sequence cht-sync backlog.
# TYPE logstash_progress_sequence counter
logstash_progress_sequence{cht_instance="cht.example.com",db="_users",job="db_targets",target="postgres.example.com"} 4
logstash_progress_sequence{cht_instance="cht.example.com",db="medic",job="db_targets",target="postgres.example.com"} 232
logstash_progress_sequence{cht_instance="cht.example.com",db="medic-logs",job="db_targets",target="postgres.example.com"} 21
logstash_progress_sequence{cht_instance="cht.example.com",db="medic-sentinel",job="db_targets",target="postgres.example.com"} 130
logstash_progress_sequence{cht_instance="cht.example.com",db="medic-users-meta",job="db_targets",target="postgres.example.com"} 6
# HELP scrape_duration_seconds How long it took to scrape the target in seconds
# TYPE scrape_duration_seconds gauge
scrape_duration_seconds{job="db_targets",target="postgres.example.com"} 0.000498091
# HELP up 1 if the target is reachable, or 0 if the scrape failed
# TYPE up gauge
up{job="db_targets",target="postgres.example.com"} 1

If exposing it as a Prometheus native endpoint is too hard, then simply mirroring the SQL Schema used in couch2pg will be fine. Here's the couchdb_progress schema:

CREATE TABLE
  public.couchdb_progress (
    seq character varying NULL,
    source character varying NOT NULL
  );

And here's 4 example rows. Note that each row allows you to know which CHT Core instance is being maintained, which database it is, the sequence count and the sequence ID. Sequence ID is truncated for brevity, they're much longer:

"132-g1AAAAOReJyV0s1N-SNIP-tdx7tu4XR6D2hQ"	"cht.example.com/medic"
"20-g1AAAANheJyV0s9Nw-SNIP-cs5OlfYfl5HuKQ"	"cht.example.com/medic-logs"
"3-g1AAAANBeJyV0s9Nwz-SNIP-lX0qHS_QDY5OzD"	"cht.example.com/medic-users-meta"
"72-g1AAAAOReJyd0s1Nw-SNIP-XB7blz_Q_E5fc2"	"cht.example.com/medic-sentinel"

witash · 2024-07-26T16:26:57Z

having metrics scraped from sql is convenient since most of what we need is there already or easy to add.
there's a lot of possible configurations for accessing the metrics
can keep the exporter in watchdog, but that requires direct database access which partners may not want or may not be possible
added sql_exporter to the helm chart to export metrics from within the cluster, then it requires ingress which otherwise may not be needed

for the metrics themselves

sequence from couchdb_progress for each couch2pg instance
last_sequence - current_sequence. can only be approximate, but shows more directly how many changes are not synced. since couch2pg is updating it, it isn't accurate if couch2pg is down. so also
couch2pg liveness for each instance. uses last update from couchdb_progress; if too far out of date, can assume couch2pg is not running for some reason
dbt run stats. dbt can save run statistics to the db which may be useful to see if dbt runs are taking long
dbt root table last update - current time. most direct metric for "is my dashboard up to date". can still be some delay between root table and dashboards

Potentially could do (dashboard last update - current time) but that would have to be dynamic somehow

mrjones-plip · 2024-07-29T13:49:17Z

Looking good @witash ! Agree that remote DB permissions vs exposing ingress is a tricky choice to make. I defer to eco team for how to best proceed, but suspect that leaving it in the DB would be fine as this status quo as compared to couch2pg and we can always improve it later.

Which ever route we go, be sure we end up with the URL of the CHT instance of the stats! Critical for multi-tenant CHT Sync deployments which I think MoH KE wanted.

* feat(#84): add optional sql exporter and ingress * feat(#84): adding pending and update time to couch2pg * feat(#84): adding dbt monitoring queries * chore(#84): fix lints * chore(#84): fix tests * feat(#84): separate request to get pending, and null if unknown * chore(#84): adding tests * feat(#84): better query for dbt_latency * chore(#84): fixing lint * chore(#84): fixing tests * chore(#84): adding upgrade script

andrablaj · 2024-08-12T09:45:22Z

@witash this issue was moved to Done. If done, can you please close the ticket?

andrablaj added this to the CHT Sync Production milestone May 30, 2024

andrablaj added the Priority: 3 - Low Can be bumped from the release label May 30, 2024

witash self-assigned this Jul 16, 2024

witash added Priority: 2 - Medium Normal priority and removed Priority: 3 - Low Can be bumped from the release labels Jul 16, 2024

witash added a commit that referenced this issue Jul 26, 2024

feat(#84): add optional sql exporter and ingress

246cfd8

witash added a commit that referenced this issue Jul 29, 2024

feat(#84): adding pending and update time to couch2pg

b096574

witash added a commit that referenced this issue Jul 30, 2024

feat(#84): adding dbt monitoring queries

101f2d1

witash added a commit that referenced this issue Jul 30, 2024

chore(#84): fix lints

80974c9

witash added a commit that referenced this issue Jul 30, 2024

chore(#84): fix tests

e676313

witash added a commit that referenced this issue Jul 31, 2024

feat(#84): separate request to get pending, and null if unknown

31c01a4

witash added a commit that referenced this issue Aug 2, 2024

chore(#84): adding tests

e29a803

witash added a commit that referenced this issue Aug 2, 2024

feat(#84): better query for dbt_latency

c33db73

witash added a commit that referenced this issue Aug 2, 2024

chore(#84): fixing lint

256a1b1

witash added a commit that referenced this issue Aug 2, 2024

chore(#84): fixing tests

3753985

witash added a commit that referenced this issue Aug 5, 2024

chore(#84): adding upgrade script

9c7b6bc

witash closed this as completed Aug 12, 2024

witash mentioned this issue Aug 13, 2024

Add a way to know how many documents are yet to be synced #76

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHT Sync monitoring #84

CHT Sync monitoring #84

andrablaj commented Apr 12, 2024

mrjones-plip commented May 17, 2024 •

edited

Loading

witash commented Jul 26, 2024

mrjones-plip commented Jul 29, 2024

andrablaj commented Aug 12, 2024

CHT Sync monitoring #84

CHT Sync monitoring #84

Comments

andrablaj commented Apr 12, 2024

mrjones-plip commented May 17, 2024 • edited Loading

witash commented Jul 26, 2024

mrjones-plip commented Jul 29, 2024

andrablaj commented Aug 12, 2024

mrjones-plip commented May 17, 2024 •

edited

Loading