Monitoring sentinel when "Transitions are disabled" is difficult #8162

mrjones-plip · 2023-03-31T16:11:08Z

Describe the bug
On a production instance we're seeing the /api/v2/monitoring return an ever growing value for sentinel backlog. The sentinel container logs are showing that the backlog is up to do date though.

To Reproduce
Steps to reproduce the behavior:

Deploy an instance with CHT 4.1.0
Have activities that require sentinel transitions
Ensure transitions are actually getting processed but /api/v2/monitoring shows backlog going up

Expected behavior
/api/v2/monitoring and actual back log are in sync

Logs

2023-03-31 14:48:11 DEBUG: Initiating all tasks
2023-03-31 14:48:11 INFO: Task dueTasks started
2023-03-31 14:48:11 INFO: Task reminders started
2023-03-31 14:48:11 INFO: Task replications started
2023-03-31 14:48:11 INFO: Task outbound started
2023-03-31 14:48:11 INFO: Task purging started
2023-03-31 14:48:11 INFO: Task backgroundCleanup started
2023-03-31 14:48:11 INFO: Task reminders completed
2023-03-31 14:48:11 INFO: Task replications completed
2023-03-31 14:48:11 INFO: Task outbound completed
2023-03-31 14:48:11 INFO: Task purging completed
2023-03-31 14:48:11 INFO: Task dueTasks completed
2023-03-31 14:48:11 INFO: Background cleanup batch: 95881 -> 95881 (0)
2023-03-31 14:48:11 INFO: Task backgroundCleanup completed

Environment

Instance: MoH Mali CHW
Browser: NA
Client platform: NA
App: sentinel
Version: 4.1.0

Additional context
App Monitoring showed (private GH issue) that this happened EXACTLY after we upgrade from 4.0.1 to 4.1.0 on Mar 9th:

Possibly related to the last time this bui happened in #7113 ?

The text was updated successfully, but these errors were encountered:

kennsippell · 2023-04-04T05:39:53Z

2023-04-03 22:30:21 ERROR: Failed loading transition "muting" 
2023-04-03 22:30:21 ERROR: Error: Configuration error. Config must define have a 'muting.mute_forms' array defined.
    at Object.init (/home/kenn/cht-core/shared-libs/transitions/src/transitions/muting.js:181:13)
    at Object.loadTransition [as _loadTransition] (/home/kenn/cht-core/shared-libs/transitions/src/transitions/index.js:184:16)
    at /home/kenn/cht-core/shared-libs/transitions/src/transitions/index.js:149:12
    at Array.forEach (<anonymous>)
    at Object.loadTransitions (/home/kenn/cht-core/shared-libs/transitions/src/transitions/index.js:139:25)
    at Object.loadTransitions (/home/kenn/cht-core/sentinel/src/transitions.js:10:20)
    at /home/kenn/cht-core/sentinel/src/config.js:64:32
    at processTicksAndRejections (node:internal/process/task_queues:96:5) {
2023-04-03 22:30:21 WARN: Disabled transition "mark_for_outbound" 
2023-04-03 22:30:21 INFO: Loading transition "create_user_for_contacts" 
2023-04-03 22:30:21 ERROR: Transitions are disabled until the above configuration errors are fixed

This can be fixed in app configuration by disabling the muting transition. There are no muting forms for this project.

ERROR: Transitions are disabled until the above configuration errors are fixed

Is sentinel doing nothing in this state? What is the purpose of having sentinel run but do nothing like this? The log output makes things look pretty healthy. Would a better pattern be to fail fast? Or could we at least re-print this error each 5 minutes so this isn't a needle in a haystack?

dianabarsan · 2023-04-04T08:11:18Z

Is sentinel doing nothing in this state?

This is correct, Sentinel is doing no transition processing in this state. The backlog value is true.
The scheduler log that @mrjones-plip linked is related to a completely separate part of Sentinel, which is working correctly even with invalid transition config.

dianabarsan · 2023-04-04T08:12:00Z

Or could we at least re-print this error each 5 minutes so this isn't a needle in a haystack?

I think this is a great suggestion. Having a way to quickly check whether sentinel is indeed processing transitions over documents is very important.

My preferred solution for this is monitoring service logs, which is on the Infrastructure Focus Group OKRs as a goal for the near future.

kennsippell · 2023-04-04T15:00:04Z

Another option is a boolean flag for is sentinel processing docs in monitoring api?

monitoring service logs, which is on the Infrastructure Focus Group OKRs as a goal for the near future

Are there more details on what this mean? Even high level vision would be helpful.

dianabarsan · 2023-04-04T15:11:31Z

boolean flag for is sentinel processing docs in monitoring api?

That works too. Adding a health check to Sentinel has been something we've discussed several times in the past. Right now, Sentinel doesn't communicate its state except for the two "seq" docs, which are already used by the monitoring API.

Are there more details on what this mean?

So far, nothing is settled. High level vision would be that all apps logs (api, sentinel, haproxy, nginx, couchdb, etc) are scanned for errors and other significant entries, with filtered entries being funneled into a digestible dashboard of sorts.

mrjones-plip · 2023-04-04T23:23:22Z

Thanks for the clean up of title and labels of this ticket @kennsippell - much appreciated!

kennsippell · 2023-04-06T19:15:27Z

Ready for AT on 8162-logs-errors-when-trans-disabled

To test this - update an app_settings.json with an unknown or crashing transition. Look at sentinel logs and verify there is a descriptive error. Wait 5 minutes and verify the descriptive error is reprinted.

mrjones-plip · 2023-04-06T21:48:35Z

@kennsippell - to be a bit more explicit for QA on how to test, "unknown or crashing transition" could be achieved with smang in app_settings.json , yeah?

  "transitions": {
    "smang": true
  }

ngaruko · 2023-05-04T07:51:20Z

How hard would it be to add a small e2e tests to transitions tests @kennsippell ? It would go along way towards our new quality assistance regime.

ngaruko · 2023-05-06T01:22:35Z

Testing details

Config: Default
Environment: Local
Platform: WebApp
Browser: Chrome

Test scenario

Add 'unkown transition' in app_settings.tasks.rules as in this comment and upload settings
Check error in sentinel logs

Fixed on `8162-logs-errors-when-trans-disabled` >> Log in feedback doc

Sentinel logs

Test passed successfully. 🎉

…disabled (#8167) When sentinel transitions are disabled the error gets buried deep in the logs. This resurfaces the error through a new scheduled task. #8162

kennsippell · 2023-05-11T21:10:28Z

Merged. Spoke with @ngaruko about the e2e test suggestion. He said he'd look at existing tests and file a followup ticket as needed.

mrjones-plip added Type: Bug Fix something that isn't working as intended Affects: 4.1.0 labels Mar 31, 2023

kennsippell added the Regression Affects a feature that worked in a previous release label Mar 31, 2023

kennsippell mentioned this issue Mar 31, 2023

Show 0 for sentinel backlog for version 4.1 medic/cht-app-monitoring-data-ingest#72

Closed

kennsippell self-assigned this Apr 4, 2023

kennsippell changed the title ~~Monitoring sentinel backlog values are wrong~~ Monitoring sentinel when "Transitions are disabled" is difficult Apr 4, 2023

kennsippell added Type: Improvement Make something better and removed Type: Bug Fix something that isn't working as intended Regression Affects a feature that worked in a previous release Affects: 4.1.0 labels Apr 4, 2023

kennsippell mentioned this issue Apr 5, 2023

feat(#8162): Resurface sentinel errors which cause transitions to be disabled #8167

Merged

5 tasks

kennsippell added the retro-action-item retro-action-item label Apr 6, 2023

dianabarsan added this to the 4.3.0 milestone Apr 25, 2023

ngaruko self-assigned this May 2, 2023

kennsippell closed this as completed May 11, 2023

ngaruko mentioned this issue May 11, 2023

Add e2e test for sentinel transition error log #8235

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring sentinel when "Transitions are disabled" is difficult #8162

Monitoring sentinel when "Transitions are disabled" is difficult #8162

mrjones-plip commented Mar 31, 2023

kennsippell commented Apr 4, 2023 •

edited

Loading

dianabarsan commented Apr 4, 2023 •

edited

Loading

dianabarsan commented Apr 4, 2023 •

edited

Loading

kennsippell commented Apr 4, 2023 •

edited

Loading

dianabarsan commented Apr 4, 2023 •

edited

Loading

mrjones-plip commented Apr 4, 2023

kennsippell commented Apr 6, 2023

mrjones-plip commented Apr 6, 2023

ngaruko commented May 4, 2023

ngaruko commented May 6, 2023 •

edited

Loading

kennsippell commented May 11, 2023

Monitoring sentinel when "Transitions are disabled" is difficult #8162

Monitoring sentinel when "Transitions are disabled" is difficult #8162

Comments

mrjones-plip commented Mar 31, 2023

kennsippell commented Apr 4, 2023 • edited Loading

dianabarsan commented Apr 4, 2023 • edited Loading

dianabarsan commented Apr 4, 2023 • edited Loading

kennsippell commented Apr 4, 2023 • edited Loading

dianabarsan commented Apr 4, 2023 • edited Loading

mrjones-plip commented Apr 4, 2023

kennsippell commented Apr 6, 2023

mrjones-plip commented Apr 6, 2023

ngaruko commented May 4, 2023

ngaruko commented May 6, 2023 • edited Loading

Testing details

Test scenario

Fixed on 8162-logs-errors-when-trans-disabled >> Log in feedback doc

kennsippell commented May 11, 2023

kennsippell commented Apr 4, 2023 •

edited

Loading

dianabarsan commented Apr 4, 2023 •

edited

Loading

dianabarsan commented Apr 4, 2023 •

edited

Loading

kennsippell commented Apr 4, 2023 •

edited

Loading

dianabarsan commented Apr 4, 2023 •

edited

Loading

ngaruko commented May 6, 2023 •

edited

Loading

Fixed on `8162-logs-errors-when-trans-disabled` >> Log in feedback doc