Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring sentinel when "Transitions are disabled" is difficult #8162

Closed
mrjones-plip opened this issue Mar 31, 2023 · 11 comments
Closed

Monitoring sentinel when "Transitions are disabled" is difficult #8162

mrjones-plip opened this issue Mar 31, 2023 · 11 comments
Assignees
Labels
retro-action-item retro-action-item Type: Improvement Make something better
Milestone

Comments

@mrjones-plip
Copy link
Contributor

Describe the bug
On a production instance we're seeing the /api/v2/monitoring return an ever growing value for sentinel backlog. The sentinel container logs are showing that the backlog is up to do date though.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy an instance with CHT 4.1.0
  2. Have activities that require sentinel transitions
  3. Ensure transitions are actually getting processed but /api/v2/monitoring shows backlog going up

Expected behavior
/api/v2/monitoring and actual back log are in sync

Logs

2023-03-31 14:48:11 DEBUG: Initiating all tasks
2023-03-31 14:48:11 INFO: Task dueTasks started
2023-03-31 14:48:11 INFO: Task reminders started
2023-03-31 14:48:11 INFO: Task replications started
2023-03-31 14:48:11 INFO: Task outbound started
2023-03-31 14:48:11 INFO: Task purging started
2023-03-31 14:48:11 INFO: Task backgroundCleanup started
2023-03-31 14:48:11 INFO: Task reminders completed
2023-03-31 14:48:11 INFO: Task replications completed
2023-03-31 14:48:11 INFO: Task outbound completed
2023-03-31 14:48:11 INFO: Task purging completed
2023-03-31 14:48:11 INFO: Task dueTasks completed
2023-03-31 14:48:11 INFO: Background cleanup batch: 95881 -> 95881 (0)
2023-03-31 14:48:11 INFO: Task backgroundCleanup completed

Environment

  • Instance: MoH Mali CHW
  • Browser: NA
  • Client platform: NA
  • App: sentinel
  • Version: 4.1.0

Additional context
App Monitoring showed (private GH issue) that this happened EXACTLY after we upgrade from 4.0.1 to 4.1.0 on Mar 9th:

image

Possibly related to the last time this bui happened in #7113 ?

@mrjones-plip mrjones-plip added Type: Bug Fix something that isn't working as intended Affects: 4.1.0 labels Mar 31, 2023
@kennsippell kennsippell added the Regression Affects a feature that worked in a previous release label Mar 31, 2023
@kennsippell
Copy link
Member

kennsippell commented Apr 4, 2023

2023-04-03 22:30:21 ERROR: Failed loading transition "muting" 
2023-04-03 22:30:21 ERROR: Error: Configuration error. Config must define have a 'muting.mute_forms' array defined.
    at Object.init (/home/kenn/cht-core/shared-libs/transitions/src/transitions/muting.js:181:13)
    at Object.loadTransition [as _loadTransition] (/home/kenn/cht-core/shared-libs/transitions/src/transitions/index.js:184:16)
    at /home/kenn/cht-core/shared-libs/transitions/src/transitions/index.js:149:12
    at Array.forEach (<anonymous>)
    at Object.loadTransitions (/home/kenn/cht-core/shared-libs/transitions/src/transitions/index.js:139:25)
    at Object.loadTransitions (/home/kenn/cht-core/sentinel/src/transitions.js:10:20)
    at /home/kenn/cht-core/sentinel/src/config.js:64:32
    at processTicksAndRejections (node:internal/process/task_queues:96:5) {
2023-04-03 22:30:21 WARN: Disabled transition "mark_for_outbound" 
2023-04-03 22:30:21 INFO: Loading transition "create_user_for_contacts" 
2023-04-03 22:30:21 ERROR: Transitions are disabled until the above configuration errors are fixed

This can be fixed in app configuration by disabling the muting transition. There are no muting forms for this project.

ERROR: Transitions are disabled until the above configuration errors are fixed

Is sentinel doing nothing in this state? What is the purpose of having sentinel run but do nothing like this? The log output makes things look pretty healthy. Would a better pattern be to fail fast? Or could we at least re-print this error each 5 minutes so this isn't a needle in a haystack?

@kennsippell kennsippell self-assigned this Apr 4, 2023
@dianabarsan
Copy link
Member

dianabarsan commented Apr 4, 2023

Is sentinel doing nothing in this state?

This is correct, Sentinel is doing no transition processing in this state. The backlog value is true.
The scheduler log that @mrjones-plip linked is related to a completely separate part of Sentinel, which is working correctly even with invalid transition config.

@dianabarsan
Copy link
Member

dianabarsan commented Apr 4, 2023

Or could we at least re-print this error each 5 minutes so this isn't a needle in a haystack?

I think this is a great suggestion. Having a way to quickly check whether sentinel is indeed processing transitions over documents is very important.

My preferred solution for this is monitoring service logs, which is on the Infrastructure Focus Group OKRs as a goal for the near future.

@kennsippell
Copy link
Member

kennsippell commented Apr 4, 2023

Another option is a boolean flag for is sentinel processing docs in monitoring api?

monitoring service logs, which is on the Infrastructure Focus Group OKRs as a goal for the near future

Are there more details on what this mean? Even high level vision would be helpful.

@dianabarsan
Copy link
Member

dianabarsan commented Apr 4, 2023

boolean flag for is sentinel processing docs in monitoring api?

That works too. Adding a health check to Sentinel has been something we've discussed several times in the past. Right now, Sentinel doesn't communicate its state except for the two "seq" docs, which are already used by the monitoring API.

Are there more details on what this mean?

So far, nothing is settled. High level vision would be that all apps logs (api, sentinel, haproxy, nginx, couchdb, etc) are scanned for errors and other significant entries, with filtered entries being funneled into a digestible dashboard of sorts.

@kennsippell kennsippell changed the title Monitoring sentinel backlog values are wrong Monitoring sentinel when "Transitions are disabled" is difficult Apr 4, 2023
@kennsippell kennsippell added Type: Improvement Make something better and removed Type: Bug Fix something that isn't working as intended Regression Affects a feature that worked in a previous release Affects: 4.1.0 labels Apr 4, 2023
@mrjones-plip
Copy link
Contributor Author

Thanks for the clean up of title and labels of this ticket @kennsippell - much appreciated!

@kennsippell
Copy link
Member

Ready for AT on 8162-logs-errors-when-trans-disabled

To test this - update an app_settings.json with an unknown or crashing transition. Look at sentinel logs and verify there is a descriptive error. Wait 5 minutes and verify the descriptive error is reprinted.

@kennsippell kennsippell added the retro-action-item retro-action-item label Apr 6, 2023
@mrjones-plip
Copy link
Contributor Author

@kennsippell - to be a bit more explicit for QA on how to test, "unknown or crashing transition" could be achieved with smang in app_settings.json , yeah?

  "transitions": {
    "smang": true
  }

@dianabarsan dianabarsan added this to the 4.3.0 milestone Apr 25, 2023
@ngaruko ngaruko self-assigned this May 2, 2023
@ngaruko
Copy link
Contributor

ngaruko commented May 4, 2023

How hard would it be to add a small e2e tests to transitions tests @kennsippell ? It would go along way towards our new quality assistance regime.

@ngaruko
Copy link
Contributor

ngaruko commented May 6, 2023

Testing details

Config: Default
Environment: Local
Platform: WebApp
Browser: Chrome


Test scenario

  1. Add 'unkown transition' in app_settings.tasks.rules as in this comment and upload settings
  2. Check error in sentinel logs

Fixed on 8162-logs-errors-when-trans-disabled >> Log in feedback doc

Sentinel logs image

Test passed successfully. 🎉

kennsippell added a commit that referenced this issue May 11, 2023
…disabled (#8167)

When sentinel transitions are disabled the error gets buried deep in the logs. This resurfaces the error through a new scheduled task.

#8162
@kennsippell
Copy link
Member

Merged. Spoke with @ngaruko about the e2e test suggestion. He said he'd look at existing tests and file a followup ticket as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
retro-action-item retro-action-item Type: Improvement Make something better
Projects
Status: Done
Development

No branches or pull requests

4 participants