[RELENG-158] Bug 1679162 - Split failed task log parsing into multiple queues and workers#6891
Conversation
|
Ok, so. As of 0eb5d21, the unit tests are green, and I think it's a better fix than the previous route, which is at https://github.com/escapewindow/treeherder/tree/logparsing-broken , specifically https://github.com/escapewindow/treeherder/commit/55a4c0ed54c29dd5fa70aed3d2cc5ebd782cf920 . The problem with 55a4c0 is that it looks like a) parse_logs is synchronous, but it waits for the log parsing task to exit, making it synchronous. No? Also, I wasn't able to green up 4 unit tests, which tells me either I broke something, or the unit tests need to be changed to match the new code. What I like better about 0eb5d21 is we're still running async @camd, let me know what you think? |
Codecov Report
@@ Coverage Diff @@
## master #6891 +/- ##
==========================================
+ Coverage 88.12% 88.17% +0.05%
==========================================
Files 286 285 -1
Lines 13413 13398 -15
==========================================
- Hits 11820 11814 -6
+ Misses 1593 1584 -9
Continue to review full report at Codecov.
|
|
Green! I'm going to optimistically take this out of draft. |
Nice! I will review this shortly. :) |
|
This is now deployed to Prototype and looks like it's working well so far. The only thing is that the older |
Makes sense. I pushed a fix. I think this will work -- we'll parse both the json and raw from the old queue in this worker, but hopefully we'll have a limited number of old messages we'll have to parse. |
| # Handles the log parsing tasks scheduled by `worker_store_pulse_data` as part of job ingestion. | ||
| worker_log_parser: REMAP_SIGTERM=SIGQUIT newrelic-admin run-program celery worker -A treeherder --without-gossip --without-mingle --without-heartbeat -Q log_parser --concurrency=7 | ||
| worker_log_parser_fail: REMAP_SIGTERM=SIGQUIT newrelic-admin run-program celery worker -A treeherder --without-gossip --without-mingle --without-heartbeat -Q log_parser_fail --concurrency=1 | ||
| worker_log_parser_fail_sheriffed: REMAP_SIGTERM=SIGQUIT newrelic-admin run-program celery worker -A treeherder --without-gossip --without-mingle --without-heartbeat -Q log_parser_fail_raw_sheriffed --concurrency=1 |
There was a problem hiding this comment.
I had mentioned including the log_parser_fail queue here until it is drained. However, to avoid needing a follow-up PR, we can just stop the ingestion of new tasks till the queues are drained. Then resume after deploy. We will just need to remember to do this for each instance.
There was a problem hiding this comment.
At least, this is especially critical for production. I suppose the others don't matter much. But a good test of the process to do this on stage and prototype too.
There was a problem hiding this comment.
Yeah, sorry about that. I was mid-review when I went to lunch and neglected to publish these comments first. :). Thanks for pivoting.
There was a problem hiding this comment.
No worries, thank you for your help!
| if first_exception: | ||
| raise first_exception | ||
|
|
||
| if "errorsummary_json" in completed_names and "live_backing_log" in completed_names: |
There was a problem hiding this comment.
This removal will leave some dead code. I'll investigate and give some links to code to remove with this.
There was a problem hiding this comment.
On second thought, I think I should just bite the bullet and create a separate PR to remove this autoclassify and crossreference code. That way it won't gum up your PR and we could more easily resurrect it later if we ever decide to. Let's leave your changes as-is and I'll remove the dead code in a separate PR. I'll have a few things to check to make sure we don't break anything.
There was a problem hiding this comment.
Ok! I think I've addressed all your review comments, then. Should we bake 5970092 in prototype?
There was a problem hiding this comment.
Yep, I'll push your new code to prototype now. :)
camd
left a comment
There was a problem hiding this comment.
What can I say, man. You did an awesome job here. :). Thanks for doing this! It seems to be running great on prototype. I'm going to merge it, in just a few, but I'll test our queue draining dance on prototype first.
Thanks!!
No description provided.