Skip to content

[RELENG-158] Bug 1679162 - Split failed task log parsing into multiple queues and workers#6891

Merged
camd merged 6 commits into
mozilla:masterfrom
escapewindow:logparsing
Dec 5, 2020
Merged

[RELENG-158] Bug 1679162 - Split failed task log parsing into multiple queues and workers#6891
camd merged 6 commits into
mozilla:masterfrom
escapewindow:logparsing

Conversation

@escapewindow
Copy link
Copy Markdown
Contributor

No description provided.

@escapewindow escapewindow changed the title Logparsing [RELENG-158] Log parsing Dec 3, 2020
Comment thread treeherder/log_parser/tasks.py Outdated
Comment thread treeherder/etl/jobs.py Outdated
@escapewindow escapewindow changed the title [RELENG-158] Log parsing [RELENG-158] Bug 1679162 - Log parsing Dec 3, 2020
Comment thread treeherder/etl/jobs.py Outdated
@escapewindow
Copy link
Copy Markdown
Contributor Author

Ok, so. As of 0eb5d21, the unit tests are green, and I think it's a better fix than the previous route, which is at https://github.com/escapewindow/treeherder/tree/logparsing-broken , specifically https://github.com/escapewindow/treeherder/commit/55a4c0ed54c29dd5fa70aed3d2cc5ebd782cf920 .

The problem with 55a4c0 is that it looks like a) parse_logs is synchronous, but it waits for the log parsing task to exit, making it synchronous. No? Also, I wasn't able to green up 4 unit tests, which tells me either I broke something, or the unit tests need to be changed to match the new code.

What I like better about 0eb5d21 is we're still running async parse_logs, but we call it with a single log per call (and separate queues). And the tests pass locally. I think this should work, but I don't know.

@camd, let me know what you think?

@codecov-io
Copy link
Copy Markdown

codecov-io commented Dec 4, 2020

Codecov Report

Merging #6891 (5970092) into master (f9b40f7) will increase coverage by 0.05%.
The diff coverage is 93.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6891      +/-   ##
==========================================
+ Coverage   88.12%   88.17%   +0.05%     
==========================================
  Files         286      285       -1     
  Lines       13413    13398      -15     
==========================================
- Hits        11820    11814       -6     
+ Misses       1593     1584       -9     
Impacted Files Coverage Δ
treeherder/config/settings.py 88.70% <ø> (ø)
treeherder/log_parser/tasks.py 63.01% <ø> (+2.29%) ⬆️
treeherder/etl/jobs.py 92.81% <93.33%> (-0.35%) ⬇️
treeherder/autoclassify/tasks.py

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f9b40f7...5970092. Read the comment docs.

@escapewindow escapewindow marked this pull request as ready for review December 4, 2020 01:05
@escapewindow escapewindow requested a review from a team as a code owner December 4, 2020 01:05
@escapewindow
Copy link
Copy Markdown
Contributor Author

Green! I'm going to optimistically take this out of draft.

@camd
Copy link
Copy Markdown
Collaborator

camd commented Dec 4, 2020

Green! I'm going to optimistically take this out of draft.

Nice! I will review this shortly. :)

@camd camd temporarily deployed to treeherder-prototype December 4, 2020 19:08 Inactive
@camd
Copy link
Copy Markdown
Collaborator

camd commented Dec 4, 2020

This is now deployed to Prototype and looks like it's working well so far. The only thing is that the older log_parser_fail queue had two orphaned messages because we stopped processing it in favor of the new queues. I totally forgot about that. We'll need to process the old queues (add them to the worker in the -Q section) temporarily until the queue is drained, then we can remove it in a follow-up PR.

@escapewindow
Copy link
Copy Markdown
Contributor Author

This is now deployed to Prototype and looks like it's working well so far. The only thing is that the older log_parser_fail queue had two orphaned messages because we stopped processing it in favor of the new queues. I totally forgot about that. We'll need to process the old queues (add them to the worker in the -Q section) temporarily until the queue is drained, then we can remove it in a follow-up PR.

Makes sense. I pushed a fix. I think this will work -- we'll parse both the json and raw from the old queue in this worker, but hopefully we'll have a limited number of old messages we'll have to parse.

@escapewindow escapewindow changed the title [RELENG-158] Bug 1679162 - Log parsing [RELENG-158] Bug 1679162 - Split error log parsing into multiple queues and workers Dec 4, 2020
@escapewindow escapewindow changed the title [RELENG-158] Bug 1679162 - Split error log parsing into multiple queues and workers [RELENG-158] Bug 1679162 - Split failed task log parsing into multiple queues and workers Dec 4, 2020
Comment thread Procfile Outdated
# Handles the log parsing tasks scheduled by `worker_store_pulse_data` as part of job ingestion.
worker_log_parser: REMAP_SIGTERM=SIGQUIT newrelic-admin run-program celery worker -A treeherder --without-gossip --without-mingle --without-heartbeat -Q log_parser --concurrency=7
worker_log_parser_fail: REMAP_SIGTERM=SIGQUIT newrelic-admin run-program celery worker -A treeherder --without-gossip --without-mingle --without-heartbeat -Q log_parser_fail --concurrency=1
worker_log_parser_fail_sheriffed: REMAP_SIGTERM=SIGQUIT newrelic-admin run-program celery worker -A treeherder --without-gossip --without-mingle --without-heartbeat -Q log_parser_fail_raw_sheriffed --concurrency=1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had mentioned including the log_parser_fail queue here until it is drained. However, to avoid needing a follow-up PR, we can just stop the ingestion of new tasks till the queues are drained. Then resume after deploy. We will just need to remember to do this for each instance.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least, this is especially critical for production. I suppose the others don't matter much. But a good test of the process to do this on stage and prototype too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. I can drop 54a7d54 then.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry about that. I was mid-review when I went to lunch and neglected to publish these comments first. :). Thanks for pivoting.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, thank you for your help!

Comment thread app.json Outdated
Comment thread lints/queuelint.py Outdated
if first_exception:
raise first_exception

if "errorsummary_json" in completed_names and "live_backing_log" in completed_names:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removal will leave some dead code. I'll investigate and give some links to code to remove with this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I think I should just bite the bullet and create a separate PR to remove this autoclassify and crossreference code. That way it won't gum up your PR and we could more easily resurrect it later if we ever decide to. Let's leave your changes as-is and I'll remove the dead code in a separate PR. I'll have a few things to check to make sure we don't break anything.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I think I've addressed all your review comments, then. Should we bake 5970092 in prototype?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll push your new code to prototype now. :)

@camd camd temporarily deployed to treeherder-prototype December 4, 2020 22:44 Inactive
Copy link
Copy Markdown
Collaborator

@camd camd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What can I say, man. You did an awesome job here. :). Thanks for doing this! It seems to be running great on prototype. I'm going to merge it, in just a few, but I'll test our queue draining dance on prototype first.

Thanks!!

@camd camd temporarily deployed to treeherder-prototype December 4, 2020 23:56 Inactive
@camd camd merged commit 3cc5cea into mozilla:master Dec 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants