Optimize job details loading with empty disk cache #3144

mdoucha · 2020-06-01T14:49:10Z

Each snippet of console output is stored in a separate file in the job result directory. Collect these tiny files into module result JSON files at the end of each job. This can speed up server-side job details JSON generation up to 5 times when the disk cache is empty.

On my local OpenQA instance, this patch speeds up details_ajax JSON generation for ltp_syscalls jobs as follows:

empty cache: from 3.3 seconds to 750ms
full cache: from 750ms to 450ms

Martchus

The idea makes sense. I had this idea actually myself and also measured the performance gain but didn't implement it because it isn't that simple (see review comments). Therefore I haven't had the time to implement it "properly" but it might be worth taking the time instead of having to deal with the possible downsides of this "simple" approach.

lib/OpenQA/Schema/Result/JobModules.pm

lib/OpenQA/Schema/Result/Jobs.pm

lib/OpenQA/Schema/Result/JobModules.pm

mdoucha · 2020-06-03T13:35:33Z

Test result cleanup reworked as minion job. I'll add a separate commit for the step view and deduplication.

Martchus

Thanks. It looks generally good, just a few points.

t/14-grutasks.t

lib/OpenQA/Task/Job/FinalizeResults.pm

lib/OpenQA/Schema/Result/JobModules.pm

codecov · 2020-06-03T14:53:05Z

Codecov Report

❗ No coverage uploaded for pull request base (master@1c62833). Click here to learn what that means.
The diff coverage is 96.87%.

@@            Coverage Diff            @@
##             master    #3144   +/-   ##
=========================================
  Coverage          ?   91.42%           
=========================================
  Files             ?      212           
  Lines             ?    12980           
  Branches          ?        0           
=========================================
  Hits              ?    11867           
  Misses            ?     1113           
  Partials          ?        0

Impacted Files	Coverage Δ
lib/OpenQA/WebAPI/Plugin/Gru.pm	`84.94% <ø> (ø)`
lib/OpenQA/Schema/Result/JobModules.pm	`93.96% <95.23%> (ø)`
lib/OpenQA/Schema/Result/Jobs.pm	`97.88% <100.00%> (ø)`
lib/OpenQA/Task/Job/FinalizeResults.pm	`100.00% <100.00%> (ø)`
lib/OpenQA/Utils.pm	`97.14% <100.00%> (ø)`
lib/OpenQA/WebAPI/Controller/Step.pm	`92.59% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c62833...22dfa75. Read the comment docs.

lib/OpenQA/Schema/Result/JobModules.pm

kraih · 2020-06-04T09:47:13Z

lib/OpenQA/Schema/Result/JobModules.pm

+        rename $tmpfile->to_string, $file->to_string;
+    };
+
+    eval { $tmpfile->remove if -e $tmpfile; };


Don't just pre-emptively use eval. If you're expecting a specific exception here then handle that case gracefully, but don't hide everything.

I think this eval still deserves a handler - even if it would just log the message. And I'm nevertheless wondering whether we should mark the Minion job as failed in the end when any of these error handler were entered. Otherwise we'll probably never notice when there's a problem.

The second eval is just a post-fail cleanup. It does nothing if the first eval succeeds. Do we really need log_error "Cleaning up after the error that was already logged above also failed"?

Mh… ok in this case one of the error would likely be sufficient.

That explanation doesn't make much sense to me. What is the actual purpose of this cleanup code? Why would the temp file still exist here? Why do you expect removing it to fail? It looks to me like if the first eval succeeds, then this would do nothing since the tempfile is gone. Do you expect the first eval to fail and leave a temp file behind that needs to be cleaned up? Why is the cleanup code not in the $@ block then? Why would the process be unable to remove it? Do you expect permissions to change? Why not do a -w check then?

I'm probably missing something here, but i can't imagine there is a good reason to have an eval without exception handling.

Point 1 seems irrelevant here, it's just a temp file. And point 3 you still left unanswered, what exception do you expect that makes the code fail but that's totally irrelevant. If the file system is that broken that you can't delete files you created that would seem like something i'd like to see as an error for in the job result (instead of just finding garbage files without any clue where they came from). That leaves only point 2, and that's just the reason for the ->remove call, not a justification to wrap it in an eval.

If the code can't delete files, you WILL see it in the log. It'll be the exact same error that was already logged when $tmpfile->spurt() or rename() failed. So there's no point logging it twice. Especially when you can expect hundreds of duplicate errors because I'm deliberately keeping the minion job going until it cycles through all test modules.

The reason for wrapping $tmpfile->remove() in an eval block is that I don't want to terminate the minion job if cleanup fails. The failure could be a temporary hickup.

tempfile() creates the file with permissions 0600. I'd have to run chmod() in addition to rename(). And no, I will NOT use move_to() here.

Then all of your problems would be solved by using a scope guard there. No need for ugly hacks.

Then all of your problems would be solved by using a scope guard there. No need for ugly hacks.

And risk that the scope guard behavior changes in the future and it starts throwing bogus errors when it fails to remove the renamed file? No, thanks. Do you understand that the only thing your "improvements" would add to the code is technical debt? The code is working as intended. Period.

That's a silly line of argumentation, how do you know the $file->sibling call above you did not wrap in an eval won't change? I control that method upstream, i could change it right now just to mess with you. I won't of course, but all the methods we are talking about here are controlled by me upstream, and i'm just trying to tell you how to use them in the safest way possible.

Anyway, it doesn't matter because scope guards work exactly in the way you need them to.

$ perl -MMojo::Util=scope_guard -E 'eval { my $guard = scope_guard sub { warn "warning from scope guard"; die "exception from scope guard" }; die "real exception" }; warn "eval error: $@" if $@;' warning from scope guard at -e line 1. eval error: real exception at -e line 1.

Just because you don't understand a concept yet that doesn't make it technical debt. Scope guards are a standard Perl pattern, and i'm happy to explain it to you in more detail if needed.

That's a silly line of argumentation, how do you know the $file->sibling call above you did not wrap in an eval won't change? I control that method upstream, i could change it right now just to mess with you. I won't of course, but all the methods we are talking about here are controlled by me upstream, and i'm just trying to tell you how to use them in the safest way possible.

I would write this PR exactly the same way even if I myself were the upstream maintainer of the Mojo library. The code is as simple as it can be and any extra abstractions would only make it more complex.

lib/OpenQA/Schema/Result/Jobs.pm

mdoucha · 2020-06-04T10:40:39Z

BTW, I ran into some os-autoinst bugs yesterday while testing Unicode handling in this PR. When a command executed on SUT prints non-ASCII UTF-8 output, it'll show up correctly in serial_terminal.txt but wait_serial() will return binary garbage that was most likely passed through encode('UTF-8', $str) twice too many.

mdoucha · 2020-06-04T15:05:15Z

I've added error handling to the finalize task, obsolete file deletion task and moved text_data loading to JobModules::results. One small problem: enqueue(... => {delay => x}) doesn't work. The task gets executed immediately on my instance.

kraih · 2020-06-04T16:11:38Z

lib/OpenQA/Task/Job/FinalizeResults.pm

+}
+
+sub _finalize_results {
+    my ($app, $task, $job_id) = @_;


That should actually be $job instead of $task (it's a Minion::Job object).

Or use $minion_job to better distinguish from openQA jobs (which are more commonly "the job" in the context of openQA).

I'm already using $job for the OpenQA job. I don't see what's the problem with calling the minion job $task.

In Minion the term "task" has a specific meaning. It's like a class name for the job, while the jobs are instances of it. There can only be one task "finalize_job_results", but there can be many "finalize_job_results" jobs in the queue. Hope that explanation makes sense.

So $minion_job would be quite a bit better.

lib/OpenQA/Schema/Result/JobModules.pm

lib/OpenQA/Schema/Result/Jobs.pm

t/14-grutasks.t

mdoucha · 2020-06-05T11:00:05Z

Finalize jobs will now be marked as failed if any of the modules doesn't get properly finalized. Also added a comment explaining the enqueue() conditions in Jobs::store_column().

lib/OpenQA/Schema/Result/JobModules.pm

lib/OpenQA/Task/Job/FinalizeResults.pm

Martchus · 2020-06-26T13:15:46Z

@mdoucha told me he will spent no further time on this PR so I'm taking over. It would be a waste to throw the code away at this point and this is actually a beneficial optimization.

okurz

I realized you add new functionality that is not covered by tests. As you are on it I strongly recommend to cover the complete patch

Each snippet of console output is stored in a separate file in the job result directory. Collect these tiny files into module result JSON files at the end of each job. This can speed up server-side job details JSON generation up to 5 times when the disk cache is empty.

Martchus · 2020-06-29T08:48:21Z

According to the checks our criteria for test coverage is reached. However, I haven't even tested this sufficiently manually and checked whether the unit tests are sufficient.

lib/OpenQA/Schema/Result/JobModules.pm

Scope guards are now used correctly.

lib/OpenQA/Task/Job/FinalizeResults.pm

Martchus · 2020-06-29T12:36:16Z

The existing tests covered everything except the error handling which is now covered as well. I also tested it locally on a real job and it works.

okurz

awesome! Let's see if mergify merges after having two approvals but checkstyle fails :)

lib/OpenQA/Schema/Result/JobModules.pm

Martchus

I'll have to fix the points mentioned by @mdoucha.

…ils loading

* Remove $collect_text flag and instead cover that case within tests * Handle errors when reading the text file * Improve coding style

* Test error cases as well * Check the minion job state

lib/OpenQA/Schema/Result/JobModules.pm

…n finalized

I guess it should be good enough now. The last commit is hard to test but the code should be straight forward enough.

AdamWill · 2020-09-15T22:26:28Z

I think this may have broken stale job detection. Generally speaking it seems dangerous to have DB schema code assuming whatever 'app' is using it will have a $self->gru, if I'm understanding what OpenQA::App->singleton does correctly?

okurz · 2020-09-16T04:37:15Z

Do you mean that the conditional execution of $app->gru->enqueue(finalize_job_results... does not ensure that the job ever finishes?

Martchus · 2020-09-16T05:04:52Z

#3389 will fix it implicitly because I amended the condition to make tests pass. However, the solution isn't ideal. Maybe it makes sense to initialize the gru plugin within the scheduler.

kraih · 2020-09-16T09:15:07Z

Since gru is part of the model it's an easy argument to make that it should be shared between all webui micro services.

Martchus · 2020-09-16T09:17:26Z

@kraih Yes, I've created #3397 for that.

Martchus requested changes Jun 2, 2020

View reviewed changes

mdoucha force-pushed the jobresult-speed branch from 33ce7e3 to 999c89f Compare June 3, 2020 13:30

Martchus requested changes Jun 3, 2020

View reviewed changes

t/14-grutasks.t Outdated Show resolved Hide resolved

lib/OpenQA/Task/Job/FinalizeResults.pm Outdated Show resolved Hide resolved

lib/OpenQA/Schema/Result/JobModules.pm Outdated Show resolved Hide resolved

kraih previously requested changes Jun 4, 2020

View reviewed changes

kraih mentioned this pull request Jun 4, 2020

Turn two schema functions into methods #3158

Merged

mdoucha force-pushed the jobresult-speed branch 2 times, most recently from eee61a4 to 6e5e3e9 Compare June 4, 2020 15:02

kraih reviewed Jun 4, 2020

View reviewed changes

lib/OpenQA/Schema/Result/JobModules.pm Outdated Show resolved Hide resolved

Martchus reviewed Jun 4, 2020

View reviewed changes

lib/OpenQA/Schema/Result/Jobs.pm Show resolved Hide resolved

Martchus reviewed Jun 5, 2020

View reviewed changes

t/14-grutasks.t Outdated Show resolved Hide resolved

mdoucha force-pushed the jobresult-speed branch from 6e5e3e9 to b0a2d12 Compare June 5, 2020 10:52

kraih reviewed Jun 5, 2020

View reviewed changes

lib/OpenQA/Schema/Result/JobModules.pm Outdated Show resolved Hide resolved

kraih reviewed Jun 5, 2020

View reviewed changes

lib/OpenQA/Schema/Result/JobModules.pm Outdated Show resolved Hide resolved

kraih reviewed Jun 5, 2020

View reviewed changes

lib/OpenQA/Task/Job/FinalizeResults.pm Outdated Show resolved Hide resolved

mdoucha force-pushed the jobresult-speed branch from b0a2d12 to f4b3f3e Compare June 5, 2020 15:29

Martchus force-pushed the jobresult-speed branch from f4b3f3e to 084d6e8 Compare June 26, 2020 13:16

okurz approved these changes Jun 26, 2020

View reviewed changes

okurz requested changes Jun 26, 2020

View reviewed changes

mdoucha added 3 commits June 29, 2020 10:46

Unify test module text output loading

f807945

Add minion task to remove duplicate result text files

f15c02e

kraih reviewed Jun 29, 2020

View reviewed changes

lib/OpenQA/Schema/Result/JobModules.pm Outdated Show resolved Hide resolved

kraih reviewed Jun 29, 2020

View reviewed changes

lib/OpenQA/Task/Job/FinalizeResults.pm Outdated Show resolved Hide resolved

Martchus force-pushed the jobresult-speed branch from a7566b4 to c9c4737 Compare June 29, 2020 12:32

Martchus approved these changes Jun 29, 2020

View reviewed changes

okurz approved these changes Jun 29, 2020

View reviewed changes

Martchus force-pushed the jobresult-speed branch from c9c4737 to f118d75 Compare June 29, 2020 13:36

mdoucha commented Jun 29, 2020

View reviewed changes

lib/OpenQA/Schema/Result/JobModules.pm Outdated Show resolved Hide resolved

mdoucha commented Jun 29, 2020

View reviewed changes

lib/OpenQA/Schema/Result/JobModules.pm Show resolved Hide resolved

mdoucha commented Jun 29, 2020

View reviewed changes

lib/OpenQA/Schema/Result/JobModules.pm Outdated Show resolved Hide resolved

Martchus previously requested changes Jun 29, 2020

View reviewed changes

Martchus added 5 commits June 29, 2020 16:49

Simplify and improve code for finalizing results to optimize job deta…

076014a

…ils loading

Improve gathering job module results

41bfe60

* Remove $collect_text flag and instead cover that case within tests * Handle errors when reading the text file * Improve coding style

Improve tests for finalizing job results

5a66701

* Test error cases as well * Check the minion job state

Make details JSON globally readable when finalizing job module results

f806e03

Skip reading text files when reading job modules for label carry over

06ed3c2

Martchus force-pushed the jobresult-speed branch from f118d75 to 2271115 Compare June 29, 2020 16:21

mdoucha commented Jun 30, 2020

View reviewed changes

lib/OpenQA/Schema/Result/JobModules.pm Show resolved Hide resolved

Martchus force-pushed the jobresult-speed branch from 2271115 to 9344968 Compare June 30, 2020 09:45

Avoid errors when serving job module results while they have just bee…

22dfa75

…n finalized

Martchus force-pushed the jobresult-speed branch from 9344968 to 22dfa75 Compare June 30, 2020 09:47

Martchus approved these changes Jun 30, 2020

View reviewed changes

mergify bot merged commit bedf7a0 into os-autoinst:master Jun 30, 2020

mdoucha deleted the jobresult-speed branch September 16, 2020 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize job details loading with empty disk cache #3144

Optimize job details loading with empty disk cache #3144

mdoucha commented Jun 1, 2020

Martchus left a comment

mdoucha commented Jun 3, 2020

Martchus left a comment

codecov bot commented Jun 3, 2020 •

edited

Loading

kraih Jun 4, 2020

Martchus Jun 5, 2020

mdoucha Jun 5, 2020

Martchus Jun 5, 2020

kraih Jun 5, 2020 •

edited

Loading

mdoucha Jun 15, 2020

kraih Jun 15, 2020

mdoucha Jun 15, 2020

kraih Jun 15, 2020

mdoucha Jun 15, 2020

mdoucha commented Jun 4, 2020

mdoucha commented Jun 4, 2020 •

edited

Loading

kraih Jun 4, 2020

Martchus Jun 4, 2020

mdoucha Jun 5, 2020

kraih Jun 5, 2020

kraih Jun 5, 2020

mdoucha commented Jun 5, 2020

Martchus commented Jun 26, 2020

okurz left a comment

Martchus commented Jun 29, 2020

Martchus commented Jun 29, 2020

okurz left a comment

Martchus left a comment

AdamWill commented Sep 15, 2020

okurz commented Sep 16, 2020

Martchus commented Sep 16, 2020

kraih commented Sep 16, 2020

Martchus commented Sep 16, 2020

Optimize job details loading with empty disk cache #3144

Optimize job details loading with empty disk cache #3144

Conversation

mdoucha commented Jun 1, 2020

Martchus left a comment

Choose a reason for hiding this comment

mdoucha commented Jun 3, 2020

Martchus left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 3, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kraih Jun 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdoucha commented Jun 4, 2020

mdoucha commented Jun 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdoucha commented Jun 5, 2020

Martchus commented Jun 26, 2020

okurz left a comment

Choose a reason for hiding this comment

Martchus commented Jun 29, 2020

Martchus commented Jun 29, 2020

okurz left a comment

Choose a reason for hiding this comment

Martchus left a comment

Choose a reason for hiding this comment

AdamWill commented Sep 15, 2020

okurz commented Sep 16, 2020

Martchus commented Sep 16, 2020

kraih commented Sep 16, 2020

Martchus commented Sep 16, 2020

codecov bot commented Jun 3, 2020 •

edited

Loading

kraih Jun 5, 2020 •

edited

Loading

mdoucha commented Jun 4, 2020 •

edited

Loading