Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize job details loading with empty disk cache #3144

Merged
merged 9 commits into from
Jun 30, 2020

Conversation

mdoucha
Copy link
Contributor

@mdoucha mdoucha commented Jun 1, 2020

Each snippet of console output is stored in a separate file in the job result directory. Collect these tiny files into module result JSON files at the end of each job. This can speed up server-side job details JSON generation up to 5 times when the disk cache is empty.

On my local OpenQA instance, this patch speeds up details_ajax JSON generation for ltp_syscalls jobs as follows:

  • empty cache: from 3.3 seconds to 750ms
  • full cache: from 750ms to 450ms

Copy link
Contributor

@Martchus Martchus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea makes sense. I had this idea actually myself and also measured the performance gain but didn't implement it because it isn't that simple (see review comments). Therefore I haven't had the time to implement it "properly" but it might be worth taking the time instead of having to deal with the possible downsides of this "simple" approach.

lib/OpenQA/Schema/Result/JobModules.pm Outdated Show resolved Hide resolved
lib/OpenQA/Schema/Result/JobModules.pm Outdated Show resolved Hide resolved
lib/OpenQA/Schema/Result/JobModules.pm Outdated Show resolved Hide resolved
lib/OpenQA/Schema/Result/Jobs.pm Outdated Show resolved Hide resolved
lib/OpenQA/Schema/Result/JobModules.pm Outdated Show resolved Hide resolved
lib/OpenQA/Schema/Result/JobModules.pm Outdated Show resolved Hide resolved
@mdoucha
Copy link
Contributor Author

mdoucha commented Jun 3, 2020

Test result cleanup reworked as minion job. I'll add a separate commit for the step view and deduplication.

Copy link
Contributor

@Martchus Martchus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It looks generally good, just a few points.

t/14-grutasks.t Outdated Show resolved Hide resolved
lib/OpenQA/Task/Job/FinalizeResults.pm Outdated Show resolved Hide resolved
lib/OpenQA/Schema/Result/JobModules.pm Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jun 3, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@1c62833). Click here to learn what that means.
The diff coverage is 96.87%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #3144   +/-   ##
=========================================
  Coverage          ?   91.42%           
=========================================
  Files             ?      212           
  Lines             ?    12980           
  Branches          ?        0           
=========================================
  Hits              ?    11867           
  Misses            ?     1113           
  Partials          ?        0           
Impacted Files Coverage Δ
lib/OpenQA/WebAPI/Plugin/Gru.pm 84.94% <ø> (ø)
lib/OpenQA/Schema/Result/JobModules.pm 93.96% <95.23%> (ø)
lib/OpenQA/Schema/Result/Jobs.pm 97.88% <100.00%> (ø)
lib/OpenQA/Task/Job/FinalizeResults.pm 100.00% <100.00%> (ø)
lib/OpenQA/Utils.pm 97.14% <100.00%> (ø)
lib/OpenQA/WebAPI/Controller/Step.pm 92.59% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c62833...22dfa75. Read the comment docs.

kraih
kraih previously requested changes Jun 4, 2020
lib/OpenQA/Schema/Result/JobModules.pm Outdated Show resolved Hide resolved
rename $tmpfile->to_string, $file->to_string;
};

eval { $tmpfile->remove if -e $tmpfile; };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't just pre-emptively use eval. If you're expecting a specific exception here then handle that case gracefully, but don't hide everything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this eval still deserves a handler - even if it would just log the message. And I'm nevertheless wondering whether we should mark the Minion job as failed in the end when any of these error handler were entered. Otherwise we'll probably never notice when there's a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second eval is just a post-fail cleanup. It does nothing if the first eval succeeds. Do we really need log_error "Cleaning up after the error that was already logged above also failed"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh… ok in this case one of the error would likely be sufficient.

Copy link
Member

@kraih kraih Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That explanation doesn't make much sense to me. What is the actual purpose of this cleanup code? Why would the temp file still exist here? Why do you expect removing it to fail? It looks to me like if the first eval succeeds, then this would do nothing since the tempfile is gone. Do you expect the first eval to fail and leave a temp file behind that needs to be cleaned up? Why is the cleanup code not in the $@ block then? Why would the process be unable to remove it? Do you expect permissions to change? Why not do a -w check then?

I'm probably missing something here, but i can't imagine there is a good reason to have an eval without exception handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point 1 seems irrelevant here, it's just a temp file. And point 3 you still left unanswered, what exception do you expect that makes the code fail but that's totally irrelevant. If the file system is that broken that you can't delete files you created that would seem like something i'd like to see as an error for in the job result (instead of just finding garbage files without any clue where they came from). That leaves only point 2, and that's just the reason for the ->remove call, not a justification to wrap it in an eval.

If the code can't delete files, you WILL see it in the log. It'll be the exact same error that was already logged when $tmpfile->spurt() or rename() failed. So there's no point logging it twice. Especially when you can expect hundreds of duplicate errors because I'm deliberately keeping the minion job going until it cycles through all test modules.

The reason for wrapping $tmpfile->remove() in an eval block is that I don't want to terminate the minion job if cleanup fails. The failure could be a temporary hickup.

tempfile() creates the file with permissions 0600. I'd have to run chmod() in addition to rename(). And no, I will NOT use move_to() here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then all of your problems would be solved by using a scope guard there. No need for ugly hacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then all of your problems would be solved by using a scope guard there. No need for ugly hacks.

And risk that the scope guard behavior changes in the future and it starts throwing bogus errors when it fails to remove the renamed file? No, thanks. Do you understand that the only thing your "improvements" would add to the code is technical debt? The code is working as intended. Period.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a silly line of argumentation, how do you know the $file->sibling call above you did not wrap in an eval won't change? I control that method upstream, i could change it right now just to mess with you. I won't of course, but all the methods we are talking about here are controlled by me upstream, and i'm just trying to tell you how to use them in the safest way possible.

Anyway, it doesn't matter because scope guards work exactly in the way you need them to.

$ perl -MMojo::Util=scope_guard -E 'eval { my $guard = scope_guard sub { warn "warning from scope guard"; die "exception from scope guard" }; die "real exception" }; warn "eval error: $@" if $@;'
warning from scope guard at -e line 1.
eval error: real exception at -e line 1.

Just because you don't understand a concept yet that doesn't make it technical debt. Scope guards are a standard Perl pattern, and i'm happy to explain it to you in more detail if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a silly line of argumentation, how do you know the $file->sibling call above you did not wrap in an eval won't change? I control that method upstream, i could change it right now just to mess with you. I won't of course, but all the methods we are talking about here are controlled by me upstream, and i'm just trying to tell you how to use them in the safest way possible.

I would write this PR exactly the same way even if I myself were the upstream maintainer of the Mojo library. The code is as simple as it can be and any extra abstractions would only make it more complex.

lib/OpenQA/Schema/Result/Jobs.pm Outdated Show resolved Hide resolved
@mdoucha
Copy link
Contributor Author

mdoucha commented Jun 4, 2020

BTW, I ran into some os-autoinst bugs yesterday while testing Unicode handling in this PR. When a command executed on SUT prints non-ASCII UTF-8 output, it'll show up correctly in serial_terminal.txt but wait_serial() will return binary garbage that was most likely passed through encode('UTF-8', $str) twice too many.

@mdoucha mdoucha force-pushed the jobresult-speed branch 2 times, most recently from eee61a4 to 6e5e3e9 Compare June 4, 2020 15:02
@mdoucha
Copy link
Contributor Author

mdoucha commented Jun 4, 2020

I've added error handling to the finalize task, obsolete file deletion task and moved text_data loading to JobModules::results. One small problem: enqueue(... => {delay => x}) doesn't work. The task gets executed immediately on my instance.

}

sub _finalize_results {
my ($app, $task, $job_id) = @_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should actually be $job instead of $task (it's a Minion::Job object).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or use $minion_job to better distinguish from openQA jobs (which are more commonly "the job" in the context of openQA).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm already using $job for the OpenQA job. I don't see what's the problem with calling the minion job $task.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Minion the term "task" has a specific meaning. It's like a class name for the job, while the jobs are instances of it. There can only be one task "finalize_job_results", but there can be many "finalize_job_results" jobs in the queue. Hope that explanation makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So $minion_job would be quite a bit better.

t/14-grutasks.t Outdated Show resolved Hide resolved
@mdoucha
Copy link
Contributor Author

mdoucha commented Jun 5, 2020

Finalize jobs will now be marked as failed if any of the modules doesn't get properly finalized. Also added a comment explaining the enqueue() conditions in Jobs::store_column().

@Martchus
Copy link
Contributor

@mdoucha told me he will spent no further time on this PR so I'm taking over. It would be a waste to throw the code away at this point and this is actually a beneficial optimization.

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized you add new functionality that is not covered by tests. As you are on it I strongly recommend to cover the complete patch

Each snippet of console output is stored in a separate file in the job result
directory. Collect these tiny files into module result JSON files at the end
of each job. This can speed up server-side job details JSON generation up to
5 times when the disk cache is empty.
@Martchus
Copy link
Contributor

According to the checks our criteria for test coverage is reached. However, I haven't even tested this sufficiently manually and checked whether the unit tests are sufficient.

@kraih kraih dismissed their stale review June 29, 2020 10:10

Scope guards are now used correctly.

@Martchus
Copy link
Contributor

The existing tests covered everything except the error handling which is now covered as well. I also tested it locally on a real job and it works.

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome! Let's see if mergify merges after having two approvals but checkstyle fails :)

Copy link
Contributor

@Martchus Martchus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to fix the points mentioned by @mdoucha.

* Remove $collect_text flag and instead cover that case within tests
* Handle errors when reading the text file
* Improve coding style
* Test error cases as well
* Check the minion job state
@Martchus Martchus dismissed their stale review June 30, 2020 10:58

I guess it should be good enough now. The last commit is hard to test but the code should be straight forward enough.

@mergify mergify bot merged commit bedf7a0 into os-autoinst:master Jun 30, 2020
@AdamWill
Copy link
Contributor

I think this may have broken stale job detection. Generally speaking it seems dangerous to have DB schema code assuming whatever 'app' is using it will have a $self->gru, if I'm understanding what OpenQA::App->singleton does correctly?

@okurz
Copy link
Member

okurz commented Sep 16, 2020

Do you mean that the conditional execution of $app->gru->enqueue(finalize_job_results... does not ensure that the job ever finishes?

@Martchus
Copy link
Contributor

#3389 will fix it implicitly because I amended the condition to make tests pass. However, the solution isn't ideal. Maybe it makes sense to initialize the gru plugin within the scheduler.

@mdoucha mdoucha deleted the jobresult-speed branch September 16, 2020 08:08
@kraih
Copy link
Member

kraih commented Sep 16, 2020

Since gru is part of the model it's an easy argument to make that it should be shared between all webui micro services.

@Martchus
Copy link
Contributor

@kraih Yes, I've created #3397 for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants