New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save machine memory dumps #621
Conversation
Added a mock save_state method on qemu driver (also is present on backend/baseclass.pm) which stops the vm, uses a migration and saves a compressed image of the running machine. Test should explicitly call this method for now. It should be introduced in testbase.pm and tests having at least fatal flag should call this. WIP * Verify that the upload will work. * Add command line used to spawn the VM. * Add tests on both sides, openQA and os-autoinst.
6584ba8
to
5378b4a
Compare
554ea97
to
5d8c450
Compare
Changed the code from hpm commands to qmp commands, to make the code a bit more readable, also qmp commands allow for better handling of states of the migration. Integration tests needed, since when called alone isotovideo will work fine, but when running behind the worker qemu can stall on state "setup" for the migration.
5d8c450
to
5d3f9c7
Compare
@agraf followed your recomendations... both sleeps are fixed. |
Please write your commit messages in present imperative tense. Also see http://chris.beams.io/posts/git-commit/ as a helpful guide how to write good commit messages. You can reference the progress issue which describes the feature request. |
|
||
save_state; | ||
|
||
if backend supports it, save machine state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be extended heavily :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation has been updated and extended :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice. Please see my inline comments. Please also squash commits using git rebase --interactive
(dheidler can help on questions).
diag "Sucessfully saved disk #1."; | ||
|
||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too many blank lines
format => "qcow2" | ||
}); | ||
|
||
diag "Sucessfully saved disk #1."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't you include the argument $args->{disk}
here, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is fixed now.
$rsp = $self->handle_qmp_command({execute => "query-migrate"}); | ||
|
||
diag "QEMU MIGRATING BYTES TOTAL: \t" . $rsp->{return}->{ram}->{total}; | ||
diag "QEMU MIGRATING BYTES REMAINING: \t" . $rsp->{return}->{ram}->{remaining}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why capital letters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly debugging purposes. Changing that to:
diag "Migrating bytes \t" . $rsp->{return}->{ram}->{total};
diag "Migrating remaining bytes: \t" . $rsp->{return}->{ram}->{remaining};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about now?... Looks better :)
$rsp = $self->handle_qmp_command( | ||
{ | ||
execute => "migrate_set_speed", | ||
arguments => {value => "4095m"}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
|
||
sub save_memory_dump { | ||
|
||
my $filename = shift || ref($autotest::current_test); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more commonly we try to use
my ($filename) = @_;
$filename //= ref($autotest::current_test);
What do you think about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case using ||
makes sense. We will most probably not use 0 as a filename and with ||
we filter out empty strings.
So I'm for
my ($filename) = @_;
$filename ||= ref($autotest::current_test);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to stick too to || over //.
bmwqemu::diag("Trying to save machine drives"); | ||
bmwqemu::load_vars(); | ||
|
||
#TODO: right now, we're saving all the disks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just delete the TODO:
. If it's an issue we should track it in progress tickets. If it's not an issue it will never be done and it does not harm :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted the TODO, but also modified the comment a bit.
within the test, but only from the post_fail_hook. So that memory and disk dumps | ||
can be extracted without any risk of data changing. | ||
|
||
It is assumed that the test writer will call freeze_vm before getting the memory or disk dumps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if we don't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc was changed to explain why freeze_vm should be called when doing both, disk and memory dump.
471998c
to
515bc51
Compare
|
||
=cut | ||
|
||
sub freeze_vm { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this need an unfreeze too to be able to interact with the VM after we saved it off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the point where the test fails, it will stop the machine afterwards. Right now if the test is flagged as fatal, the backend will shut down the machine after the test's post fail hook is finished, hence no need to keep running the test.
{ | ||
execute => "migrate", | ||
arguments => { | ||
uri => sprintf("exec:gzip -c > ulogs/%s-vm-memory-dump.gz", $args->{filename}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No pigz/pixz with nice level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet :(, would add a new dependency to servers, gzip is available everywhere... at least for the PoC, will be enough... Will discuss later with @coolo and see what he thinks about adding it to the backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on top of that: we have enough problems with CPU stalls - and we have no time pressure at this point. The test is failed anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why putting it on a friendly nice level is critical. And once you've done that, you can as well use as much cpu power as you have spare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have 16 video encoders running with nice 19.
BTW: @foursixnine you need to unblock every additional tool used in the apparmor profiles in the openQA package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So how about we move the video encoders to nice 18 and have the compression be parallel with nice 19?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. this feature is implemented because we don't expect the developer debugging the failure available. If she was, we would have preferred to just keep the VM running.
And if the developer is not available, we have all time on earth to dump and upload. The CPU of the qemu process is spare (as the VM is frozen), so use it fully and free up the worker slot.
|
||
sub save_memory_dump { | ||
my ($args) = @_; | ||
$args->{filename} = $args->{filename} || ref($autotest::current_test); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can stlil use ||= :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done :) Looks beautiful now.
515bc51
to
4d1c3da
Compare
4d1c3da
to
8088132
Compare
There's currently a bug, that triggers when there's nothing on the pool directory, and leaves the machine migration stalled. I still need to hunt that bug down. It is related to how the handle_qmp_command works. More on that later... |
Please delete the disabled code about the migration speed (keep it in a local branch). Besides this, I think we are good to go. IMHO we can worry about the empty pool dir issue later |
8088132
to
c168c15
Compare
@okurz I just removed the comments. Somehow the travis build failed... but while installing stuff via apt... |
occassionally travis checks fail, in this case because apt-get couldn't install stuff. Any project admin of |
8088132
to
5a38428
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next time please think about some tests :-)
Add save_storage_drive method to test api and freeze_vm method aswell, while the first is needed in case the test writer would explicitly require the test to save the hard drives present in the system, the latter is needed to pause/freeze the vm, so that the memory dumps or disk dumps are not affected, in case something in between happens. Rename save_state to save_memory_dump to improve code readability. Improve the documentation for methods mentioned above and some minor code changes aswell. Currently depends on bsc#1008148
5a38428
to
ba627c4
Compare
Finally i found the root of the problem with the stalled migrations, bsc#1008148. Also added a check to verify that the migration is able to run, otherwise the method will die. Now we're able to merge and use this feature as soon as the qemu gets patched. |
you probably mean "as soon as qemu is patched, the factory package of qemu got updated, maintenance updates for an update of qemu will be created and accepted into supported released products, updates applied and rolled out on all our productive instances". So: expect the fix to be available in some months from now and better live with a workaround for the time being :-) |
Fixed it for you..shouldn't be THAT long to get qemu maintenance updates out with @agraf and @foursixnine working together on it |
Bruce already got the patch provided by @agraf so it's all matter of waiting now. |
we don't have to wait with deploying this though - only with using it :) |
This can be merged, right? |
@aaannz From my side yes |
Added the necessary code to perform an online migration of the machine.
After the vmstate.gz file has been generated, the machine can be restored or migrated by appending: -incoming "exec: gzip -c -d vmstate.gz" at the end of the qemu line. By grepping autoinst.txt looking for starting qemu service: the line to spawn qemu can be found.
This change is