Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save machine memory dumps #621

Merged
merged 4 commits into from Nov 15, 2016

Conversation

foursixnine
Copy link
Member

@foursixnine foursixnine commented Oct 13, 2016

Added the necessary code to perform an online migration of the machine.

After the vmstate.gz file has been generated, the machine can be restored or migrated by appending: -incoming "exec: gzip -c -d vmstate.gz" at the end of the qemu line. By grepping autoinst.txt looking for starting qemu service: the line to spawn qemu can be found.


This change is Reviewable

Added a mock save_state method on qemu driver (also is present on
backend/baseclass.pm) which stops the vm, uses a migration and
saves a compressed image of the running machine. Test should
explicitly call this method for now. It should be introduced in
testbase.pm and tests having at least fatal flag should call this.

WIP
* Verify that the upload will work.
* Add command line used to spawn the VM.
* Add tests on both sides, openQA and os-autoinst.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 26.234% when pulling 5378b4a on foursixnine:feature/save-vm-state into 8e22325 on os-autoinst:master.

@foursixnine foursixnine force-pushed the feature/save-vm-state branch 3 times, most recently from 554ea97 to 5d8c450 Compare October 13, 2016 13:28
Changed the code from hpm commands to qmp commands, to make the
code a bit more readable, also qmp commands allow for better
handling of states of the migration.

Integration tests needed, since when called alone isotovideo
will work fine, but when running behind the worker qemu can stall
on state "setup" for the migration.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 26.294% when pulling 5d8c450 on foursixnine:feature/save-vm-state into 8e22325 on os-autoinst:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 26.294% when pulling 5d3f9c7 on foursixnine:feature/save-vm-state into 8e22325 on os-autoinst:master.

@foursixnine
Copy link
Member Author

@agraf followed your recomendations... both sleeps are fixed.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 26.294% when pulling 5d3f9c7 on foursixnine:feature/save-vm-state into 8e22325 on os-autoinst:master.

@okurz
Copy link
Member

okurz commented Oct 13, 2016

Please write your commit messages in present imperative tense. Also see http://chris.beams.io/posts/git-commit/ as a helpful guide how to write good commit messages. You can reference the progress issue which describes the feature request.


save_state;

if backend supports it, save machine state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be extended heavily :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation has been updated and extended :)

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 26.209% when pulling 471998c on foursixnine:feature/save-vm-state into 8e22325 on os-autoinst:master.

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice. Please see my inline comments. Please also squash commits using git rebase --interactive (dheidler can help on questions).

diag "Sucessfully saved disk #1.";

}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many blank lines

format => "qcow2"
});

diag "Sucessfully saved disk #1.";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't you include the argument $args->{disk} here, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is fixed now.

$rsp = $self->handle_qmp_command({execute => "query-migrate"});

diag "QEMU MIGRATING BYTES TOTAL: \t" . $rsp->{return}->{ram}->{total};
diag "QEMU MIGRATING BYTES REMAINING: \t" . $rsp->{return}->{ram}->{remaining};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why capital letters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly debugging purposes. Changing that to:

diag "Migrating bytes     \t" . $rsp->{return}->{ram}->{total};
diag "Migrating remaining bytes:   \t" . $rsp->{return}->{ram}->{remaining};

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about now?... Looks better :)

$rsp = $self->handle_qmp_command(
{
execute => "migrate_set_speed",
arguments => {value => "4095m"}});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?


sub save_memory_dump {

my $filename = shift || ref($autotest::current_test);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more commonly we try to use

my ($filename) = @_;
$filename //= ref($autotest::current_test);

What do you think about it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case using || makes sense. We will most probably not use 0 as a filename and with || we filter out empty strings.
So I'm for

my ($filename) = @_;
$filename ||= ref($autotest::current_test);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to stick too to || over //.

bmwqemu::diag("Trying to save machine drives");
bmwqemu::load_vars();

#TODO: right now, we're saving all the disks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just delete the TODO:. If it's an issue we should track it in progress tickets. If it's not an issue it will never be done and it does not harm :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted the TODO, but also modified the comment a bit.

within the test, but only from the post_fail_hook. So that memory and disk dumps
can be extracted without any risk of data changing.

It is assumed that the test writer will call freeze_vm before getting the memory or disk dumps.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if we don't?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc was changed to explain why freeze_vm should be called when doing both, disk and memory dump.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 26.199% when pulling 515bc51 on foursixnine:feature/save-vm-state into 8e22325 on os-autoinst:master.


=cut

sub freeze_vm {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this need an unfreeze too to be able to interact with the VM after we saved it off?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the point where the test fails, it will stop the machine afterwards. Right now if the test is flagged as fatal, the backend will shut down the machine after the test's post fail hook is finished, hence no need to keep running the test.

{
execute => "migrate",
arguments => {
uri => sprintf("exec:gzip -c > ulogs/%s-vm-memory-dump.gz", $args->{filename}),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No pigz/pixz with nice level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet :(, would add a new dependency to servers, gzip is available everywhere... at least for the PoC, will be enough... Will discuss later with @coolo and see what he thinks about adding it to the backend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on top of that: we have enough problems with CPU stalls - and we have no time pressure at this point. The test is failed anyway.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why putting it on a friendly nice level is critical. And once you've done that, you can as well use as much cpu power as you have spare.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have 16 video encoders running with nice 19.

BTW: @foursixnine you need to unblock every additional tool used in the apparmor profiles in the openQA package

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how about we move the video encoders to nice 18 and have the compression be parallel with nice 19?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. this feature is implemented because we don't expect the developer debugging the failure available. If she was, we would have preferred to just keep the VM running.

And if the developer is not available, we have all time on earth to dump and upload. The CPU of the qemu process is spare (as the VM is frozen), so use it fully and free up the worker slot.


sub save_memory_dump {
my ($args) = @_;
$args->{filename} = $args->{filename} || ref($autotest::current_test);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can stlil use ||= :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :) Looks beautiful now.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 26.193% when pulling 4d1c3da on foursixnine:feature/save-vm-state into e1e471a on os-autoinst:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 26.209% when pulling 8088132 on foursixnine:feature/save-vm-state into e1e471a on os-autoinst:master.

@foursixnine
Copy link
Member Author

There's currently a bug, that triggers when there's nothing on the pool directory, and leaves the machine migration stalled. I still need to hunt that bug down. It is related to how the handle_qmp_command works. More on that later...

@okurz
Copy link
Member

okurz commented Oct 20, 2016

Please delete the disabled code about the migration speed (keep it in a local branch). Besides this, I think we are good to go. IMHO we can worry about the empty pool dir issue later

@foursixnine
Copy link
Member Author

@okurz I just removed the comments. Somehow the travis build failed... but while installing stuff via apt...

@okurz
Copy link
Member

okurz commented Oct 20, 2016

occassionally travis checks fail, in this case because apt-get couldn't install stuff. Any project admin of os-autoinst could retrigger the build or you just force push again after trivial changes.
another point: I guess you can rename the PR subject and remove the "PoC:"

@foursixnine foursixnine changed the title PoC: Save machine memory dumps Save machine memory dumps Oct 20, 2016
@foursixnine foursixnine force-pushed the feature/save-vm-state branch 2 times, most recently from 8088132 to 5a38428 Compare October 21, 2016 07:52
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 26.209% when pulling 5a38428 on foursixnine:feature/save-vm-state into e1e471a on os-autoinst:master.

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next time please think about some tests :-)

Add save_storage_drive method to test api and freeze_vm method aswell,
while the first is needed in case the test writer would explicitly
require the test to save the hard drives present in the system, the
latter is needed to pause/freeze the vm, so that the memory dumps or
disk dumps are not affected, in case something in between happens.

Rename save_state to save_memory_dump to improve code readability.

Improve the documentation for methods mentioned above and some minor
code changes aswell.

Currently depends on bsc#1008148
@foursixnine
Copy link
Member Author

Finally i found the root of the problem with the stalled migrations, bsc#1008148.

Also added a check to verify that the migration is able to run, otherwise the method will die. Now we're able to merge and use this feature as soon as the qemu gets patched.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 26.14% when pulling ba627c4 on foursixnine:feature/save-vm-state into 4e2d106 on os-autoinst:master.

@okurz
Copy link
Member

okurz commented Nov 3, 2016

Now we're able to merge and use this feature as soon as the qemu gets patched.

you probably mean "as soon as qemu is patched, the factory package of qemu got updated, maintenance updates for an update of qemu will be created and accepted into supported released products, updates applied and rolled out on all our productive instances". So: expect the fix to be available in some months from now and better live with a workaround for the time being :-)

@sysrich
Copy link
Member

sysrich commented Nov 3, 2016

So: expect the fix to be available in some months WEEKS from now and better live with a workaround for the time being :-)

Fixed it for you..shouldn't be THAT long to get qemu maintenance updates out with @agraf and @foursixnine working together on it

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 26.155% when pulling c6af1d0 on foursixnine:feature/save-vm-state into 4e2d106 on os-autoinst:master.

@foursixnine
Copy link
Member Author

Bruce already got the patch provided by @agraf so it's all matter of waiting now.

@coolo
Copy link
Contributor

coolo commented Nov 8, 2016

we don't have to wait with deploying this though - only with using it :)

@aaannz
Copy link
Contributor

aaannz commented Nov 15, 2016

This can be merged, right?

@foursixnine
Copy link
Member Author

foursixnine commented Nov 15, 2016

@aaannz From my side yes

@coolo coolo merged commit 1081cea into os-autoinst:master Nov 15, 2016
@foursixnine foursixnine deleted the feature/save-vm-state branch June 25, 2018 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants