Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory dump #7517

Merged
merged 15 commits into from
May 27, 2022
Merged

Memory dump #7517

merged 15 commits into from
May 27, 2022

Conversation

ShellyKa13
Copy link
Contributor

@ShellyKa13 ShellyKa13 commented Apr 4, 2022

What this PR does / why we need it:
This PR implements virtctl command of getting memory dump to a pvc.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

Add virtctl Memory Dump command

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Apr 4, 2022
Comment on lines +1175 to +1187
l.domainModifyLock.Lock()
defer l.domainModifyLock.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like we needed this lock whenever we're performing a get+set on the domain spec. I'm uncertain if we want to hold this lock during the duration of the core dump. This lock would actually block the virt-handler's reconcile loop potentially if the same VM is synced again while the dump is taking place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think its safe to remove it I will

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way you restructured this with the lock not being held during the dump looks good now

Comment on lines 1143 to 1131
memoryDumpInfo := &api.DomainMemoryDumpInfo{
DumpTimestamp: &now,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the memory dump fails? will that get reported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I planned to add error handling

Comment on lines 439 to 441
case domainUpdate := <-domainUpdateChan:
guestEvent.domainMemoryDumpInfo = domainUpdate.MemoryDumpInfo
eventCallback(domainConn, domainCache, libvirtEvent{}, n, deleteNotificationSent, guestEvent, vmi)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm unsure about how this works. It doesn't appear like the memory dump info is persisted on the Domain, so if this event doesn't get processed as expected by virt-handler, I think we'd never have a chance to get the result of the memory dump again.

The method we have for storing persistent info on the domainSpec is the domainSpec.Metadata field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good point! I am fixing it to get if the memory dump completed from the memoryDump call by updating the info of the last memory dump on the LibvirtDomainManager struct.

// dump to the given pvc
// +nullable
// +optional
MemoryDumpRequest *VirtualMachineMemoryDumpRequest `json:"memoryDumpRequest,omitempty" optional:"true"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a dump is requested while one is in progress? Should this be a list like VolumeRequests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as the previous request hasn't completed the request is denied, there is not really a point to do another memory dump one after the other

// DumpTimestamp is the time when the memory dump occured
DumpTimestamp *metav1.Time `json:"dumpTimestamp,omitempty"`
// VolumeName is the name of the volume the memory was dumped to
VolumeName string `json:"volumeName,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be claimName?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@ShellyKa13 ShellyKa13 changed the title [WIP] Memory dump Memory dump Apr 6, 2022
@kubevirt-bot kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 6, 2022
@ShellyKa13
Copy link
Contributor Author

@davidvossel @mhenriks I updated the PR. This code now works after some manual testing. I'll appreciate a another review

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 8, 2022
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 10, 2022
@ShellyKa13
Copy link
Contributor Author

/restest


message MemoryDumpResponse {
Response response = 1;
bool completed = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not have the more flexible string memoryDumpResponse = 2; (like all other responses) instead of a bool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont understand the question. Im using the Response like all the other Responses but add a bool that I wanted to indicate if the memory dump completed or not, in the Response there is bool for success and a string message. success is not the same as completed as the request for the dump can succeed since it was triggered successfully and after that the command will return completed if the async request completed..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All I said is that bool is limited (you will rarely see it here, or in responses in general) and you may later need a response which is more informative than false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, for completion its a pretty much yes or no question, and I do updated the message in the Response part if I had some error and that maybe why the completion is false.. so I think that coveres it..

return "", err
}
}
return directoryPath, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why err and not nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is pretty much a copy of GetFileSystemDiskTargetPathFromHostView with a small change :) I can beautify it

return "", err
}
if !exists && create {
err = os.Mkdir(directoryPath, 0750)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := os.Mkdir(directoryPath, 0750); err != nil... with local err is the preferable style for such checks.

// Verify the content is still on the pvc
verifyMemoryDumpOutput(memoryDumpPVC, previousOutput, true)
},
Entry("calling endpoint directly", memoryDumpVMSubresource, removeMemoryDumpVMSubresource),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entry("[test_id:8499]calling endpoint directly"

verifyMemoryDumpOutput(memoryDumpPVC, previousOutput, true)
},
Entry("calling endpoint directly", memoryDumpVMSubresource, removeMemoryDumpVMSubresource),
Entry("using virtctl", memoryDumpVirtctl, removeMemoryDumpVirtctl),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entry("[test_id:8500]using virtctl"

verifyMemoryDumpOutput(memoryDumpPVC, previousOutput, true)
})

It("Run memory dump to a pvc, remove and run memory dump to different pvc", func() {
Copy link
Contributor

@duyanyan duyanyan Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[test_id:8503]

Entry("using virtctl", memoryDumpVirtctl, removeMemoryDumpVirtctl),
)

It("Run multiple memory dumps", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[test_id:8502]

verifyMemoryDumpOutput(memoryDumpPVC2, previousOutput, true)
})

It("Run memory dump, stop vm and remove memory dump", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[test_id:8506]

verifyMemoryDumpOutput(memoryDumpPVC, previousOutput, true)
})

It("Run memory dump, stop vm start vm", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[test_id:8515]

Copy link
Member

@mhenriks mhenriks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Played around with this for a bit and it seems to be working very well! Great job!

Have a couple simple questions. Mostly wondering if we want to annotate/label the target PVC at all. It is possible for it to outlive the VM.

// PersistentVolumeClaimVolumeSource represents a reference to a PersistentVolumeClaim in the same namespace.
// Directly attached to the virt launcher
// +optional
PersistentVolumeClaim *PersistentVolumeClaimVolumeSource `json:"persistentVolumeClaim,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this struct would be better since I doubt there will be any targets to memory dump other than pvc?

type MemoryDumpVolumeSource struct {
    PersistentVolumeClaimVolumeSource `json:",inline"`
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah might be a good suggestion

@@ -228,6 +228,23 @@ func (v *vm) Migrate(name string, migrateOptions *v1.MigrateOptions) error {
return v.restClient.Put().RequestURI(uri).Body(optsJson).Do(context.Background()).Error()
}

func (v *vm) MemoryDump(name string, memoryDumpRequest *v1.VirtualMachineMemoryDumpRequest) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this on VM and not VMI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm it is true that the memory dump is basically an action that happens on the vmi and there is a check that the vmi exists and running, but the action starts by updating the vm since eventually the memory dump info will be only on the vm and not on the vmi since the output does outlives the vmi. also the counteraction of memory dump - remove-memory-dump is an action solely on the vm and can happen on the vm even if vmi doesnt exist.
You think the subresource should be on the vmi and the function that is being called will update the vm status and not the vmi?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe both VM and VMI should have memorydump endpoints?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which both do the same thing of updating the vm status with the memory dump request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you refer for doing memory dump without the whole part of updating the VM I guess its possible but then the volume will still be mounted to the VMI. and anyways maybe I can add it later as an improvement not in this PR.

// ClaimName is the name of the pvc the memory was dumped to
ClaimName string `json:"claimName,omitempty"`
// TargetFileName is the name of the memory dump output
TargetFileName string `json:"targetFileName,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes sense to somehow also encode some/all of this information as annotations on the target pvc? It doesn't appear that the pvc is annotated at all now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an option. If it's alright with you, i'll add it in a different PR to make it possible for this PR to be merged soon :)

Copy link
Contributor

@akalenyu akalenyu Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah #7517 (comment) sounds nice, I recently added labels to restore PVCs so we could have some queryable metrics about them
in #7533

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 20, 2022
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 23, 2022
@ShellyKa13
Copy link
Contributor Author

/retest

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 23, 2022
The command will be called from the virt handler and
will trigger core dump in libvirt with memory only flag.
There will be always only one memory dump command at the same
time. Once the dump is complete it will save the inforamtion of
the last damp so next call if done for the same file will indicate
the completion or error of the last dump.

Signed-off-by: Shelly Kagan <skagan@redhat.com>
The command calls a subresource endpoint.
Then the vm status is being patched with
the memory dump command

Signed-off-by: Shelly Kagan <skagan@redhat.com>
The pvc of the memory dump should be mounted
as a directory to be able to dump to it the memory, it
doesnt have a disk to mount.

Signed-off-by: Shelly Kagan <skagan@redhat.com>
This volume source doesnt have a matching disk, need to take
that into account in the checks verifications

Signed-off-by: Shelly Kagan <skagan@redhat.com>
The vm status is updated with a memory dump request with a
phase of binding. That triggers update of the vm and vmi volumes update
with memory dump volume source and the phase changes to inprogress.
In that time the memorydump is being mounted and done, once done or
failed the volume status in the vmi is updated with timestamp or error
in such case the state will be updated to unmounting which updates the
vmi volume list to be without the memory dump volume source which
unmounts the pvc. Once it is removed from the vmi volume status list
the phase will be updated to completed.

Signed-off-by: Shelly Kagan <skagan@redhat.com>
The vmi gets an update of add volume, added adjustments
to identify memory dump volume source as a hot plug.
The volume is add to the volume status list the same as a hotplug
do. Once it is attached to the virt launcher the phase is updated to
mounted. Once this Phase is updated the memory dump is triggered
and the phase is updated to in progress. Then the virt handler reconcile
loop will keep check for memory dump completion, once completed it will
update the phase and the timestamp of the memory dump which will be
recieved in the vm watch which will trigger unmount of the memory
dump volume source.

Signed-off-by: Shelly Kagan <skagan@redhat.com>
The comman will dissocaite the memory dump, it will
remove the memory dump from the volumes list and the
memory dump request from the vm status.

Signed-off-by: Shelly Kagan <skagan@redhat.com>
Signed-off-by: Shelly Kagan <skagan@redhat.com>
Signed-off-by: Shelly Kagan <skagan@redhat.com>
Add a validation for the pvc size that should be enough
to contain the memory dump. The size should be the vm memory
plus a hard coded overhead of 100Mi - that should be enough after some
checking and testing examples.
Also moved the pvc validation from virtctl to subresource.

Signed-off-by: Shelly Kagan <skagan@redhat.com>
…umeSource

Signed-off-by: Shelly Kagan <skagan@redhat.com>
Signed-off-by: Shelly Kagan <skagan@redhat.com>
Signed-off-by: Shelly Kagan <skagan@redhat.com>
* unite memory-dump and remove-memory dump to the same command
with an action argumeant "get" or "remove"
* make flag claim-name optional in case we want to dump to the same
pvc which is already associated with the vm
* move the phase assignment from virtctl to the subresource
* make sure cant trigger remove when another remove is in progress

Signed-off-by: Shelly Kagan <skagan@redhat.com>
Instead of polling for the memory dump completion by
the same memory dump command, now updates of completion or
failure is updates on the domain.
Adjusted code and unit tests accordingly.

Signed-off-by: Shelly Kagan <skagan@redhat.com>
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 24, 2022
@mhenriks
Copy link
Member

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label May 24, 2022
Copy link
Member

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

excellent work!

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidvossel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2022
@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-bot kubevirt-bot merged commit 5bf9394 into kubevirt:main May 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants