Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allows driverABI to be saveable #9650

Closed
wants to merge 1 commit into from
Closed

Conversation

luiscape
Copy link
Contributor

@luiscape luiscape commented Nov 7, 2023

Patches the driverABI struct, allowing containers started with -nvproxy to be checkpointed. The downside of this patch is

"[...] this would imply that the container must be restored on a host with the same nvidia driver version." (comment)

Closes: #9649

cc @ayushr2

This allows containers started with `-nvproxy` to be checkpointed,
essentially ignoring the state of the `driverABI`. The downside of
this path is

"[...] this would imply that the container must be restored on a host with the
same nvidia driver version."

Closes: google#9649 (comment)

cc @ayushr2
Comment on lines +125 to +128
frontendIoctl map[uint32]frontendIoctlHandler `state:"nosave"`
uvmIoctl map[uint32]uvmIoctlHandler `state:"nosave"`
controlCmd map[uint32]controlCmdHandler `state:"nosave"`
allocationClass map[uint32]allocationClassHandler `state:"nosave"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm drop the nosave annotations? Otherwise these maps will be nil. Are these required? Function pointers should be savable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpointing panics if I don't add nosave to the functions (added error logs in attachment). All errors are similar:

unknown object (nvproxy.frontendIoctlHandler)(0xc28380): ...
unknown object (nvproxy.controlCmdHandler)(0xc2a420): ...
unknown object (nvproxy.allocationClassHandler)(0xc36780): ...
unknown object (nvproxy.uvmIoctlHandler)(0xc3cdc0): ...

gpu_checkpointing.log

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm such a restore would be useless (as the ABI is essentially empty).

@ayushr2
Copy link
Collaborator

ayushr2 commented Nov 7, 2023

Could you see if #9653 fixes your issue?

@luiscape
Copy link
Contributor Author

luiscape commented Nov 7, 2023

@ayushr2 #9653 fixes my issue. I'm able to checkpoint and restore without errors.

@luiscape luiscape closed this Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to checkpoint container with -nvproxy after the introduction of driverABI
2 participants