Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to checkpoint container with -nvproxy after the introduction of driverABI #9649

Closed
luiscape opened this issue Nov 7, 2023 · 3 comments
Labels
area: gpu Issue related to sandboxed GPU access type: bug Something isn't working

Comments

@luiscape
Copy link
Contributor

luiscape commented Nov 7, 2023

Description

Similar to #9363, the driverABI struct doesn't implement SaverLoader.

I applied a similar patch to #9385 and am able to checkpoint containers with -nvproxy successfully (still testing restore; patch below). I'm happy to submit a PR but wondering if this makes sense and what are the implications of not saving this state.

The patch would be made here.

// +stateify savable
type driverABI struct {
	frontendIoctl   map[uint32]frontendIoctlHandler   `state:"nosave"`
	uvmIoctl        map[uint32]uvmIoctlHandler        `state:"nosave"`
	controlCmd      map[uint32]controlCmdHandler      `state:"nosave"`
	allocationClass map[uint32]allocationClassHandler `state:"nosave"`

	useRmAllocParamsV535 bool
}

Does this make sense?

@luiscape luiscape added the type: bug Something isn't working label Nov 7, 2023
@ayushr2
Copy link
Collaborator

ayushr2 commented Nov 7, 2023

This makes sense. The driver ABI should be savable. Happy to review your PR.

Although this would imply that the container must be restored on a host with the same nvidia driver version. If the driver version can change, then the ABI would need to be rebuilt (which requires extra work).

@ayushr2 ayushr2 added the area: gpu Issue related to sandboxed GPU access label Nov 7, 2023
@luiscape
Copy link
Contributor Author

luiscape commented Nov 7, 2023

Although this would imply that the container must be restored on a host with the same nvidia driver version.

Gotcha. This is true in our case (for the most part :) ).

luiscape added a commit to luiscape/gvisor that referenced this issue Nov 7, 2023
This allows containers started with `-nvproxy` to be checkpointed,
essentially ignoring the state of the `driverABI`. The downside of
this path is

"[...] this would imply that the container must be restored on a host with the
same nvidia driver version."

Closes: google#9649 (comment)

cc @ayushr2
@luiscape
Copy link
Contributor Author

luiscape commented Nov 7, 2023

Submitted the patch here. Thanks a lot for the review.

copybara-service bot pushed a commit that referenced this issue Dec 12, 2023
We do not restore GPU state on restore. Any nvproxy state is also not restored.
So nvproxy.objsLive should not be saved.
After this change, we release all live objects on  checkpoint.

Updates #9363, #9649, #9767

PiperOrigin-RevId: 590271331
copybara-service bot pushed a commit that referenced this issue Dec 12, 2023
Note that GPU state is not restored. This tests that the sandbox is restored
and the GPUs are accessible and functional after restore.

Updates #9363, #9649, #9767

PiperOrigin-RevId: 590256769
copybara-service bot pushed a commit that referenced this issue Dec 13, 2023
We do not restore GPU state on restore. Any nvproxy state is also not restored.
So nvproxy.objsLive should not be saved.
After this change, we release all live objects on  checkpoint.

Updates #9363, #9649, #9767

PiperOrigin-RevId: 590661915
copybara-service bot pushed a commit that referenced this issue Dec 13, 2023
Note that GPU state is not restored. This tests that the sandbox is restored
and the GPUs are accessible and functional after restore.

Updates #9363, #9649, #9767

PiperOrigin-RevId: 590679227
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: gpu Issue related to sandboxed GPU access type: bug Something isn't working
Projects
None yet
2 participants