Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoring a checkpointed container into an existing namespace is not possible #1786

Closed
adrianreber opened this issue Apr 18, 2018 · 4 comments

Comments

@adrianreber
Copy link
Contributor

I am currently trying to fix (better implement) the non-existing functionality to restore a container into an existing network namespace. I already had a short discussion with @avagin on IRC but wanted to use this place to hopefully come to a correct solution.

I am starting a container and I want it to join an existing network namespace:

	{
		"type": "network",
		"path": "/run/netns/test"
	},

This works just like it should. The container is running and uses the network namespace 'test' as specified above. My next step is to checkpoint the container, which works and then to restore the container, which seems to work. Upon closer inspection I see that the restored contained is running in a network namespace but not the one I specified ('test') but a network namespace created by CRIU during restore. I would have probably not detected the problem if it would have been a PID namespaces, but as I have setup a veth pair in the 'test' network namespace with an IP address it became clear that the CRIU restored container is running in another network namespace.

With older versions of CRIU the veths have been restored correctly but the latest CRIU version (using criu-dev branch) seems to have a problem. But this is not the problem I am trying to understand right now.

For me it seems wrong that runc is told to use the network namespace 'test' and I configured it correctly, but CRIU just uses another namespace.

Luckily CRIU is prepared for cases like this. It has the possibility to join an existing namespace:

  -J|--join-ns NS:{PID|NS_FILE}[,OPTIONS]
			Join existing namespace and restore process in it.
			Namespace can be specified as either pid or file path.
			OPTIONS can be used to specify parameters for userns:
			    user:PID,UID,GID

This is also exported via CRIU's RPC interface so it can easily be used with the following patch:

@@ -1161,6 +1162,17 @@ func (c *linuxContainer) Restore(process *Process, criuOpts *CriuOpts) error {
                },
        }
 
+       nsPath := c.config.Namespaces.PathOf(configs.NEWNET)
+       if nsPath != "" {
+               // A network namespace path has been set. Tell CRIU to restore
+               // the processes into that namespace. This only works if the
+               // processes have been dumped *without* a network namespace.
+               join := new(criurpc.JoinNamespace)
+               join.Ns = proto.String("net")
+               join.NsFile = proto.String(c.config.Namespaces.PathOf(configs.NEWNET))
+               req.Opts.JoinNs = append(req.Opts.JoinNs, join)
+       }
+
        for _, m := range c.config.Mounts {
                switch m.Device {
                case "bind":

The problem with this approach is that it only works if I do the checkpoint from within the network namespace nsenter -n -t <PID> runc checkpoint. So during checkpointing it is important that CRIU does not checkpoint the information about the network namespace, else CRIU will create a new network namespace even with the join option.

I tried a few things already but I would like to come to a conclusion what is the right way to do it.

I had a look at LXC and it seems LXC is not touching namespaces at all during checkpoint and restore and leaves it all to CRIU. This seems not be an option for runc as it offers the possibility via 'path' to tell which network namespace to use.

Right now I am seeing two approaches:

  1. Do the checkpoint with CRIU after runc joined all the existing namespaces. My current test is based on the 'exec' functionality. I have to ignore the mount namespace to be able to use the CRIU from the host, but this does not yet work. I am not even sure this can work at all. Right now the problem is that runc passes the file descriptor of the checkpoint destination directory to CRIU but CRIU cannot access the necessary information via /proc when running inside runc's namespaces. Not sure yet where the problem is as I am not joining the PID and mount namespace.
  2. Tell CRIU to ignore namespaces during checkpoint or restore (or both).

Right now I think option 2 is the right one, but I wanted to get feedback from @avagin before continuing.

I also think this is probably not only a network namespace problem, but a problem for all namespaces as the once specified in the configuration file are not joined after restore.

@avagin
Copy link
Contributor

avagin commented Apr 23, 2018

We have an engine which say what resources are external and don't need to be dumped.
https://criu.org/External_resources

I think we can add a new type of external resources, which is called ns.
on dump: --external ns[inode_id]:NAME
on restore: --external NAME:/proc/pid/ns/xxx

@adrianreber
Copy link
Contributor Author

@avagin Your proposal sounds good. I will work on adding that to CRIU and runc.

Not totally clear yet how it should work during dump. I would expect that CRIU either dumps the process with the existing namespace information or it ignores the namespace information. So something like:

--external 'ns[net]:none' during dump to ignore it.

During restore we already support this via --join in CRIU and that seems to work.

Or we could combine to functionality of --join --empty-ns all into external.

From my point of view criu restore --external ns[net]:/run/netns/namespace_name would be the same as criu restore --join-ns net:/run/netns/namespace_name

The option --empty-ns could be --external ns[pid]:none (or --external ns[pid]:empty).

And this discussion does not really belong in the runc bug tracker... But as I opened it here I am continuing it here as it was triggered by runc and also needs runc integration.

@avagin
Copy link
Contributor

avagin commented Apr 24, 2018

--join-ns means a bit different thing, it means that ALL tasks should be restored in a specified ns. An external ns is a namespace which should not be dumped and restored and we can set more than one namespace as external. For example, the task A lives in the netns 1, the task B lives in the netns 2. On dump and restore, we can set both namespaces as externals.
criu dump --external ns[netns1_ino]:netns1 --external ns[netns2_ino]:netns2
criu restore --external ns[netns1]://run/netns/netns1 --external ns[netns2]://run/netns/netns2

@adrianreber
Copy link
Contributor Author

So, in your example, 'netns1' is just an CRIU internal label which we use to identify that namespace during restore. Yes sounds like it should work.

adrianreber added a commit to adrianreber/runc that referenced this issue Jul 24, 2018
Using CRIU to checkpoint and restore a container into an existing
network namespace is not possible.

If the network namespace is defined like

	{
		"type": "network",
		"path": "/run/netns/test"
	}

there is the expectation that the restored container is again running in
the network namespace specified with 'path'.

This adds the new CRIU 'external namespace' feature to runc, where
during checkpointing that specific namespace is referenced and during
restore CRIU tries to restore the container in exactly that
namespace.

This breaks/fixes current runc behavior. If, without this patch, runc
restores a container with such a network namespace definition, it is
ignored and CRIU recreates a network namespace without a name.

With this patch runc uses the network namespace path (if available) to
checkpoint and restore the container in just that network namespace.

Restore will now fail if a container was checkpointed with a network
namespace path set and if that network namespace path does not exist
during restore.

runc still falls back to the old behavior if CRIU older than 3.11 is
installed.

Fixes opencontainers#1786

Related to containers/podman#469

Signed-off-by: Adrian Reber <areber@redhat.com>
adrianreber added a commit to adrianreber/runc that referenced this issue Jul 24, 2018
Using CRIU to checkpoint and restore a container into an existing
network namespace is not possible.

If the network namespace is defined like

	{
		"type": "network",
		"path": "/run/netns/test"
	}

there is the expectation that the restored container is again running in
the network namespace specified with 'path'.

This adds the new CRIU 'external namespace' feature to runc, where
during checkpointing that specific namespace is referenced and during
restore CRIU tries to restore the container in exactly that
namespace.

This breaks/fixes current runc behavior. If, without this patch, runc
restores a container with such a network namespace definition, it is
ignored and CRIU recreates a network namespace without a name.

With this patch runc uses the network namespace path (if available) to
checkpoint and restore the container in just that network namespace.

Restore will now fail if a container was checkpointed with a network
namespace path set and if that network namespace path does not exist
during restore.

runc still falls back to the old behavior if CRIU older than 3.11 is
installed.

Fixes opencontainers#1786

Related to containers/podman#469

Signed-off-by: Adrian Reber <areber@redhat.com>
adrianreber added a commit to adrianreber/runc that referenced this issue Aug 14, 2018
Using CRIU to checkpoint and restore a container into an existing
network namespace is not possible.

If the network namespace is defined like

	{
		"type": "network",
		"path": "/run/netns/test"
	}

there is the expectation that the restored container is again running in
the network namespace specified with 'path'.

This adds the new CRIU 'external namespace' feature to runc, where
during checkpointing that specific namespace is referenced and during
restore CRIU tries to restore the container in exactly that
namespace.

This breaks/fixes current runc behavior. If, without this patch, runc
restores a container with such a network namespace definition, it is
ignored and CRIU recreates a network namespace without a name.

With this patch runc uses the network namespace path (if available) to
checkpoint and restore the container in just that network namespace.

Restore will now fail if a container was checkpointed with a network
namespace path set and if that network namespace path does not exist
during restore.

runc still falls back to the old behavior if CRIU older than 3.11 is
installed.

Fixes opencontainers#1786

Related to containers/podman#469

Thanks to Andrei Vagin for all the help in getting the interface between
CRIU and runc right!

Signed-off-by: Adrian Reber <areber@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants