New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: container introspection #8427

Open
tobert opened this Issue Oct 6, 2014 · 63 comments

Comments

Projects
None yet
@tobert
Copy link

tobert commented Oct 6, 2014

I am working on building Docker images for Apache Cassandra. I have a working PoC application wrapper that can automatically configure Cassandra to run inside Docker. The first thing I found missing with this app was a way to introspect the allowed amount of memory from inside the container so I can set the JVM heap up correctly.

When I went searching for available solutions on the web, I found a few mailing list threads and related issues, but nothing stands out as a clear winner so here we are.

Related links:

Related issues: #7472 #7255 #1270 #3778

Requirements

  • read-only from the container (security, simplicity)
  • values that change on the host are reflected in the container's view (dynamic)
  • cannot break existing containers
  • should be universal; public images can rely on it

Approaches

Each of these approaches has a different set of tradeoffs. I'm willing to write the code for whichever the core team decides is best. My preference would be for something that can be provided as a basic service to every container managed by Docker. My goal is to be able to create containers that rely on this information and have them work without change in as many Docker environments as possible. For example, I hope my Cassandra containers can work equally well in a full-on CoreOS setup as they do in a simple boot2docker instance.

Environment Variables

This is easily the least complex solution. Before staring a container, Docker would add some DOCKER_ environment variables to the container's first process. These could propagate or be blocked by the init process at the users' choice.

The big downside is that it is impossible to change environment variables after a process starts. I also don't know of a way to ensure they don't get munged by processes inside the containers, so they aren't really read-only. That said, they also don't provide a vector into the host OS so maybe that's OK.

IPC key/value

IPC namespaces are there so it should be fairly easy to provide a SysV SHM with key/value pairs in it. This could be updated from the host at any time and would be fast.

Since the interface into the container would be shared memory, that means tools would need to be available and then everything gets even more complicated. Moving on ...

shm_overview

REST over HTTP

The inspiration for this comes from EC2's metadata API. Every EC2 VM can route to a link-local address of 169.254.169.254 that runs a REST API. This is what tools like ohai and facter use to learn about EC2 VMs. The link-local address can be hard-coded in scripts and has good support in every programming language.

I have a working PoC for the REST approach that simply exposes memory.memory_limit_in_bytes as GET /memory/memory.memory_limit_in_bytes. A more complete implementation would follow REST best practices and may choose to expose less Linux semantics.

dummy / link-local

My PoC currently works with a dummy interface + link-local IP per-container (with a PR for libcontainer to enable dummy network strategy). Docker could inject this interface into every container except those that use --net host or share with another container. If those features are important, this is a no-go on being universal.

link-local alias on docker0

The service could also be run bound to the bridge on a link-local address. This works with very little configuration in my setup, but would have to be very careful in how it validates packet origin in order to avoid leaking container data across containers. It also falls apart when people use non-standard bridge setups, which is a deal-breaker.

AF_UNIX

I avoided AF_UNIX at first because of how many times I've had to fix daemons whose socket was unlinked by accident. That said, AF_UNIX is probably the best option since it's easy to verify exactly which container made a request. Perhaps setting chatr +i on the socket file will be good enough to prevent the common problems. Some HTTP clients don't support AF_UNIX, but it's not uncommon either. As long as curl works, I think most users will be happy with this.

related: https://github.com/cpuguy83/lestrade

Filesystem

Since Linux applications typically use /proc and /sys for system introspection, this is the most natural choice, but it is also fairly high in complexity of implementation.

The two big options for filesystems are FUSE and bind mounts. libvirt-lxc provides a FUSE interface for /proc/meminfo that seems to work out OK, but many are not comfortable with the size and complexity of the FUSE API. FUSE can do the job, the question is if it's OK to have this as a requirement in every container. Since Docker already relies on FUSE maybe this isn't an issue?

A read-only bind mount into the container can provide the same information. Assuming that RO bind mounts are safe enough security-wise, Docker could write out and maintain all the relevant metadata on the host side then bind mount each container's tree into that container read-only. Care would have to be taken to do transactional (write + link) updates to metadata files. Docker would also end up maintaining the filesystem tree on disk which would be tedious. It is fairly easy to test though.

My worry is that providing a POSIX-like filesystem will mean that POSIX semantics need to be preserved. Making the fs read-only does remove a lot of the nastier problems though.

Not emulating /proc would remove a lot of compatibility issues by giving up being able to run existing scripts.

IMNSHO

Personally, I'm leaning towards AF_UNIX + REST. The main reason is that it has the least complexity overall while providing well-known semantics to users. It can be accessed using readily available tools and libraries. It doesn't have to be HTTP+JSON. I like those because of tool availability. A memcached-like protocol would also be fine since it can still be accessed with tools like busybox nc, but for now I'd like protocol design to be something to decide after the big decisions are made.

Edits:

  • add link to Fabio Kung's blog post
  • s/tmpfs/bind mounts/ since it could be any fs
  • added some links to related issues from comments
  • add link to Digital Ocean's new metadata API
@erikh

This comment has been minimized.

Copy link
Contributor

erikh commented Oct 6, 2014

I think HTTP+JSON over a reserved namespace (like EC2) is the way to roll. Docker users could easily write small tools (As they have to now for processing things like links) to process the json data and this provides a lot of flexibility on data structures, and additionally keeps the API backend consistent presuming we can mark parts of the API readonly, and other parts not read-only.

@lalyos

This comment has been minimized.

Copy link
Contributor

lalyos commented Oct 7, 2014

allowed amount of memory was probably the first configuration you missed, but i bet it wont be the last.

So as a more general solution i would use etcd or consul to store configuration. Both has very straightforward HTTP REST api.

But if you need file based configuration use confd:

Manage local application configuration files using templates and data from etcd or consul

@tobert

This comment has been minimized.

Copy link
Author

tobert commented Oct 7, 2014

@lalyos this is purely about container instance specific information and should not include things about other containers or applications.

Put another way, not every Docker user is going to run etcd or consul, but every Docker instance running the JVM should be able to find out how much memory is available in order to set heap bounds correctly.

@jeremyeder

This comment has been minimized.

Copy link

jeremyeder commented Oct 7, 2014

We're tracking this in RHBZ1111316 (it's private, sorry, no idea why). But we've been thinking about it too, and I've copy/pasted my feedback below. I should say that after I posted this, @rhatdan mentioned that he agreed with http://fabiokung.com/2014/03/13/memory-inside-linux-containers/ that the kernel should handle it, and says that /proc should be made namespace aware.

I agree on that. My suggestions (below) were on a much shorter timeline.

"""
Probably re-title this BZ as "Container self-introspection", and I'll float some possible solutions:

  • cAdvisor (or cockpit?) provides an API for resource management. This brings in a dependency on either of those to solve container self-introspection.
  • Making nsinit (or similar) work inside a container. nsinit provides what everyone wants. We just can't get at it from inside the container at the moment.

We have a bit of chicken/egg problem in that nsinit reads an instance-specific container.json file that only gets created after the container starts.

  • Can't use/recommend -v /sys/fs/cgroup because it exposes all cgroup stuff, same as /proc/meminfo.
    """

So, I think HTTP+JSON could expose the container's own container.json (what nsinit spits out). Then again, this puts another network interface into each container. If we could get at it via localhost:port, that would be pretty clean.

@tobert

This comment has been minimized.

Copy link
Author

tobert commented Oct 7, 2014

@jeremyeder would you be opposed to a /dev/container.sock interface instead of localhost? That would remove any chance of port conflict.

The big side-effect of AF_UNIX I like is that the daemon that injects the sockets knows exactly which container a given socket belongs to instead of having to parse IPs etc.. It's also external to the container so it's unkillable from inside the container, which increases overall reliability.

@dysinger

This comment has been minimized.

Copy link

dysinger commented Oct 7, 2014

I have my fleet units setup to publish the containers 'inspect' command to etcd about 5 seconds after launch. This lets any VM in the cluster look at a container's IP addr & other information

[{
    "Args": [],
    "Config": {
        "AttachStderr": true,
        "AttachStdin": true,
        "AttachStdout": true,
        "Cmd": [
            "/bin/bash"
        ],
        "CpuShares": 0,
        "Cpuset": "",
        "Domainname": "",
        "Entrypoint": null,
        "Env": [],
        "ExposedPorts": {},
        "Hostname": "8c0be1b7d472",
        "Image": "knewton/ubuntu",
        "Memory": 0,
        "MemorySwap": 0,
        "NetworkDisabled": false,
        "OnBuild": null,
        "OpenStdin": true,
        "PortSpecs": null,
        "StdinOnce": true,
        "Tty": true,
        "User": "",
        "Volumes": {},
        "WorkingDir": ""
    },
    "Created": "2014-10-07T00:30:55.55810483Z",
    "Driver": "aufs",
    "ExecDriver": "native-0.2",
    "HostConfig": {
        "Binds": [
            "/tmp/ubuntu-keyring:/tmp/ubuntu-keyring"
        ],
        "CapAdd": null,
        "CapDrop": null,
        "ContainerIDFile": "",
        "Devices": [],
        "Dns": null,
        "DnsSearch": null,
        "Links": null,
        "LxcConf": [],
        "NetworkMode": "bridge",
        "PortBindings": {},
        "Privileged": false,
        "PublishAllPorts": false,
        "RestartPolicy": {
            "MaximumRetryCount": 0,
            "Name": ""
        },
        "VolumesFrom": null
    },
    "HostnamePath": "/var/lib/docker/containers/8c0be1b7d4726257371dfc30b19e1ada4e0825b0e4a7765e28c670aeb6ccb2cf/hostname",
    "HostsPath": "/var/lib/docker/containers/8c0be1b7d4726257371dfc30b19e1ada4e0825b0e4a7765e28c670aeb6ccb2cf/hosts",
    "Id": "8c0be1b7d4726257371dfc30b19e1ada4e0825b0e4a7765e28c670aeb6ccb2cf",
    "Image": "fb97468afe34a7b5b833dadba36ed0335ecdfb13d801cafb0dce9683b2b1ac5f",
    "MountLabel": "",
    "Name": "/silly_sinoussi",
    "NetworkSettings": {
        "Bridge": "docker0",
        "Gateway": "172.17.42.1",
        "IPAddress": "172.17.0.3",
        "IPPrefixLen": 16,
........
@tobert

This comment has been minimized.

Copy link
Author

tobert commented Oct 7, 2014

@dysinger that's pretty cool and sounds like a great approach for private clusters.

This will be similar but has some additional constraints. Multi-tenant shops like OpenShift should not expose so much implementation-private information. Single-node setups like developer laptops running boot2docker need to have something available without additional components.

I'm aiming for something that works for standalone, PaaS, and in-house usage. e.g. a developer wants to try Cassandra so they docker run tobert/cassandra. I don't think there should be an extra step if they want to say docker run -m 2048M tobert/cassandra. An entrypoint should be able to find out its allocated memory/cpu/disk/network/etc. without having to bolt something onto Docker.

@tobert tobert changed the title Proposal: container intropsection Proposal: container introspection Oct 7, 2014

@phemmer

This comment has been minimized.

Copy link
Contributor

phemmer commented Oct 7, 2014

Kinda surprised there was no global introspection issue until now. We have lots of other more specific ones: #7472 #7255 #1270 #3778 and probably others.

As for how to expose the data to the container, I would only consider 2 possibilities (both of which were mentioned in the original description)

  1. 'proc'-like filesystem. This would be extremely easy to consume inside the container. Personally this is my favorite. While yes, docker would have to maintain the data, it already maintains the info which you retrieve via docker inspect. It would just need to map that structure to files, very basic.
  2. http. Similar to the AWS EC2 metadata service. However I should note that the EC2 metadata service is not REST. A simple curl http://169.254.169.254/latest/instance-id returns a simple 1 line text response. It doesn't even have a trailing newline, and there's no JSON to dig through. This makes it very easy to work with in scripts.
    As for whether this is made available via a static link-local address (on the host, not per-container. I don't see why you would want this per-container), or a unix domain socket, I think both have advantages and disadvantages.
@bdha

This comment has been minimized.

Copy link

bdha commented Oct 7, 2014

Personally I think this is a great feature, however it ends up getting implemented. My $0.02:

I would argue for the HTTP endpoint, to mimic EC2, Joyent, RAX, et al, but also because it doesn't lock you into a system with some specific concept of how /proc behaves. Anyone can bring their own tools to the rodeo, and the feature remains agnostic.

I would agree with @phemmer about returning simple strings with no structure makes it easy to utilize.

If JSON is the way to go, you might also consider offering userspace tools to emulate simple strings behavior while serving more complex data structures from the API directly (mdata-get foo ; mdata-get foo.bar.baz).

@bgrant0607

This comment has been minimized.

Copy link

bgrant0607 commented Oct 7, 2014

Lack of dynamic updates rules out environment variables.

Shared memory is not practical in all environments.

Ideally the mechanism could potentially be made universally available for all Linux containers, not just Docker. Linux conventions, such as FSH and LSB (e.g., /proc, /dev, /etc, /var, /run), could be extended to cover our needs.

The mechanism should be accessible to simple Linux applications, including C programs and shell scripts, not just modern web programming languages.

Cloud providers do use HTTP APIs, but while they want to use OS-independent approaches, they aren't concerned about coupling VM images to their APIs, nor are they concerned about image size and complexity.

It should be possible to efficiently watch for updates (to at least slowly changing information) without polling, or be notified somehow.

Also, ideally a daemon wouldn't strictly be required (for standalone execution scenarios), and, ideally, we wouldn't break legacy (non-containerized) applications without good cause, since that would just slow down migration to containers.

@rhatdan

This comment has been minimized.

Copy link
Contributor

rhatdan commented Oct 7, 2014

This is not really a container issue, it is more about cgroups in general. I am discussing this with kernel guys now, and they keep pointing out that processes can figure (memory) this out by looking at the
/proc/self/cgroup and using this to discover the information is available in
/sys/fs/cgroup/memory/user.slice/memory.stat

I am arguing for something like /proc/self/meminfo.

My problem with the /sys/fs/cgroup solution it that it involves leaking information into the container and requires everyone to mount /sys/fs/cgroup into their containers and chroot. Also there is no guarantee where /sys/fs/cgroup is going to be mounted.

@tobert

This comment has been minimized.

Copy link
Author

tobert commented Oct 7, 2014

Excellent points.

@phemmer I'm not sure why I conflated JSON into the EC2 metadata API. I think being able to FOO=curl $URL is very useful. We can always provide something like a /inspect route that gives you all the key/value in a single JSON object.

@bdha I'm leaning towards key/value routes specifically to avoid needing tools to parse it.

@bgrant0607 I've never been a fan of LSB but sometimes it's right. If you get some free time would you like to take a pass at suggesting a URL structure?

Watching values could be implemented in either HTTP or filesystems. For HTTP it could be a long poll or possibly websockets with messages on update. On the filesystem it would mean making sure inotify/fsnotify work correctly.

I would like to avoid needing a daemon - ideally this would be a subsystem of Docker or libcontainer. I'm going to build it as a daemon for now to get things working and keep the code structured such that it can be integrated easily when the core team is ready for it.

@tobert

This comment has been minimized.

Copy link
Author

tobert commented Oct 7, 2014

@rhatdan I also don't like mounting cgroups into containers. It's too leaky.

The daemon I'm working on reads directly from /sys/fs/cgroups and presents it into the container under a similar path. It's a one-way connection and there's no way for the guest to access random values.

@rhatdan

This comment has been minimized.

Copy link
Contributor

rhatdan commented Oct 7, 2014

But your solution does nothing to help the processes which run in a cgroup and do not use namespaces or other container technology.

In RHEL7 and Fedora, systemd runs httpd within a cgroup. If I set MemLimit=50m in the unit file, then
the cgi scripts and apache processes have no way of knowing the limit without figuring out that it is listed in /sys/fs/cgroup/...

I have seen proposals that this information should be a query of systemd or a library that reads /proc/self/cgroup, mount and finally /sys/fs/cgroup/*/memory.stat

But for generations this has been done by looking at /proc for all of the information. To me this seems obvious to be a kernel issue, but not sure kernel people will agree.

@tobert

This comment has been minimized.

Copy link
Author

tobert commented Oct 7, 2014

@rhatdan I don't see any reason why systemd couldn't provide the same API and socket we're talking about.

I agree it would be nice if the kernel made /proc namespace aware, but even if that happens it won't be available widely for a year or two at least, so we still have a problem to solve.

@fabiokung has already done a lot of solid research on the kernel side: http://fabiokung.com/2014/03/13/memory-inside-linux-containers/

I think the scope should be limited to Docker for now. That said, the AF_UNIX API seems like it has the best portability, since it would work equally well on a host without namespaces or even non-Linux hosts (e.g. zones, jails).

@fabiokung

This comment has been minimized.

Copy link
Contributor

fabiokung commented Oct 7, 2014

I'm glad to see this discussion re-starting. Unfortunately I dropped the ball and haven't had a chance to experiment more on it.

I did experiment with a custom "cgroup aware" /proc filesystem though (not FUSE, it's a kernel module + small kernel patch), here: https://github.com/fabiokung/procg/

Unfortunately, it is very unlikely that anything like this will ever be accepted upstream. Kernel developers don't want to make any changes to the already messy /proc filesystem.

@bgrant0607

This comment has been minimized.

Copy link

bgrant0607 commented Oct 7, 2014

Note that our entire libcontainer team is out this week, and LPC is next week. We should get their input re. cgroup and other kernel interfaces.

/cc @rjnagal @vmarmol @vishh @yinghan @thockin

@bgrant0607

This comment has been minimized.

Copy link

bgrant0607 commented Oct 7, 2014

/proc/self/cgroup proposal: https://lkml.org/lkml/2014/7/17/584

@yinghan

This comment has been minimized.

Copy link

yinghan commented Oct 7, 2014

Especially feedback from the cgroup maintainer Tejun Heo, who hasn't commented on the patch yet.

@tobert

This comment has been minimized.

Copy link
Author

tobert commented Oct 7, 2014

The kernel cgroup proposal looks great, but it doesn't solve the problem being discussed here. It won't be available in most shops until it lands in LTS distros. It also doesn't address other parts of metadata that may be important such as container id and network addresses.

@rhatdan

This comment has been minimized.

Copy link
Contributor

rhatdan commented Oct 8, 2014

I have a different pull request to pass in the container_id as an environment variable container_uuid, (Well the first 32 bit chars). systemd would use this for creation of /etc/machine-id, which would allow you to connect journald on the host to journald running within a container.

#7685

@rhatdan

This comment has been minimized.

Copy link
Contributor

rhatdan commented Oct 8, 2014

I believe a better solution would be to have the "cgroup" statistics directory available as /proc/self/cgroup rather then have this the name of the cgroup. This data pertains to the process, and having to look this data up in /sys/fs/cgroup makes userspace overly complicated. Plus the data available in /sys/fs/cgroup is a big information leak in a multi-tenant environment.

@jbeda

This comment has been minimized.

Copy link
Contributor

jbeda commented Oct 8, 2014

A big con to using AF_UNIX sockets is that domain sockets is really very poorly supported in Java. You cannot open a unix domain socket without installing third party JNI libraries.

As the communication between the hosting environment (host level stuff like cgroups and beyond) becomes richer it will become more dynamic and having a mechanism that can be easily accessed from all environments is going to be important.

@tobert

This comment has been minimized.

Copy link
Author

tobert commented Oct 8, 2014

@jbeda I didn't realize it was that bad. I'll read up on JVM+AF_UNIX today.

@tobert

This comment has been minimized.

Copy link
Author

tobert commented Oct 8, 2014

@rhatdan that all sounds great, but when is RHEL7 going to ship a kernel with /proc/self/cgroup?

Do you think we can support current kernels with FUSE?

@bgrant0607

This comment has been minimized.

Copy link

bgrant0607 commented Oct 8, 2014

Ok, HTTP is virtually ubiquitous these days. We could go with a HTTP-based solution, but it shouldn't be inherently Docker-specific. It could carry Docker-specific (or Kubernetes-specific) information assuming the API had an adequate namespacing mechanism (e.g, hierarchical or at least pseudo-hierarchical keys) and a means to inject arbitrary metadata (which we also requested from Docker - #6839).

Internally, we have a user-space resource-management agent layered on top of the internal version of lmctfy. It serves resource information as well as tuning cgroup settings dynamically in order to manage isolation. cAdvisor was created for a similar purpose. We already expected to run that everywhere. The current cAdvisor API probably isn't exactly what we'd want, but we could create a v2 API.

/cc @rjnagal @vmarmol @vishh @thockin

@rhatdan

This comment has been minimized.

Copy link
Contributor

rhatdan commented Oct 8, 2014

tobert I would be happy with a fuse based solution similar to what libvirt-lxc did, the question I have is whether or not docker would want it.

@jbeda

This comment has been minimized.

Copy link
Contributor

jbeda commented Oct 8, 2014

I'd love to explore a FUSE solution too -- something like /cloud

@nazar-pc

This comment has been minimized.

Copy link

nazar-pc commented Jan 4, 2016

I was pointed here from another issue. My case involves networking environment. Here is quote from #18699 (comment):

I think something like /proc/docker/networks/<network_name>/containers/<container_id>, /proc/docker/networks/<network_name>/hosts/<hostname>/<container_id>, /proc/docker/networks/<network_name>/aliases/<alias>/<container_id> (where target file contains IP(s) of container within specified network) and/or similar can be provided by Docker engine itself so that it will not overlap with neither /etc/hosts nor DNS itself, but would allow any container to inspect configuration of network(s) it has access to and respond to changes using tools like inotify.

REST API is great, but in simpler containers filesystem-like solution is much more convenient to use in BASH scripts instead of parsing JSON.

@dreamcat4

This comment has been minimized.

Copy link

dreamcat4 commented Jan 4, 2016

filesystem-like solution is much more convenient to use in BASH scripts instead of parsing JSON.

Sure. However if you are stuck then can parse JSON with the standalone program jq. Calling that directly from your within script.

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Jan 4, 2016

Probably providing something that speaks the docker API with scoped access just for that container is the best route to go here since there's already lots of tools that can handle this.

@icecrime

This comment has been minimized.

Copy link
Contributor

icecrime commented Sep 10, 2016

@dpwrussell

This comment has been minimized.

Copy link

dpwrussell commented Jul 24, 2017

This thread largely seems concerned with memory, networking and similar, but I thought I would add that it would also be extremely useful to be able to introspect the image details from within the container. Potentially this is implicit to the contributors in this thread, but just in case...

A use case I am currently facing is containers which transmit datasets to object storage, these datasets should include a manifest with (among other things) the details of the docker image used to send it there. At present it seems impossible to introspect this information.

@bradjones1

This comment has been minimized.

Copy link

bradjones1 commented Jul 24, 2017

@dpwrussell Labels and environment variables (e.g., ENV commands in your Dockerfile) are helpful there?

@dpwrussell

This comment has been minimized.

Copy link

dpwrussell commented Jul 24, 2017

@bradjones1 To my knowledge one can't access labels from inside the docker container and I don't think that it is possible to encode the digest of the image into the image itself (chicken or egg problem no?)

@bradjones1

This comment has been minimized.

Copy link

bradjones1 commented Jul 25, 2017

@dpwrussell You're correct, I don't believe you can introspect labels natively (however container orchestration tools like Rancher provide this via an HTTP endpoint, for instance) but you could pass the same label data in as environment variables as well. I am doing this to maintain a record of the git reference used to build the image, for instance. So on build, it's:

docker build --build-arg gitref=`git describe --tags --always` .

and in your Dockerfile:

ARG gitref=unknown
ENV GITREF $gitref
LABEL gitref $gitref
@dpwrussell

This comment has been minimized.

Copy link

dpwrussell commented Jul 25, 2017

@bradjones1 Ah, yes, I see. That would work in that direction, but then when someone comes to use a dataset with the gitref in the manifest, I don't think it is possible to do the reverse lookup based on label to get the appropriate docker image to open it with.

Unfortunately in my case it is not just a matter of tracking the software version used, but of automated reproducibility.

With regard to the orchestration tools' endpoints, I am building these for deployment into unknown environments (maybe a scientists laptop, maybe an ECS cluster, etc), so I wont be able to rely on that.

Introspection of the image details would fix this completely and it sounds like these requirements would be easy compared to some of the other ones being discussed here.

Thanks for the suggestions though.

@dlitz

This comment has been minimized.

Copy link

dlitz commented Apr 15, 2018

One unfortunate side effect of EC2's HTTP metadata endpoint is that it provides no way to securely supply a secret (e.g. a TLS or SSH private key) to a container such that it can't be leaked to an unprivileged process inside the container. An attacker controlling some vulnerable PHP code (or whatever) running chrooted in the VM could still just query http://169.254.169.254/ and learn all sorts of information it's not supposed to have, and there's no way to tell Amazon to disable that endpoint once we're done using it.

AF_UNIX sockets and inherited filedescriptors don't have this problem, since they can be deleted/closed/chmodded. (They also don't involve doing weird things with the network stack, which I suspect would be much uglier with Linux containers than with full x86 virtualization.)

@junneyang

This comment has been minimized.

Copy link

junneyang commented Jul 12, 2018

no solution for such a long time, really disappointed

@AkihiroSuda

This comment has been minimized.

@junneyang

This comment has been minimized.

Copy link

junneyang commented Jul 12, 2018

mesos/marathon:
inject environment variables to container: PORT PORT0 PORT1
ref: https://mesosphere.github.io/marathon/docs/task-environment-vars.html#host-ports

i think this is the best solution

@kurtwheeler

This comment has been minimized.

Copy link

kurtwheeler commented Aug 3, 2018

I have a similar use case to @dpwrussell, which is to help ensure scientific reproducibility. To make sure that there's no way that we're running a different container than we think we are, we'd like to record the image id at runtime. He seems to have hit the nail on the head with:

This thread largely seems concerned with memory, networking and similar, but I thought I would add that it would also be extremely useful to be able to introspect the image details from within the container.

Is there some security reason that docker cannot put these details somewhere into the container when it starts up? The docker agent knows what image it is using to launch containers, so to put the details about that image somewhere into the container seems like it shouldn't be particularly difficult. I guess I'm curious if this issue should be marked as WONTFIX?

@erikh

This comment has been minimized.

Copy link
Contributor

erikh commented Aug 3, 2018

let's get real here: at least 10 ways have been suggested to do this and it's a 4 year old ticket with no implemented patch for any of them.

probably not going to happen.

@erikh

This comment has been minimized.

Copy link
Contributor

erikh commented Aug 3, 2018

if people could agree on limited scope it'd probably go a long way, but everyone wants to re-invent the wheel when cloud companies are doing this every day for millions of users.

anyways, $0.02 from someone who was pitching it internally when I worked there from day 1.

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Aug 3, 2018

The latest approach was to support templating in things like env vars... the main (only?) concern with it is being able to inspect this in the authz layer so authz plugins can reject templating certain values.

I think if we can provide a helper for inspecting these values to plugin authors it would go a long way... but I'm not the only one to convince.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment