Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

virtcontainers: Add support for ephemeral volumes #307

Closed
wants to merge 1 commit into from

Conversation

harche
Copy link
Contributor

@harche harche commented May 14, 2018

Ephemeral volumes should not be passed at 9pfs mounts.
They should be created inside the VM.

This patch disables ephemeral volumes from getting
mounted as 9pfs from the host and instead a corresponding
tmpfs is created inside the VM.

Fixes : #61

Signed-off-by: Harshal Patil harshal.patil@in.ibm.com

@harche
Copy link
Contributor Author

harche commented May 14, 2018

There will be the corresponding PR on agent project.

@harche
Copy link
Contributor Author

harche commented May 14, 2018

Corresponding PR in agent, kata-containers/agent#236

return nil, err
var ephiVol = false
splitSourceSlice := strings.Split(m.Source, "/")
if len(splitSourceSlice) > 2 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify the expected format of the original Source str in a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egernst A typical ephemeral storage path looks like, /var/lib/kubelet/pods/366c3a75-4869-11e8-b479-507b9ddd5ce4/volumes/kubernetes.io~empty-dir/cache-volume

I will update the code with this sample path in code comments.

Copy link
Member

@egernst egernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One ask for an added comment, and small nit on commit message: I think

Ephemeral volumes should not be passed at 9pfs mounts.

should have s/at/as

@egernst
Copy link
Member

egernst commented May 15, 2018

@WeiZhang555 @amshinde - PTAL

@WeiZhang555
Copy link
Member

WeiZhang555 commented May 15, 2018

@harche Sorry I didn't participate the discussion before. Need to confirm the cases first, I see in the issue #61 , you mentioned the volume is backed by tmpfs

# mount |grep 366c3a75
tmpfs on /var/lib/kubelet/pods/366c3a75-4869-11e8-b479-507b9ddd5ce4/volumes/kubernetes.io~empty-dir/cache-volume type tmpfs (rw,relatime)
tmpfs on /var/lib/kubelet/pods/366c3a75-4869-11e8-b479-507b9ddd5ce4/volumes/kubernetes.io~secret/default-token-2w6d4 type tmpfs (rw,relatime)

So in this PR, you only handled kubernetes.io~empty-dir but let the kubernetes.io~secret/ go, I guess it's because we need to write data to "kubernetes.io~secret"from host so we should keep it bind-mounted.

The question here is, are we sure there's no scenario we will write data to kubernetes.io~empty-dir from host? Because if this happens it will fail after we merge this. And it will also fail when container tries to write to K8S~empty-dir and someone wants to read from host side.

Another concern is, is this code too "K8S" specific?

splitSourceSlice := strings.Split(m.Source, "/")
if len(splitSourceSlice) > 2 {
k8sStorageType := splitSourceSlice[len(splitSourceSlice)-2]
if k8sStorageType == "kubernetes.io~empty-dir" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted on kata-containers/agent#236, I am concerned that although this might work today, since the code is parsing a path name it could easily break if k8s decide to change their naming for some reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is also what I am worrying about now, it's a bit too "K8S" specific.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the detection to kata cli to keep the knowledge out of virtcontainers? We can add a tmpfs device type for it in container DeviceInfos and kata cli is responsible for detecting ephemeral volumes and construct proper deviceinfo there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeiZhang555 any solution to handle ephemeral volumes will be very specific to the k8s. Ephemeral volumes are k8s concept after all. But we are interested in bringing them in kata because it helps to increase the isolation with the host. If you have a scenario where you need to share the data between say, k8s init and regular containers of the pod then those containers can share the data amongst themselves without explicitly putting it on the host filesystem.

@jodh-intel ephemeral volumes are define in k8s yaml at pod level like this,

apiVersion: v1
kind: Pod
metadata:
  name: myapp-pod
  annotations:
    io.kubernetes.cri.untrusted-workload: "true"
  labels:
    app: myapp
spec:
  containers:
  # Application Container
  - name: myapp-container
    volumeMounts:
    - mountPath: /cache
      name: cache-volume
    imagePullPolicy: IfNotPresent 
    image: localhost:5000/app_image:latest
    command: ['sh']
  initContainers:
  - name: data-container
    volumeMounts:
    - mountPath: /cache
      name: cache-volume
    imagePullPolicy: IfNotPresent 
    image: localhost:5000/data_image:latest
    command: ['sh', '-c', 'cp /test.gpg /cache/']
  volumes:
  - name: cache-volume
    emptyDir:
      medium: "Memory"

So far the SanboxConfig.Volumes (https://github.com/kata-containers/runtime/blob/master/virtcontainers/sandbox.go#L444) doens't register this volume type.

This is the reason I resorted to parsing the OCI mounts of the invidual container mounts to infer if there are any k8s ephemeral volumes.

@katabuilder
Copy link

PSS Measurement:
Qemu: 142172 KB
Proxy: 6730 KB
Shim: 13117 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2010968 KB

@@ -67,6 +68,19 @@ func ReverseString(s string) string {
return string(r)
}

// From the given path to the storage determine
// if it's a kubernetes ephemeral storage
func IsEphemeralStroage(path string) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is too k8s specific to be in the virtcontainers library. Please move it to kata cli and use a new device type to describe the ephemeral storage in container config.

Copy link
Contributor Author

@harche harche May 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf Need some clarification so just to make sure we are on the same page.

When you say to use a new DeviceType, do you mean the one declared here - https://github.com/kata-containers/runtime/blob/master/virtcontainers/device/config/config.go#L19 ?

Also, you suggested using container config to describe the storage. Staying within the cli package I could imagine the new device can be added at this point, https://github.com/kata-containers/runtime/blob/master/cli/create.go#L261 (by appending to contConfig.DeviceInfos). Because after that point the code jumps into the virtcontainer library.

Coming back to the DeviceType, it's used in the driver implementations in virtcontainer library https://github.com/kata-containers/runtime/tree/master/virtcontainers/device/drivers

All the uses of the DeviceType including its declaration are in virtcontainers only. So this isn't helping us taking the code away from virtcontainers and putting it in cli package.

To me, this seems like an overkill to implement all those methods in those drivers when we aren't even going to use any real device from the host.

Maybe I am missing something, would you like to elaborate more on what you have in mind here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your understand is correct. I asked for it because this is the device/storage model of virtcontainers. We want all devices/storage to follow the same flow. ephemeral is just another storage. I don't think we should make it a special case and handle it differently.

Besides, IsEphemeralStroage is totally bound to k8s. You cannot create an ephemeral storage unless the source path has strict k8s empty dir pattern. That is not flexible at all. We want a generic API that one can create ephemeral storage with any source path. In fact, for ephemeral storage, the source path should not matter at all.

Copy link
Contributor Author

@harche harche May 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf The current code in this PR which uses pb.Storage.Mountpoint is implemented as per our discussion here, #61 (comment)

I think the only thing we should address here is to remove the strong coupling with kubernetes. If we somehow are able to detect if the given path is for the ephemeral storage without having to explicitly depend upon kubernetes specific paths then we should be good.

What if we loop over the container mounts, and match with the /proc/mounts of the host to see if that path is backed by tmpfs? If it is, then I think we can safely assume it's for ephemeral storage. This will remove our dependency on k8s specific paths.

A typical characteristic of an ephemeral storage would be,

  1. A directory on the host (may or may not be backed by tmpfs)
  2. This directory is bind mounted to all the containers of the same pod so that they can share the data.

@bergwolf @egernst @amshinde @WeiZhang555 @jodh-intel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code in this PR which uses pb.Storage.Mountpoint is implemented as per our discussion here, #61 (comment)

That comment only describes the fact that the tmpfs storage is already supported by the agent grpc protocol. You have done a great job to implement it in kata agent. But the runtime PR does not satisfy the virtcontainer API we would like to merge.

At the runtime API level, we want an explicit API to let callers specify a tmpfs storage and its mountpoint. It can be a better workaround to check /proc/mounts for tmpfs storage, but again, the detection needs to be done at the kata cli level and the virtcontainers API need to be explicit about using a tmpfs storage, instead of trying to do the auto-detection internally.

IOW, what you did is an API change and we need to be explicit about using a tmpfs storage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harche

A typical characteristic of an ephemeral storage would be,

  1. A directory on the host (may or may not be backed by tmpfs)
  2. This directory is bind mounted to all the containers of the same pod so that they can share the data.

I wonder if this is a trusted detection point, I think ephemeral storage has another difference with normal tmpfs storage that a normal tmpfs storage DOES allow you to share contents between host and guest but ephemeral storage DOESN'T .

So if user wants to share host contents with all containers in pod and this volume happens to base on tmpfs(I think this is also valid use case), the case could be broken.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to catch up here, and trying to summarize, we don't have a reliable way to detect this specific use case, right ?
The reason being: a volume backed by tmpfs could be either a directory with some data the host want to share with the container, or an ephemeral volume that is an empty directory.

What about checking both the tmpfs and the content of the source path provided ?

Now that being said, I realize that ephemeral volume support in Kata Containers is more an improvement thing since it's already supported if two containers link to the same ephemeral volume on the host via 9p, right @harche ?
Following this, do we really need this "improvement" since it seems to me that we'll have to rely on strong assumptions to detect the case ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use ephemeral volumes regularly for sharing the data between containers of the pod (especially between init and regular containers). Some of the applications are sensitive to the security and this is where we would love to use kata in our kubernetes environment for the added isolation it provides. With upcoming techonologies like Intel's TME and MKTME, IBM Ultravisor or AMD memory encryption we can take the isolation provided by kata to the next level.

Using those technologies you will be able to encrypt the memory pages of the guest, this is where this patch will enable kata to play a crucial role in providing the strongest isolation any runtime can provide for security sensitive kubernetes workloads. Since you will place the ephemeral volume inside the guest, which in turn has encrypted memory pages, even a compromised host won't be able to read the data containers are sharing between them.

I strongly believe this work aligns with the kata's mission to provide the more secure environment for the container workloads.

@WeiZhang555 @bergwolf @sboeuf @amshinde

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation @harche. This makes sense to support this case, but it's annoying that is something we have to detect based on the path name (as mentioned by @WeiZhang555)

for _, m := range c.mounts {
if utils.IsEphemeralStroage(m.Source) {
filename := filepath.Join(ephemeralPath, filepath.Base(m.Source))
teststorage := &grpc.Storage{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teststorage is not a good variable name :)

Copy link
Contributor Author

@harche harche May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) sorry. I will clean it and update the patch.

@amshinde
Copy link
Member

@harche Are ephemeral volumes meant to be shared, that is created once for the sandbox and bind-mounted by the rest of the containers?

@harche
Copy link
Contributor Author

harche commented May 24, 2018

@amshinde Yes.

@WeiZhang555
Copy link
Member

I don't feel very good about this "improvement" though it can be useful in K8S scenarios.
All the judgement of "ephemeral volumes" is based on the file path which is not a really good decision, this is extremely customized code and isn't generic at all, even after move the code from virtcontainers/ to cli/.
I'm leaning to -1 on this, unless we find a better way to detect ephemeral volume

@harche harche changed the title virtcontainers: Add support for ephemeral volumes [WIP] virtcontainers: Add support for ephemeral volumes Jun 1, 2018
@harche
Copy link
Contributor Author

harche commented Jun 1, 2018

Edited subject to work in progress since we are yet to arrive at the consensus.

@harche
Copy link
Contributor Author

harche commented Jun 1, 2018

What happened here? it says merged?

@grahamwhaley
Copy link
Contributor

I think that 'merged' a little way down the thread was for a PR that referenced this one. This one now says 'closed', by you @harche .

@harche harche deleted the tmpfs branch June 1, 2018 10:44
@harche harche restored the tmpfs branch June 1, 2018 10:44
@harche harche reopened this Jun 1, 2018
@harche
Copy link
Contributor Author

harche commented Jun 1, 2018

We use ephemeral volumes regularly for sharing the data between containers of the pod (especially between init and regular containers). Some of the applications are sensitive to the security and this is where we would love to use kata in our kubernetes environment for the added isolation it provides. With upcoming techonologies like Intel's TME and MKTME, IBM Ultravisor or AMD memory encryption we can take the isolation provided by kata to the next level.

Using those technologies you will be able to encrypt the memory pages of the guest, this is where this patch will enable kata to play a crucial role in providing the strongest isolation any runtime can provide for security sensitive kubernetes workloads. Since you will place the ephemeral volume inside the guest, which in turn has encrypted memory pages, even a compromised host won't be able to read the data containers are sharing between them.

I strongly believe this work aligns with the kata's mission to provide the more secure environment for the container workloads.

@WeiZhang555 @bergwolf @sboeuf @amshinde @jodh-intel

@harche
Copy link
Contributor Author

harche commented Jun 1, 2018

@bergwolf I have moved ephemeral volume detection to cli package. Let me know what do you think.

@katabuilder
Copy link

PSS Measurement:
Qemu: 159832 KB
Proxy: 6715 KB
Shim: 8945 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 1996880 KB

return nil, err
epheVol := false
for _, devInfo := range c.config.DeviceInfos {
if devInfo.DevType == "e" && devInfo.HostPath == m.Source {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please define the new device type in device/config/config.go, and compare m.Source with devInfo.ContainerPath instead.

And for epheVol, you don't need to put it in the shared mounts. Let me explain how it works:

  1. on kata cli side, add a device to container config, where devinfo.type is epheVol, and a Mount to container config, where m.Source == devinfo.ContainerPath
  2. in kata runtime, skip epheVol by NOT adding it to sharedMounts
  3. in kata_agent.go, create tmpfs storage where its mountpoint is devinfo.ContainerPath

This would follow the same storage/device devinfo semantics as other types of devices/storages.

Copy link
Member

@bergwolf bergwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! It looks good overall. Just one comment inline.

Driver: "ephemeral",
Source: "tmpfs",
Fstype: "tmpfs",
MountPoint: filename,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC you need "/ephemeralPath/" + filepath.Base(devInfo.ContainerPath) here, to match line 744 below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. Turns out I was looping over a wrong slice. Instead of looping over ociSpec.Mounts I was looping over newMounts. The reason it worked during testing because I kept the source path intact, which meant it created a path like, /var/lib/kubelet/pods/366c3a75-4869-11e8-b479-507b9ddd5ce4/volumes/kubernetes.io~empty-dir/cache-volume inside a VM.

I will update the PR with the fix.

)

// GenericDevice refers to a device that is neither a VFIO device or block device.
type EphemeralDevice struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll mark this as "need change" in case we forget and merge it in accidently.

@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 145911 KB
Proxy: 4712 KB
Shim: 8848 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2007308 KB

@harche
Copy link
Contributor Author

harche commented Jun 15, 2018

@WeiZhang555 @sboeuf @bergwolf I have updated the PR

@harche harche changed the title [WIP] virtcontainers: Add support for ephemeral volumes virtcontainers: Add support for ephemeral volumes Jun 15, 2018
@harche
Copy link
Contributor Author

harche commented Jun 15, 2018

Removing [WIP] from the title.

@@ -38,6 +38,20 @@ func TestFileExists(t *testing.T) {
fmt.Sprintf("File %q should exist", file))
}

func TestIsEphemeralStroage(t *testing.T) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/TestIsEphemeralStroage/TestIsEphemeralStorage/

cli/utils.go Outdated
splitSourceSlice := strings.Split(path, "/")
if len(splitSourceSlice) > 1 {
storageType := splitSourceSlice[len(splitSourceSlice)-2]
if storageType == "kubernetes.io~empty-dir" {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please define a global variable for kubernetes.io~empty-dir

@@ -418,6 +418,7 @@ func (c *Container) createContainersDirs() error {
// available when we will need to unmount those mounts.
func (c *Container) mountSharedDirMounts(hostSharedDir, guestSharedDir string) ([]Mount, error) {
var sharedDirMounts []Mount
MntLoop:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this kind of syntax, it's not really Go friendly.

// provided by k8s, instead at later stage we are going to
// create a tmpfs inside a VM and use it to bind mount to the
// containers of the pod.
continue MntLoop
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to avoid the goto here, please use something like this:

                loopOver := false
                for _, devInfo := range c.config.DeviceInfos {
			if devInfo.DevType == "e" && devInfo.ContainerPath == m.Source {
				// We are not going to bind mount host ephemeral directory
				// provided by k8s, instead at later stage we are going to
				// create a tmpfs inside a VM and use it to bind mount to the
				// containers of the pod.
				loopOver = true
                                break
			}
		}

                if loopOver {
                            continue
                }

Copy link

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @harche ! LGTM

@sboeuf
Copy link

sboeuf commented Jun 18, 2018

@WeiZhang555 please give it another review !

var devInfos []config.DeviceInfo
for _, devInfo := range c.config.DeviceInfos {
if devInfo.DevType == "e" {
filename := filepath.Join(ephemeralPath, filepath.Base(devInfo.ContainerPath))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Base guaranteed to be unique here? If not, I would append this with a randomly generated string to make sure there are no conflicts.

Copy link
Contributor Author

@harche harche Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the Base is guaranteed to be unique. In k8s ephemeral volume is defined by,

  volumes:
  - name: cache-volume
    emptyDir:
      medium: "Memory"

In the pod yaml.

The Base comes from name of the volume which is unique within the pod (which in the example above would be cache-volume)

@amshinde
Copy link
Member

amshinde commented Jun 18, 2018

LGTM overall.
Thanks @harche!

You will need to refactor the code introduced in createContainer and mountSharedDirMounts in separate functions to reduce the cyclomatic complexity. Static checks are failing right now. You will then need to rebase on master to fix a CI issue we had recently.

Approved with PullApprove

Ephemeral volumes should not be passed at 9pfs mounts.
They should be created inside the VM.

This patch disables ephemeral volumes from getting
mounted as 9pfs from the host and instead a corresponding
tmpfs is created inside the VM.

Fixes : kata-containers#61

Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>
@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 144210 KB
Proxy: 4737 KB
Shim: 8776 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2007308 KB

@harche
Copy link
Contributor Author

harche commented Jun 19, 2018

@amshinde code in mountSharedDirMounts already has the cyclomatic complexity of 15 (without my changes). Any new condition will bump it to 16 and above (15 is the cut off right now). I need to have at least one if condition there to decide if that mount source should be skipped or not.

For createContainer too cyclomatic complexity is at 15 without my changes, so this is too on the edge.

Probably I have to refactor beyond my changes.

@WeiZhang555
Copy link
Member

WeiZhang555 commented Jun 19, 2018

The "ephemeral device" design is still weird to me, because it's a fake device, there's no real device type associated with it and it's not even related to device, it's only about "Mount".

@bergwolf I know you suggested this design, I am thinking to just use "Mount" info instead of adding a "Device" type.
For example, from client side, it modify the ephemeral mount to new Mount:

Mount {
"Source": "/var/lib/kubelet/pods/366c3a75-4869-11e8-b479-507b9ddd5ce4/volumes/kubernetes.io~empty-dir/cache-volume",
"Destination": "/container/ephemeral",
"Type": "ephemeral",
}

Then it will be translated into a grpcStorage:

grpc.Storage{
 				Driver:     kataEphemeralDevType,
 				Source:     "tmpfs",
 				Fstype:     "tmpfs",
 				MountPoint: filepath.Join(ephemeralPath, filepath.Base(mount.Source)),
 			}

and a modified OCI spec with modified Mount field to kata-agent as:

Mount {
"Source": filepath.Join(ephemeralPath, filepath.Base(mount.Source)),,
"Destination": "/container/ephemeral",
"Type": "bind",
}

So we can omit a fake device type, instead we introduce a new fake storage type "ephemeral" similar to "tmpfs" or "proc".
My suggestion isn't about saving code lines, it's just because I think the fake device type looks weird and easily confuse users.

What do you think? I didn't try but I think the new design could work. @bergwolf @harche

@sboeuf
Copy link

sboeuf commented Jun 19, 2018

I like the proposal @WeiZhang555. I think using Storage in this case is a better option than Device.

@bergwolf
Copy link
Member

@WeiZhang555 It feels weird because Storage was renamed to Device :)

The main reason I suggested the device description instead of relying only on container Mounts is that there is no sandbox level storage structure that can describe a storage without a real device. There is much in common between storage and device and there is also subtle difference. If we want to exclude fake devices from the device infrastructure, we need a storage structure to allow proper tracking of those storages w/o real devices at the sandbox level.

IWO, we either treat storage the same as device and allow fake devices, or we define storage and device separately and clearly.

For ephemeral storage, your proposal does work and it is actually what @harche did in the first place. But still, it misses a sandbox level description. If we ever want to check for ephemeral storage sharing, we have to iterate through all containers' config in the sandbox.

If we choose to exclude storage w/o real devices from the device handling infrastructure, we need to at least file an issue to say so and that we will add the missing storage abstraction at sandbox level. And then this PR can move forward with your proposal and @harche's original implementation.

@tallclair
Copy link

Sorry I'm late to the party, but I think I can provide a bit more context on the Kubernetes side. Long term, the CRI will need to change to really support this case (I've started those conversations, but AFAIK there hasn't been any progress). Without the full k8s support, I have a few concerns with pushing forward with this approach:

  • Using tmpfs means that writes to the emptyDir volume will consume pod memory. How is this memory accounted? We don't yet have a notion of pod overhead, so I don't think there's a way to reserve memory for these volumes. An alternative approach is to use a block or "image device" (I'm don't know the correct terminology) that provides disk-backed storage that is opaque to the host.
  • How does this interact non-emptyDir volume types that delegate to the emptydir volume handler? This includes: secrets, configmaps, projected, gitRepo, downwardAPI
  • We'll eventually want to treat the container's writeable layer the same way (opaque to the host)

Considering all this, I think this should be configureable, and maybe treated as an experimental feature (have per-pod opt-in, as well as a runtime-level configuration to enable it). This also solves the issue of whether the filepath to the volume could ever change, since that could be part of the config.

That was a bit of a ramble, but hopefully it helps. I'm very excited about this feature, thanks for working on it!

@WeiZhang555
Copy link
Member

@tallclair 's suggestion is valuable, I think make this configurable is a good idea, we can configure the K8S "ephemeral" volume pattern into the configure file in case the ephemeral volume path changed in higher version of K8S.

@harche
Copy link
Contributor Author

harche commented Jul 3, 2018

For people following this issue, I have created a PR that doesn't use devices - #458

@harche
Copy link
Contributor Author

harche commented Jul 3, 2018

@tallclair Ephemeral volumes will consume the memory allocated for pod VM. But attaching the block storage to provision ephemeral volumes defeats the main motivation to have this feature in first place. I have explained in this, #307 (comment), in detail on why we need to make sure the ephemeral volume that we provisioned needs to remain in VM memory.

TLDR; Upcoming memory encryption technologies allow us to encrypt VM pages and gaurd against a compromised host from reading that memory. We inject our ephemeral volume in that VM memory so that even a compromised host can't read what containers of that pod are sharing with each other.

@sboeuf
Copy link

sboeuf commented Jul 9, 2018

Replaced by #458, this can be closed.

@sboeuf sboeuf closed this Jul 9, 2018
@tallclair
Copy link

@harche couldn't the attached block storage just be encrypted? That said, I think we're currently very far away from being able to protect VM pods running a compromised node, and it will only be possible by taking a holistic look at Kubernetes, possibly rethinking some fundamental concepts. I'm not sure it's worth designing individual features for this use case without the full picture, and I'm worried this is the security equivalent of premature optimization (premature hardening?).

zklei pushed a commit to zklei/runtime that referenced this pull request Jun 13, 2019
This updates grpc-go vendor package to v1.11.3 release, to fix server.Stop()
handling so that server.Serve() does not wait blindly.

Full commit list:
d11072e (tag: v1.11.3) Change version to 1.11.3
d06e756 clientconn: add support for unix network in DialContext. (kata-containers#1883)
452c2a7 Change version to 1.11.3-dev
d89cded (tag: v1.11.2) Change version to 1.11.2
98ac976 server: add grpc.Method function for extracting method from context (kata-containers#1961)
0f5fa28 Change version to 1.11.2-dev
1e2570b (tag: v1.11.1) Change version to 1.11.1
d28faca client: Fix race when using both client-side default CallOptions and per-call CallOptions (kata-containers#1948)
48b7669 Change version to 1.11.1-dev
afc05b9 (tag: v1.11.0) Change version to 1.11.0
f2620c3 resolver: keep full unparsed target string if scheme in parsed target is not registered (kata-containers#1943)
9d2250f status: rename Status to GRPCStatus to avoid name conflicts (kata-containers#1944)
2756956 status: Allow external packages to produce status-compatible errors (kata-containers#1927)
0ff1b76 routeguide: reimplement distance calculation
dfbefc6 service reflection can lookup enum, enum val, oneof, and field symbols (kata-containers#1910)
32d9ffa Documentation: Fix broken link in rpc-errors.md (kata-containers#1935)
d5126f9 Correct Go 1.6 support policy (kata-containers#1934)
5415d18 Add documentation and example of adding details to errors (kata-containers#1915)
57640c0 Allow storing alternate transport.ServerStream implementations in context (kata-containers#1904)
031ee13 Fix Test: Update the deadline since small deadlines are prone to flakes on Travis. (kata-containers#1932)
2249df6 gzip: Add ability to set compression level (kata-containers#1891)
8124abf credentials/alts: Remove the enable_untrusted_alts flag (kata-containers#1931)
b96718f metadata: Fix bug where AppendToOutgoingContext could modify another context's metadata (kata-containers#1930)
738eb6b fix minor typos and remove grpc.Codec related code in TestInterceptorCanAccessCallOptions (kata-containers#1929)
211a7b7 credentials/alts: Update ALTS "New" APIs (kata-containers#1921)
fa28bef client: export types implementing CallOptions for access by interceptors (kata-containers#1902)
ec9275b travis: add Go 1.10 and run vet there instead of 1.9 (kata-containers#1913)
13975c0 stream: split per-attempt data from clientStream (kata-containers#1900)
2c2d834 stats: add BeginTime to stats.End (kata-containers#1907)
3a9e1ba Reset ping strike counter right before sending out data. (kata-containers#1905)
90dca43 resolver: always fall back to default resolver when target does not follow URI scheme (kata-containers#1889)
9aba044 server: Convert all non-status errors to codes.Unknown (kata-containers#1881)
efcc755 credentials/alts: change ALTS protos to match the golden version (kata-containers#1908)
0843fd0 credentials/alts: fix infinite recursion bug [in custom error type] (kata-containers#1906)
207e276 Fix test race: Atomically access minConnecTimout in testing environment. (kata-containers#1897)
3ae2a61 interop: Add use_alts flag to client and server binaries (kata-containers#1896)
5190b06 ALTS: Simplify "New" APIs (kata-containers#1895)
7c5299d Fix flaky test: TestCloseConnectionWhenServerPrefaceNotReceived (kata-containers#1870)
f0a1202 examples: Replace context.Background with context.WithTimeout (kata-containers#1877)
a1de3b2 alts: Change ALTS proto package name (kata-containers#1886)
2e7e633 Add ALTS code (kata-containers#1865)
583a630 Expunge error codes that shouldn't be returned from library (kata-containers#1875)
2759199 Small spelling fixes (unknow -> unknown) (kata-containers#1868)
12da026 clientconn: fix a typo in GetMethodConfig documentation (kata-containers#1867)
dfa1834 Change version to 1.11.0-dev (kata-containers#1863)
46fd263 benchmarks: add flag to benchmain to use bufconn instead of network (kata-containers#1837)
3926816 addrConn: Report underlying connection error in RPC error (kata-containers#1855)
445b728 Fix data race in TestServerGoAwayPendingRPC (kata-containers#1862)
e014063 addrConn: keep retrying even on non-temporary errors (kata-containers#1856)
484b3eb transport: fix race causing flow control discrepancy when sending messages over server limit (kata-containers#1859)
6c48c7f interop test: Expect io.EOF from stream.Send() (kata-containers#1858)
08d6261 metadata: provide AppendToOutgoingContext interface (kata-containers#1794)
d50734d Add status.Convert convenience function (kata-containers#1848)
365770f streams: Stop cleaning up after orphaned streams (kata-containers#1854)
7646b53 transport: support stats.Handler in serverHandlerTransport (kata-containers#1840)
104054a Fix connection drain error message (kata-containers#1844)
d09ec43 Implement unary functionality using streams (kata-containers#1835)
37346e3 Revert "Add WithResolverUserOptions for custom resolver build options" (kata-containers#1839)
424e3e9 Stream: do not cancel ctx created with service config timeout (kata-containers#1838)
f9628db Fix lint error and typo (kata-containers#1843)
0bd008f stats: Fix bug causing trailers-only responses to be reported as headers (kata-containers#1817)
5769e02 transport: remove unnecessary rstReceived (kata-containers#1834)
0848a09 transport: remove redundant check of stream state in Write (kata-containers#1833)
c22018a client: send RST_STREAM on client-side errors to prevent server from blocking (kata-containers#1823)
82e9f61 Use keyed fields for struct initializers (kata-containers#1829)
5ba054b encoding: Introduce new method for registering and choosing codecs (kata-containers#1813)
4f7a2c7 compare atomic and mutex performance in case of contention. (kata-containers#1788)
b71aced transport: Fix a data race when headers are received while the stream is being closed (kata-containers#1814)
46bef23 Write should fail when the stream was done but context wasn't cancelled. (kata-containers#1792)
10598f3 Explain target format in DialContext's documentation (kata-containers#1785)
08b7bd3 gzip: add Name const to avoid typos in usage (kata-containers#1804)
8b02d69 remove .please-update (kata-containers#1800)
1cd2346 Documentation: update broken wire.html link in metadata package. (kata-containers#1791)
6913ad5 Document that all errors from RPCs are status errors (kata-containers#1782)
8a8ac82 update const order (kata-containers#1770)
e975017 Don't set reconnect parameters when the server has already responded. (kata-containers#1779)
7aea499 credentials: return Unavailable instead of Internal for per-RPC creds errors (kata-containers#1776)
c998149 Avoid copying headers/trailers in unary RPCs unless requested by CallOptions (kata-containers#1775)
8246210 Update version to 1.10.0-dev (kata-containers#1777)
17c6e90 compare atomic and mutex performance for incrementing/storing one variable (kata-containers#1757)
65c901e Fix flakey test. (kata-containers#1771)
7f2472b grpclb: Remove duplicate init() (kata-containers#1764)
09fc336 server: fix bug preventing Serve from exiting when Listener is closed (kata-containers#1765)
035eb47 Fix TestGracefulStop flakiness (kata-containers#1767)
2720857 server: fix race between GracefulStop and new incoming connections (kata-containers#1745)
0547980 Notify parent ClientConn to re-resolve in grpclb (kata-containers#1699)
e6549e6 Add dial option to set balancer (kata-containers#1697)
6610f9a Fix test: Data race while resetting global var. (kata-containers#1748)
f4b5237 status: add Code convenience function (kata-containers#1754)
47bddd7 vet: run golint on _string files (kata-containers#1749)
45088c2 examples: fix concurrent map accesses in route_guide server (kata-containers#1752)
4e393e0 grpc: fix deprecation comments to conform to standard (kata-containers#1691)
0b24825 Adjust keepalive paramenters in the test such that scheduling delays don't cause false failures too often. (kata-containers#1730)
f9390a7 fix typo (kata-containers#1746)
6ef45d3 fix stats flaky test (kata-containers#1740)
98b17f2 relocate check for shutdown in ac.tearDown() (kata-containers#1723)
5ff10c3 fix flaky TestPickfirstOneAddressRemoval (kata-containers#1731)
2625f03 bufconn: allow readers to receive data after writers close (kata-containers#1739)
b0e0950 After sending second goaway close conn if idle. (kata-containers#1736)
b8cf13e Make sure all goroutines have ended before restoring global vars. (kata-containers#1732)
4742c42 client: fix race between server response and stream context cancellation (kata-containers#1729)
8fba5fc In gracefull stop close server transport only after flushing status of the last stream. (kata-containers#1734)
d1fc8fa Deflake tests that rely on Stop() then Dial() not reconnecting (kata-containers#1728)
dba60db Switch balancer to grpclb when at least one address is grpclb address (kata-containers#1692)
ca1b23b Update CONTRIBUTING.md to CNCF CLA
2941ee1 codes: Add UnmarshalJSON support to Code type (kata-containers#1720)
ec61302 naming: Fix build constraints for go1.6 and go1.7 (kata-containers#1718)
b8191e5 remove stringer and go generate (kata-containers#1715)
ff1be3f Add WithResolverUserOptions for custom resolver build options (kata-containers#1711)
580defa Fix grpc basics link in route_guide example (kata-containers#1713)
b7dc71e Optimize codes.String() method using a switch instead of a slice of indexes (kata-containers#1712)
1fc873d Disable ccBalancerWrapper when it is closed (kata-containers#1698)
bf35f1b Refactor roundrobin to support custom picker (kata-containers#1707)
4308342 Change parseTimeout to not handle non-second durations (kata-containers#1706)
be07790 make load balancing policy name string case-insensitive (kata-containers#1708)
cd563b8 protoCodec: avoid buffer allocations if proto.Marshaler/Unmarshaler (kata-containers#1689)
61c6740 Add comments to ClientConn/SubConn interfaces to indicate new methods may be added (kata-containers#1680)
ddbb27e client: backoff before reconnecting if an HTTP2 server preface was not received (kata-containers#1648)
a4bf341 use the request context with net/http handler (kata-containers#1696)
c6b4608 transport: fix race sending RPC status that could lead to a panic (kata-containers#1687)
00383af Fix misleading default resolver scheme comments (kata-containers#1703)
a62701e Eliminate data race in ccBalancerWrapper (kata-containers#1688)
1e1a47f Re-resolve target when one connection becomes TransientFailure (kata-containers#1679)
2ef021f New grpclb implementation (kata-containers#1558)
10873b3 Fix panics on balancer and resolver updates (kata-containers#1684)
646f701 Change version to 1.9.0-dev (kata-containers#1682)

Fixes: kata-containers#307

Signed-off-by: Peng Tao <bergwolf@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet