Skip to content

Conversation

@chakri-nelluri
Copy link
Contributor

@chakri-nelluri chakri-nelluri commented May 22, 2017

Add proposal for volume spec reconstruction.
Addresses: kubernetes/kubernetes#44737

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 22, 2017
@chakri-nelluri
Copy link
Contributor Author

@saad-ali @jsafrane @rootfs PTAL when you get a chance.

* Versioning the meta-data is offloaded to the plugin.

## Meta-data location:
Store the meta-data file \<volume name\>.json in the plugin path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

controller-manager can run on several machines in hot-standby HA, where one controller-manager is master (and runs controllers) and the other controller-managers just wait for the master to die so they can quickly resume. The metadata won't be available to a new master on another machine if you store the metadata on the filesystem. I am afraid that these metadata must be stored in API server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information will be stored as part of mountdevice call, which is only executed on worker nodes. Do we need this information and spec reconstruction in controller-manager code path too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, ok, I think we don't need it in the controller-manager, @tsmetana or @jingxu97 can confirm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and would you please add a note that this will be stored only on nodes and not on controller-manager?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will do.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just introduce a standard 'provider_info' key in json format and the plugin can determine what it needs to store here on it's own? In addition, it would be up to the driver to reconnect if needed and it would utilize that provider_info data.

@saad-ali saad-ali self-assigned this May 26, 2017
@saad-ali
Copy link
Member

CC @kubernetes/sig-storage-feature-requests @kubernetes/sig-storage-misc

Option 1:
Store the volume object source in JSON format. Example: for GCE plugin store "GCEPersistentDiskVolumeSource".

Option 2:
Store Persistent volume spec/Volume Source in JSON format. Persistent volume spec is stored if the volume is backed by a Persistent Volume Object. Volume source is stored if the volume source is inlined in Pod spec. We can use the different filenames ~pv.json and ~vs.json to distinguish these objects.

Option 3:
Implement a per plugin specific API to store the meta-data relevant to the plugin. We can provide a sensible default using Option 2/Option 1 if plugin does not want to implement these API.

I think option 3 is a non-starter. That will lead to fragmentation and counter to our end goal of providing k8s kubelet a reliable way to reconstruct lost internal state.

Option 1 is insufficient for all the information we may need, and even option 2 does not go far enough.

The information we need for kubelet volume reconstruction is spread across 3 API objects:

  1. PV Object
  • This includes the PersistentVolumeSource. Which we must have for PV defined objects to be able to run GetVolumeName(...) and the PV name required for generating the InnerVolumeSpecName.
  • It also includes things like AccessMode and Capacity which are likely not useful for kubelet volume reconstruction.
  1. PVC Object
  • Most likely not useful for kubelet volume reconstruction
  1. Pod API Object
  • Contains the Pod.Spec.Volumes[x].VolumeSource along with the user defined Pod.Spec.Volumes[x].Name -- both are required to run GetVolumeName(...) and generating the InnerVolumeSpecName and OuterVolumeSpecName.
  • Also contains Pod.Spec.Containers[x].VolumeMounts which is likely not useful for kubelet volume reconstruction.

A 4th option I did not see proposed, is to figure out what fields Kubernetes needs and to store only those. I admit that would be more difficult to implement and harder to maintain (you need to worry about backwards compatibility, you need to find a way to encode volume plugin specific information, etc.).

So, it sounds like the easiest options would be to just save the entity of the involved objects (pod and PV) so that we are sure to have everything we need at reconstruction time. At a minimum we'd need to store some part of the Pod API object and the PV object (if not direct reference volume). But, like you pointed out, we need to worry about versioning and the easiest way around that would be to store the entire object (including version). Not sure if this is kosher, however, esp for pod objects.

@k8s-ci-robot
Copy link
Contributor

@saad-ali: These labels do not exist in this repository: sig/storage, sig/storage.

In response to this:

CC @kubernetes/sig-storage-feature-requests @kubernetes/sig-storage-misc

Option 1:
Store the volume object source in JSON format. Example: for GCE plugin store "GCEPersistentDiskVolumeSource".

Option 2:
Store Persistent volume spec/Volume Source in JSON format. Persistent volume spec is stored if the volume is backed by a Persistent Volume Object. Volume source is stored if the volume source is inlined in Pod spec. We can use the different filenames ~pv.json and ~vs.json to distinguish these objects.

Option 3:
Implement a per plugin specific API to store the meta-data relevant to the plugin. We can provide a sensible default using Option 2/Option 1 if plugin does not want to implement these API.

I think option 3 is a non-starter. That will lead to fragmentation and counter to our end goal of providing k8s kubelet a reliable way to reconstruct lost internal state.

Option 1 is insufficient for all the information we may need, and even option 2 does not go far enough.

The information we need for kubelet volume reconstruction is spread across 3 API objects:

  1. PV Object
  • This includes the PersistentVolumeSource. Which we must have for PV defined objects to be able to run GetVolumeName(...) and the PV name required for generating the InnerVolumeSpecName.
  • It also includes things like AccessMode and Capacity which are likely not useful for kubelet volume reconstruction.
  1. PVC Object
  • Most likely not useful for kubelet volume reconstruction
  1. Pod API Object
  • Contains the Pod.Spec.Volumes[x].VolumeSource along with the user defined Pod.Spec.Volumes[x].Name -- both are required to run GetVolumeName(...) and generating the InnerVolumeSpecName and OuterVolumeSpecName.
  • Also contains Pod.Spec.Containers[x].VolumeMounts which is likely not useful for kubelet volume reconstruction.

A 4th option I did not see proposed, is to figure out what fields Kubernetes needs and to store only those. I admit that would be more difficult to implement and harder to maintain (you need to worry about backwards compatibility, you need to find a way to encode volume plugin specific information, etc.).

So, it sounds like the easiest options would be to just save the entity of the involved objects (pod and PV) so that we are sure to have everything we need at reconstruction time. At a minimum we'd need to store some part of the Pod API object and the PV object (if not direct reference volume). But, like you pointed out, we need to worry about versioning and the easiest way around that would be to store the entire object (including version). Not sure if this is kosher, however, esp for pod objects.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@chakri-nelluri
Copy link
Contributor Author

@shilpamayanna PTAL

@chakri-nelluri
Copy link
Contributor Author

Thanks @saad-ali. I prefer storing the PV Object too, but if the spec is inlined in pod spec, we have to store the entire pod spec. I don't like that part. Lets pick a choice, before it's too late.
@rootfs @jsafrane any preferences?

Between, @shilpamayanna is working on implementing this support.

@chakri-nelluri
Copy link
Contributor Author

@saad-ali PTAL. Waiting on you for LGTM.
I am on PTO this week and @shilpamayanna is working on this.

@saad-ali
Copy link
Member

saad-ali commented Jun 6, 2017

Freeze date for 1.7 was June 1, so we missed the boat. Let's prioritize it high for 1.8.

In the meantime please make sure that any bug fixes that need to go in for 1.7 to mitigate kubernetes/kubernetes#44737 are in the master/1.7 branches.

@chakri-nelluri
Copy link
Contributor Author

@saad-ali PTAL. Waiting on you for LGTM.
@shilpamayanna is back from PTO. She is planning to work on it this week.

@saad-ali
Copy link
Member

Spoke with @thockin for some ideas. His suggestion was to use Kubernetes Component Config which gives us versioned APIs, backward compatibility, deprecation policies, documentation, etc for arbitrary configuration files. He also was leaning towards a per volume hook.

My concern with a per volume hook is whether it would be sufficient to capture all the information that the volume code would need (PV name, pod volume name, etc.). I feel like we may need to store the entire PV, PVC, and lots (if not all) the pod (volumes, mounts, etc.). Tim's suggestion was to store only what we need, and create our own versioned API to do that. Problem with building our own API is that it is one more thing to maintain, and as the external k8s volume API (or internal volume state management) changes, we'd need to make changes (and maintain backwards compatibility) in this layer.

@shilpamayanna what are your thoughts?

+@msau42

@childsb
Copy link
Contributor

childsb commented Aug 15, 2017

Kubelet checkpointing is similar concept for all of kubelet: kubernetes/kubernetes#489

@jingxu97
Copy link
Contributor

jingxu97 commented Aug 15, 2017

There are a few things I want to discuss based on the meeting.

1. Disable some volume plugin reconstruct function

I think disable some volume plugins reconstructVolume is still needed for some plugins.
For example, in glusterfs
a. ConstructVolumeSpec function in glusterfs.go uses volumeName as endpoint and path which is wrong.
https://github.com/kubernetes/kubernetes/blob/release-1.6.3/pkg/volume/glusterfs/glusterfs.go#L197
b. reconstructVolume function in reconciler will try to call plugin.NewMounter() to get the mounter. This NewMounter function will in turn call API server to get the endpoint object which will always fail. https://github.com/kubernetes/kubernetes/blob/release-1.6.3/pkg/volume/glusterfs/glusterfs.go#L143
In this you can see many error message in the log saying "glusterfs: failed to get endpoints". Also since this reconstructVolume is called periodically, doing this is wasting the API calls because endpoint is the fake name.

So I think it would be better disable reconstructVolume for glusterfs and some others (I will double check)

2. UniqueVolumeName issue

Also we checked that actually for non-attachable volumes, the unique volume name could be fully reconstruct correctly. It only uses pod name, plugin name, and volume spec name which are used in constructing the mount path. https://github.com/kubernetes/kubernetes/blob/886e04f1fffbb04faf8a9f9ee141143b2684ae68/pkg/volume/util/volumehelper/volumehelper.go#L66
So there will be no issue to match the unique name with the state of the controller and reconstruct should work. (glusterfs has issue because it requires endpoint from API server to construct mounter, but this is actually not really needed for unmounting)

For attachable volume plugin, most of plugins work fine like gce pd, aws ebs. There was an issue for flex volume which is fixed by kubernetes/kubernetes#46136

2. Reconstruct state at master controller

On master controller side, we recovered state from node status "VolumeAttached" field, kubernetes/kubernetes#39732. There might be corner cases that node status is not in sync with real world.

Now I kind of wondered what important issues we haven't addressed. Please comment if I miss any points. Thanks!

@shilpamayanna
Copy link

For now, we will not implement GetVolumeName in flex volume to avoid volumes from getting unmounted and update flex volume documentation with limitations for inline volumes in pod spec.

@k8s-github-robot k8s-github-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 15, 2017
@k8s-github-robot
Copy link

This PR hasn't been active in 90 days. Closing this PR. Please reopen if you would like to work towards merging this change, if/when the PR is ready for the next round of review.

cc @chakri-nelluri @saad-ali

You can add 'keep-open' label to prevent this from happening again, or add a comment to keep it open another 90 days

@k8s-github-robot
Copy link

This PR hasn't been active in 90 days. Closing this PR. Please reopen if you would like to work towards merging this change, if/when the PR is ready for the next round of review.

cc @chakri-nelluri @saad-ali

You can add 'keep-open' label to prevent this from happening again, or add a comment to keep it open another 90 days

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
* Update T&R WG Lead, alphabetize list by last name

* Also update teams.yaml for change in T&R WG lead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants