-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add dev guide explaining lvmo components [skip ci] #101
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: leelavg The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see the design writeup, thanks!
- *csiDriver*: Reconciles topolvm CSI Driver | ||
- *topolvmController*: Reconciles topolvm controller plugin | ||
- *lvmVG*: Reconciles volume groups from LVMCluster CR | ||
- *openshiftSccs*: Manages SCCs when the operator is run in Openshift |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operator creates the SCCs? Security best practices is that the SCCs should be defined statically and not managed by an operator. For example, Rook defines its SCC in its upstream manifest here. However, downstream doesn't have a way to define the SCCs in OLM, so the workaround was to have the operator do it. But upstream really should define the SCC in a standalone manifest so the operator doesn't need high privileges to create SCCs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- i may be wrong here, does rook being deployed standalone in downstream, isn't that setup by meta operators which also takes care of SCCs?
- if indeed rook upstream isn't setting SCCs from code, it does have to deploy necessary RBAC required to access SCC api irrespective of upstream/downstream, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- i may be wrong here, does rook being deployed standalone in downstream, isn't that setup by meta operators which also takes care of SCCs?
Downstream Rook is only deployed as part of ODF and the OCS operator creates the necessary SCC since Rook does not create its own.
- if indeed rook upstream isn't setting SCCs from code, it does have to deploy necessary RBAC required to access SCC api irrespective of upstream/downstream, isn't it?
The Rook operator does not ever create SCCs or RBAC. It's all defined in its helm chart (for helm users), or the common.yaml for other upstream users. So the admin has full control to grant the operator only exactly the permissions needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should this be done for downstream outside of the operator? Bundles do not allow us to include SCCs so we decided to do this in the operator in order to make deployment easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Downstream does require the operator to do it. I'm just suggesting that for upstream there should be an option for the operator not to do it, so the SCC can be created separately and avoid giving the operator the extra privileges to create SCCs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can look into that.
|
||
- *topolvmStorageClass* resource units creates and manages all the storage | ||
classes corresponding to the deviceClasses in the LVMCluster | ||
- Storage Class name is generated with a prefix "topolvm-" added to name of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a storage class name be specified without the topolvm- prefix being added? Admins generally like to control the names of their storage classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- if admin doesn't want to use a pre-created storage class they are free to create a new storage class and not use pre-created SC at all
- we aren't setting any SC as default but just creating SC's corresponding to device classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding a property on the device class something like storageClassName
that allows them to override the generated storage class name? It's just a suggestion for a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention here is to provide a 1-click install so there is a working storage setup the admin can use from day 1, similar to how OCS does this. They can create their own sc later if they choose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream users appreciate customizing from the first install. That's the beauty of CRs is that it allows for the customization from the start. Admins don't generally want to have extra/unused/uncustomizable storage classes. 1-click install is great for the common case, but requiring the 1-click install doesn't work well for the advanced users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good point and can be taken up in later improvements. Out first priority was to have a working operator and one of the requirements was a one-click installation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes certainly, common cases first and this is just a suggestion for the future.
and creates the required volume groups on the individual nodes based on the | ||
specified deviceSelector and nodeSelector. | ||
- The corresponding CRs forms the basis of `vgManager` unit to create volume | ||
groups and create lvmd config file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the lvmd config file generated on each node? Does the vgManager have a hostPath mounted? Could you add more details about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- doc did explain about hostPath in the initial version of this PR
- maybe I misunderstood that comment and removed that info from vg-manager doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nbalacha do you want to see the mention of hostPath?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think having the hostpath information here would be useful.
- *vgManager*: Responsible for creation of Volume Groups | ||
- *topolvmStorageClass*: Manages storage class life cycle based on | ||
devicesClasses in LVMCluster CR | ||
- The LVMO creates an LVMVolumeGroup CR for each deviceClass in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the point I still don't understand... Why doesn't the admin create the LVMVolumeGroup CR directly? I don't see any description of what we're solving by embedding them in the cluster CR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- same as docs: add usage guide for lvmo [skip ci] #100 (comment)
- it's mostly about having to update status on the CR by multiple controllers which'll result into parallel CR update requests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the operator to have the ability to generate the LVMVolumeGroup CRs based on the LVMCluster CR. This allows us better control and validation as discussed earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's mostly about having to update status on the CR by multiple controllers which'll result into parallel CR update requests
The problem of updating the status seems the same either way. I would agree that the operator should own updating the status on the LVMVolumeGroup CRs, but that doesn't mean the admin can't create those CRs.
We want the operator to have the ability to generate the LVMVolumeGroup CRs based on the LVMCluster CR. This allows us better control and validation as discussed earlier.
What control and validation is needed? These are the details I'm missing. It seems any meaningful validation must happen on nodes by the vgmanager, not in the operator. The operator doesn't have insight to storage actually available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be something as simple as making sure there are not 2 entries trying to use up all the disks on all the nodes. It allows the operator to perform a first level validation before the vgmanager controller starts to create the VGs on the nodes. There could be other validations that we may perform in the future as the operator grows. IMO it is better to prevent issues than to try to fix things especially when dealing with local storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vgmanager will process the LVMVolumeGroup CRs serially, right? This means that if there is a conflict, the first one processed will win and the later ones may not have any devices available to add to its VG. This doesn't seem like an error condition, it just seems like the reality that if the admin wants multiple volume groups they will need to define the volume groups wisely. Why should conflicts be an error condition rather than a first-one-wins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple nodes and if multiple CRs are created at the same time, can we guarantee that all the nodes will process them in the same order?
If different nodes process the CRs in a different order, we cannot know which VGs can exist on which nodes. As the Storageclasses will have been created by then, potentially PVs can be created and then cleaning up becomes difficult. Hence I want to have a level of control on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API server will notify controllers about the CRs in the order of creation, so I expect the ordering will be consistent.
As discussed on the call, I'm ok to keep this behavior for now until we give this more time to bake. In 4.10 it's anyway simplified with a single VG supported. I also don't see upgrades being affected negatively if we do decide later to remove the VGs from the LVMCluster CR. The LVMVolumeGroup CRs would have already been created by the operator, and the operator could just ignore the VGs in the LVMCluster after that.
I can see if that if we have compelling validation error-checking scenarios for clusters with multiple VGs we could keep this design. But fundamentally I still recommend not including the VGs in the LVMCluster CR for these reasons:
- Bundling multiple resources into a single CR and forcing it to be a singleton is not a common K8s pattern I would expect on CRDs. A volume group is a logical unit, and therefore each VG makes sense as a single CR instance.
- Validation must be done on the nodes anyway by the vgmanager where the devices can be fully validated. The operator just can't fully validate available devices.
- Instead of assuming multiple VGs would cause conflicts as an error condition, allow the first VG to consume the device. The later VG will simply not find the available device. The later VG won't be in error, it just might not have devices available to fulfill storage requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed on the call, I do not see any advantage to having the admin create the LVMVolumeGroup CRs. The actual interface exposed to the admin is the deviceClass - that is what we are internally mapping to lvm volume groups. We could decide to change this completely in the future.
The additional level of validation is something that makes things easier with setup especially when targeting hundreds of nodes. The node level validation done by the Vg manager is a different level of validation.
A volume group is a logical unit, and therefore each VG makes sense as a single CR instance.
Yes, but that is not something the user needs to know. As far as they are concerned they get dynamic local storage and the exposing the lower layers to them is not really required. The LVMCluster CR will be fine tuned in future to provide this experience.
Instead of assuming multiple VGs would cause conflicts as an error condition, allow the first VG to consume the device. The later VG will simply not find the available device. The later VG won't be in error, it just might not have devices available to fulfill storage requests.
If there are failures in reconciliation for one VG or the reconcile requests are, for some reason, sent in a different order (network issues etc), this can lead to different VGs on different nodes. Since cleaning up is not easy, it makes more sense to prevent such conflicts as early as possible.
As discussed, this is the approach we are following now. Any further decisions will be based on user experience and feedback and will be taken up later.
Signed-off-by: Leela Venkaiah G <lgangava@redhat.com>
@leelavg: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Signed-off-by: Leela Venkaiah G lgangava@redhat.com