-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Longhorn 1.2.0 - wrong volume permissions inside container / broken fsGroup #2964
Comments
I can confirm this happens for automatically provisioned volumes as well (i.e. statefulsets)
statefulset example:
|
I seem to be hitting this as well. Verified using repro examples above. And my workloads that were working fine yesterday on Longhorn 1.1 are now throwing permission denied on their Longhorn volumes with 1.2. |
Has anyone found a work around for this issue or can we expect an emergency patch as this seems to be a show stopper for me and is currently blocking us. Every pod deployed now gives a message similar to;
|
Agreed, this is a showstopper. I had to roll back to 1.1.x. Fortunately this was only a dev environment and it's no biggie to blow away the storage. |
Thanks, guys! We are investigating this issue |
@bkupidura @HubbeKing @mstrent @bdobsonca The problematic volumes are newly created after upgrading to Longhorn v1.2.0 or they have already existed since Longhorn v1.1.2? |
Our QA, @khushboo-rancher confirmed that this problem only happens with newly created volumes |
Workaround: manually add a new flag
Root cause:
|
Pre Ready-For-Testing Checklist
|
Test steps:
|
Verified with Longhorn-master and v1.2.x-head Validation - Pass |
Is pointing people to a workaround enough if 1.2.1 is still weeks away? This is a fatal enough flaw I'd think 1.2 should either be pulled or re-released with the fix. |
@mstrent While this issue has indeed high impact:
Retag or re-released a version is generally a bad idea, since there is no way to upgrade from and to the same version, and there won't be helpful to any existing users already hit the bug. It's going to be lots of things mixed up if we choose to do that. Sorry for the inconvenience. v1.2.1 will be there soon. |
Retag is usually a no-go, but you can still release My point is that you are a storage provider, and you should be able to release quick bug-fixes like this. 1.2.x is for patch-releases, they shouldn't be planned for, just release! |
This bug still exists even with the workaround in #2964 (comment) or similar one.
deployment for csi-provisioner: $ k get deployment/csi-provisioner -n longhorn-system -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "2"
driver.longhorn.io/kubernetes-version: v1.21.4+k3s1
driver.longhorn.io/version: v1.2.0
longhorn.io/last-applied-tolerations: '[]'
creationTimestamp: "2021-09-15T00:42:04Z"
generation: 2
labels:
app: csi-provisioner
longhorn.io/managed-by: longhorn-manager
name: csi-provisioner
namespace: longhorn-system
resourceVersion: "3983839"
uid: c12397e0-415d-4061-805b-ff49808c602c
spec:
progressDeadlineSeconds: 600
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app: csi-provisioner
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: csi-provisioner
spec:
containers:
- args:
- --v=2
- --csi-address=$(ADDRESS)
- --timeout=1m50s
- --leader-election
- --leader-election-namespace=$(POD_NAMESPACE)
- --default-fstype=ext4
env:
- name: ADDRESS
value: /csi/csi.sock
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
image: k8s.gcr.io/sig-storage/csi-provisioner:v2.1.2
imagePullPolicy: IfNotPresent
name: csi-provisioner
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /csi/
name: socket-dir
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: longhorn-service-account
serviceAccountName: longhorn-service-account
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /var/lib/kubelet/plugins/driver.longhorn.io
type: DirectoryOrCreate
name: socket-dir
status:
availableReplicas: 3
conditions:
- lastTransitionTime: "2021-09-15T00:42:45Z"
lastUpdateTime: "2021-09-15T00:42:45Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2021-09-15T00:42:04Z"
lastUpdateTime: "2021-09-15T00:58:38Z"
message: ReplicaSet "csi-provisioner-669c8cc698" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 2
readyReplicas: 3
replicas: 3
updatedReplicas: 3 |
#1210 (comment): Culprit: multipathd |
I also encounter this bug. I tried to install Hashicorp Vault through a helm chart. This one does not allow custom fsGroup settings and I can see that the main directory is not set to the vault user. |
@GeroL |
In my opinion, if you can't/won't submit a quick patch release, the 1.2.0 should be pulled as it's broken in a non obvious way. In my case I thought there was some kind of issue with the GitLab helm chart, wasted literally hours on this, I'm afraid I won't be alone. Looking forward to upgrade to 1.2.1. |
Describe the bug
After upgrade longhorn to 1.20, some container are unable to start corectly (e.g prometheus).
Looks like root cause is wrong Longhorn volume permisions inside container when container is not running as root.
Even with
fsGroup
specified, permissions are not set for volume.To Reproduce
Expected behavior
When
fsGroup
is provided, it should be used to chown destination mount.Environment:
Additional context
The text was updated successfully, but these errors were encountered: