Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 0.25.0 maxscale statefulset failing to start #381

Closed
pasztorl opened this issue Feb 13, 2024 · 29 comments · Fixed by #427
Closed

[Bug] 0.25.0 maxscale statefulset failing to start #381

pasztorl opened this issue Feb 13, 2024 · 29 comments · Fixed by #427
Labels
bug Something isn't working

Comments

@pasztorl
Copy link

pasztorl commented Feb 13, 2024

Documentation

  • [ x ] I acknowledge that I have read the relevant documentation.

Describe the bug

I'm testing the latest operator by creating a mariadb replication with maxscale. The maxscale statefulset pods failing to start because of a permission problem.
Here is the log output from the maxscale pod:

2024-02-13 13:53:51   notice : MaxScale will be run in the terminal process.
2024-02-13 13:53:51   notice : /etc/config/..2024_02_13_13_53_43.4123823925/maxscale.cnf.d does not exist, not reading.
2024-02-13 13:53:51   notice : /var/lib/maxscale/maxscale.cnf.d does not exist, not reading.
2024-02-13 13:53:51   notice : Using up to 2.29GiB of memory for query classifier cache
2024-02-13 13:53:51   notice : syslog logging is disabled.
2024-02-13 13:53:51   notice : maxlog logging is enabled.
2024-02-13 13:53:51   error  : Failed to create directory '/var/lib/maxscale/maxscale.cnf.d': 13, Permission denied
2024-02-13 13:53:51   alert  : Can't access '/var/lib/maxscale/maxscale.cnf.d'.: No such file or directory

I see that the pod have a volume which mounted to this directory, the problem is that it tries to create a directory inside this mountpoint as maxscale user which is not allowed by the mountpoint filesystem permissions.

Here is the statefulset spec for the maxscale what the operator created:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: test1-mariadb-maxscale
  namespace: test1-mariadb
  ...
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: test1-mariadb-maxscale
      app.kubernetes.io/name: maxscale
  template:
    metadata:
      name: test1-mariadb-maxscale
      namespace: test1-mariadb
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: test1-mariadb-maxscale
        app.kubernetes.io/name: maxscale
    spec:
      volumes:
        - name: config
          secret:
            secretName: test1-mariadb-maxscale-config
            defaultMode: 420
      containers:
        - name: maxscale
          image: mariadb/maxscale:23.08
          command:
            - maxscale
          args:
            - '--config'
            - /etc/config/maxscale.cnf
            - '-dU'
            - maxscale
            - '-l'
            - stdout
          ports:
            - name: admin
              containerPort: 8989
              protocol: TCP
          resources: {}
          volumeMounts:
            - name: storage
              mountPath: /var/lib/maxscale
            - name: config
              mountPath: /etc/config
          livenessProbe:
            httpGet:
              path: /
              port: 8989
              scheme: HTTP
            initialDelaySeconds: 20
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /
              port: 8989
              scheme: HTTP
            initialDelaySeconds: 20
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      serviceAccountName: test1-mariadb-maxscale
      serviceAccount: test1-mariadb-maxscale
      automountServiceAccountToken: false
      securityContext: {}
      schedulerName: default-scheduler
  volumeClaimTemplates:
    - kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: storage
        creationTimestamp: null
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 100Mi
        volumeMode: Filesystem
  serviceName: test1-mariadb-maxscale-internal
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
  revisionHistoryLimit: 10

I've also tried image: mariadb/maxscale:23.08.4 with the same result.

MariaDB created with this manifest:

apiVersion: mariadb.mmontes.io/v1alpha1
kind: MariaDB
metadata:
  name: test1-mariadb
  namespace: test1-mariadb
spec:
  ephemeralStorage: false
  image: mariadb:11.2.3
  maxScale:
    connection:
      port: 3306
      secretName: test1-mariadb-connection
    enabled: true
    kubernetesService:
      type: ClusterIP
    replicas: 1
  maxScaleRef:
    name: test1-mariadb-maxscale
    namespace: test1-mariadb
  metrics:
    enabled: true
    exporter:
      image: prom/mysqld-exporter:v0.15.1
      port: 9104
    passwordSecretKeyRef:
      key: password
      name: test1-mariadb-metrics-password
    serviceMonitor: {}
    username: test1-mariadb-metrics
  port: 3306
  primaryService:
    type: ClusterIP
  replicas: 3
  replication:
    enabled: true
    primary:
      automaticFailover: true
      podIndex: 0
    probesEnabled: false
    replica:
      connectionRetries: 10
      connectionTimeout: 10s
      gtid: CurrentPos
      syncTimeout: 10s
      waitPoint: AfterSync
    syncBinlog: true
  rootEmptyPassword: false
  rootPasswordSecretKeyRef:
    key: password
    name: test1-mariadb-root-password
  secondaryService:
    type: ClusterIP
  service:
    type: ClusterIP
  volumeClaimTemplate:
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 20Gi
@pasztorl pasztorl added the bug Something isn't working label Feb 13, 2024
@pasztorl
Copy link
Author

I tried to set up an initcontainer (to do the chown) inside the MaxScaleSpec. The CRD accepted it, but the operator seems to ignored it.

@DrZoidberg09
Copy link

Same problem here

@K-MeLeOn
Copy link
Contributor

K-MeLeOn commented Feb 18, 2024

Mhh, strange, same problem, even with podSecurityContext set to:

maxScale:
   podSecurityContext:
      runAsUser: 0

@pasztorl
Copy link
Author

@pasztorl
Copy link
Author

But another problem (related to operator) is that it ignore the initContainer.

@mmontes11
Copy link
Member

mmontes11 commented Feb 18, 2024

Hey there! thanks for reporting.

As you mentioned, it looks like a problem of the image that can be mitigated with an initContainer. For now I would solve it in the operator side before the image gets fixed.

Adding support for initContainer to MaxScale would be a one liner, just add it to this this builder:

func (b *Builder) maxscalePodTemplate(mxs *mariadbv1alpha1.MaxScale, labels map[string]string) (*corev1.PodTemplateSpec, error) {

Like we currently do for MariaDB:

func (b *Builder) mariadbPodTemplate(mariadb *mariadbv1alpha1.MariaDB, labels map[string]string) (*corev1.PodTemplateSpec, error) {

Contributions welcome! I will do it myself before the next release if needed 👍🏻

@mmontes11
Copy link
Member

Out of curioristy, which storage are you using?

@K-MeLeOn
Copy link
Contributor

K-MeLeOn commented Feb 18, 2024

Mhh, strange, same problem, even with podSecurityContext set to:

maxScale:
   podSecurityContext:
      runAsUser: 0

It works now, I use longhorn for local storage, but as I see, I need a minimum of 300Mi of storage and set security context to work (same as needed for galera x longhorn). Hope it can help.

  maxScale:
    enabled: true
    config:
      volumeClaimTemplate:
        resources:
          requests:
            storage: 300Mi
        storageClassName: longhorn-encrypted-global-noretain
        accessModes:
          - ReadWriteMany
    podSecurityContext:
      runAsUser: 0 # For galera, caused by longhorn storage permission

@mmontes11
Copy link
Member

@K-MeLeOn it does, thank you!.

Actually, some Longhorn users reported the same error with Galera:

Not a Longhorn user unfortunately, maybe something we could do for preventing this? Will take a closer look to the Longhorn docs

@K-MeLeOn
Copy link
Contributor

K-MeLeOn commented Feb 18, 2024

@K-MeLeOn it does, thank you!.

Actually, some Longhorn users reported the same error with Galera:

* https://github.com/mariadb-operator/mariadb-operator/blob/main/docs/GALERA.md#permission-denied-writing-galera-configuration

Not a Longhorn user unfortunately, maybe something we could do for preventing this? Will take a closer look to the Longhorn docs

My favorite longhorn hack is to add a busybox image to grant permission to the right user or group.
It's ugly but it works:

- name: volume-hack
  image: busybox
  command:
    - /bin/sh
    - -c
    - chown -R USER_ID:GROUP_ID /var/lib/maxscale
  securityContext:
    runAsUser: 0

@mmontes11
Copy link
Member

mmontes11 commented Feb 18, 2024

My favorite longhorn hack is to add a busybox image to grant permission to the right user or group.
It's ugly but it works

This will be doable after we add support to the initContainers in MaxScale in this release. Maybe we can consider adding it by default from the operator before the image gets fixed.

@DrZoidberg09
Copy link

Unfortunately, runAsUser: 0 did not work for me. I tried it with Hetzner-CSI (RWO) and with NFS-CSI (RWX), same result.

@K-MeLeOn
Copy link
Contributor

K-MeLeOn commented Feb 18, 2024

Unfortunately, runAsUser: 0 did not work for me. I tried it with Hetzner-CSI (RWO) and with NFS-CSI (RWX), same result.

Do you have a minimum of 300Mi of storage set on your volumeClaimTemplate with longhorn ?

I haven't tried it with HCSI, but it's 10Gb minimum, so it should theoretically work.

@DrZoidberg09
Copy link

Yes, I used your example from above for NFS and 10 Gi for Hetzner. No success.

@K-MeLeOn
Copy link
Contributor

Mmmh, after many recreation, I got the error again with RWO, whit RWX it's working, this is a strange behavior, need to wait the initContainer update I think.

@mmontes11
Copy link
Member

mmontes11 commented Feb 18, 2024

Very strange indeed. It would be very much appreciated if someone could try the following steps:

  • Downscale mariadb-operator to avoid collisions:
kubectl scale deployment mariadb-operator --replicas=0
  • Edit the MariaDB StatefulSet, and add the following initContainer:
- name: volume-hack
  image: busybox
  command:
    - /bin/sh
    - -c
    - chown -R USER_ID:GROUP_ID /var/lib/maxscale
  securityContext:
    runAsUser: 0
  • Wait for the StatefulSet rolling upgrade

I guess it should work, but just to confirm.

@K-MeLeOn
Copy link
Contributor

K-MeLeOn commented Feb 18, 2024

[...]

I guess it should work, but just to confirm.

It works with :

      initContainers:
        - name: volume-hack
          image: busybox
          command:
            - /bin/sh
            - -c
            - chown -R 998:996 /var/lib/maxscale
          resources: {}
          volumeMounts:
            - name: storage
              mountPath: /var/lib/maxscale

Looking for the user and group id :

less /etc/passwd

20240218_202339

Applying initContainers config into the StatefullSet.

StatefullSet status :
20240218_202859

Folder tree :
20240218_202311

Maxscale log :
20240218_202111

But as I see, the /usr/lib64/maxscale is not proprietary of maxscale:maxscale, probably need to hack permissions too (I'm not sure, because module successfully load, isn't neccessary to write in this directory ?) :

      initContainers:
        - name: volume-hack
          image: busybox
          command:
            - /bin/sh
            - -c
            - chown -R 998:996 /var/lib/maxscale && chown -R 998:996 /usr/lib64/maxscale
          resources: {}
          volumeMounts:
            - name: storage
              mountPath: /var/lib/maxscale

@pasztorl
Copy link
Author

pasztorl commented Feb 18, 2024

I guess it should work, but just to confirm.

Works for me, thanks! My storage is ceph/rbd/ext4.

After the init container finished I started the operator. The MariaDB and MaxScale crd went true, I see the new config in the maxscale pod's log.

Now the problem is I can't log in to the admin console with the password from the adminPasswordSecretKeyRef. I also tried to use maxctrl and got permission denied.

There is a /var/lib/maxscale/passwd file with the admin user and a password hash.

@DrZoidberg09
Copy link

DrZoidberg09 commented Feb 19, 2024

Same problem here again. However, it is not a matter of password. This is mariadb-operator / (maxscale-admin secret password).

The Service is exposed via ingress {URL}=My URL (removed for privacy reasons).

However, when I log in, I immediatley get logged out with this error in the console:

ERROR [https://{URL}/js/app~06837ae4.65c68fab.js:1:fetchLoggedInUserAttrs] Object { stack: "o@https://{URL}/js/app~2a42e354.252e49c0.js:1:1881\nX@https:/{URL}n/js/app~2a42e354.252e49c0.js:1:22807\nh@https://{URL}/js/app~2a42e354.252e49c0.js:1:25959\n", message: "Request failed with status code 401", name: "AxiosError", code: "ERR_BAD_REQUEST", config: {…}, request: XMLHttpRequest, response: {…} } ​ code: "ERR_BAD_REQUEST" ​ config: Object { timeout: 0, xsrfCookieName: "XSRF-TOKEN", xsrfHeaderName: "X-XSRF-TOKEN", … } ​ message: "Request failed with status code 401" ​ name: "AxiosError" ​ request: XMLHttpRequest { readyState: 4, timeout: 0, withCredentials: false, … } ​ response: Object { data: {…}, status: 401, statusText: "", … } ​ stack: "o@https://{URL}/js/app~2a42e354.252e49c0.js:1:1881\nX@https://{URL}/js/app~2a42e354.252e49c0.js:1:22807\nh@https://{URL}/js/app~2a42e354.252e49c0.js:1:25959\n" ​ <prototype>: Object { constructor: o(e, t, n, r, o), toJSON: toJSON(), stack: "", … } app~4e350ea9.0250b2a8.js:1:31181 error https://{URL}/js/app~4e350ea9.0250b2a8.js:1 fetchLoggedInUserAttrs https://{URL}/js/app~06837ae4.65c68fab.js:1

@mmontes11
Copy link
Member

Thanks a lot for testing this @K-MeLeOn ! Very much appreciated 🙏🏻 I can confidently add the initContainer for the next release now, just as a temporary measure before the image gets fixed.

I'm not sure, because module successfully load, isn't neccessary to write in this directory ?)

Not entirely sure about this, I will ask and add the permissions in the initContainer if needed as you suggested.

@mmontes11
Copy link
Member

mmontes11 commented Feb 19, 2024

After the init container finished I started the operator. The MariaDB and MaxScale crd went true, I see the new config in the maxscale pod's log.

Good news @pasztorl ! This if the resources are in ready status we are good to go. I will keep this issue open until we add the initContainer

Now the problem is I can't log in to the admin console with the password from the adminPasswordSecretKeyRef. I also tried to use maxctrl and got permission denied.

@pasztorl / @DrZoidberg09

That's a different problem, could you please file another issue for this? The operator uses this credentials to create resources in the MaxScale API, so if you don't see any error log, they should be valid. Be sure to use the mariadb-operator user and the following password:

kubectl get secret maxscale-admin -o jsonpath="{.password}" | base64 -d

Also, take into account that, if you are accessing MaxScale via an Ingress controller, the headers might have been modified, which might result in a 401. Please try to access to the MaxScale instance directly via a port-forward to understand where the problem is first.

@pasztorl
Copy link
Author

I'm connecting directly to the admin port, not using ingress (yet).
I'm also used that password in the secret as you mentioned above.

Here is the maxscale log:

...
2024-02-18 22:41:43   notice : Service 'rconn-master-router' started (3/3)
2024-02-18 22:41:43   notice : MaxScale started with 8 worker threads.
2024-02-18 22:41:44   notice : Read 11 user@host entries from 'mariadb-0' for service 'rconn-slave-router'.
2024-02-18 22:41:44   notice : Read 11 user@host entries from 'mariadb-0' for service 'rconn-master-router'.
2024-02-18 22:41:44   notice : Read 11 user@host entries from 'mariadb-0' for service 'rw-router'.
2024-02-18 22:41:58   warning: Authentication failed for 'admin', using password. Request: GET /v1/servers
2024-02-18 22:42:18   warning: Authentication failed for 'admin', using password. Request: GET /v1/servers
2024-02-18 22:43:41   notice : [mariadbmon] Selecting new primary server.
2024-02-18 22:43:41   notice : [mariadbmon] Setting 'mariadb-0' as primary.
2024-02-18 22:44:11   warning: Authentication failed for 'admin', using password. Request: GET /v1/servers
2024-02-18 22:44:19   notice : [mariadbmon] Selecting new primary server.
2024-02-18 22:44:19   notice : [mariadbmon] Setting 'mariadb-0' as primary.
2024-02-18 22:45:14   warning: Authentication failed for 'mariadb-operator', using password. Request: GET /servers
2024-02-18 22:45:14   warning: Authentication failed for 'mariadb-operator', using password. Request: GET /monitors/mariadbmon-monitor
2024-02-18 22:45:14   warning: Authentication failed for 'mariadb-operator', using password. Request: GET /services
2024-02-18 22:45:14   warning: Authentication failed for 'mariadb-operator', using password. Request: GET /listeners
2024-02-18 22:45:14   warning: Authentication failed for 'mariadb-operator', using password. Request: GET /users/inet/mariadb-operator
2024-02-18 22:45:14   warning: Authentication failed for 'admin', using password. Request: POST /users/inet
...

So when the maxscale statefulset created I sopped the operator and when the permission fixed started again. So It is possible that the operator can not continue it's job and miss something?
Of course I can create another issue, but If the problem is that the operator canceled during the job because of the permission problem this relates to this issue, so I leave the comment here, then if is fixed I drop the whole setup and restart from the beginning and will see that solves this problem too or not. Is it ok?

@pasztorl
Copy link
Author

Update: now I dropped the namespace and recreated the MariaDB crd. When MaxScale statefulset created i stopped (kill -STOP) the operator. I fixed the permissions, then kill -CONT on the operator process.
Still can't access with the "admin" user, but.. I can log in with the mariadb-operator user using the admin password from the admin secret. So it looks like this secret holds the mariadb-operator user password not for the admin user?

Related logs from the maxscale pod:

2024-02-19 10:06:01   notice : Create network user 'mariadb-operator'
2024-02-19 10:06:01   notice : Deleted network user 'admin'

In the operator logs:

{"level":"info","ts":1708337161.7916281,"msg":"Configuring admin in MaxScale Pod","controller":"maxscale","controllerGroup":"mariadb.mmontes.io","controllerKind":"MaxScale","MaxScale":{"name":"mariadb-maxscale","namespace":"mariadb-test"},"namespace":"mariadb-test","name":"mariadb-maxscale","reconcileID":"0e36cf39-afb0-4d9d-9b2f-3bc71203b0a6","pod":"mariadb-maxscale-0"}
{"level":"info","ts":1708337161.8184333,"msg":"Initializing MaxScale Pod","controller":"maxscale","controllerGroup":"mariadb.mmontes.io","controllerKind":"MaxScale","MaxScale":{"name":"mariadb-maxscale","namespace":"mariadb-test"},"namespace":"mariadb-test","name":"mariadb-maxscale","reconcileID":"0e36cf39-afb0-4d9d-9b2f-3bc71203b0a6","pod":"mariadb-maxscale-0"}

@mmontes11
Copy link
Member

mmontes11 commented Feb 19, 2024

So it looks like this secret holds the mariadb-operator user password not for the admin user?

You get it right.The secret contains the password not the user itself:
https://github.com/mariadb-operator/mariadb-operator/blob/main/docs/API_REFERENCE.md#maxscaleauth

adminUsername specifies the username with admin privileges. adminPasswordSecretKeyRef contains a reference to the password for that user.

The logs above are expected ☝🏻 .

@K-MeLeOn
Copy link
Contributor

K-MeLeOn commented Feb 19, 2024

Thanks a lot for testing this @K-MeLeOn ! Very much appreciated 🙏🏻 I can confidently add the initContainer for the next release now, just as a temporary measure before the image gets fixed.

I'm not sure, because module successfully load, isn't neccessary to write in this directory ?)

Not entirely sure about this, I will ask and add the permissions in the initContainer if needed as you suggested.

Oops, I just thought this part can't work, because it's part of the container itself! There's already a module in there (libpp_sqlite.so) :

 && chown -R 998:996 /usr/lib64/maxscale

@mmontes11
Copy link
Member

We have added support for initContainer in this PR:

The operator adds one init container to change the permissions to /var/lib/maxscale so it can work properly with all the storage providers. Kudos to @K-MeLeOn !

Closing! 🙏🏻 This will be released in v0.0.26, feel free to reopen if you are still facing issues.

@lwj5
Copy link

lwj5 commented Feb 29, 2024

Hi all,

could we use fsGroup instead?

For CSI/storage provider that do not support fsGroup, we allow the user to add volume initContainer. This prevents the need to a chown -R and an additional container for some providers.

This is what is used in bitnami helm charts as well, with volume initContainers disabled by default.

To allow the maxscale to work, just use:

      securityContext:
        fsGroup: 996

@mmontes11
Copy link
Member

mmontes11 commented Mar 2, 2024

Hey @lwj5 !

Thanks for your suggestion, I've manged to spin up a MaxScale without the chown initContainer but adding the following to the spec:

  podSecurityContext:
    fsGroup: 996

I am using the synology CSI driver, which by default has fsGroupPolicy = ReadWriteOnceWithFSType:

apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  name: csi.san.synology.com
spec:
  attachRequired: true
  fsGroupPolicy: ReadWriteOnceWithFSType
  podInfoOnMount: true
  requiresRepublish: false
  storageCapacity: false
  volumeLifecycleModes:
  - Persistent

For context, the CSI driver specification allows to specify a fsGroup in order to delegate to the kubelet the permissions change:
https://kubernetes-csi.github.io/docs/support-fsgroup.html

Most of the CSI drivers nowadays support by default compatible values of fsGroupPolicy:

Not sure about longhorn though, see the following issues:

I think it will be sensible to default the fsGroup instead of the initContainer running as root. The latter implies extra container orchestration and it could be seen as a red flag in terms of security. In any case, adding initContainers by the user is now supported!

I will be moving forward with this change and I will add a troubleshooting section in the MaxScale docs.

Thanks! Let me know what you think.

@mmontes11
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants