Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

[stable/nextcloud] Image stuck at Initializing NextCloud... when PVC is attached #22920

Closed
mikeyGlitz opened this issue Jun 24, 2020 · 16 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@mikeyGlitz
Copy link

Describe the bug

When the helm chart is bringing up NextCloud, the application does not get past the log message

Initializing Nextcloud 17.0.7...

Version of Helm and Kubernetes:


helm: v3.2.1
kubernetes: v1.18.4+k3s1

Which chart:

stable/nextcloud

What happened:

Namespace is created.
Helm creates persistent-volume-claim
Helm instantiates MariaDB using bitnami/mariadb chart
Helm instantiates Nextcloud container
Nextcloud container starts
Nextcloud container does not get past

Initializing Nextcloud 17.0.7...

What you expected to happen:

Nextcloud was supposed to finish initialization
Nextcloud files were supposed to be copied with correct permissions to the PVC

How to reproduce it (as minimally and precisely as possible):

Initialize helm with the following:

helm install stable/nfs-client-provisioner nfs --namespace=nas  \
--set nfs.server=x.x.x.x --set nfs.path=/mnt/external

helm install -f values.yaml stable/nextcloud files --namespace=nextcloud

values.yaml

ingress:
  enabled: true
  annotations:
    kubernetes.io/ingress.class: traefik
    cert-manager.io/cluster-issuer: cluster-issuer
    traefik.ingress.kubernetes.io/redirect-entry-point: https
    traefik.frontend.passHostHeader: "true"
  tls:
   - hosts:
     - files.haus.net
      secretName: nextcloud-app-tls
nextcloud.host: files.haus.net
nextcloud.username: admin
nextcloud.password: P@$$w0rd!
internalDatabase:
  enabled: false
mariadb:
  enabled: yes
  password: P@$$w0rd!
  user: nextcloud
  name: nextcloud
persistence:
  enabled: yes
  storageClass: nfs-client
  size: 1Ti
@11jwolfe2
Copy link

11jwolfe2 commented Jun 29, 2020

I have also been trying to get this install to work with a PV and PVC and no luck, If I do it without a PVC and PV it works, as soon as I enable the PV, it says nextcloud directory isn't found, so I make the directory. Then it says "Error: failed to create subPath directory for volumeMount "nextcloud-data" of container "nextcloud"". does anyone have any ideas about this?

@derdrdirk
Copy link

I am having the same issue. I also use nfs-client as storageClass, which might cause this bug? IIRC I used a manual created PV some time back and it worked.

Have you figured out how to make this work?

@almahmoud
Copy link

almahmoud commented Jun 30, 2020

Not sure if we are having the same issue, but I will detail my investigation so far on trying to use persistence.existingClaim, in case it helps people progress in their own investigations and/or if the context would help someone more knowledgeable provide some help as I have only worked with k8s for a year or so.

From what I could see, the container creation process errors out with:

Error: failed to start container "nextcloud": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/49c19090-14d6-4bee-b774-ca24b0ddd259/volume-subpaths/jun30third-nextcloud-data-pv/nextcloud/0\\\" to rootfs \\\"/var/lib/docker/overlay2/40dca10bcad3a57d61d35d40d0bd897f6d2322c3a5d9f615d2a90a38d7fe4cd5/merged\\\" at \\\"/var/lib/docker/overlay2/40dca10bcad3a57d61d35d40d0bd897f6d2322c3a5d9f615d2a90a38d7fe4cd5/merged/var/www\\\" caused \\\"no such file or directory\\\"\"": unknown

I've looked on the node during the time of the directory creation, and some things to note:

  • the source directory always exists and the path is correct
  • the destination directory up to (.*)/merged is created when the container is being spun up, but I could never see merged directory inside (although I didn't have the container ID beforehand so relied on watch commands and manually looking on the node, so I can't guarantee that it was never there, I just know I could never see it there)

The only lead I've found so far to why this might be happening is kubernetes/kubernetes#61545 (comment) and the following comment links kubernetes/kubernetes#61563 (comment). My guess is that this is related to the second issue in the last comment (i.e. kubernetes/kubernetes#61545), given that the config mounts are nested inside the directory mount, however given that the error is on subpath /nextcloud/0 of the container (which I have verified is the root subpath), this might not be true but is my best lead so far.

I'm currently poking by manually changing specifications to see if any configuration works (i.e. trying different variations of the mountpaths nesting to see if I can get it to start up manually before figuring out how to correct the chart), but in the meantime if anyone else finds a solution and/or if it seems I'm going down the wrong trail, please let me know!

Update: it is not the configmap causing this in my case, it's the nested mounts: https://github.com/helm/charts/blob/master/stable/nextcloud/templates/deployment.yaml#L289 Additionally, the problem only appears after the first restart (it seems that the first time it can do the mounting, but once things get written to the volumes and the container restarts, the bind mounts fail for the new container with the above error). This problem might be specific to our storage class (we're using an RClone CSI which fuse-mounts an s3 bucket) and different from yours, although I haven't tried it with a nfs layer on top yet to confirm. This does seem to be different than what you're seeing though... (sorry for hijacking your issue).

In case this comes up for anyone else current workaround is keeping only the root directory mount (which is enough to backup everything else as they are nested) and that seems to fix the problem

@11jwolfe2
Copy link

Okay I got it working! I am using a Open Media vault NFS share for all of my persistent volumes. I set them up with the following settings and it now works without any issues when using the regular helm install, no extra stuff required.

Settings for nfs share.

  • rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000

@almahmoud
Copy link

almahmoud commented Jul 1, 2020

It also works with nfs-server-provisioner (https://github.com/helm/charts/tree/master/stable/nfs-server-provisioner) with expected values.

Specific values we're using

helm install nfs-provisioner stable/nfs-server-provisioner
    --namespace myns
    --set persistence.enabled=true
    --set persistence.storageClass="ebs"
    --set persistence.size=100Gi
    --set storageClass.create=true
    --set storageClass.reclaimPolicy="Delete"
    --set storageClass.allowVolumeExpansion=true

and NextCloud snippet:

persistence:
  enabled: true
  storageClass: nfs
  accessMode: "ReadWriteMany"

I'll open a separate issue for the existingClaim problem

@mikeyGlitz
Copy link
Author

Okay I got it working! I am using a Open Media vault NFS share for all of my persistent volumes. I set them up with the following settings and it now works without any issues when using the regular helm install, no extra stuff required.

Settings for nfs share.

  • rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000

Tried changing the line in my /etc/exports and it didn't fix the problem.

@mikeyGlitz
Copy link
Author

mikeyGlitz commented Jul 4, 2020

Using the following snippets:

nfs-client-provisioner.values.yaml

nfs:
  mountOptions:
    - nfsvers=4
  server: 172.16.0.1
  nfs.path: /mnt/external

I updated my nextcloud values with the new value persistence.accessMode=ReadWriteMany.

Also didn't work.

I have the following directories in my volume:

drwxrwxrwx 9 root     root 4096 Jul  4 01:04 ./
drwxr-xr-x 7 root     root 4096 Jul  4 01:09 ../
drwxrwxrwx 2 root     root 4096 Jul  4 01:04 config/
drwxrwxrwx 2 root     root 4096 Jul  4 01:04 custom_apps/
drwxrwxrwx 2 root     root 4096 Jul  4 01:04 data/
drwxrwxrwx 8 www-data root 4096 Jul  4 01:08 html/
drwxrwxrwx 4 root     root 4096 Jul  4 01:04 root/
drwxrwxrwx 2 root     root 4096 Jul  4 01:04 themes/
drwxrwxrwx 2 root     root 4096 Jul  4 01:04 tmp/

@stale
Copy link

stale bot commented Aug 3, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 3, 2020
@tomhouweling1987
Copy link

Got the same problem. Tested with version 17.0.0-apache and 19.0.1-apache. Also seeing that the dirs are root:root.
When we deploy without PVC the installation works

@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 6, 2020
@jesussancheztellomm
Copy link

jesussancheztellomm commented Aug 11, 2020

Using nfs-client-provisioner works but the main problem is that the initial rsync takes arround 5 minutes to complete (at least in my tests using GCP Filestore). You can look at the entrypoint.sh file.

rsync -rlDog --chown www-data:root --delete --exclude-from=/upgrade.exclude /usr/src/nextcloud/ /var/www/html/

If you disable the readiness and the liveness in the values, it works.

❯ k logs nextcloud-5756597dbc-nhg5m
Initializing nextcloud 17.0.8.1 ...
Initializing finished
New nextcloud instance
Installing with PostgreSQL database
starting nextcloud installation
Nextcloud was successfully installed
setting trusted domains…
System config value trusted_domains => 1 set to string XXXXXXXX
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 10.192.149.41. Set the 'ServerName' directive globally to suppress this message
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 10.192.149.41. Set the 'ServerName' directive globally to suppress this message
[Tue Aug 11 08:54:50.097547 2020] [mpm_prefork:notice] [pid 1] AH00163: Apache/2.4.38 (Debian) PHP/7.3.21 configured -- resuming normal operations
[Tue Aug 11 08:54:50.097621 2020] [core:notice] [pid 1] AH00094: Command line: 'apache2 -D FOREGROUND'

I've trying some alternatives to that rsync but since there are a lot of small files to copy i haven't found any improvement.

Any ideas?

@timtorChen
Copy link

timtorChen commented Aug 17, 2020

Log looks like stuck at Initializing Nextcloud 17.0.7..., because the rsync process extremely slow (for my local nfs, it is about 1.5MB/s, show the progress with rsync --info=progress2). More worsely, the liveness probe will continuously fail and finally get CrashLoopBackOff.

As the workaround like jesussancheztellomm, I disable the liveness probe on first installation, and enable it after finishing the installation.

Maybe we can refer to nextcloud/docker#968
It will not solve the problem of slow nfs transmission speed (I still have no idea why ...), but stateless application may remove the rsync process.

@tomhouweling1987
Copy link

@timtorChen i can confirm, when i disabled the LivinessProbe it took 11min to sync. Also tried it with an S3 Storage backend, it took just seconds to sync.

So i looked deeper in my NFS, and we are using SYNC instead of ASYNC because we want not lose any data. I didnt test it with an ASYNC connection.

@billimek
Copy link
Collaborator

The nextcloud chart has migrated to a new repo. Can you please raise the issue over there? https://github.com/nextcloud/helm

@somerandow
Copy link

Opened an incident over on the new repo. Tried to summarize some of the info from this discussion.

@stale
Copy link

stale bot commented Oct 3, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 3, 2020
@stale
Copy link

stale bot commented Oct 24, 2020

This issue is being automatically closed due to inactivity.

@stale stale bot closed this as completed Oct 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

9 participants