[stable/nextcloud] resilience against restarts during container "setup" #14090

mrtndwrd · 2019-05-23T09:06:29Z

Describe the bug

Kubernetes tends to restart pods when it thinks they are failing. In my setup (a single node K8s cluster), sometimes it takes a while for the database to be available, which means Nextcloud stays stuck on the sort of "installation phase", where it copies the files and generates database tables, etc. When this takes too long, Kubernetes starts firing liveness probes at the NC container, which naturally fail, because it hasn't been set up correctly yet. Then, the pod gets restarted while Nextcloud is in the middle of "installation", which leads to
very unpredictable situations.

Version of Helm and Kubernetes:

# helm version
Client: &version.Version{SemVer:"v2.13.1", GitCommit:"618447cbf203d147601b4b9bd7f8c37a5d39fbb4", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.13.1", GitCommit:"618447cbf203d147601b4b9bd7f8c37a5d39fbb4", GitTreeState:"clean"}

# kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.2", GitCommit:"66049e3b21efe110454d67df4fa62b08ea79a19b", GitTreeState:"clean", BuildDate:"2019-05-16T16:23:09Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Which chart:

stable/nextcloud

What happened:

I installed the chart
Nextcloud couldn't connect to the database:

# kubectl logs nc-pod -p --timestamps
2019-05-23T07:34:43.687241549Z Initializing nextcloud 15.0.2.0 ...
2019-05-23T07:34:56.852129418Z Initializing finished
2019-05-23T07:34:56.852199469Z New nextcloud instance
2019-05-23T07:34:56.852376626Z Installing with MySQL database
2019-05-23T07:34:56.852376626Z starting nextcloud installation
2019-05-23T07:35:27.274728683Z Error while trying to create admin user: Failed to connect to the database: An exception occured in driver: SQLSTATE[HY000] [2002] Connection timed out
2019-05-23T07:35:27.275699769Z  -> 
2019-05-23T07:35:27.293547933Z retrying install...
2019-05-23T07:36:00.820052371Z Error while trying to create admin user: Failed to connect to the database: An exception occured in driver: SQLSTATE[HY000] [2002] Connection timed out
2019-05-23T07:36:00.820186706Z  -> 
2019-05-23T07:36:00.834889339Z retrying install...
2019-05-23T07:36:34.195172937Z Error while trying to create admin user: Failed to connect to the database: An exception occured in driver: SQLSTATE[HY000] [2002] Connection timed out
2019-05-23T07:36:34.195215587Z  -> 
2019-05-23T07:36:34.209778729Z retrying install...

K8s killed nextcloud pod

    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 23 May 2019 07:37:17 +0000
      Finished:     Thu, 23 May 2019 07:37:18 +0000

New nc started, and didn't recognise that installation was not completely finished:

# kubectl logs nc-pod --timestamps
2019-05-23T07:36:40.379109775Z AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 10.42.0.12. Set the 'ServerName' directive globally to suppress this message
2019-05-23T07:36:40.427522104Z AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 10.42.0.12. Set the 'ServerName' directive globally to suppress this message
2019-05-23T07:36:40.47235789Z [Thu May 23 07:36:40.472046 2019] [mpm_prefork:notice] [pid 1] AH00163: Apache/2.4.25 (Debian) PHP/7.2.14 configured -- resuming normal operations
2019-05-23T07:36:40.472573658Z [Thu May 23 07:36:40.472520 2019] [core:notice] [pid 1] AH00094: Command line: 'apache2 -D FOREGROUND'
2019-05-23T07:37:13.794369446Z 213.108.108.187 - - [23/May/2019:07:37:13 +0000] "GET /status.php HTTP/1.1" 200 1538 "-" "kube-probe/1.13"
2019-05-23T07:37:18.546352647Z 213.108.108.187 - - [23/May/2019:07:37:18 +0000] "GET /status.php HTTP/1.1" 200 1544 "-" "kube-probe/1.13"

All seems well now, because the liveness probes respond and everything, but then I try to run an occ command:

# kubectl logs integration-pod  --timestamps
2019-05-23T07:37:31.735543221Z Nextcloud is not installed - only a limited number of commands are available
2019-05-23T07:37:31.763774985Z 
2019-05-23T07:37:31.764516522Z                                          
2019-05-23T07:37:31.764623529Z   Command "app:install" is not defined.  
2019-05-23T07:37:31.764779342Z                                          
2019-05-23T07:37:31.764788838Z   Did you mean one of these?             
2019-05-23T07:37:31.76480513Z       app:check-code                     
2019-05-23T07:37:31.76480513Z       maintenance:install                
2019-05-23T07:37:31.76480513Z                                          
2019-05-23T07:37:31.764817529Z

What you expected to happen:

In short, I expect the container to be resilient to pod restarts, because Kubernetes could terminate a pod at any moment.

How to reproduce it (as minimally and precisely as possible):

install the NC helm chart
Before the NC pod is ready, kill it
The restarted pod will "act weird" (it's hard to predict, because it depends on when you kill it exactly)

Anything else we need to know:

I think part of the solution can be to "install" the nextcloud core to an emptyDir instead of putting it on the same persistent volume as the rest of your data (plugins & uploads). Actually in an ideal world I would hope we can prevent copying the code to a persistent volume altogether.

Would there be any downsides to my proposed solution? Could it fix the issue? Are there other solutions?

The text was updated successfully, but these errors were encountered:

chrisingenhaag · 2019-05-27T19:03:03Z

Hi @mrtndwrd ,
I´m currently working on the configurability of health-checks in the chart. So with this you are at least able to set higher delays or something.

On the other hand, you´re totally right. This should not lead to a broken setup. But I´m not sure if it helps if we move the actual nextcloud application to an emptydir volume. It would just hide the problem that docker-entrypoint script is not able to create a consistent application with a sync or something.

chrisingenhaag mentioned this issue May 27, 2019

[stable/nextcloud] nextcloud 16, health config settings #14164

Merged

4 tasks

k8s-ci-robot closed this as completed in #14164 May 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stable/nextcloud] resilience against restarts during container "setup" #14090

[stable/nextcloud] resilience against restarts during container "setup" #14090

mrtndwrd commented May 23, 2019

chrisingenhaag commented May 27, 2019

[stable/nextcloud] resilience against restarts during container "setup" #14090

[stable/nextcloud] resilience against restarts during container "setup" #14090

Comments

mrtndwrd commented May 23, 2019

chrisingenhaag commented May 27, 2019