Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting a Database resource causes operator process to crash #35

Closed
wms opened this issue Jan 15, 2020 · 24 comments
Closed

Deleting a Database resource causes operator process to crash #35

wms opened this issue Jan 15, 2020 · 24 comments
Labels
bug Something isn't working

Comments

@wms
Copy link

wms commented Jan 15, 2020

First off, thanks for this Operator - it's ideal for our use-case of dynamically provisioning a database for dynamically created 'preview' instances of our in-development applications.

I am experiencing the following issue using 'vanilla' PostgreSQL (not on AWS or RDS):

  • Create a Postgres resource with .spec.dropOnDelete: true
  • Ensure that the db is created correctly on the PostgreSQL server
  • Delete the resource
  • The Operator process crashes with the following log entry:
{"level":"info","ts":1579085139.616076,"logger":"controller_postgres","msg":"Reconciling Postgres","Request.Namespace":"tmt","Request.Name":"example-temp-db"}
2020/01/15 10:45:39 failed to connect to PostgreSQL server: pq: database "example-temp-db" does not exist
  • However, the DB has been deleted from the PostgreSQL database server

Since the DB does get deleted, it kind-of works, but has the following annoyances:

  1. The Operator is now stuck in a crash loop, citing the same two log entries before each crash
  2. The Postgres resource I wanted to delete originally is now 'stuck' in a Pending state - I guess it's waiting for the Finalizer to it's job (and always crashes before it can signal success)

Happy to provide further input or give you some supervised hands-on time with the affected dev cluster to help identify the cause of this.

@arnarg
Copy link

arnarg commented Jan 15, 2020

What I think is happening is that the operator tried to remove the finalizer after deleting the database but that failed for some reason so it requeued the request. But then when it tries to drop the roles again it has to connect to the database it just deleted. This should be handled more gracefully.

https://github.com/movetokube/postgres-operator/blob/master/pkg/controller/postgres/postgres_controller.go#L103-L130

Is there no other error in the log?

@arnarg arnarg added the bug Something isn't working label Jan 15, 2020
@wms
Copy link
Author

wms commented Jan 15, 2020

@arnarg Thanks for looking into this - no others as far as I can tell - but it could be that some error was encountered on a previous run and is now 'lost' behind many, many restarts.

If it's of any further help, I've set the env. var on the Operator deployment to POSTGRES_DEFAULT_DATABASE to postgres -- I was assuming that with this set, the Operator would always connect to the maintenance DB and not need the resource-specified DB to exist in order to operate.

@arnarg
Copy link

arnarg commented Jan 15, 2020

Yeah that's probably the case. You can manually edit the Postgres object and remove the finalizers list and kubernetes will finish deleting the object. Then you can start using the operator again normally. Has this happened more than once?

I've set the env. var on the Operator deployment to POSTGRES_DEFAULT_DATABASE to postgres -- I was assuming that with this set, the Operator would always connect to the maintenance DB and not need the resource-specified DB to exist in order to operate.

The thing is, in order to drop roles we need to first assign all of its objects to operator's user and in order to do that we need to connect to the database where the objects are.

@wms
Copy link
Author

wms commented Jan 15, 2020 via email

@arnarg
Copy link

arnarg commented Jan 15, 2020

Yes, this behavior is consistent.

Can you then recreate this and capture the logs before the first crash?

Are you saying that the current implementation of dropping databases is
fundamentally broken, or is there still a viable solution?

If my theory is correct then removing the finalizer is failing for some reason it just needs to handle this more gracefully. But first I'll need to be able to recreate this myself or see some logs.

What does the ClusterRole in use by the pod look like?

@wms
Copy link
Author

wms commented Jan 15, 2020

Ah ha! I tinkered with the Operator's Pod template so that it hangs around for a while after crashing. After repeating the process, I witnessed this in the logs:

{"level":"info","ts":1579092991.258128,"logger":"controller_postgres","msg":"Dropped database example-temp-db","Request.Namespace":"tmt","Request.Name":"example-temp-db"}
{"level":"error","ts":1579092991.2638128,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"postgres-controller","request":"tmt/example-temp-db","error":"Operation cannot be fulfilled on postgres.db.movetokube.com \"example-temp-db\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.1.10/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.1.10/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190221213512-86fb29eff628/pkg/u
til/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190221213512-86fb29eff628/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190221213512-86fb29eff628/pkg/util/wait/wait.go:88"}
{"level":"info","ts":1579092992.264065,"logger":"controller_postgres","msg":"Reconciling Postgres","Request.Namespace":"tmt","Request.Name":"example-temp-db"}
2020/01/15 12:56:32 failed to connect to PostgreSQL server: pq: database "example-temp-db" does not exist

At this point, the process has exited and would be restarted, entering the crash loop described before.

The ClusterRole this Pod is running is as effectively the same as the example given in the deploy directory:

rules:
  - apiGroups:
      - ''
    resources:
      - pods
      - services
      - endpoints
      - persistentvolumeclaims
      - events
      - configmaps
      - secrets
    verbs:
      - '*'
  - apiGroups:
      - apps
    resources:
      - deployments
      - daemonsets
      - replicasets
      - statefulsets
    verbs:
      - '*'
  - apiGroups:
      - apps
    resourceNames:
      - dev-postgresql-provisioner-ext-postgresql-operator
    resources:
      - deployments/finalizers
    verbs:
      - update
  - apiGroups:
      - db.movetokube.com
    resources:
      - '*'
    verbs:
      - '*'

@arnarg
Copy link

arnarg commented Jan 15, 2020

Interesting. You get the error Operation cannot be fulfilled on postgres.db.movetokube.com "example-temp-db": the object has been modified; please apply your changes to the latest version and try again when trying to remove the finalizer. The kubernetes API is not happy about something. The Postgres will never be garbage collected if we can't remove the finalizer.

What version of Kubernetes are you running?

@wms
Copy link
Author

wms commented Jan 15, 2020

Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.9-eks-c0eccc", GitCommit:"c0eccca51d7500bb03b2f163dd8d534ffeb2f7a2", GitTreeState:"clean", BuildDate:"2019-12-22T23:14:11Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

@wms
Copy link
Author

wms commented Jan 15, 2020

I'm not super-knowledgable on RBAC setup; is is possible that .rules[2].resourceNames is incorrect? In the examples given in deploy, it's simply ext-postgresql-operator. However, I'm looking at use-cases where we can provision against different PostgreSQL servers based on the tier the instance is running on. (eg. an in-cluster, shared PostgreSQL instance for dev, another for pre-prod, and RDS for prod.)

Does .rules[2].resourceNames have to be literally exp-postgresql-operator, is it OK to by instance name instead?

@arnarg
Copy link

arnarg commented Jan 15, 2020

Ok I have the same version of EKS running where I might be able to try to recreate this situation.

This error seems to suggest that in the time the controller starts processing the request and gets the Postgres object from the API until it removes the finalizer and tries to update the object, something else is changing the Postgres object. Is there something else in your cluster that messes with the Postgres object?

Does .rules[2].resourceNames have to be literally exp-postgresql-operator, is it OK to by instance name instead

It should be the name of the deployment for postgres-operator. But that rule is for the deployment finalizer and is therefor not related to the finalizer in the Postgres object, the last rule covers that.

@wms
Copy link
Author

wms commented Jan 15, 2020

Quite possibly - the Postgres resource is defined in a Helm chart that is installed into the cluster via Argo CD.

@wms
Copy link
Author

wms commented Jan 15, 2020

I'll create and delete a Postgres resource outside of these tools to see if the behaviour is any different.

@wms
Copy link
Author

wms commented Jan 15, 2020

Confirmed, it's likely Argo modifying the resource in some way that the operator doesn't like. Manually creating and then deleting a Resource works as expected with no crash loop.

@arnarg
Copy link

arnarg commented Jan 15, 2020

Ok, incidentally we also use ArgoCD but I had never included a Postgres in a chart yet. Thanks for reporting this and I will want to get this use case sorted as well.

@wms
Copy link
Author

wms commented Jan 15, 2020

Thanks for sounding me out and helping to isolate the cause so quickly.

@hitman99
Copy link
Member

I'm glad you guys sorted this out

@arnarg
Copy link

arnarg commented Jan 16, 2020

I ran some tests with and without ArgoCD. I didn't manage to crash the operator though but I did sometimes see the error Operation cannot be fulfilled on postgres.db.movetokube.com "my-db": the object has been modified; please apply your changes to the latest version and try again.

I noticed that when deleting the Postgres object in ArgoCD a foregroundDeletion finalizer was added while deleting with kubectl delete postgres my-db does not. I don't think this is the cause but I just thought I'd add it here.

The controller uses a cached getter client. It could be getting a cached Postgres with an older resourceVersion than what is stored in kubernetes' datastore. I think this is a more likely cause.

I think what we should do is add a check when dropping a role if the database where we think the role still owns some reasources actually exists before we try to connect. In that case it should be able to remove the finalizer in the next reconcile loop if it fails, without crashing. We might even want to send a patch instead of update to remove the finalizer.

@wms
Copy link
Author

wms commented Jan 16, 2020

Commenting from the peanut gallery, it seems that patching the current in-cluster resource to remove the finalizer would be more sensible than sending an update against the current in-memory version of the resource.

@arnarg
Copy link

arnarg commented Jan 19, 2020

@wms As I can't reproduce the crash it's a bit difficult to test this on my end. I have made some changes in branch bugfix/35 that I believe fixes this bug.

Are you able to build this and test?

@wms
Copy link
Author

wms commented Jan 19, 2020 via email

@wms
Copy link
Author

wms commented Jan 19, 2020

Yeah! Looks like that new Patch logic does the trick, and no crashes! I've attached the operator's logs for this test if you're curious...

postgresql-operator-logs.txt

@arnarg
Copy link

arnarg commented Jan 19, 2020

Great!

Regarding that last error postgres.db.movetokube.com "devdb-test-deletes" not found: I've been seeing this as well on my EKS cluster. It seems to be garbage collecting objects even though there is a finalizer attached to it. AFAIK that's not how things are suppose to work and seems to be a bug with EKS.

The operator is trying to remove the finalizer but kubernetes has already garbage collected the object, so it doesn't exist anymore. This error does not matter at all if dropping the roles and database were successful but if anything goes wrong and the request needs to be requeued, the remaining objects will be left in the database.

EDIT: Disregard most of this comment. What is happening is that it's trying to patch the status after the finalizer has been removed and kubernetes has garbage collected it.

@wms
Copy link
Author

wms commented Jan 19, 2020

@arnarg Thanks for taking the time to diagnose and fix this, I really appreciate it! I figured there was some weird out-of-order stuff happening here but did check in the PostgreSQL server to check that the DB was really gone, and it was!

At this time, I only intend to use this operator for our short-lived, non-production tier(s) to allow us to dynamically create and destroy databases on a single server. However, I'll continue to keep an eye on this and let you know if we encounter any strange situations where the operator thinks the DB is gone but failed.

@arnarg
Copy link

arnarg commented Feb 9, 2020

The fix for this was released with version 0.4.2.

@arnarg arnarg closed this as completed Feb 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants