Deleting a Database resource causes operator process to crash #35

wms · 2020-01-15T10:53:47Z

First off, thanks for this Operator - it's ideal for our use-case of dynamically provisioning a database for dynamically created 'preview' instances of our in-development applications.

I am experiencing the following issue using 'vanilla' PostgreSQL (not on AWS or RDS):

Create a Postgres resource with .spec.dropOnDelete: true
Ensure that the db is created correctly on the PostgreSQL server
Delete the resource
The Operator process crashes with the following log entry:

{"level":"info","ts":1579085139.616076,"logger":"controller_postgres","msg":"Reconciling Postgres","Request.Namespace":"tmt","Request.Name":"example-temp-db"}
2020/01/15 10:45:39 failed to connect to PostgreSQL server: pq: database "example-temp-db" does not exist

However, the DB has been deleted from the PostgreSQL database server

Since the DB does get deleted, it kind-of works, but has the following annoyances:

The Operator is now stuck in a crash loop, citing the same two log entries before each crash
The Postgres resource I wanted to delete originally is now 'stuck' in a Pending state - I guess it's waiting for the Finalizer to it's job (and always crashes before it can signal success)

Happy to provide further input or give you some supervised hands-on time with the affected dev cluster to help identify the cause of this.

The text was updated successfully, but these errors were encountered:

arnarg · 2020-01-15T11:03:28Z

What I think is happening is that the operator tried to remove the finalizer after deleting the database but that failed for some reason so it requeued the request. But then when it tries to drop the roles again it has to connect to the database it just deleted. This should be handled more gracefully.

https://github.com/movetokube/postgres-operator/blob/master/pkg/controller/postgres/postgres_controller.go#L103-L130

Is there no other error in the log?

wms · 2020-01-15T11:11:06Z

@arnarg Thanks for looking into this - no others as far as I can tell - but it could be that some error was encountered on a previous run and is now 'lost' behind many, many restarts.

If it's of any further help, I've set the env. var on the Operator deployment to POSTGRES_DEFAULT_DATABASE to postgres -- I was assuming that with this set, the Operator would always connect to the maintenance DB and not need the resource-specified DB to exist in order to operate.

arnarg · 2020-01-15T11:18:49Z

Yeah that's probably the case. You can manually edit the Postgres object and remove the finalizers list and kubernetes will finish deleting the object. Then you can start using the operator again normally. Has this happened more than once?

I've set the env. var on the Operator deployment to POSTGRES_DEFAULT_DATABASE to postgres -- I was assuming that with this set, the Operator would always connect to the maintenance DB and not need the resource-specified DB to exist in order to operate.

The thing is, in order to drop roles we need to first assign all of its objects to operator's user and in order to do that we need to connect to the database where the objects are.

wms · 2020-01-15T12:12:30Z

Yes, this behavior is consistent. I can get out of the blockage by manually removing the finalized, but it's hardly ideal. Are you saying that the current implementation of dropping databases is fundamentally broken, or is there still a viable solution?

…

On Wed, Jan 15, 2020, 11:18 AM Arnar ***@***.***> wrote: Yeah that's probably the case. You can manually edit the Postgres object and remove the finalizers list and kubernetes will finish deleting the object. Then you can start using the operator again normally. Has this happened more than once? I've set the env. var on the Operator deployment to POSTGRES_DEFAULT_DATABASE to postgres -- I was assuming that with this set, the Operator would always connect to the maintenance DB and not need the resource-specified DB to exist in order to operate. The thing is, in order to drop roles we need to first assign all of its objects to operator's user and in order to do that we need to connect to the database where the objects are. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#35?email_source=notifications&email_token=AAFCBJBUWY4J3M6CCUEO67TQ53WJTA5CNFSM4KHBQ5QKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI77CZI#issuecomment-574615909>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFCBJDSIFV2ZF6GBXHIS63Q53WJTANCNFSM4KHBQ5QA> .

arnarg · 2020-01-15T12:45:15Z

Yes, this behavior is consistent.

Can you then recreate this and capture the logs before the first crash?

Are you saying that the current implementation of dropping databases is
fundamentally broken, or is there still a viable solution?

If my theory is correct then removing the finalizer is failing for some reason it just needs to handle this more gracefully. But first I'll need to be able to recreate this myself or see some logs.

What does the ClusterRole in use by the pod look like?

wms · 2020-01-15T13:00:48Z

Ah ha! I tinkered with the Operator's Pod template so that it hangs around for a while after crashing. After repeating the process, I witnessed this in the logs:

{"level":"info","ts":1579092991.258128,"logger":"controller_postgres","msg":"Dropped database example-temp-db","Request.Namespace":"tmt","Request.Name":"example-temp-db"}
{"level":"error","ts":1579092991.2638128,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"postgres-controller","request":"tmt/example-temp-db","error":"Operation cannot be fulfilled on postgres.db.movetokube.com \"example-temp-db\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.1.10/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.1.10/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190221213512-86fb29eff628/pkg/u
til/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190221213512-86fb29eff628/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190221213512-86fb29eff628/pkg/util/wait/wait.go:88"}
{"level":"info","ts":1579092992.264065,"logger":"controller_postgres","msg":"Reconciling Postgres","Request.Namespace":"tmt","Request.Name":"example-temp-db"}
2020/01/15 12:56:32 failed to connect to PostgreSQL server: pq: database "example-temp-db" does not exist

At this point, the process has exited and would be restarted, entering the crash loop described before.

The ClusterRole this Pod is running is as effectively the same as the example given in the deploy directory:

rules:
  - apiGroups:
      - ''
    resources:
      - pods
      - services
      - endpoints
      - persistentvolumeclaims
      - events
      - configmaps
      - secrets
    verbs:
      - '*'
  - apiGroups:
      - apps
    resources:
      - deployments
      - daemonsets
      - replicasets
      - statefulsets
    verbs:
      - '*'
  - apiGroups:
      - apps
    resourceNames:
      - dev-postgresql-provisioner-ext-postgresql-operator
    resources:
      - deployments/finalizers
    verbs:
      - update
  - apiGroups:
      - db.movetokube.com
    resources:
      - '*'
    verbs:
      - '*'

arnarg · 2020-01-15T13:13:33Z

Interesting. You get the error Operation cannot be fulfilled on postgres.db.movetokube.com "example-temp-db": the object has been modified; please apply your changes to the latest version and try again when trying to remove the finalizer. The kubernetes API is not happy about something. The Postgres will never be garbage collected if we can't remove the finalizer.

What version of Kubernetes are you running?

wms · 2020-01-15T13:14:34Z

Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.9-eks-c0eccc", GitCommit:"c0eccca51d7500bb03b2f163dd8d534ffeb2f7a2", GitTreeState:"clean", BuildDate:"2019-12-22T23:14:11Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

wms · 2020-01-15T13:17:52Z

I'm not super-knowledgable on RBAC setup; is is possible that .rules[2].resourceNames is incorrect? In the examples given in deploy, it's simply ext-postgresql-operator. However, I'm looking at use-cases where we can provision against different PostgreSQL servers based on the tier the instance is running on. (eg. an in-cluster, shared PostgreSQL instance for dev, another for pre-prod, and RDS for prod.)

Does .rules[2].resourceNames have to be literally exp-postgresql-operator, is it OK to by instance name instead?

arnarg · 2020-01-15T13:33:02Z

Ok I have the same version of EKS running where I might be able to try to recreate this situation.

This error seems to suggest that in the time the controller starts processing the request and gets the Postgres object from the API until it removes the finalizer and tries to update the object, something else is changing the Postgres object. Is there something else in your cluster that messes with the Postgres object?

Does .rules[2].resourceNames have to be literally exp-postgresql-operator, is it OK to by instance name instead

It should be the name of the deployment for postgres-operator. But that rule is for the deployment finalizer and is therefor not related to the finalizer in the Postgres object, the last rule covers that.

wms · 2020-01-15T13:34:27Z

Quite possibly - the Postgres resource is defined in a Helm chart that is installed into the cluster via Argo CD.

wms · 2020-01-15T13:35:16Z

I'll create and delete a Postgres resource outside of these tools to see if the behaviour is any different.

wms · 2020-01-15T13:46:42Z

Confirmed, it's likely Argo modifying the resource in some way that the operator doesn't like. Manually creating and then deleting a Resource works as expected with no crash loop.

arnarg · 2020-01-15T13:48:28Z

Ok, incidentally we also use ArgoCD but I had never included a Postgres in a chart yet. Thanks for reporting this and I will want to get this use case sorted as well.

wms · 2020-01-15T13:49:47Z

Thanks for sounding me out and helping to isolate the cause so quickly.

hitman99 · 2020-01-15T17:15:29Z

I'm glad you guys sorted this out

arnarg · 2020-01-16T10:59:32Z

I ran some tests with and without ArgoCD. I didn't manage to crash the operator though but I did sometimes see the error Operation cannot be fulfilled on postgres.db.movetokube.com "my-db": the object has been modified; please apply your changes to the latest version and try again.

I noticed that when deleting the Postgres object in ArgoCD a foregroundDeletion finalizer was added while deleting with kubectl delete postgres my-db does not. I don't think this is the cause but I just thought I'd add it here.

The controller uses a cached getter client. It could be getting a cached Postgres with an older resourceVersion than what is stored in kubernetes' datastore. I think this is a more likely cause.

I think what we should do is add a check when dropping a role if the database where we think the role still owns some reasources actually exists before we try to connect. In that case it should be able to remove the finalizer in the next reconcile loop if it fails, without crashing. We might even want to send a patch instead of update to remove the finalizer.

wms · 2020-01-16T11:02:11Z

Commenting from the peanut gallery, it seems that patching the current in-cluster resource to remove the finalizer would be more sensible than sending an update against the current in-memory version of the resource.

arnarg · 2020-01-19T16:34:07Z

@wms As I can't reproduce the crash it's a bit difficult to test this on my end. I have made some changes in branch bugfix/35 that I believe fixes this bug.

Are you able to build this and test?

wms · 2020-01-19T16:48:47Z

Sure, I'll do that and let you know the result.

…

On Sun, Jan 19, 2020 at 4:34 PM Arnar ***@***.***> wrote: @wms <https://github.com/wms> As I can't reproduce the crash it's a bit difficult to test this on my end. I have made some changes in branch bugfix/35 <https://github.com/movetokube/postgres-operator/tree/bugfix/35> that I believe fixes this bug. Are you able to build this and test? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#35?email_source=notifications&email_token=AAFCBJEMLPWAYQIEBAXUFI3Q6R6H7A5CNFSM4KHBQ5QKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKWMYQ#issuecomment-576022114>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFCBJB2COZNGKRQLD7TXGTQ6R6H7ANCNFSM4KHBQ5QA> .

wms · 2020-01-19T17:07:04Z

Yeah! Looks like that new Patch logic does the trick, and no crashes! I've attached the operator's logs for this test if you're curious...

postgresql-operator-logs.txt

arnarg · 2020-01-19T17:18:23Z

Great!

Regarding that last error postgres.db.movetokube.com "devdb-test-deletes" not found: I've been seeing this as well on my EKS cluster. It seems to be garbage collecting objects even though there is a finalizer attached to it. AFAIK that's not how things are suppose to work and seems to be a bug with EKS.

The operator is trying to remove the finalizer but kubernetes has already garbage collected the object, so it doesn't exist anymore. This error does not matter at all if dropping the roles and database were successful but if anything goes wrong and the request needs to be requeued, the remaining objects will be left in the database.

EDIT: Disregard most of this comment. What is happening is that it's trying to patch the status after the finalizer has been removed and kubernetes has garbage collected it.

wms · 2020-01-19T17:27:59Z

@arnarg Thanks for taking the time to diagnose and fix this, I really appreciate it! I figured there was some weird out-of-order stuff happening here but did check in the PostgreSQL server to check that the DB was really gone, and it was!

At this time, I only intend to use this operator for our short-lived, non-production tier(s) to allow us to dynamically create and destroy databases on a single server. However, I'll continue to keep an eye on this and let you know if we encounter any strange situations where the operator thinks the DB is gone but failed.

arnarg · 2020-02-09T20:28:34Z

The fix for this was released with version 0.4.2.

arnarg added the bug Something isn't working label Jan 15, 2020

arnarg mentioned this issue Jan 17, 2020

Updating dependencies to operator-sdk 0.14.0 #36

Merged

arnarg mentioned this issue Jan 19, 2020

Patch Postgres instead of Update #38

Merged

arnarg closed this as completed Feb 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deleting a Database resource causes operator process to crash #35

Deleting a Database resource causes operator process to crash #35

wms commented Jan 15, 2020

arnarg commented Jan 15, 2020

wms commented Jan 15, 2020

arnarg commented Jan 15, 2020

wms commented Jan 15, 2020 via email

arnarg commented Jan 15, 2020

wms commented Jan 15, 2020 •

edited

Loading

arnarg commented Jan 15, 2020

wms commented Jan 15, 2020

wms commented Jan 15, 2020

arnarg commented Jan 15, 2020

wms commented Jan 15, 2020

wms commented Jan 15, 2020

wms commented Jan 15, 2020

arnarg commented Jan 15, 2020

wms commented Jan 15, 2020

hitman99 commented Jan 15, 2020

arnarg commented Jan 16, 2020 •

edited

Loading

wms commented Jan 16, 2020

arnarg commented Jan 19, 2020

wms commented Jan 19, 2020 via email

wms commented Jan 19, 2020

arnarg commented Jan 19, 2020 •

edited

Loading

wms commented Jan 19, 2020

arnarg commented Feb 9, 2020

Deleting a Database resource causes operator process to crash #35

Deleting a Database resource causes operator process to crash #35

Comments

wms commented Jan 15, 2020

arnarg commented Jan 15, 2020

wms commented Jan 15, 2020

arnarg commented Jan 15, 2020

wms commented Jan 15, 2020 via email

arnarg commented Jan 15, 2020

wms commented Jan 15, 2020 • edited Loading

arnarg commented Jan 15, 2020

wms commented Jan 15, 2020

wms commented Jan 15, 2020

arnarg commented Jan 15, 2020

wms commented Jan 15, 2020

wms commented Jan 15, 2020

wms commented Jan 15, 2020

arnarg commented Jan 15, 2020

wms commented Jan 15, 2020

hitman99 commented Jan 15, 2020

arnarg commented Jan 16, 2020 • edited Loading

wms commented Jan 16, 2020

arnarg commented Jan 19, 2020

wms commented Jan 19, 2020 via email

wms commented Jan 19, 2020

arnarg commented Jan 19, 2020 • edited Loading

wms commented Jan 19, 2020

arnarg commented Feb 9, 2020

wms commented Jan 15, 2020 •

edited

Loading

arnarg commented Jan 16, 2020 •

edited

Loading

arnarg commented Jan 19, 2020 •

edited

Loading