-
-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleting a Database resource causes operator process to crash #35
Comments
What I think is happening is that the operator tried to remove the finalizer after deleting the database but that failed for some reason so it requeued the request. But then when it tries to drop the roles again it has to connect to the database it just deleted. This should be handled more gracefully. Is there no other error in the log? |
@arnarg Thanks for looking into this - no others as far as I can tell - but it could be that some error was encountered on a previous run and is now 'lost' behind many, many restarts. If it's of any further help, I've set the env. var on the Operator deployment to |
Yeah that's probably the case. You can manually edit the
The thing is, in order to drop roles we need to first assign all of its objects to operator's user and in order to do that we need to connect to the database where the objects are. |
Yes, this behavior is consistent. I can get out of the blockage by manually
removing the finalized, but it's hardly ideal.
Are you saying that the current implementation of dropping databases is
fundamentally broken, or is there still a viable solution?
…On Wed, Jan 15, 2020, 11:18 AM Arnar ***@***.***> wrote:
Yeah that's probably the case. You can manually edit the Postgres object
and remove the finalizers list and kubernetes will finish deleting the
object. Then you can start using the operator again normally. Has this
happened more than once?
I've set the env. var on the Operator deployment to
POSTGRES_DEFAULT_DATABASE to postgres -- I was assuming that with this set,
the Operator would always connect to the maintenance DB and not need the
resource-specified DB to exist in order to operate.
The thing is, in order to drop roles we need to first assign all of its
objects to operator's user and in order to do that we need to connect to
the database where the objects are.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#35?email_source=notifications&email_token=AAFCBJBUWY4J3M6CCUEO67TQ53WJTA5CNFSM4KHBQ5QKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI77CZI#issuecomment-574615909>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFCBJDSIFV2ZF6GBXHIS63Q53WJTANCNFSM4KHBQ5QA>
.
|
Can you then recreate this and capture the logs before the first crash?
If my theory is correct then removing the finalizer is failing for some reason it just needs to handle this more gracefully. But first I'll need to be able to recreate this myself or see some logs. What does the |
Ah ha! I tinkered with the Operator's Pod template so that it hangs around for a while after crashing. After repeating the process, I witnessed this in the logs:
At this point, the process has exited and would be restarted, entering the crash loop described before. The rules:
- apiGroups:
- ''
resources:
- pods
- services
- endpoints
- persistentvolumeclaims
- events
- configmaps
- secrets
verbs:
- '*'
- apiGroups:
- apps
resources:
- deployments
- daemonsets
- replicasets
- statefulsets
verbs:
- '*'
- apiGroups:
- apps
resourceNames:
- dev-postgresql-provisioner-ext-postgresql-operator
resources:
- deployments/finalizers
verbs:
- update
- apiGroups:
- db.movetokube.com
resources:
- '*'
verbs:
- '*' |
Interesting. You get the error What version of Kubernetes are you running? |
|
I'm not super-knowledgable on RBAC setup; is is possible that Does |
Ok I have the same version of This error seems to suggest that in the time the controller starts processing the request and gets the
It should be the name of the deployment for |
Quite possibly - the |
I'll create and delete a |
Confirmed, it's likely Argo modifying the resource in some way that the operator doesn't like. Manually creating and then deleting a Resource works as expected with no crash loop. |
Ok, incidentally we also use ArgoCD but I had never included a |
Thanks for sounding me out and helping to isolate the cause so quickly. |
I'm glad you guys sorted this out |
I ran some tests with and without ArgoCD. I didn't manage to crash the operator though but I did sometimes see the error I noticed that when deleting the The controller uses a cached getter client. It could be getting a cached I think what we should do is add a check when dropping a role if the database where we think the role still owns some reasources actually exists before we try to connect. In that case it should be able to remove the finalizer in the next reconcile loop if it fails, without crashing. We might even want to send a patch instead of update to remove the finalizer. |
Commenting from the peanut gallery, it seems that patching the current in-cluster resource to remove the finalizer would be more sensible than sending an update against the current in-memory version of the resource. |
Sure, I'll do that and let you know the result.
…On Sun, Jan 19, 2020 at 4:34 PM Arnar ***@***.***> wrote:
@wms <https://github.com/wms> As I can't reproduce the crash it's a bit
difficult to test this on my end. I have made some changes in branch
bugfix/35 <https://github.com/movetokube/postgres-operator/tree/bugfix/35>
that I believe fixes this bug.
Are you able to build this and test?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#35?email_source=notifications&email_token=AAFCBJEMLPWAYQIEBAXUFI3Q6R6H7A5CNFSM4KHBQ5QKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKWMYQ#issuecomment-576022114>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFCBJB2COZNGKRQLD7TXGTQ6R6H7ANCNFSM4KHBQ5QA>
.
|
Yeah! Looks like that new Patch logic does the trick, and no crashes! I've attached the operator's logs for this test if you're curious... |
Great! Regarding that last error The operator is trying to remove the finalizer but kubernetes has already garbage collected the object, so it doesn't exist anymore. This error does not matter at all if dropping the roles and database were successful but if anything goes wrong and the request needs to be requeued, the remaining objects will be left in the database. EDIT: Disregard most of this comment. What is happening is that it's trying to patch the status after the finalizer has been removed and kubernetes has garbage collected it. |
@arnarg Thanks for taking the time to diagnose and fix this, I really appreciate it! I figured there was some weird out-of-order stuff happening here but did check in the PostgreSQL server to check that the DB was really gone, and it was! At this time, I only intend to use this operator for our short-lived, non-production tier(s) to allow us to dynamically create and destroy databases on a single server. However, I'll continue to keep an eye on this and let you know if we encounter any strange situations where the operator thinks the DB is gone but failed. |
The fix for this was released with version 0.4.2. |
First off, thanks for this Operator - it's ideal for our use-case of dynamically provisioning a database for dynamically created 'preview' instances of our in-development applications.
I am experiencing the following issue using 'vanilla' PostgreSQL (not on AWS or RDS):
Postgres
resource with.spec.dropOnDelete: true
Since the DB does get deleted, it kind-of works, but has the following annoyances:
Postgres
resource I wanted to delete originally is now 'stuck' in a Pending state - I guess it's waiting for the Finalizer to it's job (and always crashes before it can signal success)Happy to provide further input or give you some supervised hands-on time with the affected dev cluster to help identify the cause of this.
The text was updated successfully, but these errors were encountered: