-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harbor can't recover when the connection to the database drops. #14856
Comments
In addition to that the job service does also not recover once temporarily the connection to the DB is dropped
|
Hey 👋🏼 Was chatting about the same issue with @ywk253100 in Slack, we're experiencing the same problem in both We noticed that sometimes In our configuration
The Postgres database is deployed as a StatefulSet (using Zalando's postgres-operator) and Harbor is configured to connect via the ClusterIP Service, after the DB failover/switchover the Service points to the now active/primary postgres Pod and testing connectivity via the Service works just fine, I believe the issue is that the clients (core, jobservice) still try to connect via the "old" session and fail and need to re-establish a new connection when this condition happens. Happy to provide more information/logs/etc. |
Same set up that we have as well.
I have a similar feeling. The DB is online and definitely working but the Job service still holds on to a non existing handle or something. We have an RDS DB outside our K8s. |
@Vad1mo Is your database also an external one? @dkulchinsky I cannot reproduce this issue in my env with the latest master code. |
@ywk253100 we're still on v2.1 and I assume master is v2.2? perhaps there were related changes there, could you test the same with v2.1? when you say |
Hitting this as well, using an RDS DB |
I'm also seeing this issue in a configuration where harbor v2.2.1 connects through pgbouncer as the connection pooler. |
Could you share the arch of the DB and the config of the pgbouncer so that we can reproduce this problem? Thanks. |
@ywk253100, we are on an external PostgreDB similar to AWS Aurora. on the jobservie the logs to db needs to be enabled. I'll try to reproduce the error by dropping the idle connection. What you can also do is reducing the idle and max connection to 1 and 10 for example. |
I have the same issue, with a Azure Postgres database and Harbor 2.2.1 the only way I found is to restart core pods. |
@ywk253100 if you want to debug that I can provide you with a cluster with remote code debugging access to it so we could together take a look at it. Ping me on slack so we can set this up! how does that sounds? |
We have same issues, Azure Postgres and Harbor 2.2.1 with similar log output from core & jobservice. @Vad1mo are You using also Azure Redis ? ( which tier ? and which region ? ) |
@lukasmrtvy we are using jobservice to forward logs to Postgres and the error only happens when this feature is enabled. Logging to stdout or file does not cause this issue. |
We encountered this issue a couple of times. There is a lot of logs like this: The server is not responsive in that state! Manual restart of core service is fixing this issue for some time. But because of this issue there is a time window when our application is not available. We encounter the same issue in our Java apps and we fixed them using connection validators (provided by Wildfly server). Connection validator checks all database connections from pool in specified intervals and tries to execute very simple statements, like In my opinion Harbor requires similar solution or at least it should remove connections from pool after some time. I suspect that after some time all database connections in pool can be broken. I also saw, that similar issue was fixed for MySql: |
I spend a few hours on resolving this issue, @jowko recommendations to set the References
In the long run, a switch to pgx would be the ideal outcome. Hence, please upvote #15209 I'll provide a patch and a test image with the fix on top of 2.3 for you to try. |
I found out that the Harbor exporter is also affected by this issue. I don't know if above fix will cover this case. There are a lot of logs such as this: |
When the connection to the database is lost (unreliable connectivity), Harbor can't recover itself and the health check doesn't notice that there is a problem. Restarting the pod solves the issues immediately until the next connectivity problem.
This problem occurs multiple times a day.
I am quite confident that this isn't a DB issues as other Harbor instances in the same network keep on working fine and the DB is basically idling.
I think the health check should
Harbor 2.2.1 - Don't know if that is also happening on older versions. (This is a new setup)
The text was updated successfully, but these errors were encountered: