Harbor can't recover when the connection to the database drops. #14856

Vad1mo · 2021-05-11T21:42:32Z

When the connection to the database is lost (unreliable connectivity), Harbor can't recover itself and the health check doesn't notice that there is a problem. Restarting the pod solves the issues immediately until the next connectivity problem.

This problem occurs multiple times a day.
I am quite confident that this isn't a DB issues as other Harbor instances in the same network keep on working fine and the DB is basically idling.

2021-05-11 23:21:23 | 2021-05-11T21:21:23Z [WARNING] [/common/utils/oidc/helper.go:72]: Failed to refresh configuration, error: failed to load OIDC setting: write tcp 100.1.1.1:49918->51.59.7.7:5432: write: broken pipe
-- | --
  |   |   | 2021-05-11 23:21:20 | 2021-05-11T21:21:20Z [WARNING] [/common/utils/oidc/helper.go:72]: Failed to refresh configuration, error: failed to load OIDC setting: write tcp 100.1.1.1::40860->>51.59.7.7:5432: write: connection reset by peer
  |   |   | 2021-05-11 23:21:17 | 2021-05-11T21:21:17Z [WARNING] [/common/utils/oidc/helper.go:72]: Failed to refresh configuration, error: failed to load OIDC setting: write tcp 100.1.1.1::49912->>51.59.7.7:5432: write: broken pipe
  |   |   | 2021-05-11 23:21:15 | 2021-05-11T21:21:15Z [ERROR] [/replication/registry/healthcheck.go:65]: Health check error: write tcp 100.1.1.1::40860->51.59.7.7:5432: write: connection reset by peer

I think the health check should
Harbor 2.2.1 - Don't know if that is also happening on older versions. (This is a new setup)

Vad1mo · 2021-05-14T19:55:34Z

In addition to that the job service does also not recover once temporarily the connection to the DB is dropped

2021-05-05T20:01:30Z [ERROR] [/jobservice/logger/sweeper_controller.go:40]: sweep logs error in *sweeper.DBSweeper at 1620244890: sweep logs in DB failed before 2021-04-21 20:01:30.5578332 +0000 UTC m=+1209600.049316437 with error: write tcp 100.11.12.13:43386->51.59.7.7:5432: write: broken pipe
2021-05-14T17:04:12Z [INFO] [/jobservice/worker/cworker/c_worker.go:76]: Job incoming: {"name":"IMAGE_SCAN","id":"7c482f01f524fb80c0d2712c","t":1621011851,"args":null}
2021-05-14T17:04:12Z [INFO] [/common/config/store/driver/rest.go:31]: get configuration from url: http://harbor-harbor-core:80/api/internal/configurations
2021-05-14T17:04:12Z [INFO] [/common/config/store/driver/rest.go:31]: get configuration from url: http://harbor-harbor-core:80/api/internal/configurations
2021-05-14T17:04:39Z [ERROR] [/jobservice/runner/redis.go:146]: Run job IMAGE_SCAN:7c482f01f524fb80c0d2712c error: runtime error: write tcp 100.11.12.13:43386->51.59.7.7:5432: write: broken pipe; stack: goroutine 124 [running]:
github.com/goharbor/harbor/src/jobservice/runner.(*RedisJob).Run.func2(0xc0009397d0, 0xc00050c780)
/harbor/src/jobservice/runner/redis.go:144 +0xa5
panic(0x1bcc520, 0xc00034ae40)
/usr/local/go/src/runtime/panic.go:975 +0x47a
github.com/astaxie/beego/orm.(*DB).QueryRow(0xc000802040, 0xc00099c000, 0xba, 0xc0003ce280, 0x6, 0x8, 0xa1)
/harbor/src/vendor/github.com/astaxie/beego/orm/db_alias.go:182 +0xc5
github.com/astaxie/beego/orm.(*dbBase).InsertOrUpdate(0xc000380d20, 0x20d7d60, 0xc000802040, 0xc0003f5360, 0x1c5a700, 0xc0004a6f80, 0x199, 0xc000498200, 0xc0001195a0, 0x1, ...)
/harbor/src/vendor/github.com/astaxie/beego/orm/db.go:593 +0xc35
github.com/astaxie/beego/orm.(*orm).InsertOrUpdate(0xc000802f80, 0x1b42340, 0xc0004a6f80, 0xc0001195a0, 0x1, 0x1, 0x4550af, 0xc000157200, 0xc000156900)
/harbor/src/vendor/github.com/astaxie/beego/orm/orm.go:223 +0xff
github.com/goharbor/harbor/src/common/dao.CreateOrUpdateJobLog(0xc0004a6f80, 0xc000156900, 0x8b0, 0xc000157200)
/harbor/src/common/
2021-05-14T17:04:39Z [ERROR] [/jobservice/runner/redis.go:111]: Job 'IMAGE_SCAN:7c482f01f524fb80c0d2712c' exit with error: runtime error: write tcp 100.11.12.13:43386->51.59.7.7:5432: write: broken pipe; stack: goroutine 124 [running]:
github.com/goharbor/harbor/src/jobservice/runner.(*RedisJob).Run.func2(0xc0009397d0, 0xc00050c780)
/harbor/src/jobservice/runner/redis.go:144 +0xa5

dkulchinsky · 2021-05-18T12:54:14Z

Hey 👋🏼

Was chatting about the same issue with @ywk253100 in Slack, we're experiencing the same problem in both core and jobservice components with Harbor v2.1.4 & v2.1.5

We noticed that sometimes core is able to recover, although it seems very inconsistent, however jobservice never does and requires a restart of all the jobservice pods to get it working properly again.

In our configuration jobservice stores job logs in the database and failing when trying to write the log to the DB.

core log:
[harbor-core-5968789c58-9v2sm core] 2021-05-06T16:45:13Z [ERROR] [/lib/orm/orm.go:72]: begin transaction failed: write tcp 172.26.4.16:48366->172.27.14.179:5432: write: broken pipe
[harbor-core-5968789c58-9v2sm core] 2021-05-06T16:45:13Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: deal with /api/v2.0/systeminfo request in transaction failed: write tcp 172.26.4.16:48366-\u003e172.27.14.179:5432: write: broken pipe"}]}

jobservice log:

[harbor-jobservice-759d75cdf8-k2z5x jobservice] 2021-05-06T16:54:56Z [ERROR] [/jobservice/runner/redis.go:194]: Close job logger failed: logger: DB, err: write tcp 172.26.2.45:35486->172.27.14.179:5432: write: broken pipe

The Postgres database is deployed as a StatefulSet (using Zalando's postgres-operator) and Harbor is configured to connect via the ClusterIP Service, after the DB failover/switchover the Service points to the now active/primary postgres Pod and testing connectivity via the Service works just fine, I believe the issue is that the clients (core, jobservice) still try to connect via the "old" session and fail and need to re-establish a new connection when this condition happens.

Happy to provide more information/logs/etc.

Vad1mo · 2021-05-18T14:14:49Z

In our configuration jobservice stores job logs in the database and failing when trying to write the log to the DB.

Same set up that we have as well.

I believe the issue is that the clients (core, jobservice) still try to connect via the "old" session and fail and need to re-establish a new connection when this condition happens.

I have a similar feeling. The DB is online and definitely working but the Job service still holds on to a non existing handle or something. We have an RDS DB outside our K8s.

ywk253100 · 2021-05-19T02:08:02Z

@Vad1mo Is your database also an external one?

@dkulchinsky I cannot reproduce this issue in my env with the latest master code.
I tried to delete the database pod, both the core and jobservice can recover from the broken pipe error

dkulchinsky · 2021-05-19T02:11:03Z

@ywk253100 we're still on v2.1 and I assume master is v2.2? perhaps there were related changes there, could you test the same with v2.1?

when you say I tried to delete the database pod <-- do you have just one DB pod? in our case we have two pods in a StatefulSet, one is primary and the other is standby and we perform a switchover which is where the issue happens.

wk8 · 2021-05-19T04:13:10Z

Hitting this as well, using an RDS DB

alxbse · 2021-05-19T08:40:32Z

I'm also seeing this issue in a configuration where harbor v2.2.1 connects through pgbouncer as the connection pooler.

heww · 2021-05-19T15:04:58Z

I'm also seeing this issue in a configuration where harbor v2.2.1 connects through pgbouncer as the connection pooler.

Could you share the arch of the DB and the config of the pgbouncer so that we can reproduce this problem? Thanks.

Vad1mo · 2021-05-19T16:37:43Z

@ywk253100, we are on an external PostgreDB similar to AWS Aurora.
we are on 2.2.1, but have seen this error also on master. The thing is it needs to run for some time

on the jobservie the logs to db needs to be enabled.

I'll try to reproduce the error by dropping the idle connection.

What you can also do is reducing the idle and max connection to 1 and 10 for example.

flbla · 2021-05-31T08:21:37Z

I have the same issue, with a Azure Postgres database and Harbor 2.2.1
Core never recover, logs are full of :
[/lib/orm/orm.go:72]: begin transaction failed: write tcp X.X.X.X:36996->X.X.X.X:5432: write: connection reset by peer

the only way I found is to restart core pods.

Vad1mo · 2021-05-31T08:31:01Z

@ywk253100 if you want to debug that I can provide you with a cluster with remote code debugging access to it so we could together take a look at it. Ping me on slack so we can set this up! how does that sounds?

lukasmrtvy · 2021-06-21T14:34:34Z

We have same issues, Azure Postgres and Harbor 2.2.1 with similar log output from core & jobservice.

@Vad1mo are You using also Azure Redis ? ( which tier ? and which region ? )

Vad1mo · 2021-06-23T07:57:53Z

@lukasmrtvy we are using jobservice to forward logs to Postgres and the error only happens when this feature is enabled. Logging to stdout or file does not cause this issue.

jowko · 2021-06-24T12:49:22Z

We encountered this issue a couple of times. There is a lot of logs like this:
[ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: deal with /api/v2.0/health request in transaction failed: write tcp <HIDDEN_IP_AND_PORT>-\u003e<HIDDEN_IP_AND_PORT>: write: broken pipe"}]}
[ERROR] [/lib/orm/orm.go:72]: begin transaction failed: write tcp <HIDDEN_IP_AND_PORT>-><HIDDEN_IP_AND_PORT>: write: broken pipe

The server is not responsive in that state! Manual restart of core service is fixing this issue for some time. But because of this issue there is a time window when our application is not available.
We are using PostgreSQL 12. Harbor is deployed in Kubernetes cluster 1.19 by official helm chart 1.6.2 (Harbor has version v2.2.2).

We encounter the same issue in our Java apps and we fixed them using connection validators (provided by Wildfly server). Connection validator checks all database connections from pool in specified intervals and tries to execute very simple statements, like Select 1. If query will fail, such connection is removed from pool and a new one is created.

In my opinion Harbor requires similar solution or at least it should remove connections from pool after some time. I suspect that after some time all database connections in pool can be broken. I also saw, that similar issue was fixed for MySql:
Issue: #4713
PR: #4719
Maybe similar change in https://github.com/goharbor/harbor/blob/master/src/common/dao/pgsql.go will fix the issue for PostgreSQL.

Vad1mo · 2021-06-26T16:42:45Z

I spend a few hours on resolving this issue, @jowko recommendations to set the SetConnMaxLifetime() is a good idea.

References

In the long run, a switch to pgx would be the ideal outcome. Hence, please upvote #15209

I'll provide a patch and a test image with the fix on top of 2.3 for you to try.

jowko · 2021-06-28T07:43:45Z

I found out that the Harbor exporter is also affected by this issue. I don't know if above fix will cover this case. There are a lot of logs such as this:
[ERROR] [/pkg/exporter/project_collector.go:236]: get data from DB failure: write tcp <HIDDEN_IP_PORT>-><HIDDEN_IP_PORT>: write: connection reset by peer
[ERROR] [/pkg/exporter/project_collector.go:236]: get project from DB failure: write tcp <HIDDEN_IP_PORT>-><HIDDEN_IP_PORT>: write: connection reset by peer
[ERROR] [/pkg/exporter/project_collector.go:236]: get project member data from DB failure: write tcp <HIDDEN_IP_PORT>-><HIDDEN_IP_PORT>: write: connection reset by peer
[ERROR] [/pkg/exporter/project_collector.go:236]: get project repo data from DB failure: write tcp <HIDDEN_IP_PORT>-><HIDDEN_IP_PORT>: write: connection reset by peer

heww self-assigned this May 17, 2021

ywk253100 self-assigned this May 18, 2021

jowko mentioned this issue Jun 24, 2021

Core - database connection lost -> does not reconnect #14943

Closed

Vad1mo mentioned this issue Jun 26, 2021

Switch from Postgres driver from lib/pq to pgx #15209

Closed

This was referenced Jun 26, 2021

Fix/14856 #15210

Closed

Fixes the broken pipe connection error with postgres. #15211

Merged

wy65701436 closed this as completed in #15211 Jul 23, 2021

heww mentioned this issue Aug 11, 2021

Harbor-core failing to push images #15401

Closed

heww mentioned this issue Aug 23, 2021

Harbor core error connexion reset by peer #15435

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harbor can't recover when the connection to the database drops. #14856

Harbor can't recover when the connection to the database drops. #14856

Vad1mo commented May 11, 2021 •

edited

Loading

Vad1mo commented May 14, 2021

dkulchinsky commented May 18, 2021

Vad1mo commented May 18, 2021 •

edited

Loading

ywk253100 commented May 19, 2021

dkulchinsky commented May 19, 2021

wk8 commented May 19, 2021

alxbse commented May 19, 2021

heww commented May 19, 2021

Vad1mo commented May 19, 2021

flbla commented May 31, 2021 •

edited

Loading

Vad1mo commented May 31, 2021

lukasmrtvy commented Jun 21, 2021 •

edited

Loading

Vad1mo commented Jun 23, 2021 •

edited

Loading

jowko commented Jun 24, 2021 •

edited

Loading

Vad1mo commented Jun 26, 2021 •

edited

Loading

jowko commented Jun 28, 2021

Harbor can't recover when the connection to the database drops. #14856

Harbor can't recover when the connection to the database drops. #14856

Comments

Vad1mo commented May 11, 2021 • edited Loading

Vad1mo commented May 14, 2021

dkulchinsky commented May 18, 2021

Vad1mo commented May 18, 2021 • edited Loading

ywk253100 commented May 19, 2021

dkulchinsky commented May 19, 2021

wk8 commented May 19, 2021

alxbse commented May 19, 2021

heww commented May 19, 2021

Vad1mo commented May 19, 2021

flbla commented May 31, 2021 • edited Loading

Vad1mo commented May 31, 2021

lukasmrtvy commented Jun 21, 2021 • edited Loading

Vad1mo commented Jun 23, 2021 • edited Loading

jowko commented Jun 24, 2021 • edited Loading

Vad1mo commented Jun 26, 2021 • edited Loading

jowko commented Jun 28, 2021

Vad1mo commented May 11, 2021 •

edited

Loading

Vad1mo commented May 18, 2021 •

edited

Loading

flbla commented May 31, 2021 •

edited

Loading

lukasmrtvy commented Jun 21, 2021 •

edited

Loading

Vad1mo commented Jun 23, 2021 •

edited

Loading

jowko commented Jun 24, 2021 •

edited

Loading

Vad1mo commented Jun 26, 2021 •

edited

Loading