Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receiver/postgresql: improve connection management #30831

Closed
kevinnoel-be opened this issue Jan 29, 2024 · 7 comments
Closed

receiver/postgresql: improve connection management #30831

kevinnoel-be opened this issue Jan 29, 2024 · 7 comments
Assignees
Labels

Comments

@kevinnoel-be
Copy link
Contributor

Component(s)

receiver/postgresql

Is your feature request related to a problem? Please describe.

We've been trying to enable the postgresql receiver on our fleet but did encounter a "spam log" issue due to how this receiver manages the connection i.e. connect/disconnect on each scrape per database.
All of our PostgreSQL are configured, for security/compliance reasons, with log_connections=on & log_disconnections=on which makes this receiver very verbose, especially at this scale (2000+ PGs).

For example, here is an excerpt of PG logs generated during one basic test scrape:

PG logs
2024-01-29 17:09:50.965 UTC [55] LOG:  connection received: host=172.18.0.1 port=33724
2024-01-29T17:09:50.984870001Z 2024-01-29 17:09:50.984 UTC [55] LOG:  connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100)
2024-01-29T17:09:50.984992982Z 2024-01-29 17:09:50.984 UTC [55] LOG:  connection authorized: user=otelu database=postgres
2024-01-29T17:09:50.993220875Z 2024-01-29 17:09:50.991 UTC [56] LOG:  connection received: host=172.18.0.1 port=33734
2024-01-29T17:09:50.993247245Z 2024-01-29 17:09:50.992 UTC [57] LOG:  connection received: host=172.18.0.1 port=33738
2024-01-29T17:09:50.999776857Z 2024-01-29 17:09:50.999 UTC [56] LOG:  connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100)
2024-01-29T17:09:50.999804090Z 2024-01-29 17:09:50.999 UTC [56] LOG:  connection authorized: user=otelu database=postgres
2024-01-29T17:09:51.004496467Z 2024-01-29 17:09:51.004 UTC [57] LOG:  connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100)
2024-01-29T17:09:51.004515817Z 2024-01-29 17:09:51.004 UTC [57] LOG:  connection authorized: user=otelu database=postgres
2024-01-29T17:09:51.008993235Z 2024-01-29 17:09:51.008 UTC [57] LOG:  disconnection: session time: 0:00:00.016 user=otelu database=postgres host=172.18.0.1 port=33738
2024-01-29T17:09:51.010942781Z 2024-01-29 17:09:51.010 UTC [58] LOG:  connection received: host=172.18.0.1 port=33740
2024-01-29T17:09:51.015009416Z 2024-01-29 17:09:51.014 UTC [58] LOG:  connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100)
2024-01-29T17:09:51.015032320Z 2024-01-29 17:09:51.014 UTC [58] LOG:  connection authorized: user=otelu database=postgres
2024-01-29T17:09:51.027208786Z 2024-01-29 17:09:51.027 UTC [59] LOG:  connection received: host=172.18.0.1 port=33750
2024-01-29T17:09:51.031809257Z 2024-01-29 17:09:51.031 UTC [59] LOG:  connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100)
2024-01-29T17:09:51.031827320Z 2024-01-29 17:09:51.031 UTC [59] LOG:  connection authorized: user=otelu database=otel2
2024-01-29T17:09:51.057199872Z 2024-01-29 17:09:51.057 UTC [59] LOG:  disconnection: session time: 0:00:00.030 user=otelu database=otel2 host=172.18.0.1 port=33750
2024-01-29T17:09:51.057529056Z 2024-01-29 17:09:51.057 UTC [56] LOG:  disconnection: session time: 0:00:00.066 user=otelu database=postgres host=172.18.0.1 port=33734
2024-01-29T17:09:51.057537967Z 2024-01-29 17:09:51.057 UTC [58] LOG:  disconnection: session time: 0:00:00.046 user=otelu database=postgres host=172.18.0.1 port=33740
2024-01-29T17:09:51.059634469Z 2024-01-29 17:09:51.057 UTC [55] LOG:  disconnection: session time: 0:00:00.093 user=otelu database=postgres host=172.18.0.1 port=33724
2024-01-29T17:10:00.965055493Z 2024-01-29 17:10:00.964 UTC [61] LOG:  connection received: host=172.18.0.1 port=60680
2024-01-29T17:10:00.969273549Z 2024-01-29 17:10:00.969 UTC [61] LOG:  connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100)
2024-01-29T17:10:00.969294007Z 2024-01-29 17:10:00.969 UTC [61] LOG:  connection authorized: user=otelu database=postgres
2024-01-29T17:10:00.973458607Z 2024-01-29 17:10:00.973 UTC [62] LOG:  connection received: host=172.18.0.1 port=60696
2024-01-29T17:10:00.974047716Z 2024-01-29 17:10:00.973 UTC [63] LOG:  connection received: host=172.18.0.1 port=60700
2024-01-29T17:10:00.977926682Z 2024-01-29 17:10:00.977 UTC [62] LOG:  connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100)
2024-01-29T17:10:00.977940513Z 2024-01-29 17:10:00.977 UTC [62] LOG:  connection authorized: user=otelu database=postgres
2024-01-29T17:10:00.980855639Z 2024-01-29 17:10:00.980 UTC [63] LOG:  connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100)
2024-01-29T17:10:00.980869061Z 2024-01-29 17:10:00.980 UTC [63] LOG:  connection authorized: user=otelu database=postgres
2024-01-29T17:10:00.987216600Z 2024-01-29 17:10:00.985 UTC [63] LOG:  disconnection: session time: 0:00:00.011 user=otelu database=postgres host=172.18.0.1 port=60700
2024-01-29T17:10:00.987237072Z 2024-01-29 17:10:00.986 UTC [64] LOG:  connection received: host=172.18.0.1 port=60716
2024-01-29T17:10:00.990905486Z 2024-01-29 17:10:00.990 UTC [64] LOG:  connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100)
2024-01-29T17:10:00.990924449Z 2024-01-29 17:10:00.990 UTC [64] LOG:  connection authorized: user=otelu database=postgres
2024-01-29T17:10:01.003217141Z 2024-01-29 17:10:01.002 UTC [65] LOG:  connection received: host=172.18.0.1 port=60722
2024-01-29T17:10:01.007543161Z 2024-01-29 17:10:01.006 UTC [65] LOG:  connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100)
2024-01-29T17:10:01.007561344Z 2024-01-29 17:10:01.006 UTC [65] LOG:  connection authorized: user=otelu database=otel2
2024-01-29T17:10:01.035387631Z 2024-01-29 17:10:01.033 UTC [65] LOG:  disconnection: session time: 0:00:00.030 user=otelu database=otel2 host=172.18.0.1 port=60722
2024-01-29T17:10:01.035411643Z 2024-01-29 17:10:01.033 UTC [64] LOG:  disconnection: session time: 0:00:00.046 user=otelu database=postgres host=172.18.0.1 port=60716
2024-01-29T17:10:01.035414531Z 2024-01-29 17:10:01.033 UTC [62] LOG:  disconnection: session time: 0:00:00.060 user=otelu database=postgres host=172.18.0.1 port=60696
2024-01-29T17:10:01.035980287Z 2024-01-29 17:10:01.033 UTC [61] LOG:  disconnection: session time: 0:00:00.069 user=otelu database=postgres host=172.18.0.1 port=60680

This prevents us using/rolling out this receiver more widely as we cannot drop logs nor disable those log configurations on the PostgreSQL side either.

Describe the solution you'd like

Connection reuse/pooling to avoid having these spike of logs on each scrape interval across our fleet e.g. by using a connection pool (maybe?). Could be an opt-in configuration of course.

Describe alternatives you've considered

Forking the receiver in our internal custom build

Additional context

To get an idea, I've ran the integration test locally with both PG configurations log_connections & log_disconnections enabled:

  • using current code with lib/pq, no reuse of connection
    • 12 audit logs connection authorized: user=otelu
    • same for disconnections
  • (very rough) code change using connection pooling with jackc/pgx/v5/pgxpool
    • 4 audit logs connection authorized: user=otelu
@kevinnoel-be kevinnoel-be added enhancement New feature or request needs triage New item requiring triage labels Jan 29, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski djaglowski removed the needs triage New item requiring triage label Jan 29, 2024
@djaglowski
Copy link
Member

This seems like a reasonable improvement to me. @kevinnoel-be are you interested in submitting a PR for this?

@kevinnoel-be
Copy link
Contributor Author

@djaglowski
We would prefer to have this in upstream so yes I'd be willing to take a stab at this.
I'd be curious how you see those changes happening:

  • keep current behavior (even lib/pq module?) by default and optionally enable pooling?
  • enable pooling by default and replace lib/pq altogether?
  • should we expose some/most/all of the pooling config? In any case at least min/max size I'd guess, the callbacks are not useful here either

@djaglowski
Copy link
Member

Thanks @kevinnoel-be, I'll assign the issue to you and look forward to your PR.

  • keep current behavior (even lib/pq module?) by default and optionally enable pooling?
  • enable pooling by default and replace lib/pq altogether?

Since this is a substantial change to the internals of the receiver, I think we must manage this transition with a feature gate. Basically, we'll have to give the user a choice between the libraries. When the gate is alpha, lib/pq is used by default, but optionally you can use the other. Then once this has been validated by some users, we can bump the gate to beta, which makes the other library default, but still leaves an escape hatch to switch back to lib/pq. Once this has been well validated, we can mark the gate as stable and eventually remove it.

should we expose some/most/all of the pooling config

We can work out details here but I think the following guidelines should apply:

  1. Since all these settings are naturally related, we should group them under a subsection of the config. (e.g. connection_pool)
  2. We should start with minimal settings but can expose additional settings over time as necessary.

@kevinnoel-be
Copy link
Contributor Author

Thanks @djaglowski I'll take those in account for creating the PR.

Another question that'll drive which version of pgx we can use for those changes here: I see that this receiver is marked as supporting PG 9.6+ in the README. Do we still want to keep this much backwards compatibility given that this is quite an old version now (list of PG versions here). It seems that the latest supported version is PG 12 with still ~9 months of support.

@djaglowski
Copy link
Member

@kevinnoel-be, I think it's probably reasonable to drop support for older versions but this should be a separate issue so others have a chance to comment.

@kevinnoel-be
Copy link
Contributor Author

@djaglowski I've created #30923 to open the discussion around the version support.

djaglowski pushed a commit that referenced this issue Mar 4, 2024
**Description:** <Describe what has changed.>
Reuse of connections created per database (configured or discovered) vs
current behavior to create & close connection per database on each
scrape.

**Link to tracking Issue:** 

#30831

**Testing:** 
Updated unit & integration tests. Also, ran locally multiple scenario:
- no feature gate specified (default): current behavior maintained,
connections created/closed on each database per scrape
- feature gate connection pool enabled, no connection pool config
specified (default): reduction of the number of connections
created/closed
- feature gate connection pool enabled, connection pool config tweaked:
connections created on first scrape & closed when configured lifetime
reached or collector shutdown

**Documentation:**
- change log
- readme for the feature gate & related optional configurations linked
to this feature

**Note**
Checking internally for getting the CLA signed
breedx-splk pushed a commit to breedx-splk/opentelemetry-collector-contrib that referenced this issue Mar 4, 2024
**Description:** <Describe what has changed.>
Reuse of connections created per database (configured or discovered) vs
current behavior to create & close connection per database on each
scrape.

**Link to tracking Issue:** 

open-telemetry#30831

**Testing:** 
Updated unit & integration tests. Also, ran locally multiple scenario:
- no feature gate specified (default): current behavior maintained,
connections created/closed on each database per scrape
- feature gate connection pool enabled, no connection pool config
specified (default): reduction of the number of connections
created/closed
- feature gate connection pool enabled, connection pool config tweaked:
connections created on first scrape & closed when configured lifetime
reached or collector shutdown

**Documentation:**
- change log
- readme for the feature gate & related optional configurations linked
to this feature

**Note**
Checking internally for getting the CLA signed
DougManton pushed a commit to DougManton/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
**Description:** <Describe what has changed.>
Reuse of connections created per database (configured or discovered) vs
current behavior to create & close connection per database on each
scrape.

**Link to tracking Issue:** 

open-telemetry#30831

**Testing:** 
Updated unit & integration tests. Also, ran locally multiple scenario:
- no feature gate specified (default): current behavior maintained,
connections created/closed on each database per scrape
- feature gate connection pool enabled, no connection pool config
specified (default): reduction of the number of connections
created/closed
- feature gate connection pool enabled, connection pool config tweaked:
connections created on first scrape & closed when configured lifetime
reached or collector shutdown

**Documentation:**
- change log
- readme for the feature gate & related optional configurations linked
to this feature

**Note**
Checking internally for getting the CLA signed
XinRanZhAWS pushed a commit to XinRanZhAWS/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
**Description:** <Describe what has changed.>
Reuse of connections created per database (configured or discovered) vs
current behavior to create & close connection per database on each
scrape.

**Link to tracking Issue:** 

open-telemetry#30831

**Testing:** 
Updated unit & integration tests. Also, ran locally multiple scenario:
- no feature gate specified (default): current behavior maintained,
connections created/closed on each database per scrape
- feature gate connection pool enabled, no connection pool config
specified (default): reduction of the number of connections
created/closed
- feature gate connection pool enabled, connection pool config tweaked:
connections created on first scrape & closed when configured lifetime
reached or collector shutdown

**Documentation:**
- change log
- readme for the feature gate & related optional configurations linked
to this feature

**Note**
Checking internally for getting the CLA signed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants