Skip to content

Over reporting clients waiting statistic when a checkout attempt is done against bad instance #67

@drdrsh

Description

@drdrsh

Describe the bug

     database      |   user    | cl_active | cl_waiting | cl_cancel_req | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
-------------------+-----------+-----------+------------+---------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
 shard_0_primary   | main_user |         0 |      57466 |             0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
 shard_0_replica_0 | main_user |         6 |          8 |             0 |         0 |       2 |       0 |         0 |        0 |       0 |          0 | transaction
 shard_0_replica_1 | main_user |         1 |      57258 |             0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
(3 rows)

To Reproduce
Create a pool with one primary and one replica then kill the replica

docker-compose.yml

version: '3'
services:
  pgp:
    image: postgres:12-bullseye
    environment:
      POSTGRES_HOST_AUTH_METHOD: trust
      POSTGRES_DB: main_db
      POSTGRES_USER: main_user
  pgr0:
    image: postgres:12-bullseye
    environment:
      POSTGRES_HOST_AUTH_METHOD: trust
      POSTGRES_DB: main_db
      POSTGRES_USER: main_user

  pgcat:
    build: https://github.com/levkk/pgcat.git#main
    restart: always
    command: ["pgcat", "/etc/pgcat/pgcat.toml"]
    environment:
      - RUST_LOG=debug
    ports:
      - 6000:5432
    links:
      - pgp
      - pgr0
    volumes:
      - "./config.toml:/etc/pgcat/pgcat.toml"

  app:
    image: ruby
    links:
      - pgcat
    volumes:
      - "./entrypoint.rb:/app/entrypoint.rb"
    command: bash -c "gem install pg && ruby /app/entrypoint.rb"

config.toml

[general]
host = "0.0.0.0"
port = 5432
pool_size = 2
pool_mode = "transaction"
connect_timeout = 5000
healthcheck_timeout = 500
ban_time = 1 # Seconds
statsd_address = ''
[user]
name = "main_user"
password = ""
[shards]
[shards.0]
servers = [
    [ "pgp", 5432, "primary" ],
    [ "pgr0", 5432, "replica" ]
]
database = "main_db"
[query_router]
default_role = "replica"
query_parser_enabled = false
primary_reads_enabled = true
sharding_function = "pg_bigint_hash"

entrypoint.rb

require 'pg'
$stdout.sync = true

def work
    conn = PG::connect("postgres://main_user:@pgcat:5432/main_db")
    conn.async_exec("SET SERVER ROLE TO 'replica'")
    conn.async_exec("SELECT 1")
    conn.close
rescue Exception => e
    puts "Encountered #{e}"
end

sleep 5
loop { 10.times.map { Thread.new { work } }.map(&:join) }

After that you can run
docker compse stop pgp
and run
watch -n 1 'psql "postgres://main_user:@localhost:6000/pgbouncer" -c "SHOW POOLS;"' to watch the waiting_clients explode

I think this is because we report the client as waiting early in the codepath and exit the codepath without reporting the client as disconnecting or active
https://github.com/levkk/pgcat/blob/main/src/pool.rs#L212

Expected behavior
We shouldn't count a client as waiting on an instance unless we actually consider it for the query.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions