Over reporting clients waiting statistic when a checkout attempt is done against bad instance

**Describe the bug**
```
     database      |   user    | cl_active | cl_waiting | cl_cancel_req | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
-------------------+-----------+-----------+------------+---------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
 shard_0_primary   | main_user |         0 |      57466 |             0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
 shard_0_replica_0 | main_user |         6 |          8 |             0 |         0 |       2 |       0 |         0 |        0 |       0 |          0 | transaction
 shard_0_replica_1 | main_user |         1 |      57258 |             0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
(3 rows)
```

**To Reproduce**
Create a pool with one primary and one replica then kill the replica

`docker-compose.yml`
```
version: '3'
services:
  pgp:
    image: postgres:12-bullseye
    environment:
      POSTGRES_HOST_AUTH_METHOD: trust
      POSTGRES_DB: main_db
      POSTGRES_USER: main_user
  pgr0:
    image: postgres:12-bullseye
    environment:
      POSTGRES_HOST_AUTH_METHOD: trust
      POSTGRES_DB: main_db
      POSTGRES_USER: main_user

  pgcat:
    build: https://github.com/levkk/pgcat.git#main
    restart: always
    command: ["pgcat", "/etc/pgcat/pgcat.toml"]
    environment:
      - RUST_LOG=debug
    ports:
      - 6000:5432
    links:
      - pgp
      - pgr0
    volumes:
      - "./config.toml:/etc/pgcat/pgcat.toml"

  app:
    image: ruby
    links:
      - pgcat
    volumes:
      - "./entrypoint.rb:/app/entrypoint.rb"
    command: bash -c "gem install pg && ruby /app/entrypoint.rb"

```
`config.toml`
```
[general]
host = "0.0.0.0"
port = 5432
pool_size = 2
pool_mode = "transaction"
connect_timeout = 5000
healthcheck_timeout = 500
ban_time = 1 # Seconds
statsd_address = ''
[user]
name = "main_user"
password = ""
[shards]
[shards.0]
servers = [
    [ "pgp", 5432, "primary" ],
    [ "pgr0", 5432, "replica" ]
]
database = "main_db"
[query_router]
default_role = "replica"
query_parser_enabled = false
primary_reads_enabled = true
sharding_function = "pg_bigint_hash"
```
`entrypoint.rb`
```
require 'pg'
$stdout.sync = true

def work
    conn = PG::connect("postgres://main_user:@pgcat:5432/main_db")
    conn.async_exec("SET SERVER ROLE TO 'replica'")
    conn.async_exec("SELECT 1")
    conn.close
rescue Exception => e
    puts "Encountered #{e}"
end

sleep 5
loop { 10.times.map { Thread.new { work } }.map(&:join) }
```


After that you can run
`docker compse stop pgp`
and run
`watch -n 1 'psql "postgres://main_user:@localhost:6000/pgbouncer" -c "SHOW POOLS;"'` to watch the waiting_clients explode

I think this is because we report the client as waiting early in the codepath and exit the codepath without reporting the client as disconnecting or active
https://github.com/levkk/pgcat/blob/main/src/pool.rs#L212


**Expected behavior**
We shouldn't count a client as waiting on an instance unless we actually consider it for the query.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Over reporting clients waiting statistic when a checkout attempt is done against bad instance #67

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Over reporting clients waiting statistic when a checkout attempt is done against bad instance #67

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions