Skip to content

fix(gateway): deduplicate PlatformConn to prevent content multiplication#12

Closed
hrygo wants to merge 1 commit into
mainfrom
fix-duplicate-pc-entry
Closed

fix(gateway): deduplicate PlatformConn to prevent content multiplication#12
hrygo wants to merge 1 commit into
mainfrom
fix-duplicate-pc-entry

Conversation

@hrygo
Copy link
Copy Markdown
Owner

@hrygo hrygo commented Apr 19, 2026

Summary

Fixes #11 — Feishu streaming card content duplication bug where responses appear 2x, 4x, 8x... multiplied after N messages.

Root Causes

  1. JoinPlatformSession creates duplicate pcEntry wrappersBridge.Handle calls JoinPlatformSession on every incoming message, but each call creates a new &pcEntry{pc: pc} map entry. Since each pcEntry is a distinct pointer (map key), N messages result in N event routing entries, causing N-fold content duplication.

  2. Close() silently dropped during PhaseCreating — When a worker finishes quickly, the done event arrives while EnsureCard is still in PhaseCreating (doing HTTP round-trips). Close() tries to transition to PhaseCompleted, but this transition was not allowed from PhaseCreating, so the final content flush was lost.

Changes

  • internal/gateway/hub.go: Added deduplication check in JoinPlatformSession — iterates existing session entries to find an existing pcEntry wrapping the same PlatformConn, returning early if found.

  • internal/messaging/feishu/streaming.go:

    • Added PhaseCompleted to PhaseCreating's allowed transitions in the phase state machine
    • Added defensive check in EnsureCard: after the card message is sent, if Close() was already called (phase is PhaseCompleted), immediately flush optimized content via IM patch and return

Test Plan

  • make test-short — all packages pass
  • make lint — 0 issues
  • Manual verification: send multiple Feishu messages, confirm no content duplication
  • Manual verification: send a message that triggers a very fast response, confirm content renders correctly

…content multiplication

Bridge.Handle calls JoinPlatformSession on every message, creating a new
pcEntry wrapper each time. Since each pcEntry is a distinct map key, N
messages result in N copies of every event being routed to the platform
connection (2x, 4x, 8x... duplication).

Additionally, when a worker finishes quickly, the done event can arrive
while EnsureCard is still in PhaseCreating. Close() tried to transition
to PhaseCompleted but this was not allowed from PhaseCreating, so the
final content flush was silently dropped.

Root causes:
- JoinPlatformSession did not deduplicate identical PlatformConns
- PhaseCreating did not allow transition to PhaseCompleted

Fixes #11
@hrygo hrygo closed this Apr 19, 2026
@hrygo hrygo deleted the fix-duplicate-pc-entry branch April 19, 2026 01:12
hrygo pushed a commit that referenced this pull request May 26, 2026
WARN #5/#16 — Add migrations-postgres/README.md explaining gaps
  (003=SQLite PRAGMAs, 008=SQLite event store optimize — PG-only skip)

WARN #24 — Strip trailing semicolon before appending RETURNING id
  in eventstore turns.insert PG rebind (prevents syntax error)

WARN #12 — Update env.example and config.yaml DSN examples
  from sslmode=disable → sslmode=prefer
hrygo added a commit that referenced this pull request May 27, 2026
* ✨ feat(db): add PostgreSQL dual-database support via dbutil.Dialect abstraction

Add opt-in PostgreSQL backend while preserving SQLite as the default.
A thin dbutil.Dialect layer (5 methods, 120-line Rebind state machine)
isolates all SQL dialect differences — no ORM, no existing interface changes.

Architecture:
- internal/dbutil/ — Dialect type + Rebind($1..$N) state machine + DB wrapper
- DBConfig split into Driver + SQLiteConfig + PostgresConfig sub-structs
- sqlutil.WriteMu becomes no-op on PG (MVCC handles concurrency natively)
- 9 PG migration files in migrations-postgres/ alongside SQLite originals
- 5 PG Store implementations (session/cron/eventstore/chat_access/api_key)
- gateway_run.go branches on db.driver: "sqlite" | "postgres"

Key design decisions:
- Dialect is a string constant type, not interface
- Rebind uses 6-state automaton handling string literals, quotes, $$, comments
- Existing Store interfaces zero-change
- Phase 0 extracted 3 missing interfaces: ChatAccessStorer, APIKeyUserStorer, DBExecutor

Stats: 46 files, +2103/-71, go build clean, 33/34 test suites pass

Closes: #487

* 📝 docs: improve AGENTS.md coverage and remove stale line counts

- Create AGENTS.md for internal/cron/ (timer engine, 3 schedule types,
  YAML import, backoff retry, attached session dispatch)
- Create AGENTS.md for internal/dbutil/ (WriteMu serialization, PRAGMA
  tuning, dialect abstraction, rebind)
- Add missing bot_registry.go and config.go to messaging/AGENTS.md
- Remove line counts from STRUCTURE sections across all subdirectory
  AGENTS.md files to prevent staleness (7 files affected)

* 🐛 fix(db): fix 5 issues in PostgreSQL dual-database support

- Fix nil interface trap in akStore constructors (typed nil pointer
  stored in interface caused panic in nil checks)
- Eliminate double PG connection in NewPGStore (accept shared *dbutil.DB
  instead of opening its own)
- Wire PG admin store through DI (export NewAPIKeyUserPGStore, pass via
  GatewayDeps/APIKeyStore)
- Fix openPostgres DSN source (use cfg.DSN() instead of cfg.Path) and
  honor Postgres.MaxOpenConns config
- Fix turns table success column type from INTEGER to BOOLEAN

* 🐛 fix(db): address 6 code review findings

F1 - Prevent double-close in gatewayStores: make PGStore.Close() a no-op
     (gatewayStores.close() already handles s.db.Close())

F2 - Fix Validate() for PG-only configs: guard db.path check with
     driver=sqlite gate, check both legacy Path and SQLite.Path

F3 - Fix BeginTx context cancellation: remove defer cancel() that
     violated database/sql contract for PG eventstore transactions

F4 - Fix SQLite init: use cfg.SQLite.Path with cfg.Path fallback
     in dbutil/openSQLite() and session/stores.go

F5 - Fix CLI OpenStore: branch on db.driver to support PostgreSQL
     cron commands (client.go, cron_cmd.go, cron_history.go)

F6 - Fix migration 010: replace DROP TABLE IF EXISTS with
     CREATE TABLE IF NOT EXISTS pattern

Also: add jackc/pgx/v5 stdlib import to sqlutil/driver.go
      update AGENTS.md PostgreSQL status line
      update config tests for structured DBConfig validation

* 🐛 fix(db): address PR #490 review blocking issues

Fix 2 ship-blocking issues from hotplex-ai review:

1. nil cache invalidator — APIKeyUserPGStore now exported and accepts
   dbResolver via SetInvalidator(); gateway_run.go wires it after init
2. apikey_pg_store create() hardcoded $N — uses dialect.Rebind() now

Plus 3 WARN fixes:
- Remove dead code var _ = (*sql.DB)(nil) from dialect.go
- Wrap error in apikey_pg_store get() with context message
- Export apiKeyUserPGStore → APIKeyUserPGStore

* 🎨 style(db): address PR #490 WARN items

- Use testify/require in rebind_test.go (was t.Errorf) — wraps table-driven
  tests in t.Run + require.Equal/require.True
- Add t.Parallel() to all db_test.go test functions
- Add TestDialectConstantsSync — compile-time check that dbutil and
  sqlutil dialect constants match
- Update session/AGENTS.md: pgstore.go stub → pg_store.go full PG impl
- Merge dual switch in migrate.go into single switch
- Add ConnMaxLifetime(5min) + PingContext validation on PG pool open
- Log warning when using default DSN (sslmode=disable)
- Remove dead code var _ = (*sql.DB)(nil) from dialect.go

* 🐛 fix(brain): remove dead nil checks in extractor_test.go

NewClaudeCodeExtractor() and NewOpenCodeExtractor() always allocate
and return non-nil pointers. The nil checks triggered SA5011 false
positives in staticcheck.

Remove the dead nil-check branch and unused extractor variable.

* 🎨 style(db): address PR #490 round-3 WARN items

W2 - Wrap errors in session PGStore GetExpiredMaxLifetime/GetExpiredIdle
     with fmt.Errorf (was raw err, inconsistent with DeleteTerminated)

W3 - Add t.Parallel() to all 9 test functions in rebind_test.go
     (pure string functions, safe to parallelize)

W4 - Extract DBConfig.EffectiveSQLitePath() to eliminate legacy path
     fallback duplication in dbutil/db.go + session/stores.go

W7 - Set ConnMaxIdleTime(5min) in openPostgres alongside
     ConnMaxLifetime (was infinite → stale connections on PG restart)

W12 - Add sync.Once to APIKeyUserPGStore.SetInvalidator()
     (prevents data race on hot-reload)

* 🐳 feat(docker): add PostgreSQL init + multi-DB Docker Compose setup

- Add docker/postgres-init.sql — creates hotplex DB and pgcrypto extension
- Add docker/docker-entrypoint.sh — dual-mode entrypoint (gateway or cron)
- Update Dockerfile — multi-stage build, pgx driver, healthcheck
- Update docker-compose.yml — postgres service, healthcheck, env vars
- Update docker-compose.prod.yml — production PG config with volume
- Update configs/env.example — add PG DSN and db.driver examples
- Update .dockerignore — exclude sql files from build context

* 🔒 fix(db): address PR #490 5th-round MUST FIX items

MUST FIX 1 — Default PG DSN sslmode=disable → sslmode=prefer
- PostgresConfig.DSN() default now uses sslmode=prefer (was =disable)
- Eliminate duplicate default DSN in dbutil/db.go openPostgres
  (cfg.DSN() already provides the default; the hasDefaultDSN check
  was dead code since DSN() never returned empty)
- Detect default via cfg.Postgres.ConnStr == "" instead

MUST FIX 2 — CLAUDE.md documentation sync
- Session: add pg_store.go (PostgreSQL persistence)
- sqlutil/: mention jackc/pgx/v5 PG driver + WriteMu PG no-op
- Add new dbutil/ entry (Dialect, Rebind, BoolValue, DB wrapper)
  to support module list

WARN — SetInvalidator via type assertion
- Add SetInvalidator() to APIKeyUserStorer interface
- Implement on both SQLite and PG stores
- Replace type assertion in gateway_run.go with interface call

* 🐳 feat(docker): add dedicated PG compose file + refine Docker configs

- Add docker-compose.pg.yml — PostgreSQL-only stack for testing
- Refine Dockerfile, docker-compose.yml, docker-compose.prod.yml
- Update docker-entrypoint.sh for PG DSN env injection

* ✅ test(db): add sqlmock tests for all 5 PG stores (round 6 MF1)

MF1 - 5 PG store test files, 24 test cases, zero → covered

New files:
- internal/session/pg_store_test.go       (6 tests)
- internal/cron/pg_store_test.go          (6 tests)
- internal/eventstore/pg_store_test.go    (3 tests)
- internal/admin/apikey_pg_store_test.go  (4 tests)
- internal/messaging/chat_access_pg_store_test.go (5 tests)

MF3 - Remove root user override from docker-compose.yml
      (Dockerfile already uses USER hotplex + COPY --chown)

Pattern: go-sqlmock + testify/require + t.Parallel() + regexp.QuoteMeta
Coverage: success paths + error paths (NotExist, duplicate, ErrNoRows)

* chore: update Dockerfile and docker-compose.prod.yml

* chore: update Dockerfile and docker-compose.prod.yml

* 🔧 fix(db): address PR #490 WARN items (round 6)

WARN #5/#16 — Add migrations-postgres/README.md explaining gaps
  (003=SQLite PRAGMAs, 008=SQLite event store optimize — PG-only skip)

WARN #24 — Strip trailing semicolon before appending RETURNING id
  in eventstore turns.insert PG rebind (prevents syntax error)

WARN #12 — Update env.example and config.yaml DSN examples
  from sslmode=disable → sslmode=prefer

* fix(db): fix PostgreSQL int4 overflow, banner display, and config binding

- Fix INTEGER→BIGINT for timestamp columns in PG migrations (002,005,007,009)
- Add migration 012 to ALTER existing tables with BIGINT timestamps
- Fix startup banner to show "PostgreSQL" instead of SQLite path when using PG
- Add BindEnv for db.driver and db.postgres.* config fields
- Fix ConnMaxIdleTime from 5min to 3min
- Add EffectiveMaxOpenConns bridge method
- Add Makefile dev-pg target for PostgreSQL dev environment
- Fix docker-compose security and PG config issues

* feat(db): add db-stats skill manual with go:embed integration

Add database awareness and statistics analysis manual (db-stats-skill-manual.md):
- 4-step database detection: process → env vars (incl. MAKEFLAGS) → .env → config
- Complete schema reference for all 6 tables (SQLite/PG type differences)
- 9 categories of analytics SQL templates with index-aware optimization
- Fix 7 SQL issues found in audit: index prevention, sort order, PG BOOLEAN, JOIN optimization

Integrate via go:embed pattern (matching cron skill manual):
- internal/dbutil/skill.go: embed + SkillManual()
- gateway_run.go: release to ~/.hotplex/skills/db-stats.md on startup
- META-COGNITION.md: §8 B-channel directive for mandatory pre-read

* refactor: streamline db-stats META-COGNITION entry and add conflict rule

* refactor: scope db-stats directive to HotPlex operational data only

* fix: address PR #490 review items — security, correctness, fail-fast

- P0: Remove POSTGRES_ vars from envsubst to prevent password leaking into on-disk config
- IsUniqueViolation: replace fragile string matching with pgx type assertion (errors.As)
- apiKeyUserStore: add SQLite-only doc comment, fix LastInsertId error handling
- openPostgres: fail-fast when DSN is empty instead of using insecure default
- pgStore SetInvalidator: replace sync.Once with mutex to avoid silently dropping updates

* fix: address PR #490 round-7 review — correctness, security, dead code

- Fix env var name: HOTPLEX_DB_POSTGRES_CONNSTR → HOTPLEX_DB_POSTGRES_DSN
- Add mutex to apiKeyUserStore (parity with pgStore, prevents data race)
- Wrap pgStore.create error with fmt.Errorf for consistency
- Add Effective* bridge methods for all SQLite pragma config fields
- pragma.go: use Effective* methods instead of flat legacy fields
- envsubst: replace prefix grep with explicit allowlist (YAML injection fix)
- Remove dead openPostgresDB from sqlutil (dbutil.Open is the PG path)
- Add writeMu nil comment for PG path in gatewayStores
- Config.Validate: add PG DSN required check when driver=postgres
- .dockerignore: restore configs/*.yaml exclusion, keep config.yaml

* fix: remove redundant ON CONFLICT SET id in UpsertByName

PG preserves conflict row columns not in SET clause; id = cron_jobs.id
was a no-op. state and created_at kept as explicit runtime-state guards.

* fix: address PR #490 round-8 review — writeMu serialization, CLI writeMu, migration safety, DSN cleanup

P1: apiKeyUserStore write/create/update/delete now wrapped with writeMu.WithLock() for SQLite serialization
P1: CLI cron path creates writeMu instead of passing nil to session/cron/event stores
P1: PG migration 009 adds IF NOT EXISTS for idempotent re-runs
P2: PostgresConfig.DSN() returns empty string when unconfigured instead of misleading default
P3: EffectiveWALMode zero-value ambiguity documented

* fix: address PR #490 round-9 review — PG BOOLEAN scan, CLI migrations, envsubst cleanup

P1: Add scanJobRowPG for PostgreSQL BOOLEAN→bool scanning (pgx returns bool, not int)
P1: CLI OpenStore PG path now runs goose migrations before creating stores
P2: Remove HOTPLEX_DB_POSTGRES_DSN from envsubst allowlist (Viper handles it)
P2: NewWriteMu empty dialect defaults to SQLite for consistent nil-safe behavior

* fix: migration 012 Down path — prevent BIGINT truncation to INTEGER

The Down migration used TYPE INTEGER which would silently truncate
Unix ms timestamps (~1.7×10¹² exceeds int4 max). Replaced with no-op
since reverting to INTEGER is unsafe.

* fix: address PR #490 round-10 review — wrap all bare return nil, err with fmt.Errorf

P2: Add fmt.Errorf("...: %w", err) context to 8 bare error returns across
dbutil, cron, eventstore, and session PG stores per CLAUDE.md error convention.

---------

Co-authored-by: 黄飞虹 <aaronwong1989@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Streaming card flush 重复内容:done 事件后未停止 flush 导致回复内容被重复 4 倍

2 participants