Skip to content
5 changes: 3 additions & 2 deletions architecture/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,9 @@ These files carry **no frontmatter** — they are prose, dated by git.

## Capabilities

- [retry.md](retry.md) — `postgres_retry`, the async tenacity decorator and its
cause-chain retry predicate.
- [retry.md](retry.md) — `postgres_retry`, the async tenacity decorator.
- [retriable.md](retriable.md) — `is_retriable` / `RETRIABLE_ASYNCPG_ERRORS`,
the pure retriable-error predicate and cause-chain walk.
- [connections.md](connections.md) — `build_connection_factory`, multi-host
load balancing and failover.
- [dsn.md](dsn.md) — `build_db_dsn` / `is_dsn_multihost`, DSN parsing and
Expand Down
13 changes: 13 additions & 0 deletions architecture/glossary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Glossary

The project's ubiquitous language — the domain terms that code, specs, and
capability pages share. Living prose, no frontmatter, dated by git. Each entry is
a term, what it *is* (not what it does), and the synonyms to avoid. No
implementation detail; this is a glossary, not a spec.

**Retriable error**:
A PostgreSQL failure transient enough to retry the operation unchanged: a
serialization failure or a lost connection. The operation may succeed if retried
without modification, because the failure does not reflect a logical error in the
request itself.
_Avoid_: transient error, recoverable error
56 changes: 56 additions & 0 deletions architecture/retriable.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Retriable

`db_retry/retriable.py` classifies a PostgreSQL exception as retriable — a pure
predicate that can be tested in memory without a live database.

## `is_retriable`

```python
def is_retriable(exception: BaseException) -> bool: ...
```

Returns `True` if `exception` or any exception in its `__cause__`/`__context__`
chain is a `sqlalchemy.exc.DBAPIError` whose `.orig` is set and whose `.orig.__cause__` is one of the
[`RETRIABLE_ASYNCPG_ERRORS`](#retriable_asyncpg_errors).

The predicate is **pure**: no logging, no side effects. `postgres_retry` wraps it
in `_log_and_decide`, which adds the two debug lines and feeds the result to
tenacity.

## `RETRIABLE_ASYNCPG_ERRORS`

```python
RETRIABLE_ASYNCPG_ERRORS = (asyncpg.SerializationError, asyncpg.PostgresConnectionError)
```

The two asyncpg error classes that make a `DBAPIError` retriable:

| Class | SQLSTATE | Meaning |
|---|---|---|
| `asyncpg.SerializationError` | `40001` | Serialization failure — transaction conflicted with a concurrent write; retry may succeed. |
| `asyncpg.PostgresConnectionError` | class `08` (e.g. `08000`, `08003`) | Lost or refused connection — transient network or server state; retry may reconnect. |

`StatementCompletionUnknownError` (`40002`) is **not** included: when the
statement's outcome is unknown, a blind retry risks duplicating a write that
already committed. Classification stops at the unknown boundary.

## Cause-chain walk

`is_retriable` does not inspect only the top exception — it walks the chain:

1. Follow `__cause__` first (an explicit `raise … from …`), then `__context__`
(an implicit exception chain).
2. Guard against cycles with a `seen` set of `id()`s — if the same exception
object appears twice, the walk terminates.
3. Return `True` at the **first** link that is a retriable `DBAPIError`.

The walk matters because `DBAPIError` is often re-wrapped before it surfaces.
For example, advanced-alchemy's `wrap_sqlalchemy_exception()` raises a
`RepositoryError` (or `IntegrityError`) with the real `DBAPIError` attached as
`__cause__`; without the walk, the decorator would see only the outer wrapper and
give up.

## Related

- [retry.md](retry.md) — `postgres_retry`, which consumes `is_retriable` via `_log_and_decide`.
- [glossary.md](glossary.md) — **Retriable error** definition.
31 changes: 7 additions & 24 deletions architecture/retry.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,35 +25,18 @@ is read per invocation, not frozen at decoration.

Each call builds a `tenacity.AsyncRetrying` with:

- `stop=stop_after_attempt(retries or get_retries_number())`
- `stop=stop_after_attempt(retries if retries is not None else get_retries_number())`
- `wait=wait_exponential_jitter()` — exponential backoff with jitter
- `retry=retry_if_exception(_retry_handler)` — the predicate below
- `retry=retry_if_exception(_log_and_decide)` — delegates to
[`is_retriable`](retriable.md); `_log_and_decide` adds the two debug log lines
(`"postgres_retry, retrying"` / `"postgres_retry, giving up on retry"`) around
the pure predicate
- `reraise=True` — the **original** exception propagates after the last attempt,
not tenacity's `RetryError`
- `before=before_log(logger, DEBUG)` — debug log before each attempt

## What counts as retriable

`_is_retriable_dbapi_error` returns `True` only for a `sqlalchemy.exc.DBAPIError`
whose `.orig` is set and whose `.orig.__cause__` is an
`asyncpg.SerializationError` (SQLSTATE `40001`) or
`asyncpg.PostgresConnectionError` (class `08`, e.g. `08000`/`08003`). This
deliberately excludes lookalikes such as `StatementCompletionUnknownError`
(`40002`), where the statement's outcome is unknown and a blind retry is unsafe.

## Cause-chain walk

`_retry_handler` does not inspect only the raised exception — it walks the
`__cause__`/`__context__` chain (following `__cause__` first, then
`__context__`), guarding against cycles with a `seen` set of `id()`s, and
returns `True` as soon as any link is a retriable `DBAPIError`.

The walk matters because the `DBAPIError` is often re-raised inside another
exception. For example advanced-alchemy's `wrap_sqlalchemy_exception()` surfaces
it as `RepositoryError`/`IntegrityError` with the real `DBAPIError` hanging off
`__cause__`; the walk lets the retry still fire. Both retry and give-up paths
emit a debug log.

## Related

- [settings.md](settings.md) — where the default attempt count comes from.
- [retriable.md](retriable.md) — the retriable-error predicate: error taxonomy,
cause-chain walk, and cycle guard.
25 changes: 25 additions & 0 deletions db_retry/retriable.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import asyncpg
from sqlalchemy.exc import DBAPIError


RETRIABLE_ASYNCPG_ERRORS = (asyncpg.SerializationError, asyncpg.PostgresConnectionError)


def _is_retriable_link(exception: BaseException) -> bool:
return (
isinstance(exception, DBAPIError)
and exception.orig is not None
and isinstance(exception.orig.__cause__, RETRIABLE_ASYNCPG_ERRORS)
)


def is_retriable(exception: BaseException) -> bool:
"""Walk __cause__/__context__; True if any link is a retriable DBAPIError."""
current: BaseException | None = exception
seen: set[int] = set()
while current is not None and id(current) not in seen:
seen.add(id(current))
if _is_retriable_link(current):
return True
current = current.__cause__ or current.__context__
return False
27 changes: 6 additions & 21 deletions db_retry/retry.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,19 @@
import logging
import typing

import asyncpg
import tenacity
from sqlalchemy.exc import DBAPIError

from db_retry import settings
from db_retry.retriable import is_retriable


logger = logging.getLogger(__name__)


def _is_retriable_dbapi_error(exception: BaseException) -> bool:
return (
isinstance(exception, DBAPIError)
and exception.orig is not None
and isinstance(exception.orig.__cause__, (asyncpg.SerializationError, asyncpg.PostgresConnectionError))
)


def _retry_handler(exception: BaseException) -> bool:
current: BaseException | None = exception
seen: set[int] = set()
while current is not None and id(current) not in seen:
seen.add(id(current))
if _is_retriable_dbapi_error(current):
logger.debug("postgres_retry, retrying")
return True
current = current.__cause__ or current.__context__

def _log_and_decide(exception: BaseException) -> bool:
if is_retriable(exception):
logger.debug("postgres_retry, retrying")
return True
logger.debug("postgres_retry, giving up on retry")
return False

Expand Down Expand Up @@ -57,7 +42,7 @@ async def wrapped_method(*args: P.args, **kwargs: P.kwargs) -> T:
retryer = tenacity.AsyncRetrying(
stop=tenacity.stop_after_attempt(retries if retries is not None else settings.get_retries_number()),
wait=tenacity.wait_exponential_jitter(),
retry=tenacity.retry_if_exception(_retry_handler),
retry=tenacity.retry_if_exception(_log_and_decide),
reraise=True,
before=tenacity.before_log(logger, logging.DEBUG),
)
Expand Down
135 changes: 135 additions & 0 deletions planning/changes/2026-06-26.01-retriable-error-seam/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
summary: Extracted the retriable-error predicate into a pure is_retriable(exc) -> bool in db_retry/retriable.py, enabling in-memory classification tests; postgres_retry now consumes it via _log_and_decide.
---

# Design: Give the retriable-error predicate its own seam

## Summary

The decision "is this exception worth retrying?" is the deepest logic in the
package — it unwraps a SQLAlchemy `DBAPIError`, classifies the underlying
asyncpg error, and walks the `__cause__`/`__context__` chain to find a retriable
link even when re-wrapped. Today it lives in two private functions in
`retry.py` (`_is_retriable_dbapi_error`, `_retry_handler`) reachable only
through `@postgres_retry`, so its only test surface is a live Postgres that
raises a chosen SQLSTATE. This change moves the seam: a pure
`is_retriable(exc) -> bool` in a new `db_retry/retriable.py`, tested directly
with in-memory exception chains built from real types. `postgres_retry` becomes
a thin consumer of the seam; the integration suite shrinks to proving the
wiring.

## Motivation

- The retriability logic is deep but sits behind the wrong seam — **the
interface is not the test surface**. The classification matrix in
`tests/test_retry.py` (which SQLSTATEs retry, re-wrapped chains, attempt
counts) round-trips a real database via a `CREATE FUNCTION raise_error()`
stored proc to exercise a pure predicate.
- The SQLSTATE/asyncpg taxonomy ("serialization failures and lost connections
are transient; `40002` is not") is an inline `isinstance` tuple buried in a
boolean expression — no name, no single place to change.
- Verified the in-memory approach works: a real `sqlalchemy.exc.DBAPIError`
whose `.orig.__cause__` is a real `asyncpg.SerializationError`, optionally
re-wrapped in another exception via `__cause__`, drives the current predicate
to the correct verdict with no database.

## Non-goals

- No behaviour change to what counts as retriable — same SQLSTATE classes in,
same out.
- No change to the public surface: `is_retriable` is an **internal seam**, not
added to `__init__.py`'s `__all__`. The package keeps its five public symbols.
- No change to the retry loop's logging output — the two debug lines are
preserved, only relocated.
- Not modelling the taxonomy as richer data (SQLSTATE codes / rationale as a
structure) — the *why* lives in `architecture/retriable.md`; the *what* is a
named tuple of asyncpg classes.

## Design

### 1. New module `db_retry/retriable.py`

A named constant and one pure public function (plus a private per-link helper):

```python
RETRIABLE_ASYNCPG_ERRORS = (asyncpg.SerializationError, asyncpg.PostgresConnectionError)

def is_retriable(exception: BaseException) -> bool:
"""Walk __cause__/__context__; True if any link is a retriable DBAPIError."""
current: BaseException | None = exception
seen: set[int] = set()
while current is not None and id(current) not in seen:
seen.add(id(current))
if _is_retriable_link(current):
return True
current = current.__cause__ or current.__context__
return False
```

`is_retriable` is **pure** — no logging, no side effects. `_is_retriable_link`
is the current `_is_retriable_dbapi_error` body, checking `DBAPIError` →
`.orig is not None` → `isinstance(.orig.__cause__, RETRIABLE_ASYNCPG_ERRORS)`.
The cycle guard (`seen` set of `id()`s) and the cause-first/context-second walk
order are preserved exactly.

### 2. `retry.py` consumes the seam

`postgres_retry` imports `is_retriable` and wraps it in a thin local predicate
that carries the relocated logging:

```python
def _log_and_decide(exception: BaseException) -> bool:
if is_retriable(exception):
logger.debug("postgres_retry, retrying")
return True
logger.debug("postgres_retry, giving up on retry")
return False

retry=tenacity.retry_if_exception(_log_and_decide)
```

`_is_retriable_dbapi_error` and `_retry_handler` are deleted from `retry.py`.
The `retriable.py` module keeps no logger.

### 3. Architecture promotion

- **New** `architecture/retriable.md` — documents `is_retriable`, the
`RETRIABLE_ASYNCPG_ERRORS` taxonomy, the cause-chain walk + cycle guard, and
**why `40002` (`StatementCompletionUnknownError`) is excluded** (outcome
unknown, blind retry unsafe). The cause-chain prose moves here from
`retry.md`.
- **Edit** `architecture/retry.md` — drop the "What counts as retriable" /
"Cause-chain walk" sections; leave a one-line pointer to `retriable.md` and
note the predicate is wired in via `_log_and_decide`.
- **Edit** `architecture/README.md` — add the `retriable.md` capability row.
- **New** `architecture/glossary.md` — authored lazily (first term):
**Retriable error**.

## Testing

- **New** `tests/test_retriable.py` — in-memory, no database. A small helper
builds a `DBAPIError` whose `.orig.__cause__` is a given asyncpg error.
Parametrized matrix covering: `SerializationError` (40001) → retriable;
`PostgresConnectionError` and a subclass (08000/08003) → retriable; a
non-retriable `PostgresError` → not; a re-wrapped chain (advanced-alchemy
style, real `DBAPIError` hung off a `RepositoryError.__cause__`) → retriable;
a `__context__`-only link → retriable; a `__cause__` cycle → terminates and
returns the right verdict; a bare non-DBAPI exception → not.
- **Edit** `tests/test_retry.py` — keep exactly two integration cases (one
retriable `40001` → retries, asserts attempt count; one non-retriable
`40002` → no retry) proving the decorator wires the predicate into tenacity
and `reraise=True` surfaces the original error. The exhaustive matrix and the
advanced-alchemy case move to `test_retriable.py`.
- `just lint-ci` passes; `just test` (Docker Postgres) green for the trimmed
integration cases.

## Risk

- **Low. Behaviour-preserving refactor.** The predicate body, walk order, and
cycle guard are copied verbatim; the named constant is the same tuple. Risk is
an accidental semantic drift during the move — mitigated by writing
`test_retriable.py` first (TDD) against the documented matrix, and by keeping
the two integration cases as a live-Postgres backstop on the wiring.
- **Logging relocation** could drop or duplicate a debug line — mitigated by
`_log_and_decide` reproducing both lines verbatim and `retriable.py` carrying
no logger.
Loading