periodic_data_archiving deletes accumulated backlog in a single unbatched transaction (first-run trap) #1125

mgradalska · 2026-06-11T07:54:14Z

mgradalska
Jun 11, 2026

What's happening

The periodic worker task periodic_data_archiving deletes stale rows from six tables (APILog, Event, TracingRecord, Tracking, Shipment, Order). For five of those six tables, the deletion is implemented as an unbounded queryset.delete() in a single transaction.

Steady-state operation can also hit the same pattern at sufficiently high traffic - a busy APILog table can accumulate enough rows in a single retention window to overflow even a normal daily archive. The first-run-after-deployment case is just the most dramatic manifestation: the task tries to delete the entire accumulated backlog (potentially millions of rows) in one transaction.

The resulting symptom is memory exhaustion: the worker process loads the full set of IDs and risks OOM and pod restarts before the deletion completes.

The archiving code

modules/events/karrio/server/events/task_definitions/base/archiving.py.

The APILog deletion is representative of the unbatched path (line 49):

api_logs_deleted = utils.failsafe(lambda: api_log_data.delete()[0]) or 0

api_log_data is core.APILog.objects.filter(requested_at__lt=log_retention) - i.e. every API log older than the retention window, unbounded. The same shape is used for Event directly, and the helpers _bulk_delete_tracking_data, _bulk_delete_shipment_data, and _bulk_delete_order_data all load the full ID list into memory before calling .delete() on it in one go.

The one exception is _bulk_delete_tracing_data, which iterates with a BATCH_SIZE = 1000 loop and deletes in chunks. That helper is a worked example of the correct pattern - it just hasn't been propagated to the other five tables.

Why this is a problem

Unbounded transaction size. queryset.delete() on a multi-million-row queryset runs as a single long-lived transaction, holding locks for its entire duration. On a live database it competes with concurrent traffic the whole time.
No memory ceiling. Loading every matching PK into a single list (as the _bulk_delete_* helpers other than tracing do) scales linearly with backlog size. On large tables this is unbounded memory growth in the worker process.
Scales with traffic, not just backlog. First-run-after-deployment is the most acute case (the entire accumulated history goes in one transaction), but high-traffic deployments hit the same pattern in steady state - APILog accumulates one row per API request, so a daily archive batch on a busy instance can still be unsafe in a single transaction. The bug isn't a one-time bootstrap concern.
The right pattern already exists in the same file. _bulk_delete_tracing_data does batched deletion correctly. The fix isn't a new design - it's propagating an existing one.

Suggested direction

Apply the _bulk_delete_tracing_data batching pattern to the other five deletion paths. Each batch should be deleted in its own transaction (or at least its own SQL statement) so locks and WAL accumulation are bounded per batch rather than per backlog.

Environment

Karrio version: 2026.1.31.

g-builder-0 · 2026-06-23T21:28:40Z

g-builder-0
Jun 23, 2026

I'd like to work on this. I can see _bulk_delete_tracing_data already uses the correct BATCH_SIZE = 1000 pattern — so a fix is needed for deletions in the remaining 5 tables. Before I open a PR, wanted to confirm this is something that still needs to be worked on and that there are no constraints I should be aware of.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karrio Shipping

periodic_data_archiving deletes accumulated backlog in a single unbatched transaction (first-run trap) #1125

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Karrio Shipping

periodic_data_archiving deletes accumulated backlog in a single unbatched transaction (first-run trap) #1125

Uh oh!

mgradalska Jun 11, 2026

What's happening

The archiving code

Why this is a problem

Suggested direction

Environment

Replies: 1 comment

Uh oh!

g-builder-0 Jun 23, 2026

mgradalska
Jun 11, 2026

g-builder-0
Jun 23, 2026