periodic_data_archiving deletes accumulated backlog in a single unbatched transaction (first-run trap) #1125
Unanswered
mgradalska
asked this question in
Q&A
Replies: 1 comment
-
|
I'd like to work on this. I can see _bulk_delete_tracing_data already uses the correct BATCH_SIZE = 1000 pattern — so a fix is needed for deletions in the remaining 5 tables. Before I open a PR, wanted to confirm this is something that still needs to be worked on and that there are no constraints I should be aware of. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
What's happening
The periodic worker task
periodic_data_archivingdeletes stale rows from six tables (APILog,Event,TracingRecord,Tracking,Shipment,Order). For five of those six tables, the deletion is implemented as an unboundedqueryset.delete()in a single transaction.Steady-state operation can also hit the same pattern at sufficiently high traffic - a busy
APILogtable can accumulate enough rows in a single retention window to overflow even a normal daily archive. The first-run-after-deployment case is just the most dramatic manifestation: the task tries to delete the entire accumulated backlog (potentially millions of rows) in one transaction.The resulting symptom is memory exhaustion: the worker process loads the full set of IDs and risks OOM and pod restarts before the deletion completes.
The archiving code
modules/events/karrio/server/events/task_definitions/base/archiving.py.The
APILogdeletion is representative of the unbatched path (line 49):api_log_dataiscore.APILog.objects.filter(requested_at__lt=log_retention)- i.e. every API log older than the retention window, unbounded. The same shape is used forEventdirectly, and the helpers_bulk_delete_tracking_data,_bulk_delete_shipment_data, and_bulk_delete_order_dataall load the full ID list into memory before calling.delete()on it in one go.The one exception is
_bulk_delete_tracing_data, which iterates with aBATCH_SIZE = 1000loop and deletes in chunks. That helper is a worked example of the correct pattern - it just hasn't been propagated to the other five tables.Why this is a problem
queryset.delete()on a multi-million-row queryset runs as a single long-lived transaction, holding locks for its entire duration. On a live database it competes with concurrent traffic the whole time._bulk_delete_*helpers other than tracing do) scales linearly with backlog size. On large tables this is unbounded memory growth in the worker process.APILogaccumulates one row per API request, so a daily archive batch on a busy instance can still be unsafe in a single transaction. The bug isn't a one-time bootstrap concern._bulk_delete_tracing_datadoes batched deletion correctly. The fix isn't a new design - it's propagating an existing one.Suggested direction
Apply the
_bulk_delete_tracing_databatching pattern to the other five deletion paths. Each batch should be deleted in its own transaction (or at least its own SQL statement) so locks and WAL accumulation are bounded per batch rather than per backlog.Environment
Karrio version:
2026.1.31.Beta Was this translation helpful? Give feedback.
All reactions