How I Learned to Stop Worrying and Love atomic()
Banking Blunders and Concurrency Challenges
How to deal with concurrency, money, and log-structured data in the Django ORM
PyGotham Talk Page
(sorry for the overused "how I learned..." title)
You can view the slides here:
What do Superman III, Hackers, and Office Space all have in common? Find out in this talk, along with some concurrency, database integrity, and financial data safety fundamentals. This is a technical talk going over the core principles in database integrity and data safety, it assumes familiarity with the Django ORM.
Did you know every Django app already behaves like a distributed system, even when it’s running only on one server? In this talk I’ll go over some of the distributed systems & database fundamentals that you’ll neeed to understand when building a Python project that handles sensitive data. We’ll focus on intermediate and advanced usage of the Django ORM, but many of the concepts apply equally well to SQLAlchemy and other Python ORMs.
We’ll go over:
- how to use immutable append-only logs to order events across your system (log-structured storage)
- the importance of accurate timestamps, and why that's difficult in distributed systems
- the meaning of transaction isolation levels
- how and when to use locking (pessimistic concurrency)
- how and when to use atomic compare-and-swaps (optimistic concurrency)
- how and when to use a hybrid optimistic/pessimistic approach like MVCC
- type safety for currencies, and math operations to avoid
- new distributed-SQL databases like spanner, TiDB, and Cockroachdb
- transaction lifecycles when doing async processing with django-channels
We spent the last two years building an online poker engine based on Django + channels, and we have plenty of stories about our failures and discoveries to share along the way. Come learn about all the ways it’s possible to screw up when handling sensitive data, and how to avoid them!
One thing I didn't cover in the talk is that SQL can do something called "gap locking". That is if you have an index for a given column, and you perform a
.select_for_update() with a filter, it won't just lock the rows that match the filter, it will actually prevent any new rows from being added that match the filter while the lock is held, which lets you effectively lock append-only tables without needing to lock the entire table.
(Thanks to Sam Kimbrel for telling me about this feature in the hall after the talk)
class BalanceTransfer(models.Model): src = models.ForeignKey(User) dst = models.ForeignKey(User) amt = models.DecimalField(max_digits=20, decimal_places=2) timestamp = models.DateTimeField(auto_now_add=True) ... with transaction.atomic(): lock = BalanceTransfer.objects.select_for_update().filter(Q(src=user) | Q(dst=user)) ... # no new rows can be added matching Q(src=user) | Q(dst=user) during this time # meaning we can safely do some logic without a user's balance being changed by new transfers added to the table withdraw(user, 5000)
Gap locking only works on columns that are indexed (all columns must be indexed together for the query you're tring to lock). By default, all django ForeignKey columns are indexed, so you only need to worry about that if you want to gap lock a field like