Paper: Highly Available Transactions: Virtues and Limitations
Highly Available Transactions: Virtues and Limitations (Bailis et al.) A very recent but excellent paper.
Paper: Crew Resource Management
Crew Resource Management: a Positive Change for the Fire Service Best article-length resource I've been able to find so far, probably can replace the current Wikipedia link.
Book: The Field Guide to Understanding Human Error
Recommended in this reading list. Maybe this turns out to be a better fit than #4.
Some good URLs around this that I know of: Kafka: A Distributed Messaging System for Log Processing (Kreps et al.) The Log: What every software engineer should know about real-time data's unifying ...
Dynamo is easily understandable and a good intro to distributed eventually consisted databases. http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf
Paper: Automatic Management of Partitioned, Replicated Search Services
Automatic Management of Partitioned, Replicated Search Services (Leibert et al.) Nice short, practical paper on managing a replicated search service in production.
Post: How to lose $172,222 a second for 45 minutes
hello, educational article and post-mortem document http://pythonsweetness.tumblr.com/post/64740079543/how-to-lose-172-222-a-second-for-45-minutes