Posts
- Resilience Engineering: Part I, Part II (Allspaw)
- Systems Engineering: a Great Definition (Allspaw)
- Chaos Monkey Released Into The Wild (Bennett and Tseitlin)
- Some Rules for Engineering and Operations (Black)
- Service Level Disagreements Part I, Part II (Black)
- My Philosophy on Alerting (Ewaschuk)
- You Can’t Sacrifice Partition Tolerance (Hale)
- Customer Trust (Hamilton)
- Observations on Errors, Corrections, & Trust of Dependent Systems (Hamilton)
-
Game Day Exercises at Stripe: Learning from
kill -9(Hedlund) - Life Beyond Distributed Transactions: An Apostate’s Opinion (Helland)
- Notes on Distributed Systems for Young Bloods (Hodges)
- The Network is Reliable (Kingsbury)
- The Trouble with Clocks (Kingsbury)
- Call Me Maybe: Final Thoughts (Kingsbury)
- Getting Real About Distributed Systems Reliability (Kreps)
- The Log: What every software engineer should know about real-time data's unifying abstraction (Kreps)
- Incident Response at Heroku (McGranaghan)
- On HTTP Load Testing (Nottingham)
- Observability at Twitter (Watson)
- Stevey’s Google Platforms Rant (Yegge)
- Incuriosity Will Kill Your Infrastructure (?)