Permalink
Newer
100644
88 lines (73 sloc)
8.15 KB
|
7dec478
|
||
| 1 | ## Services Engineering Reading List | |
| 2 | ||
| 3 | * A reading list for services engineering, with a focus on cloud infrastructure services | |
| 4 | * Most content is on applied distributed systems and systems operations | |
|
a3ccf65
|
||
| 5 | * Please send suggestions to [@mmcgrana](https://twitter.com/mmcgrana) or [open an issue](https://github.com/mmcgrana/services-engineering/issues) | |
|
7dec478
|
||
| 6 | ||
|
39deca5
|
||
| 7 | #### Papers | |
| 8 | ||
|
ffcd9d1
|
||
| 9 | * [Making Reliable Distributed Systems in the Presence of Software Errors](http://www.erlang.org/download/armstrong_thesis_2003.pdf) (Armstrong) | |
|
bd11f78
|
||
| 10 | * [Highly Available Transactions: Virtues and Limitations](http://www.bailis.org/papers/hat-vldb2014.pdf) (Bailis et al.) | |
|
28a68d5
|
||
| 11 | * [The Incident Command System](http://www.high-reliability.org/files/The_Incident_Command_System.pdf) (Bigley and Roberts) | |
|
ffcd9d1
|
||
| 12 | * [The Chubby Lock Service for Loosely Coupled Distributed Systems](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/chubby-osdi06.pdf) (Burrows) | |
|
9ad4e3c
|
||
| 13 | * [Bigtable: a Distributed Storage System for Structured Data](http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/chang06bigtable.pdf) (Chang et al.) | |
|
ffcd9d1
|
||
| 14 | * [Spanner: Google’s Globally-Distributed Database](http://research.google.com/archive/spanner-osdi2012.pdf) (Corbett et al.) | |
|
4e0e6ae
|
||
| 15 | * [Dynamo: Amazon’s Highly Available Key-Value Store](http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf) (DeCandia et al.) | |
|
39deca5
|
||
| 16 | * [MapReduce: Simplified Data Processing on Large Clusters](http://research.google.com/archive/mapreduce-osdi04.pdf) (Dean and Ghemawat) | |
| 17 | * [The Google File System](http://research.google.com/archive/gfs-sosp2003.pdf) (Ghemawat et al.) | |
|
c490f56
|
||
| 18 | * [On Designing and Deploying Internet Scale Services](http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf) (Hamilton) | |
|
507afc2
|
||
| 19 | * [Kafka: A Distributed Messaging System for Log Processing](http://research.microsoft.com/en-us/UM/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf) (Kreps et al.) | |
|
332d800
|
||
| 20 | * [The Unified Logging Infrastructure for Data Analytics at Twitter](http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf) (Lee et al.) | |
| 21 | * [Scaling Big Data Mining Infrastructure: The Twitter Experience](http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-02-Lin.pdf) (Lin and Rayboy) | |
|
0078369
|
||
| 22 | * [Dremel: Interactive Analysis of Web-Scale Datasets](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36632.pdf) (Melnik et al.) | |
|
e522faf
|
||
| 23 | * [Out of the Tar Pit](http://shaffner.us/cs/papers/tarpit.pdf) (Moseley and Marks) | |
|
34f1807
|
||
| 24 | * [In Search of an Understandable Consensus Algorithm](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf) (Ongaro and Ousterhout) | |
|
39deca5
|
||
| 25 | * [Failure Trends in a Large Disk Drive Population](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf) (Pinheiro et al.) | |
|
0cc26dd
|
||
| 26 | * [Fallacies of Distributed Computing Explained](http://www.rgoarchitects.com/Files/fallacies.pdf) (Rotem-Gal-Oz) | |
|
39deca5
|
||
| 27 | * [F1 - The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business](http://research.google.com/pubs/archive/38125.pdf) (Shute et al.) | |
|
ffcd9d1
|
||
| 28 | * [Dapper, A Large Scale Distributed Systems Tracing Infrastructure](http://research.google.com/pubs/archive/36356.pdf) (Sigelman et al.) | |
|
51c605a
|
||
| 29 | * [Resident Distributed Datasets: a Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) (Zahari et al.) | |
|
6e714c8
|
||
| 30 | * [The Human Side of Postmortems](https://docs.google.com/file/d/0Byl4UKRYLErDVlJMNDNjaThiR2M/edit) (Zwieback) | |
|
bb38929
|
||
| 31 | * [Crew Resource Management: a Positive Change for the Fire Service](http://www.iaff.org/06news/NearMissKit/6.%20Crew%20Resource%20Management/CRM.pdf) | |
| 32 | ||
|
332d800
|
||
| 33 | ||
|
39deca5
|
||
| 34 | #### Posts | |
|
7dec478
|
||
| 35 | ||
| 36 | * [Resilience Engineering: Part I](http://www.kitchensoap.com/2011/04/07/resilience-engineering-part-i/), [Part II](http://www.kitchensoap.com/2012/06/18/resilience-engineering-part-ii-lenses/) (Allspaw) | |
| 37 | * [Systems Engineering: a Great Definition](http://www.kitchensoap.com/2011/07/18/systems-engineering-great-definition/) (Allspaw) | |
| 38 | * [Some Rules for Engineering and Operations](http://blog.b3k.us/2012/01/24/some-rules.html) (Black) | |
| 39 | * [Service Level Disagreements Part I](http://blog.b3k.us/2009/07/15/service-level-disagreements.html), [Part II](http://blog.b3k.us/2009/07/16/service-level-disagreements-2.html) (Black) | |
|
a23d77f
|
||
| 40 | * [My Philosophy on Alerting](https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#heading=h.whsaboyw21nk) (Ewaschuk) | |
|
7dec478
|
||
| 41 | * [You Can’t Sacrifice Partition Tolerance](http://codahale.com/you-cant-sacrifice-partition-tolerance/) (Hale) | |
| 42 | * [Customer Trust](http://perspectives.mvdirona.com/2013/01/15/CustomerTrust.aspx) (Hamilton) | |
| 43 | * [Observations on Errors, Corrections, & Trust of Dependent Systems](http://perspectives.mvdirona.com/2012/02/26/ObservationsOnErrorsCorrectionsTrustOfDependentSystems.aspx) (Hamilton) | |
| 44 | * [Life Beyond Distributed Transactions: An Apostate’s Opinion](http://cs.brown.edu/courses/cs227/archives/2012/papers/weaker/cidr07p15.pdf) (Helland) | |
| 45 | * [Notes on Distributed Systems for Young Bloods](http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/) (Hodges) | |
| 46 | * [The Network is Reliable](http://aphyr.com/posts/288-the-network-is-reliable) (Kingsbury) | |
|
fc9428f
|
||
| 47 | * [The Trouble with Clocks](http://aphyr.com/posts/299-the-trouble-with-timestamps) (Kingsbury) | |
|
7dec478
|
||
| 48 | * [Call Me Maybe: Final Thoughts](http://aphyr.com/posts/286-call-me-maybe-final-thoughts) (Kingsbury) | |
| 49 | * [Getting Real About Distributed Systems Reliability](http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability) (Kreps) | |
|
507afc2
|
||
| 50 | * [The Log: What every software engineer should know about real-time data's unifying abstraction](http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying) (Kreps) | |
|
2e3c001
|
||
| 51 | * [Incident Response at Heroku](https://blog.heroku.com/archives/2014/5/9/incident-response-at-heroku) (McGranaghan) | |
|
7dec478
|
||
| 52 | * [On HTTP Load Testing](http://www.mnot.net/blog/2011/05/18/http_benchmark_rules) (Nottingham) | |
| 53 | * [Observability at Twitter](https://blog.twitter.com/2013/observability-at-twitter) (Watson) | |
| 54 | * [Stevey’s Google Platforms Rant](https://plus.google.com/112678702228711889851/posts/eVeouesvaVX) (Yegge) | |
| 55 | ||
|
39deca5
|
||
| 56 | #### Presentations | |
| 57 | ||
|
f1fb26c
|
||
| 58 | * [Design, Lessons, and Advice from Building Distributed Systems at Google](http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf) (Dean) | |
|
39deca5
|
||
| 59 | * [Service Design Best Practices](http://www.mvdirona.com/jrh/TalksAndPapers/JamesHamilton_POA20090226.pdf) (Hamilton) | |
| 60 | ||
|
7dec478
|
||
| 61 | #### Books | |
| 62 | ||
|
7c75ab3
|
||
| 63 | * [The Field Guide To Understanding Human Error](http://www.amazon.com/Field-Guide-Understanding-Human-Error/dp/0754648265) (Dekker) | |
|
7dec478
|
||
| 64 | * [Agile Retrospectives: Making Good Teams Great](http://www.amazon.com/Agile-Retrospectives-Making-Teams-Great/dp/0977616649) (Derby et al.) | |
| 65 | * [Better: A Surgeon’s Notes on Performance](http://www.amazon.com/dp/0312427654) (Gawande) | |
| 66 | * [The Checklist Manifesto: How to Get Things Right](http://www.amazon.com/The-Checklist-Manifesto-ebook/dp/B0030V0PEW) (Gawande) | |
|
c0496fd
|
||
| 67 | * [High Performance Browswer Networking](http://chimera.labs.oreilly.com/books/1230000000545/index.html) (Grigorik) | |
|
b345fc7
|
||
| 68 | * [Resilience Engineering in Practice](http://www.amazon.com/Resilience-Engineering-Practice-Ashgate-Studies/dp/1409410358/) (Hollnagel et al.) | |
|
c70a18a
|
||
| 69 | * [Effective Monitoring and Alerting](http://www.amazon.com/Effective-Monitoring-Alerting-For-Operations/dp/1449333524) (Ligus) | |
|
062e057
|
||
| 70 | * [Release It!: Design and Deploy Production-Ready Software](http://www.amazon.com/Release-It-Production-Ready-Pragmatic-Programmers/dp/0978739213) (Nygard) | |
|
18eef08
|
||
| 71 | * [The Challenger Launch Decision](http://www.amazon.com/The-Challenger-Launch-Decision-Technology/dp/0226851761) (Vaughan) | |
|
ecdf86a
|
||
| 72 | * [Managing the Unexpected](http://www.amazon.com/gp/product/B004IK9U4U) (Weick and Sutcliffe) | |
|
7dec478
|
||
| 73 | ||
| 74 | #### Research Groups | |
| 75 | ||
|
e4a16b5
|
||
| 76 | * [Berkley AMP Lab](https://amplab.cs.berkeley.edu/) | |
|
7dec478
|
||
| 77 | * [Berkeley Database Group](http://db.cs.berkeley.edu/w/) | |
| 78 | * [Google Research](http://research.google.com/) | |
| 79 | * [Microsoft Systems Research](http://research.microsoft.com/en-US/groups/sr/default.aspx) | |
|
d2ef185
|
||
| 80 | * [Twitter Research](https://engineering.twitter.com/research) | |
|
7dec478
|
||
| 81 | ||
| 82 | #### Conferences | |
| 83 | ||
|
8674215
|
||
| 84 | * [Monitorama](http://monitorama.com/) | |
|
7dec478
|
||
| 85 | * [Ricon](http://ricon.io/) | |
|
6e26253
|
||
| 86 | * [Surge](http://surge.omniti.com/) | |
|
7dec478
|
||
| 87 | * [Velocity](http://velocityconf.com/) |