Permalink
Newer
Older
100644 82 lines (65 sloc) 6.98 KB
7dec478 @mmcgrana import from google doc
authored Oct 5, 2013
1 ## Services Engineering Reading List
2
3 * A reading list for services engineering, with a focus on cloud infrastructure services
4 * Most content is on applied distributed systems and systems operations
af1e6f9 @mmcgrana Suggest just using GitHub issues for consistency
authored Dec 29, 2013
5 * WIP: please send suggestions to [@mmcgrana](https://twitter.com/mmcgrana) or [open a GitHub issue](https://github.com/mmcgrana/services-engineering/issues)
7dec478 @mmcgrana import from google doc
authored Oct 5, 2013
6
39deca5 @mmcgrana papers on top, break out presentations
authored Oct 5, 2013
7 #### Papers
8
ffcd9d1 @mmcgrana sort by first author last name
authored Oct 7, 2013
9 * [Making Reliable Distributed Systems in the Presence of Software Errors](http://www.erlang.org/download/armstrong_thesis_2003.pdf) (Armstrong)
bd11f78 @mmcgrana Add HAT paper, closes #17
authored Oct 12, 2013
10 * [Highly Available Transactions: Virtues and Limitations](http://www.bailis.org/papers/hat-vldb2014.pdf) (Bailis et al.)
ffcd9d1 @mmcgrana sort by first author last name
authored Oct 7, 2013
11 * [The Chubby Lock Service for Loosely Coupled Distributed Systems](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/chubby-osdi06.pdf) (Burrows)
9ad4e3c @mmcgrana add bigtable paper
authored Oct 7, 2013
12 * [Bigtable: a Distributed Storage System for Structured Data](http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/chang06bigtable.pdf) (Chang et al.)
ffcd9d1 @mmcgrana sort by first author last name
authored Oct 7, 2013
13 * [Spanner: Google’s Globally-Distributed Database](http://research.google.com/archive/spanner-osdi2012.pdf) (Corbett et al.)
4e0e6ae @mmcgrana add dynamo paper, closes #10
authored Oct 7, 2013
14 * [Dynamo: Amazon’s Highly Available Key-Value Store](http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf) (DeCandia et al.)
39deca5 @mmcgrana papers on top, break out presentations
authored Oct 5, 2013
15 * [MapReduce: Simplified Data Processing on Large Clusters](http://research.google.com/archive/mapreduce-osdi04.pdf) (Dean and Ghemawat)
16 * [The Google File System](http://research.google.com/archive/gfs-sosp2003.pdf) (Ghemawat et al.)
c490f56 @mmcgrana Hamilton work is actually a paper
authored Feb 19, 2014
17 * [On Designing and Deploying Internet Scale Services](http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf) (Hamilton)
0078369 @mmcgrana Add Dremel paper
authored Oct 13, 2013
18 * [Dremel: Interactive Analysis of Web-Scale Datasets](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36632.pdf) (Melnik et al.)
e522faf @mmcgrana add out of the tar pit
authored Oct 7, 2013
19 * [Out of the Tar Pit](http://shaffner.us/cs/papers/tarpit.pdf) (Moseley and Marks)
34f1807 @mmcgrana Add Raft paper, closes #11
authored Oct 12, 2013
20 * [In Search of an Understandable Consensus Algorithm](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf) (Ongaro and Ousterhout)
39deca5 @mmcgrana papers on top, break out presentations
authored Oct 5, 2013
21 * [Failure Trends in a Large Disk Drive Population](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf) (Pinheiro et al.)
0cc26dd @mmcgrana last names
authored Oct 6, 2013
22 * [Fallacies of Distributed Computing Explained](http://www.rgoarchitects.com/Files/fallacies.pdf) (Rotem-Gal-Oz)
39deca5 @mmcgrana papers on top, break out presentations
authored Oct 5, 2013
23 * [F1 - The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business](http://research.google.com/pubs/archive/38125.pdf) (Shute et al.)
ffcd9d1 @mmcgrana sort by first author last name
authored Oct 7, 2013
24 * [Dapper, A Large Scale Distributed Systems Tracing Infrastructure](http://research.google.com/pubs/archive/36356.pdf) (Sigelman et al.)
51c605a @mmcgrana Put RDD paper in correct list
authored Apr 13, 2014
25 * [Resident Distributed Datasets: a Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) (Zahari et al.)
bb38929 @mmcgrana Order
authored Apr 13, 2014
26 * [Crew Resource Management: a Positive Change for the Fire Service](http://www.iaff.org/06news/NearMissKit/6.%20Crew%20Resource%20Management/CRM.pdf)
27
c21f0ef @mmcgrana Whitespace
authored Oct 12, 2013
28
39deca5 @mmcgrana papers on top, break out presentations
authored Oct 5, 2013
29 #### Posts
7dec478 @mmcgrana import from google doc
authored Oct 5, 2013
30
31 * [Resilience Engineering: Part I](http://www.kitchensoap.com/2011/04/07/resilience-engineering-part-i/), [Part II](http://www.kitchensoap.com/2012/06/18/resilience-engineering-part-ii-lenses/) (Allspaw)
32 * [Systems Engineering: a Great Definition](http://www.kitchensoap.com/2011/07/18/systems-engineering-great-definition/) (Allspaw)
33 * [Some Rules for Engineering and Operations](http://blog.b3k.us/2012/01/24/some-rules.html) (Black)
34 * [Service Level Disagreements Part I](http://blog.b3k.us/2009/07/15/service-level-disagreements.html), [Part II](http://blog.b3k.us/2009/07/16/service-level-disagreements-2.html) (Black)
0b6df6a @mmcgrana Faster mirror for Dean slides
authored Feb 2, 2014
35 * [Design, Lessons, and Advice from Building Distributed Systems at Google](http://odbms.org/download/dean-keynote-ladis2009.pdf) (Dean)
a23d77f @mmcgrana add alerting philosophy post, ref #13
authored Oct 7, 2013
36 * [My Philosophy on Alerting](https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#heading=h.whsaboyw21nk) (Ewaschuk)
7dec478 @mmcgrana import from google doc
authored Oct 5, 2013
37 * [You Can’t Sacrifice Partition Tolerance](http://codahale.com/you-cant-sacrifice-partition-tolerance/) (Hale)
38 * [Customer Trust](http://perspectives.mvdirona.com/2013/01/15/CustomerTrust.aspx) (Hamilton)
39 * [Observations on Errors, Corrections, & Trust of Dependent Systems](http://perspectives.mvdirona.com/2012/02/26/ObservationsOnErrorsCorrectionsTrustOfDependentSystems.aspx) (Hamilton)
40 * [Life Beyond Distributed Transactions: An Apostate’s Opinion](http://cs.brown.edu/courses/cs227/archives/2012/papers/weaker/cidr07p15.pdf) (Helland)
41 * [Notes on Distributed Systems for Young Bloods](http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/) (Hodges)
42 * [The Network is Reliable](http://aphyr.com/posts/288-the-network-is-reliable) (Kingsbury)
fc9428f @mmcgrana Add The Trouble with Clocks
authored Oct 14, 2013
43 * [The Trouble with Clocks](http://aphyr.com/posts/299-the-trouble-with-timestamps) (Kingsbury)
7dec478 @mmcgrana import from google doc
authored Oct 5, 2013
44 * [Call Me Maybe: Final Thoughts](http://aphyr.com/posts/286-call-me-maybe-final-thoughts) (Kingsbury)
45 * [Getting Real About Distributed Systems Reliability](http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability) (Kreps)
46 * [On HTTP Load Testing](http://www.mnot.net/blog/2011/05/18/http_benchmark_rules) (Nottingham)
47 * [Observability at Twitter](https://blog.twitter.com/2013/observability-at-twitter) (Watson)
48 * [Stevey’s Google Platforms Rant](https://plus.google.com/112678702228711889851/posts/eVeouesvaVX) (Yegge)
49
39deca5 @mmcgrana papers on top, break out presentations
authored Oct 5, 2013
50 #### Presentations
51
52 * [Service Design Best Practices](http://www.mvdirona.com/jrh/TalksAndPapers/JamesHamilton_POA20090226.pdf) (Hamilton)
53
7dec478 @mmcgrana import from google doc
authored Oct 5, 2013
54 #### Books
55
7c75ab3 @mmcgrana Add FGtUHE, closes #29
authored Feb 6, 2014
56 * [The Field Guide To Understanding Human Error](http://www.amazon.com/Field-Guide-Understanding-Human-Error/dp/0754648265) (Dekker)
7dec478 @mmcgrana import from google doc
authored Oct 5, 2013
57 * [Agile Retrospectives: Making Good Teams Great](http://www.amazon.com/Agile-Retrospectives-Making-Teams-Great/dp/0977616649) (Derby et al.)
58 * [Better: A Surgeon’s Notes on Performance](http://www.amazon.com/dp/0312427654) (Gawande)
59 * [The Checklist Manifesto: How to Get Things Right](http://www.amazon.com/The-Checklist-Manifesto-ebook/dp/B0030V0PEW) (Gawande)
c0496fd @mmcgrana Add High Performance Browser Networking book
authored Apr 13, 2014
60 * [High Performance Browswer Networking](http://chimera.labs.oreilly.com/books/1230000000545/index.html) (Grigorik)
b345fc7 @mmcgrana Add Resilience Engineering in Practice, closes #5
authored Dec 30, 2013
61 * [Resilience Engineering in Practice](http://www.amazon.com/Resilience-Engineering-Practice-Ashgate-Studies/dp/1409410358/) (Hollnagel et al.)
c70a18a @mmcgrana Add Effective Monitoring and Alerting, closes #30
authored Feb 19, 2014
62 * [Effective Monitoring and Alerting](http://www.amazon.com/Effective-Monitoring-Alerting-For-Operations/dp/1449333524) (Ligus)
18eef08 @mmcgrana Add The Challenger Launch Decision
authored Dec 28, 2013
63 * [The Challenger Launch Decision](http://www.amazon.com/The-Challenger-Launch-Decision-Technology/dp/0226851761) (Vaughan)
ecdf86a @mmcgrana Add Managing the Unexpected, closes #22
authored Jan 12, 2014
64 * [Managing the Unexpected](http://www.amazon.com/gp/product/B004IK9U4U) (Weick and Sutcliffe)
7dec478 @mmcgrana import from google doc
authored Oct 5, 2013
65
66 #### Research Groups
67
e4a16b5 @mmcgrana Add Berkley AMP Lab
authored Apr 13, 2014
68 * [Berkley AMP Lab](https://amplab.cs.berkeley.edu/)
7dec478 @mmcgrana import from google doc
authored Oct 5, 2013
69 * [Berkeley Database Group](http://db.cs.berkeley.edu/w/)
70 * [Google Research](http://research.google.com/)
71 * [Microsoft Systems Research](http://research.microsoft.com/en-US/groups/sr/default.aspx)
72
73 #### Conferences
74
75 * [Ricon](http://ricon.io/)
6e26253 @mmcgrana Add Surge conference
authored Oct 13, 2013
76 * [Surge](http://surge.omniti.com/)
7dec478 @mmcgrana import from google doc
authored Oct 5, 2013
77 * [Velocity](http://velocityconf.com/)
78
79 #### Courseware
80
81 * [University of Illinois CS 525: Advanced Distributed Systems](http://courses.engr.illinois.edu/cs525/sp2011/sched.htm)