Literature Review for Fault Detection in Distributed Systems
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README.md
outline.md

README.md

Monitoring is Dead: Long Live Monitoring

Abstract

Monitoring systems have not changed significantly in 20 years and has fallen behind the way be build software. Our software is now large distributed systems made up of many non-uniform interacting components while the core functionality of monitoring systems has stagnated. Furthermore, it is often people without expert knowledge of systems under observation that are responsible for monitoring and operating them. In this talk, we will explore how our current monitoring capabilities are failing us and discuss how we can build systems that are both reliable and observable while making our lives (or the lives of the people responsible for their operations in production) easier.

References

  1. Fischer, M. Impossibility of Distributed Concensus with One Faulty Process. in Journal of the Association for Computing Machinery, Vol. 32, No. 2, April 1985, pp. 374-382.
  2. Lamport, L., Shostak, R., and Pease, M. The Byzantine Generals Problem. in ACM Transactions on Programming Languages and Systems, Vol. 4, No. 3, July 1982, Pages 382-401.
  3. Poledna, S., Burns, A., Wellings, A., and Barrett, P. Replica Determinism and Flexible Scheduling in Hard Real-Time Dependable Systems. in IEEE Transactions on Computers, Vol. 49, No. 2, February 2000, Pages 100-111.
  4. Videla, A. Failure Modes in Distributed Systems. in his blog, December 2013.

Further Reading

Tools