Skip to content
This repository has been archived by the owner on Aug 23, 2020. It is now read-only.

Commit

Permalink
add involved topics & impact
Browse files Browse the repository at this point in the history
  • Loading branch information
hjacobs committed Feb 3, 2019
1 parent 2df38bb commit f924344
Show file tree
Hide file tree
Showing 2 changed files with 37 additions and 2 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.*
38 changes: 36 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,57 @@
A compiled list of links to public failure stories related to Kubernetes.
Most recent publications on top.

* [Kubernetes Load Balancer Configuration – Beware when draining nodes - DevOps Hof - blog post 2019](https://www.devops-hof.de/kubernetes-load-balancer-konfiguration-beware-when-draining-nodes/)
* [Kubernetes Load Balancer Configuration - Beware when draining nodes - DevOps Hof - blog post 2019](https://www.devops-hof.de/kubernetes-load-balancer-konfiguration-beware-when-draining-nodes/)
* involved: GCP Load Balancer, externalTrafficPolicy, ingress-nginx
* impact: total ingress traffic outage
* [On Infrastructure at Scale: A Cascading Failure of Distributed Systems - Target - Medium post January 2019](https://medium.com/@daniel.p.woods/on-infrastructure-at-scale-a-cascading-failure-of-distributed-systems-7cff2a3cd2df)
* involved: on-premise, Kafka, large cluster, Consul
* impact: development environment outage
* [Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Zalando - DevOpsCon Munich 2018](https://www.slideshare.net/try_except_/running-kubernetes-in-production-a-million-ways-to-crash-your-cluster-devopscon-munich-2018)
* involved: AWS, Ingress, CronJob, etcd, flannel, Docker, CPU throttling
* impact: production outages
* [Outages? Downtime? - Veracode - blog post 2018](https://sethmccombs.github.io/work/2018/12/03/Outages.html)
* involved: AWS, AWS IAM, region migration, kubespray, Terraform, pod CIDR
* impact: QA/dev cluster outage
* [NRE Labs Outage Post-Mortem - NRE Labs - blog post 2018](https://keepingitclassless.net/2018/12/december-4---nre-labs-outage-post-mortem/)
* involved: GCP, kubeadm, etcd, Terraform, livenessProbe
* impact: production outage
* [A Perfect DNS Storm - Toyota Connected - blog post 2018](https://www.adammargherio.com/a-perfect-dns-storm/)
* involved: Azure, DNS, ndots 5, Alpine musl libc
* impact: DNS resolution failures
* [Kubernetes and the Menace ELB, the tale of an outage - Turnitin - blog post 2018](https://itnext.io/kubernetes-and-the-menace-elb-the-tale-of-an-outage-c00bef678fc0)
* involved: AWS, kube-aws, ELB dynamic IPs, API server, kubelet, NotReady nodes
* impact: 15 minutes cluster outage
* [Moving the Entire Stack to K8s Within a Year – Lessons Learned - ThredUP - DevOpsStage 2018](https://www.youtube.com/watch?v=tA8Sr3Nsx1I)
* involved: AWS, kops, HAProxy, livenessProbe, DNS, too many open files
* impact: unknown outages, DNS errors
* [AirMap Platform Service Outage - AirMap - incident report 2018](https://www.airmap.com/incident-180719/)
* involved: Azure, NotReady nodes, kubelet PLEG, CNI
* impact: production AirMap platform outage
* [Anatomy of a Production Kubernetes Outage - Monzo - KubeCon Europe 2018](https://www.youtube.com/watch?v=OUYTNywPk-s)
* involved:
* impact:
* [101 Ways to "Break and Recover" Kubernetes Cluster - Oath/Yahoo - KubeCon Europe 2018](https://www.youtube.com/watch?v=likHm-KHGWQ)
* involved:
* impact:
* [101 Ways to Crash Your Cluster - Nordstrom - KubeCon North America 2017](https://www.youtube.com/watch?v=xZO9nx6GBu0)
* involved:
* impact:
* [Major Outage: Current account payments may fail - Monzo - Monzo Community post 2017](https://community.monzo.com/t/resolved-current-account-payments-may-fail-major-outage-27-10-2017/26296/95)
* involved: AWS, etcd, Linkerd, NullPointerException, services without endpoints
* impact: major production outage, full platform outage, current account payments fail
* [Search and Reporting Outage - Universe - incident report 2017](http://status.universe.com/incidents/115n3vxqwzcf)
* involved: Job, RestartPolicy, consume node resources
* impact: production Universe search and reporting outage
* [Our First Kubernetes Outage - Saltside - blog post 2017](https://engineering.saltside.se/our-first-kubernetes-outage-c6b9249cfd3a)
* involved: AWS, kops, Helm, DeadNode, resource exhaustion
* impact: nonproduction cluster outage
* [Our Failure Migrating to Kubernetes - Saltside - blog post 2017](https://engineering.saltside.se/our-failure-migrating-to-kubernetes-25c28e6dd604)
* involved: AWS, kops, ELB, BackendConnctionErrors, LoadBalancer service
* impact: aborted application migration
* [SaleMove US System Issue - SaleMove - incident report 2017](https://status.salemove.com/incidents/xf6cr710yrzn)
* involved: AWS, ELB dynamic IPs, DNS A for master, API server
* impact: production issues with SaleMove US System

# Why

Expand All @@ -28,7 +62,7 @@ Its ecosystem is constantly evolving and adding even more layers (service mesh,
Considering this environment, we don't hear enough real-world horror stories to learn from each other!
This compilation of failure stories should make it easier for people dealing with Kubernetes operations (SRE, Ops, platform/infrastructure teams) to
learn from others and reduce the unknown unknowns of running Kubernetes in production.
For more information, [see the blog post](https://srcco.de/posts/kubernetes-failure-stories.html).
For more information, [see the blog post](https://srcco.de/posts/kubernetes-failure-stories.html).


# Contributing
Expand Down

0 comments on commit f924344

Please sign in to comment.