add involved topics & impact

hjacobs · Feb 3, 2019 · f924344 · f924344
1 parent 2df38bb
commit f924344
Show file tree

Hide file tree

Showing 2 changed files with 37 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+.*
diff --git a/README.md b/README.md
@@ -3,23 +3,57 @@
 A compiled list of links to public failure stories related to Kubernetes.
 Most recent publications on top.
 
-* [Kubernetes Load Balancer Configuration – Beware when draining nodes - DevOps Hof - blog post 2019](https://www.devops-hof.de/kubernetes-load-balancer-konfiguration-beware-when-draining-nodes/)
+* [Kubernetes Load Balancer Configuration - Beware when draining nodes - DevOps Hof - blog post 2019](https://www.devops-hof.de/kubernetes-load-balancer-konfiguration-beware-when-draining-nodes/)
+    * involved: GCP Load Balancer, externalTrafficPolicy, ingress-nginx
+    * impact: total ingress traffic outage
 * [On Infrastructure at Scale: A Cascading Failure of Distributed Systems - Target - Medium post January 2019](https://medium.com/@daniel.p.woods/on-infrastructure-at-scale-a-cascading-failure-of-distributed-systems-7cff2a3cd2df)
+    * involved: on-premise, Kafka, large cluster, Consul
+    * impact: development environment outage
 * [Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Zalando - DevOpsCon Munich 2018](https://www.slideshare.net/try_except_/running-kubernetes-in-production-a-million-ways-to-crash-your-cluster-devopscon-munich-2018)
+    * involved: AWS, Ingress, CronJob, etcd, flannel, Docker, CPU throttling
+    * impact: production outages
 * [Outages? Downtime? - Veracode - blog post 2018](https://sethmccombs.github.io/work/2018/12/03/Outages.html)
+    * involved: AWS, AWS IAM, region migration, kubespray, Terraform, pod CIDR
+    * impact: QA/dev cluster outage
 * [NRE Labs Outage Post-Mortem - NRE Labs - blog post 2018](https://keepingitclassless.net/2018/12/december-4---nre-labs-outage-post-mortem/)
+    * involved: GCP, kubeadm, etcd, Terraform, livenessProbe
+    * impact: production outage
 * [A Perfect DNS Storm - Toyota Connected - blog post 2018](https://www.adammargherio.com/a-perfect-dns-storm/)
+    * involved: Azure, DNS, ndots 5, Alpine musl libc
+    * impact: DNS resolution failures
 * [Kubernetes and the Menace ELB, the tale of an outage - Turnitin - blog post 2018](https://itnext.io/kubernetes-and-the-menace-elb-the-tale-of-an-outage-c00bef678fc0)
+    * involved: AWS, kube-aws, ELB dynamic IPs, API server, kubelet, NotReady nodes
+    * impact: 15 minutes cluster outage
 * [Moving the Entire Stack to K8s Within a Year – Lessons Learned - ThredUP - DevOpsStage 2018](https://www.youtube.com/watch?v=tA8Sr3Nsx1I)
+    * involved: AWS, kops, HAProxy, livenessProbe, DNS, too many open files
+    * impact: unknown outages, DNS errors
 * [AirMap Platform Service Outage - AirMap - incident report 2018](https://www.airmap.com/incident-180719/)
+    * involved: Azure, NotReady nodes, kubelet PLEG, CNI
+    * impact: production AirMap platform outage
 * [Anatomy of a Production Kubernetes Outage - Monzo - KubeCon Europe 2018](https://www.youtube.com/watch?v=OUYTNywPk-s)
+    * involved:
+    * impact:
 * [101 Ways to "Break and Recover" Kubernetes Cluster - Oath/Yahoo - KubeCon Europe 2018](https://www.youtube.com/watch?v=likHm-KHGWQ)
+    * involved:
+    * impact:
 * [101 Ways to Crash Your Cluster - Nordstrom - KubeCon North America 2017](https://www.youtube.com/watch?v=xZO9nx6GBu0)
+    * involved:
+    * impact:
 * [Major Outage: Current account payments may fail - Monzo - Monzo Community post 2017](https://community.monzo.com/t/resolved-current-account-payments-may-fail-major-outage-27-10-2017/26296/95)
+    * involved: AWS, etcd, Linkerd, NullPointerException, services without endpoints
+    * impact: major production outage, full platform outage, current account payments fail
 * [Search and Reporting Outage - Universe - incident report 2017](http://status.universe.com/incidents/115n3vxqwzcf)
+    * involved: Job, RestartPolicy, consume node resources
+    * impact: production Universe search and reporting outage
 * [Our First Kubernetes Outage - Saltside - blog post 2017](https://engineering.saltside.se/our-first-kubernetes-outage-c6b9249cfd3a)
+    * involved: AWS, kops, Helm, DeadNode, resource exhaustion
+    * impact: nonproduction cluster outage
 * [Our Failure Migrating to Kubernetes - Saltside - blog post 2017](https://engineering.saltside.se/our-failure-migrating-to-kubernetes-25c28e6dd604)
+    * involved: AWS, kops, ELB, BackendConnctionErrors, LoadBalancer service
+    * impact: aborted application migration
 * [SaleMove US System Issue - SaleMove - incident report 2017](https://status.salemove.com/incidents/xf6cr710yrzn)
+    * involved: AWS, ELB dynamic IPs, DNS A for master, API server
+    * impact: production issues with SaleMove US System
 
 # Why
 
@@ -28,7 +62,7 @@ Its ecosystem is constantly evolving and adding even more layers (service mesh,
 Considering this environment, we don't hear enough real-world horror stories to learn from each other!
 This compilation of failure stories should make it easier for people dealing with Kubernetes operations (SRE, Ops, platform/infrastructure teams) to
 learn from others and reduce the unknown unknowns of running Kubernetes in production.
-For more information, [see the blog post](https://srcco.de/posts/kubernetes-failure-stories.html). 
+For more information, [see the blog post](https://srcco.de/posts/kubernetes-failure-stories.html).
 
 
 # Contributing