kubernetes · fabioy · Feb 26, 2016 · Jan 22, 2016 · ikehz · Jan 23, 2016
diff --git a/docs/devel/e2e-tests.md b/docs/devel/e2e-tests.md
@@ -42,6 +42,8 @@ The e2e tests in kubernetes are built atop of [Ginkgo](http://onsi.github.io/gin
 
 The purpose of *this* document is to serve as a primer for developers who are looking to execute or add tests using a local development environment.
 
+Before writing new tests or making substantive changes to existing tests, you should also read [Writing Good e2e Tests](writing-good-e2e-tests.md)
+
 ## Building and Running the Tests
 
 There are a variety of ways to run e2e tests, but we aim to decrease the number of ways to run e2e tests to a canonical way: `hack/e2e.go`.

diff --git a/docs/devel/writing-good-e2e-tests.md b/docs/devel/writing-good-e2e-tests.md
@@ -0,0 +1,264 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Writing good e2e tests for Kubernetes #
+
+## Patterns and Anti-Patterns ##
+
+### Goals of e2e tests ###
+
+Beyond the obvious goal of providing end-to-end system test coverage,
+there are a few less obvious goals that you should bear in mind when
+designing, writing and debugging your end-to-end tests.  In
+particular, "flaky" tests, which pass most of the time but fail
+intermittently for difficult-to-diagnose reasons are extremely costly
+in terms of blurring our regression signals and slowing down our
+automated merge queue.  Up-front time and effort designing your test
+to be reliable is very well spent.  Bear in mind that we have hundreds
+of tests, each running in dozens of different environments, and if any
+test in any test environment fails, we have to assume that we
+potentially have some sort of regression. So if a significant number
+of tests fail even only 1% of the time, basic statistics dictates that
+we will almost never have a "green" regression indicator.  Stated
+another way, writing a test that is only 99% reliable is just about
+useless in the harsh reality of a CI environment.  In fact it's worse
+than useless, because not only does it not provide a reliable
+regression indicator, but it also costs a lot of subsequent debugging
+time, and delayed merges.
+
+#### Debuggability ####
+
+If your test fails, it should provide as detailed as possible reasons
+for the failure in it's output. "Timeout" is not a useful error
+message. "Timed out after 60 seconds waiting for pod xxx to enter
+running state, still in pending state" is much more useful to someone
+trying to figure out why your test failed and what to do about it.
+Specifically,
+[assertion](https://onsi.github.io/gomega/#making-assertions) code
+like the following generates rather useless errors:
+
+```
+Expect(err).NotTo(HaveOccurred())
+```
+
+Rather
+[annotate](https://onsi.github.io/gomega/#annotating-assertions) your assertion with something like this:
+
+```
+Expect(err).NotTo(HaveOccurred(), "Failed to create %d foobars, only created %d", foobarsReqd, foobarsCreated)
+```
+
+On the other hand, overly verbose logging, particularly of non-error conditions, can make
+it unnecessarily difficult to figure out whether a test failed and if
+so why?  So don't log lots of irrelevant stuff either.
+
+#### Ability to run in non-dedicated test clusters ####
+
+To reduce end-to-end delay and improve resource utilization when
+running e2e tests, we try, where possible, to run large numbers of
+tests in parallel against the same test cluster.  This means that:
+
+1. you should avoid making any assumption (implicit or explicit) that
+your test is the only thing running against the cluster.  For example,
+making the assumption that your test can run a pod on every node in a
+cluster is not a safe assumption, as some other tests, running at the
+same time as yours, might have saturated one or more nodes in the
+cluster.  Similarly, running a pod in the system namespace, and
+assuming that that will increase the count of pods in the system
+namespace by one is not safe, as some other test might be creating or
+deleting pods in the system namespace at the same time as your test.
+If you do legitimately need to write a test like that, make sure to
+label it ["\[Serial\]"](e2e-tests.md#kinds_of_tests) so that it's easy
+to identify, and not run in parallel with any other tests.
+1. You should avoid doing things to the cluster that make it difficult
+for other tests to reliably do what they're trying to do, at the same
+time.  For example, rebooting nodes, disconnecting network interfaces,
+or upgrading cluster software as part of your test is likely to
+violate the assumptions that other tests might have made about a
+reasonably stable cluster environment.  If you need to write such
+tests, please label them as
+["\[Disruptive\]"](e2e-tests.md#kinds_of_tests) so that it's easy to
+identify them, and not run them in parallel with other tests.
+1. You should avoid making assumptions about the Kubernetes API that
+are not part of the API specification, as your tests will break as
+soon as these assumptions become invalid.  For example, relying on
+specific Events, Event reasons or Event messages will make your tests
+very brittle.
+
+#### Speed of execution ####
+
+We have hundreds of e2e tests, some of which we run in serial, one
+after the other, in some cases.  If each test takes just a few minutes
+to run, that very quickly adds up to many, many hours of total
+execution time.  We try to keep such total execution time down to a
+few tens of minutes at most.  Therefore, try (very hard) to keep the
+execution time of your individual tests below 2 minutes, ideally
+shorter than that.  Concretely, adding inappropriately long 'sleep'
+statements or other gratuitous waits to tests is a killer.  If under
+normal circumstances your pod enters the running state within 10
+seconds, and 99.9% of the time within 30 seconds, it would be
+gratuitous to wait 5 minutes for this to happen.  Rather just fail
+after 30 seconds, with a clear error message as to why your test
+failed ("e.g. Pod x failed to become ready after 30 seconds, it
+usually takes 10 seconds").  If you do have a truly legitimate reason
+for waiting longer than that, or writing a test which takes longer
+than 2 minutes to run, comment very clearly in the code why this is
+necessary, and label the test as
+["\[Slow\]"](e2e-tests.md#kinds_of_tests), so that it's easy to
+identify and avoid in test runs that are required to complete
+timeously (for example those that are run against every code
+submission before it is allowed to be merged).
+Note that completing within, say, 2 minutes only when the test
+passes is not generally good enough.  Your test should also fail in a
+reasonable time.  We have seen tests that, for example, wait up to 10
+minutes for each of several pods to become ready.  Under good
+conditions these tests might pass within a few seconds, but if the
+pods never become ready (e.g. due to a system regression) they take a
+very long time to fail and typically cause the entire test run to time
+out, so that no results are produced.  Again, this is a lot less
+useful than a test that fails reliably within a minute or two when the
+system is not working correctly.
+
+#### Resilience to relatively rare, temporary infrastructure glitches or delays ####
+
+Remember that your test will be run many thousands of
+times, at different times of day and night, probably on different
+cloud providers, under different load conditions.  And often the
+underlying state of these systems is stored in eventually consistent
+data stores.  So, for example, if a resource creation request is
+theoretically asynchronous, even if you observe it to be practically
+synchronous most of the time, write your test to assume that it's
+asynchronous (e.g. make the "create" call, and poll or watch the
+resource until it's in the correct state before proceeding).
+Similarly, don't assume that API endpoints are 100% available.
+They're not.  Under high load conditions, API calls might temporarily
+fail or time-out. In such cases it's appropriate to back off and retry
+a few times before failing your test completely (in which case make
+the error message very clear about what happened, e.g. "Retried
+http://... 3 times - all failed with xxx".  Use the standard
+retry mechanisms provided in the libraries detailed below.
+
+### Some concrete tools at your disposal ###
+
+Obviously most of the above goals apply to many tests, not just yours.
+So we've developed a set of reusable test infrastructure, libraries
+and best practises to help you to do the right thing, or at least do
+the same thing as other tests, so that if that turns out to be the
+wrong thing, it can be fixed in one place, not hundreds, to be the
+right thing.
+
+Here are a few pointers:
+
++ [E2e Framework](../../test/e2e/framework.go):
+   Familiarise yourself with this test framework and how to use it.
+   Amongst others, it automatically creates uniquely named namespaces
+   within which your tests can run to avoid name clashes, and reliably
+   automates cleaning up the mess after your test has completed (it
+   just deletes everything in the namespace).  This helps to ensure
+   that tests do not leak resources. Note that deleting a namespace
+   (and by implication everything in it) is currently an expensive
+   operation.  So the fewer resources you create, the less cleaning up
+   the framework needs to do, and the faster your test (and other
+   tests running concurrently with yours) will complete. Your tests
+   should always use this framework.  Trying other home-grown
+   approaches to avoiding name clashes and resource leaks has proven
+   to be a very bad idea.
++ [E2e utils library](../../test/e2e/util.go):
+   This handy library provides tons of reusable code for a host of
+   commonly needed test functionality, including waiting for resources
+   to enter specified states, safely and consistently retrying failed
+   operations, usefully reporting errors, and much more.  Make sure
+   that you're familiar with what's available there, and use it.
+   Likewise, if you come across a generally useful mechanism that's
+   not yet implemented there, add it so that others can benefit from
+   your brilliance.  In particular pay attention to the variety of
+   timeout and retry related constants at the top of that file. Always
+   try to reuse these constants rather than try to dream up your own
+   values.  Even if the values there are not precisely what you would
+   like to use (timeout periods, retry counts etc), the benefit of
+   having them be consistent and centrally configurable across our
+   entire test suite typically outweighs your personal preferences.
++ **Follow the examples of stable, well-written tests:** Some of our
+   existing end-to-end tests are better written and more reliable than
+   others.  A few examples of well-written tests include:
+   [Replication Controllers](../../test/e2e/rc.go),
+   [Services](../../test/e2e/service.go),
+   [Reboot](../../test/e2e/reboot.go).
++ [Ginkgo Test Framework](https://github.com/onsi/ginkgo): This is the
+   test library and runner upon which our e2e tests are built.  Before
+   you write or refactor a test, read the docs and make sure that you
+   understand how it works.  In particular be aware that every test is
+   uniquely identified and described (e.g. in test reports) by the
+   concatenation of it's `Describe` clause and nested `It` clauses.
+   So for example `Describe("Pods",...).... It(""should be scheduled
+   with cpu and memory limits")` produces a sane test identifier and
+   descriptor `Pods should be scheduled with cpu and memory limits`,
+   which makes it clear what's being tested, and hence what's not
+   working if it fails.  Other good examples include:
+
+```
+   CAdvisor should be healthy on every node
+```
+
+and
+
+```
+   Daemon set should run and stop complex daemon
+```
+
+   On the contrary
+(these are real examples), the following are less good test
+descriptors:
+
+```
+   KubeProxy should test kube-proxy
+```
+
+and
+
+```
+Nodes [Disruptive] Network when a node becomes unreachable
+[replication controller] recreates pods scheduled on the
+unreachable node AND allows scheduling of pods on a node after
+it rejoins the cluster
+```
+
+An improvement might be
+
+```
+Unreachable nodes are evacuated and then repopulated upon rejoining [Disruptive]
+```
+
+Note that opening issues for specific better tooling is welcome, and
+code implementing that tooling is even more welcome :-).
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/writing-good-e2e-tests.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->