Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bad config in flaky test documentation and add script to help check for flakes #4338

Merged
merged 1 commit into from
Feb 11, 2015
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
24 changes: 18 additions & 6 deletions docs/devel/flaky-tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ There is a testing image ```brendanburns/flake``` up on the docker hub. We will

Create a replication controller with the following config:
```yaml
id: flakeController
id: flakecontroller
kind: ReplicationController
apiVersion: v1beta1
desiredState:
Expand Down Expand Up @@ -41,14 +41,26 @@ labels:

```./cluster/kubectl.sh create -f controller.yaml```

This will spin up 100 instances of the test. They will run to completion, then exit, the kubelet will restart them, eventually you will have sufficient
runs for your purposes, and you can stop the replication controller by setting the ```replicas``` field to 0 and then running:
This will spin up 24 instances of the test. They will run to completion, then exit, and the kubelet will restart them, accumulating more and more runs of the test.
You can examine the recent runs of the test by calling ```docker ps -a``` and looking for tasks that exited with non-zero exit codes. Unfortunately, docker ps -a only keeps around the exit status of the last 15-20 containers with the same image, so you have to check them frequently.
You can use this script to automate checking for failures, assuming your cluster is running on GCE and has four nodes:

```sh
./cluster/kubectl.sh update -f controller.yaml
./cluster/kubectl.sh delete -f controller.yaml
echo "" > output.txt
for i in {1..4}; do
echo "Checking kubernetes-minion-${i}"
echo "kubernetes-minion-${i}:" >> output.txt
gcloud compute ssh "kubernetes-minion-${i}" --command="sudo docker ps -a" >> output.txt
done
grep "Exited ([^0])" output.txt
```

Now examine the machines with ```docker ps -a``` and look for tasks that exited with non-zero exit codes (ignore those that exited -1, since that's what happens when you stop the replica controller)
Eventually you will have sufficient runs for your purposes. At that point you can stop and delete the replication controller by running:

```sh
./cluster/kubectl.sh stop replicationcontroller flakecontroller
```

If you do a final check for flakes with ```docker ps -a```, ignore tasks that exited -1, since that's what happens when you stop the replication controller.

Happy flake hunting!