Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test cases in docker tests #27

Closed
nroi opened this issue Dec 6, 2020 · 6 comments · Fixed by #55
Closed

Flaky test cases in docker tests #27

nroi opened this issue Dec 6, 2020 · 6 comments · Fixed by #55

Comments

@nroi
Copy link
Owner

nroi commented Dec 6, 2020

The problem seems to be that for some reason, during our integration tests, it can happen that all services suddenly stall completely: they are completely frozen, do not respond to HTTP requests or executing any code at all. It only takes a few seconds or a few hundred milliseconds until they "unfreeze", but this still causes failing test cases.

The problem has been analyzed on the timeouts-wip branch, this is what is known so far:

  • The problem is related to Docker, and not the Rust program. To verify this, we have added a NGINX service into our docker-compose.yml and written a bash script test_stalling.sh: We can start our docker services with ./docker-compose and then run the test_stalling.sh script. It may take more than one attempt to reproduce the issue, but the test_stalling script should eventually output something like >>> Threshold exceeded: 1.625723038

  • The problem is not restricted to a single docker service: it seems that all docker services are stalling completely. We can run the following two statements in two parallel bash sessions:

    while true; do curl -f -s http://127.0.0.1:8088 > /dev/null && date +%s%N;  sleep .2; done | ./analyze_nanos.py
     while true; do curl -f -s http://127.0.0.1:8099 > /dev/null && date +%s%N;  sleep .2; done | ./analyze_nanos.py

    and we will see that the error occurs at both bash sessions at the same time.

  • It's not just the docker services that have been started with this particular compose file that are stalling. For example, if we start a third NGINX server outside of the docker-compose file, for example, like so:

    docker container run -it --rm -p 8077:80 nginx:1.19.1-alpine

    then we can verify that this NGINX instance stalls at the exact same moment where the other NGINX instances also stall.

@nroi
Copy link
Owner Author

nroi commented Dec 6, 2020

The problem seems to be that, while our docker containers are running, certain actions can only be executed with a delay. For example, if NGINX is running on localhost – even without docker – then the following will show that connection establishment is sometimes delayed when our docker-compose tests are running:

while true; do echo -n '' | nc -N 127.0.0.1 80; date +%s%N; sleep 0.1; done | ./analyze_nanos.py

@nroi
Copy link
Owner Author

nroi commented Dec 6, 2020

It seems to be unrelated to sockets:

while true; do echo -n '' | date +%s%N; sleep 0.1; done | ./analyze_nanos.py

Gave the following output when the docker tests were running:

>>> Threshold exceeded at 2020-12-06 18:45:54.307145: 1.493492608
>>> Threshold exceeded at 2020-12-06 18:45:58.851240: 4.544129866

@nroi
Copy link
Owner Author

nroi commented Dec 6, 2020

Another example:

while true; do date +%s%N; sleep 0.1; done | ./analyze_nanos.py
>>> Threshold exceeded at 2020-12-06 19:11:48.709007: 2.51274226

@nroi
Copy link
Owner Author

nroi commented Dec 6, 2020

while true; do strace --absolute-timestamps=format:unix,precision:ns /bin/true  2>&1; done | ./analyze_nanos.py
>>> Threshold exceeded at 2020-12-06 21:32:27.167433: 19.471243619918823
1607286727.695725213 +++ exited with 0 +++

1607286747.166968830 execve("/bin/true", ["/bin/true"], 0x7ffc168da2e8 /* 61 vars */) = 0

nroi added a commit that referenced this issue May 16, 2021
Fixes #27

It seems that the issue was caused by an overloaded file system: Copying
large amounts of data inside the docker containers would cause spikes in
IO usage which, in turn, caused freezes, which then caused test cases to
fail because we're using timeouts in Flexo.

Storing the packages in a tmpfs means that the files are stored in
RAM, not on disk. This is more of a workaround and it also has the
added disadvantage that running the test cases using docker now requires
substantial amounts of RAM. The test cases will most likely not run
successfully with less than 32 GB of RAM.
@nroi nroi closed this as completed in #55 May 16, 2021
@aude
Copy link
Contributor

aude commented May 21, 2021

For what it's worth, I've been experiencing freezes in the service lately (haven't run the test suite).

I've not been able to get any response with HTTP, and the log shows nothing happens. Even restarting flexo doesn't work immediately. Then, after maybe 1s-2m, it just starts responding again like nothing happened.

Haven't looked into it, so could definitely be something local, though thought I'd mention anyway just in case.

@nroi nroi mentioned this issue May 21, 2021
@nroi
Copy link
Owner Author

nroi commented May 21, 2021

For what it's worth, I've been experiencing freezes in the service lately (haven't run the test suite).

I'm pretty confident that the freezes described in this issue where caused by the docker setup, and not related to any bug in the actual Flexo code, and I'm also confident that this issue has now been fixed. So what you're describing is probably another issue.

I've created a separate issue from your description. If you gain any more information about this issue, like a pattern that it always occurs in a specific situation, please let me know, because these sorts of things are difficult to troubleshoot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants