Flaky test cases in docker tests #27

nroi · 2020-12-06T16:27:49Z

The problem seems to be that for some reason, during our integration tests, it can happen that all services suddenly stall completely: they are completely frozen, do not respond to HTTP requests or executing any code at all. It only takes a few seconds or a few hundred milliseconds until they "unfreeze", but this still causes failing test cases.

The problem has been analyzed on the timeouts-wip branch, this is what is known so far:

The problem is related to Docker, and not the Rust program. To verify this, we have added a NGINX service into our docker-compose.yml and written a bash script test_stalling.sh: We can start our docker services with ./docker-compose and then run the test_stalling.sh script. It may take more than one attempt to reproduce the issue, but the test_stalling script should eventually output something like >>> Threshold exceeded: 1.625723038
The problem is not restricted to a single docker service: it seems that all docker services are stalling completely. We can run the following two statements in two parallel bash sessions:
```
while true; do curl -f -s http://127.0.0.1:8088 > /dev/null && date +%s%N;  sleep .2; done | ./analyze_nanos.py
```
```
 while true; do curl -f -s http://127.0.0.1:8099 > /dev/null && date +%s%N;  sleep .2; done | ./analyze_nanos.py
```
and we will see that the error occurs at both bash sessions at the same time.
It's not just the docker services that have been started with this particular compose file that are stalling. For example, if we start a third NGINX server outside of the docker-compose file, for example, like so:
```
docker container run -it --rm -p 8077:80 nginx:1.19.1-alpine
```
then we can verify that this NGINX instance stalls at the exact same moment where the other NGINX instances also stall.

The text was updated successfully, but these errors were encountered:

nroi · 2020-12-06T17:12:28Z

The problem seems to be that, while our docker containers are running, certain actions can only be executed with a delay. For example, if NGINX is running on localhost – even without docker – then the following will show that connection establishment is sometimes delayed when our docker-compose tests are running:

while true; do echo -n '' | nc -N 127.0.0.1 80; date +%s%N; sleep 0.1; done | ./analyze_nanos.py

nroi · 2020-12-06T17:50:14Z

It seems to be unrelated to sockets:

while true; do echo -n '' | date +%s%N; sleep 0.1; done | ./analyze_nanos.py

Gave the following output when the docker tests were running:

>>> Threshold exceeded at 2020-12-06 18:45:54.307145: 1.493492608
>>> Threshold exceeded at 2020-12-06 18:45:58.851240: 4.544129866

nroi · 2020-12-06T18:14:14Z

Another example:

while true; do date +%s%N; sleep 0.1; done | ./analyze_nanos.py

>>> Threshold exceeded at 2020-12-06 19:11:48.709007: 2.51274226

nroi · 2020-12-06T20:36:04Z

while true; do strace --absolute-timestamps=format:unix,precision:ns /bin/true  2>&1; done | ./analyze_nanos.py

>>> Threshold exceeded at 2020-12-06 21:32:27.167433: 19.471243619918823
1607286727.695725213 +++ exited with 0 +++

1607286747.166968830 execve("/bin/true", ["/bin/true"], 0x7ffc168da2e8 /* 61 vars */) = 0

Fixes #27 It seems that the issue was caused by an overloaded file system: Copying large amounts of data inside the docker containers would cause spikes in IO usage which, in turn, caused freezes, which then caused test cases to fail because we're using timeouts in Flexo. Storing the packages in a tmpfs means that the files are stored in RAM, not on disk. This is more of a workaround and it also has the added disadvantage that running the test cases using docker now requires substantial amounts of RAM. The test cases will most likely not run successfully with less than 32 GB of RAM.

aude · 2021-05-21T18:39:09Z

For what it's worth, I've been experiencing freezes in the service lately (haven't run the test suite).

I've not been able to get any response with HTTP, and the log shows nothing happens. Even restarting flexo doesn't work immediately. Then, after maybe 1s-2m, it just starts responding again like nothing happened.

Haven't looked into it, so could definitely be something local, though thought I'd mention anyway just in case.

nroi · 2021-05-21T18:56:34Z

For what it's worth, I've been experiencing freezes in the service lately (haven't run the test suite).

I'm pretty confident that the freezes described in this issue where caused by the docker setup, and not related to any bug in the actual Flexo code, and I'm also confident that this issue has now been fixed. So what you're describing is probably another issue.

I've created a separate issue from your description. If you gain any more information about this issue, like a pattern that it always occurs in a specific situation, please let me know, because these sorts of things are difficult to troubleshoot.

nroi mentioned this issue May 15, 2021

Store file size in normal files, not extended attributes #54

Merged

nroi mentioned this issue May 16, 2021

Fix Flaky docker tests #55

Merged

nroi closed this as completed in #55 May 16, 2021

nroi mentioned this issue May 21, 2021

Random freezes #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test cases in docker tests #27

Flaky test cases in docker tests #27

nroi commented Dec 6, 2020 •

edited

Loading

nroi commented Dec 6, 2020 •

edited

Loading

nroi commented Dec 6, 2020

nroi commented Dec 6, 2020

nroi commented Dec 6, 2020 •

edited

Loading

aude commented May 21, 2021

nroi commented May 21, 2021

Flaky test cases in docker tests #27

Flaky test cases in docker tests #27

Comments

nroi commented Dec 6, 2020 • edited Loading

nroi commented Dec 6, 2020 • edited Loading

nroi commented Dec 6, 2020

nroi commented Dec 6, 2020

nroi commented Dec 6, 2020 • edited Loading

aude commented May 21, 2021

nroi commented May 21, 2021

nroi commented Dec 6, 2020 •

edited

Loading

nroi commented Dec 6, 2020 •

edited

Loading

nroi commented Dec 6, 2020 •

edited

Loading