-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky test cases in docker tests #27
Comments
The problem seems to be that, while our docker containers are running, certain actions can only be executed with a delay. For example, if NGINX is running on localhost – even without docker – then the following will show that connection establishment is sometimes delayed when our docker-compose tests are running: while true; do echo -n '' | nc -N 127.0.0.1 80; date +%s%N; sleep 0.1; done | ./analyze_nanos.py |
It seems to be unrelated to sockets: while true; do echo -n '' | date +%s%N; sleep 0.1; done | ./analyze_nanos.py Gave the following output when the docker tests were running:
|
Another example: while true; do date +%s%N; sleep 0.1; done | ./analyze_nanos.py
|
while true; do strace --absolute-timestamps=format:unix,precision:ns /bin/true 2>&1; done | ./analyze_nanos.py
|
Fixes #27 It seems that the issue was caused by an overloaded file system: Copying large amounts of data inside the docker containers would cause spikes in IO usage which, in turn, caused freezes, which then caused test cases to fail because we're using timeouts in Flexo. Storing the packages in a tmpfs means that the files are stored in RAM, not on disk. This is more of a workaround and it also has the added disadvantage that running the test cases using docker now requires substantial amounts of RAM. The test cases will most likely not run successfully with less than 32 GB of RAM.
For what it's worth, I've been experiencing freezes in the service lately (haven't run the test suite). I've not been able to get any response with HTTP, and the log shows nothing happens. Even restarting flexo doesn't work immediately. Then, after maybe 1s-2m, it just starts responding again like nothing happened. Haven't looked into it, so could definitely be something local, though thought I'd mention anyway just in case. |
I'm pretty confident that the freezes described in this issue where caused by the docker setup, and not related to any bug in the actual Flexo code, and I'm also confident that this issue has now been fixed. So what you're describing is probably another issue. I've created a separate issue from your description. If you gain any more information about this issue, like a pattern that it always occurs in a specific situation, please let me know, because these sorts of things are difficult to troubleshoot. |
The problem seems to be that for some reason, during our integration tests, it can happen that all services suddenly stall completely: they are completely frozen, do not respond to HTTP requests or executing any code at all. It only takes a few seconds or a few hundred milliseconds until they "unfreeze", but this still causes failing test cases.
The problem has been analyzed on the
timeouts-wip
branch, this is what is known so far:The problem is related to Docker, and not the Rust program. To verify this, we have added a NGINX service into our
docker-compose.yml
and written a bash scripttest_stalling.sh
: We can start our docker services with./docker-compose
and then run thetest_stalling.sh
script. It may take more than one attempt to reproduce the issue, but thetest_stalling
script should eventually output something like>>> Threshold exceeded: 1.625723038
The problem is not restricted to a single docker service: it seems that all docker services are stalling completely. We can run the following two statements in two parallel bash sessions:
and we will see that the error occurs at both bash sessions at the same time.
It's not just the docker services that have been started with this particular compose file that are stalling. For example, if we start a third NGINX server outside of the docker-compose file, for example, like so:
then we can verify that this NGINX instance stalls at the exact same moment where the other NGINX instances also stall.
The text was updated successfully, but these errors were encountered: