Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Fix Flaky Github Actions That Use OpenSearch docker container #62

Open
nhtruong opened this issue Oct 6, 2022 · 3 comments

Comments

@nhtruong
Copy link

nhtruong commented Oct 6, 2022

What is the problem you are trying to solve?

There is a bug from openseach-build that causes the OpenSearch container to occasionally fail as soon as it's booted up. This has caused Integration tests on the Javascript client to fail intermittently, and I've been informed that other repos' workflows are also facing this issue.

Even though the chance of OpenSearch container crashing is only 1 out of 50 (per my benchmarks running thousands of such jobs), the chance of this bug failing a workflow is quite high when you run compatibility tests that stand up a few dozen of OpenSearch instances. This has caused every other Push/Pull-Request to fail the Action check, and requires an admin to rerun the failed jobs. This is not a good experience for the contributors nor the admins

What else have you found out about this problem?

  • I've only seen this happen on workflows that use docker-compose up. Workflows that run gradlew and docker run, does not seem to be affected. [Verification needed]
  • This issue only happens AFTER the container has started successfully. So, you won't see any error from Docker during docker-compose.
  • This bug can either cause the container to crash, or terminate the OpenSearch service within the container. The container will still appear to be running just fine in the latter.
  • When this happens, you will see the following messages in the container's logs:
    Killing opensearch process 10
    Killing performance analyzer process 11

What are you proposing?

Simply restarting the container will bring OpenSearch service back to life. So, there are a couple of workarounds that we can apply to these flaky workflows:

1. Grep for the Killing message:

Run the following script after the container's stood up (You can add this in the make file after docker-compose up):

for i in {1..3}; do \
	sleep 30; \
	if docker logs opensearch_opensearch_1 --tail 10 | grep -q "Killing opensearch process"; then \
		echo "Restarting OpenSearch Container..."; \
		docker restart opensearch_opensearch_1; \
	else break; fi; \
done;
sleep 30;

This is a quick and dirty workaround. You can just copy-paste this script to your workflow-step/make-file (after replacing opensearch_opensearch_1 with your container's name of course), and it will just work.

2. Autoheal + Auto Restart:

  • When this bug crashes the container, we can use docker's auto restart feature to bring it back up. Add restart: always to the service definition in your docker file:

    services:
      opensearch:
        restart: always
  • When this bug crashes the OpenSearch service but leaves the container running, we can use a combination of Healthcheck and Autoheal to restart the container when it's unhealthy:

    • Add Healthcheck to Dockerfile. For example:

      HEALTHCHECK --start-period=20s --interval=5s --retries=2 --timeout=1s \
        CMD if [ "$SECURE_INTEGRATION" != "true" ]; \
          then curl --fail localhost:9200/_cat/health; \
          else curl --fail -k https:/localhost:9200/_cat/health -u admin:admin; fi

      Note that the HealthCheck CMD is specific to your OpenSearch instance, so you will have to figure it out for your own workflow. It also takes some time for the container to be ready for the first HealthCheck. So, I'd recommend a --start-period of at least 20 seconds. If you're using any env vars in your CMD, remember to define them in services.<service_name>.environment, instead of services.<service_name>.build.args, in the docker-compose file as the command is NOT run during docker-build.

    • Add Autoheal as a service running along side Opensearch container. For example to define Autoheal in docker-compose.yml:

      services:
         opensearch:
         ...
         autoheal:
           restart: always
           image: willfarrell/autoheal
           environment:
             - AUTOHEAL_CONTAINER_LABEL=all
             - AUTOHEAL_START_PERIOD=30
             - AUTOHEAL_INTERVAL=5
             - AUTOHEAL_DEFAULT_STOP_TIMEOUT=30
           volumes:
             - /var/run/docker.sock:/var/run/docker.sock

      In this example, I use AUTOHEAL_CONTAINER_LABEL=all, which means Autoheal will try to restart all unhealthy containers instead of only those with services.<service_name>.labels.autoheal=true. I opted not to use the label feature of autoheal because I couldn't make it to work consistently on my Ubuntu workflows (It works fine on my local Mac env). Also note that the AUTOHEAL_START_PERIOD and AUTOHEAL_DEFAULT_STOP_TIMEOUT should be greater than the time it takes for first possible unhealthy status to be reported.

    • Add ample sleep time after OpenSearch and Autoheal containers are stood up so that the OpenSearch container can be restarted by Autoheal at least once, to avoid a race condition between these 2 containers and your tests. For this example, I'd use sleep 60;

    This workaround is more involved, but it has the added benefit of also solving other kinds of intermittent failures, not just the one caused by said bug.

I benchmarked both solutions on over 700 jobs each, and they all passed.

@dblock
Copy link
Member

dblock commented Oct 6, 2022

@nhtruong I really think we're wasting our time trying to retry restarting the containers, we should fix the root cause - want to try writing a matrix job that runs enough containers in a loop/parallel to reproduce this semi-consistently and collect logs from the opensearch instance that doesn't start? there's an error in there I'm almost sure

@nhtruong
Copy link
Author

nhtruong commented Oct 6, 2022

@dblock For sure. Lemme look for ways to grab better logs than the default container logs which only shows

  Killing opensearch process 10
  Killing performance analyzer process 11

@dblock
Copy link
Member

dblock commented Oct 11, 2022

@nhtruong So we like opensearch-project/opensearch-js#304? Let's document how to do that everywhere else? Can we reuse some of those GH workflows? Do we need a doc on integration testing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants