This repository has been archived by the owner on Jan 29, 2022. It is now read-only.
Processors in Airflow are stuck in "Running" state #812
Comments
After more investigation, I confirmed that this issue doesn't happen with the |
vitorbaptista
added a commit
to opentrials/opentrials-airflow
that referenced
this issue
Apr 28, 2017
There's a bug with the CeleryExecutor and DockerOperator where long-running tasks are stuck in a "running" state, even after the Docker container finished. I reported the bug on https://issues.apache.org/jira/browse/AIRFLOW-1131. While that's not finished, we use the LocalExecutor as a workaround. Fixes opentrials/opentrials#812
vitorbaptista
added a commit
to opentrials/opentrials-airflow
that referenced
this issue
Apr 28, 2017
There's a bug with the CeleryExecutor and DockerOperator where long-running tasks are stuck in a "running" state, even after the Docker container finished. I reported the bug on https://issues.apache.org/jira/browse/AIRFLOW-1131. While that's not finished, we use the LocalExecutor as a workaround. Fixes opentrials/opentrials#812
vitorbaptista
added a commit
to opentrials/opentrials-airflow
that referenced
this issue
Apr 28, 2017
It's needed for the Scheduler to be able to run Docker containers (with the `LocalExecutor`, the scheduler runs the tasks itself). opentrials/opentrials#812
Unfortunately, this is still happening in production. |
vitorbaptista
added a commit
to opentrials/opentrials-airflow
that referenced
this issue
May 17, 2017
Airflow's DockerOperator has an issue with long-running tasks, where it doesn't detect their exit status, keeping the tasks stuck in a "running" state even after they're finished (see https://issues.apache.org/jira/browse/AIRFLOW-1131). This commit implements the DockerCLIOperator that instead of using the Docker API, as the DockerOperator does, uses the Docker CLI executable. I have tested locally and it seems to work around the issues we've been having. However, we'll only be sure after testing in production. This commit also changes the EUCTR processor task to use the new DockerCLIOperator, so we can try it. If it does work, we'll change the helpers to use DockerCLIOperator instead of Airflow's DockerOperator. opentrials/opentrials#812
vitorbaptista
added a commit
to opentrials/opentrials-airflow
that referenced
this issue
May 17, 2017
Airflow's DockerOperator has an issue with long-running tasks, where it doesn't detect their exit status, keeping the tasks stuck in a "running" state even after they're finished (see https://issues.apache.org/jira/browse/AIRFLOW-1131). This commit implements the DockerCLIOperator that instead of using the Docker API, as the DockerOperator does, uses the Docker CLI executable. I have tested locally and it seems to work around the issues we've been having. However, we'll only be sure after testing in production. This commit also changes the EUCTR processor task to use the new DockerCLIOperator, so we can try it. If it does work, we'll change the helpers to use DockerCLIOperator instead of Airflow's DockerOperator. opentrials/opentrials#812
vitorbaptista
added a commit
to opentrials/opentrials-airflow
that referenced
this issue
May 17, 2017
Airflow's DockerOperator has an issue with long-running tasks, where it doesn't detect their exit status, keeping the tasks stuck in a "running" state even after they're finished (see https://issues.apache.org/jira/browse/AIRFLOW-1131). This commit implements the DockerCLIOperator that instead of using the Docker API, as the DockerOperator does, uses the Docker CLI executable. I have tested locally and it seems to work around the issues we've been having. However, we'll only be sure after testing in production. This commit also changes the EUCTR processor task to use the new DockerCLIOperator, so we can try it. If it does work, we'll change the helpers to use DockerCLIOperator instead of Airflow's DockerOperator. opentrials/opentrials#812
vitorbaptista
added a commit
to opentrials/opentrials-airflow
that referenced
this issue
May 19, 2017
We need to set the server's Docker API version when using the client, otherwise we'll get an error like: > Error response from daemon: client is newer than server (client API version: > 1.24, server API version: 1.23) opentrials/opentrials#812
vitorbaptista
added a commit
to opentrials/opentrials-airflow
that referenced
this issue
May 19, 2017
The environment passed to `subprocess.Popen()` needs to: * Have all values as strings; * All strings must be byte strings. We solved this by removing `None` values, converting everything else to string, and encoding as UTF-8. The other issue was that we were running Docker passing the env variables like `--env "FOO=$FOO"`, expecting that the `$FOO` variable would take the contents of the `FOO` env variable. It wasn't the case. To do so, we had to instead of running the command directly, running it like `/bin/bash -c "command"`, this way bash would parse and change the env variables. The last issue was that we weren't logging the STDERR output, which was fixed by simply piping it to STDOUT. opentrials/opentrials#812
vitorbaptista
added a commit
to opentrials/opentrials-airflow
that referenced
this issue
Jun 6, 2017
We solved the bug https://issues.apache.org/jira/browse/AIRFLOW-1131 by creating a new DockerCLIOperator, so we can return using the CeleryExecutor. Reverts 99106c3. opentrials/opentrials#812
vitorbaptista
added a commit
to opentrials/opentrials-airflow
that referenced
this issue
Jun 6, 2017
The DockerOperator has a bug where it fails to notice when a long-running container is stopped, keeping the Airflow tasks in "running" state (see https://issues.apache.org/jira/browse/AIRFLOW-1131). To workaround this, I wrote the DockerCLIOperator which, instead of using the `docker` Python bindings, runs Docker directly from a `subprocess`. After a couple weeks testing with EUCTR, this has been working fine, so now we're using this operator everywhere. opentrials/opentrials#812
vitorbaptista
added a commit
to opentrials/opentrials-airflow
that referenced
this issue
Jul 21, 2017
We also limit the memory usage to 350 MB. We were facing a few containers starving, so I'm hoping these changes will fix that. I haven't added resource limits to the "main" containers as well (worker, webserver, etc.) because I couldn't find out how to do that in the Docker Cloud stack file. opentrials/opentrials#812
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
This is an important issue, that is currently breaking our entire data pipeline.
For example, https://airflow.opentrials.net/admin/airflow/log?task_id=processor_nct&dag_id=run_all_processors&execution_date=2017-04-12T10:51:51.861748.
Notice that Airflow runs
bash -c "airflow run ..."
twice, with the latest one raising an errorRecorded pid 163 is not a descendant of the current pid 252
. I opened an issue about this on https://issues.apache.org/jira/browse/AIRFLOW-1131, including a small reproducible case.I have tested locally using the
SequentialExecutor
and I couldn't reproduce the issue, but I'm not sure if this was because the issue doesn't happen withSequentialExecutor
or some problem with how I ran the test.The text was updated successfully, but these errors were encountered: