Jobs run with GKEOperator need `get_logs=False`, otherwise job is likely to fail unless constantly logging to standard out #844

wlach · 2020-01-22T21:10:19Z

I noticed this while working on adding the missioncontrol-etl job (#840), but apparently this happened with the probe scraper as well.

tl;dr: a job must print something to standard out / error every 30 seconds or so, or else it will fail with a mysterious error saying IncompleteRead:

https://issues.apache.org/jira/browse/AIRFLOW-3534

I'm not sure if there's an easy / good workaround here. The function that's causing the problem is called read_namespaced_pod_log, which (AFAICT) is using a persistently opened http connection in Kubernetes to read the log under the hood:

https://github.com/apache/airflow/blob/c890d066965aa9dbf3016f41cfae45e9a084478a/airflow/kubernetes/pod_launcher.py#L173

I did some spelunking in the kubernetes python repository + issue tracker, and to be honest it doesn't seem like this type of use case is really taken into account with the API. There is no way to pick up the logs again in the event of a timeout or similiar, see for example this issue comment:

kubernetes-client/python#199 (comment)

The workaround is just to not get the logs and rely on stackdriver logging. This is pretty non-ideal: it increases the amount of filtering/spelunking you would need to do pretty significantly in the case that something goes wrong. Filing this issue for internal visibility, as it's a pretty serious gotcha.

The text was updated successfully, but these errors were encountered:

wlach · 2020-01-24T23:41:40Z

Thought I had solved it with this, but it doesn't actually fix the issue: wlach/airflow@e7ae01a

Will make further comments on my investigation in their issue tracker, starting with: https://issues.apache.org/jira/browse/AIRFLOW-3534?focusedCommentId=17023334&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17023334

brihati · 2020-04-10T09:22:15Z

Can you check in your logs if your task is marked as zombie. If it is then increase the duration of scheduler_zombie_task_threshold from default 5 minutes to something ~n minutes. When logs are not printed, it seems like worker doesn't send heartbeat to the DB and scheduler marks it as failure after scheduler_zombie_task_threshold minutes

This was referenced May 9, 2020

Add script to generate desktop glam queries mozilla/bigquery-etl#971

Merged

Convert desktop glam clients daily queries to gke commands #992

Merged

wlach mentioned this issue Feb 1, 2021

Bug 1687017 - Schedule etl-graph queries on a daily basis #1241

Merged

acmiyaguchi mentioned this issue Aug 26, 2021

Add spark.ui.showConsoleProgress true mozilla/prio-processor#121

Merged

mikaeld closed this as not planned Won't fix, can't repro, duplicate, stale Mar 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs run with GKEOperator need `get_logs=False`, otherwise job is likely to fail unless constantly logging to standard out #844

Jobs run with GKEOperator need `get_logs=False`, otherwise job is likely to fail unless constantly logging to standard out #844

wlach commented Jan 22, 2020

wlach commented Jan 24, 2020

brihati commented Apr 10, 2020

Jobs run with GKEOperator need get_logs=False, otherwise job is likely to fail unless constantly logging to standard out #844

Jobs run with GKEOperator need get_logs=False, otherwise job is likely to fail unless constantly logging to standard out #844

Comments

wlach commented Jan 22, 2020

wlach commented Jan 24, 2020

brihati commented Apr 10, 2020

Jobs run with GKEOperator need `get_logs=False`, otherwise job is likely to fail unless constantly logging to standard out #844

Jobs run with GKEOperator need `get_logs=False`, otherwise job is likely to fail unless constantly logging to standard out #844