Skip to content

Jobs run with GKEOperator need get_logs=False, otherwise job is likely to fail unless constantly logging to standard out #844

@wlach

Description

@wlach

I noticed this while working on adding the missioncontrol-etl job (#840), but apparently this happened with the probe scraper as well.

tl;dr: a job must print something to standard out / error every 30 seconds or so, or else it will fail with a mysterious error saying IncompleteRead:

https://issues.apache.org/jira/browse/AIRFLOW-3534

I'm not sure if there's an easy / good workaround here. The function that's causing the problem is called read_namespaced_pod_log, which (AFAICT) is using a persistently opened http connection in Kubernetes to read the log under the hood:

https://github.com/apache/airflow/blob/c890d066965aa9dbf3016f41cfae45e9a084478a/airflow/kubernetes/pod_launcher.py#L173

I did some spelunking in the kubernetes python repository + issue tracker, and to be honest it doesn't seem like this type of use case is really taken into account with the API. There is no way to pick up the logs again in the event of a timeout or similiar, see for example this issue comment:

kubernetes-client/python#199 (comment)

The workaround is just to not get the logs and rely on stackdriver logging. This is pretty non-ideal: it increases the amount of filtering/spelunking you would need to do pretty significantly in the case that something goes wrong. Filing this issue for internal visibility, as it's a pretty serious gotcha.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions