You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm not sure if there's an easy / good workaround here. The function that's causing the problem is called read_namespaced_pod_log, which (AFAICT) is using a persistently opened http connection in Kubernetes to read the log under the hood:
I did some spelunking in the kubernetes python repository + issue tracker, and to be honest it doesn't seem like this type of use case is really taken into account with the API. There is no way to pick up the logs again in the event of a timeout or similiar, see for example this issue comment:
The workaround is just to not get the logs and rely on stackdriver logging. This is pretty non-ideal: it increases the amount of filtering/spelunking you would need to do pretty significantly in the case that something goes wrong. Filing this issue for internal visibility, as it's a pretty serious gotcha.
The text was updated successfully, but these errors were encountered:
Can you check in your logs if your task is marked as zombie. If it is then increase the duration of scheduler_zombie_task_threshold from default 5 minutes to something ~n minutes. When logs are not printed, it seems like worker doesn't send heartbeat to the DB and scheduler marks it as failure after scheduler_zombie_task_threshold minutes
I noticed this while working on adding the missioncontrol-etl job (#840), but apparently this happened with the probe scraper as well.
tl;dr: a job must print something to standard out / error every 30 seconds or so, or else it will fail with a mysterious error saying
IncompleteRead
:https://issues.apache.org/jira/browse/AIRFLOW-3534
I'm not sure if there's an easy / good workaround here. The function that's causing the problem is called read_namespaced_pod_log, which (AFAICT) is using a persistently opened http connection in Kubernetes to read the log under the hood:
https://github.com/apache/airflow/blob/c890d066965aa9dbf3016f41cfae45e9a084478a/airflow/kubernetes/pod_launcher.py#L173
I did some spelunking in the kubernetes python repository + issue tracker, and to be honest it doesn't seem like this type of use case is really taken into account with the API. There is no way to pick up the logs again in the event of a timeout or similiar, see for example this issue comment:
kubernetes-client/python#199 (comment)
The workaround is just to not get the logs and rely on stackdriver logging. This is pretty non-ideal: it increases the amount of filtering/spelunking you would need to do pretty significantly in the case that something goes wrong. Filing this issue for internal visibility, as it's a pretty serious gotcha.
The text was updated successfully, but these errors were encountered: