Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs run with GKEOperator need get_logs=False, otherwise job is likely to fail unless constantly logging to standard out #844

Closed
wlach opened this issue Jan 22, 2020 · 2 comments

Comments

@wlach
Copy link
Contributor

wlach commented Jan 22, 2020

I noticed this while working on adding the missioncontrol-etl job (#840), but apparently this happened with the probe scraper as well.

tl;dr: a job must print something to standard out / error every 30 seconds or so, or else it will fail with a mysterious error saying IncompleteRead:

https://issues.apache.org/jira/browse/AIRFLOW-3534

I'm not sure if there's an easy / good workaround here. The function that's causing the problem is called read_namespaced_pod_log, which (AFAICT) is using a persistently opened http connection in Kubernetes to read the log under the hood:

https://github.com/apache/airflow/blob/c890d066965aa9dbf3016f41cfae45e9a084478a/airflow/kubernetes/pod_launcher.py#L173

I did some spelunking in the kubernetes python repository + issue tracker, and to be honest it doesn't seem like this type of use case is really taken into account with the API. There is no way to pick up the logs again in the event of a timeout or similiar, see for example this issue comment:

kubernetes-client/python#199 (comment)

The workaround is just to not get the logs and rely on stackdriver logging. This is pretty non-ideal: it increases the amount of filtering/spelunking you would need to do pretty significantly in the case that something goes wrong. Filing this issue for internal visibility, as it's a pretty serious gotcha.

@wlach
Copy link
Contributor Author

wlach commented Jan 24, 2020

Thought I had solved it with this, but it doesn't actually fix the issue: wlach/airflow@e7ae01a

Will make further comments on my investigation in their issue tracker, starting with: https://issues.apache.org/jira/browse/AIRFLOW-3534?focusedCommentId=17023334&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17023334

@brihati
Copy link

brihati commented Apr 10, 2020

Can you check in your logs if your task is marked as zombie. If it is then increase the duration of scheduler_zombie_task_threshold from default 5 minutes to something ~n minutes. When logs are not printed, it seems like worker doesn't send heartbeat to the DB and scheduler marks it as failure after scheduler_zombie_task_threshold minutes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants