-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logs --follow process times out after 4 minutes #106
Comments
The process that follows the training logs of an ongoing training job should not timeout after 4 minutes. Instead the log follow process should complete after the training job itself is finished. This behavior is necessary to enable chaining up commands to create machine learning pipelines, where subsequent commands require the output data of the training job whose logs are being "followed" like in our ART notebook. This commit reinstates the log follow behavior prior merge of PR IBM#79
@ckadner Apologies for the late response, as I've been on vacation. The lines in question as part of a larger commit that changed how GetTrainedModelLogs works. I traced the original DLaaS commit, from your's truly, and I think my intention was to not rely on a long term stream being held open, but to be able to re-open a new stream starting from where the old left off, if the connection terminates. But I think the ctx, cancel := context.WithTimeout(context.Background(), time.Minute * 4)
defer cancel() section needs to be moved to inside the for loop just above query := trainedModelLogRequestToTrainerQuery(req, rindex, oldEndpointInternalPageSize) So, before we merge #107, can you give this a try first, to see if it's a better solution? |
@ckadner will you get a chance to try ^^^ today? |
@sboagibm -- your suggested fix appears to work. I will update this PR accordingly. |
* Remove 4 minute timeout for log follow process (#106) The process that follows the training logs of an ongoing training job should not timeout after 4 minutes. Instead the log follow process should complete after the training job itself is finished. This behavior is necessary to enable chaining up commands to create machine learning pipelines, where subsequent commands require the output data of the training job whose logs are being "followed" like in our ART notebook. This commit reinstates the log follow behavior prior merge of PR #79 * Updates suggested by sboagibm Intention was to not rely on a long term stream being held open, but to be able to re-open a new stream starting from where the old left off, if the connection terminates.
Closed with #107 |
It used to be that the FfDL CLI command to follow the logs of an ongoing training job
$FFDL_CMD logs --follow ${MODEL_ID}
would tail the training logs until completion of the training job. Thelogs --follow
process returned control only after the training job was complete. This was a useful feature when chaining up commands to create a semi-automated machine learning pipeline, where subsequent commands require the output data of the training job whose logs are being "followed". We have a small example of such a training pipeline in our ART notebook which is currently broken.That behavior changed with the merge of PR #79. Now the the
$FFDL_CMD logs --follow ${MODEL_ID}
process terminates after 4 minutes -- usually before the training job is completed -- which causes the failure of subsequent processes that depend on training output data.Code change causing the regression:
https://github.com/IBM/FfDL/pull/79/files?utf8=%E2%9C%93&diff=split&w=1#diff-7376976023aba3c29977b24e4794f938R1406
Command output showing the behavior:
Showing prematurely aborted
logs --follow
process with apparent 4 min timeout.Notice the time stamps:
Wed Jun 27 19:19:34 UTC 2018: Running training job
-> date/time training startsWed Jun 27 12:23:35 PDT 2018
> date/time just after the logs job returns (after 4 min)The text was updated successfully, but these errors were encountered: