Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logs --follow process times out after 4 minutes #106

Closed
ckadner opened this issue Jun 28, 2018 · 4 comments
Closed

Logs --follow process times out after 4 minutes #106

ckadner opened this issue Jun 28, 2018 · 4 comments

Comments

@ckadner
Copy link
Contributor

ckadner commented Jun 28, 2018

It used to be that the FfDL CLI command to follow the logs of an ongoing training job $FFDL_CMD logs --follow ${MODEL_ID} would tail the training logs until completion of the training job. The logs --follow process returned control only after the training job was complete. This was a useful feature when chaining up commands to create a semi-automated machine learning pipeline, where subsequent commands require the output data of the training job whose logs are being "followed". We have a small example of such a training pipeline in our ART notebook which is currently broken.

That behavior changed with the merge of PR #79. Now the the $FFDL_CMD logs --follow ${MODEL_ID} process terminates after 4 minutes -- usually before the training job is completed -- which causes the failure of subsequent processes that depend on training output data.

Code change causing the regression:

https://github.com/IBM/FfDL/pull/79/files?utf8=%E2%9C%93&diff=split&w=1#diff-7376976023aba3c29977b24e4794f938R1406

-	var ctx context.Context
-	var cancel context.CancelFunc
-	logr.Debugf("follow is %t", req.Follow)
-	if req.Follow {
-		ctx, cancel = context.WithTimeout(context.Background(), 10*(time.Hour*24))
-	} else {
-		ctx, cancel = context.WithTimeout(context.Background(), 5*time.Second)
-      }
+	ctx, cancel := context.WithTimeout(context.Background(), time.Minute*4)
 	defer cancel()

Command output showing the behavior:

Showing prematurely aborted logs --follow process with apparent 4 min timeout.

$FFDL_CMD train manifest.yml model.zip
Deploying model with manifest 'manifest.yml' and model file 'model.zip'...
Model ID: training-uLQ7ZMDmR
OK
$FFDL_CMD logs --follow training-uLQ7ZMDmR  &&  date
Getting model training logs for 'training-uLQ7ZMDmR'...
Training with training/test data at:
  DATA_DIR: /mnt/data/training-data-bbe28e19-4fba-4e29-af5f-564f0e0d3f53
  MODEL_DIR: /job/model-code
  TRAINING_JOB: 
  TRAINING_COMMAND: pip3 install keras; python3 convolutional_keras.py --data ${DATA_DIR}/mnist.npz
...
Wed Jun 27 19:19:34 UTC 2018: Running training job
...
Train on 54000 samples, validate on 6000 samples
Epoch 1/1
  128/54000 [..............................] - ETA: 5:21 - loss: 2.2977 - acc: 0.1562
  256/54000 [..............................] - ETA: 4:37 - loss: 2.2591 - acc: 0.1758
...
45184/54000 [========================>.....] - ETA: 39s - loss: 0.3311 - acc: 0.8972
45312/54000 [========================>.....] - ETA: 39s - loss: 0.3305 - acc: 0.8974
45440/54000 [========================>.....] - ETA: 38s - loss: 0.3299 - acc: 0.8976

Wed Jun 27 12:23:35 PDT 2018

Notice the time stamps:
Wed Jun 27 19:19:34 UTC 2018: Running training job -> date/time training starts
Wed Jun 27 12:23:35 PDT 2018 > date/time just after the logs job returns (after 4 min)

ckadner added a commit to ckadner/FfDL that referenced this issue Jun 28, 2018
The process that follows the training logs of an ongoing training job
should not timeout after 4 minutes. Instead the log follow process
should complete after the training job itself is finished.

This behavior is necessary to enable chaining up commands to create
machine learning pipelines, where subsequent commands require the output
data of the training job whose logs are being "followed" like in our
ART notebook.

This commit reinstates the log follow behavior prior merge of PR IBM#79
@sboagibm
Copy link
Contributor

sboagibm commented Jul 2, 2018

@ckadner Apologies for the late response, as I've been on vacation.

The lines in question as part of a larger commit that changed how GetTrainedModelLogs works. I traced the original DLaaS commit, from your's truly, and I think my intention was to not rely on a long term stream being held open, but to be able to re-open a new stream starting from where the old left off, if the connection terminates. But I think the

		ctx, cancel := context.WithTimeout(context.Background(), time.Minute * 4)
		defer cancel()

section needs to be moved to inside the for loop just above

query := trainedModelLogRequestToTrainerQuery(req, rindex, oldEndpointInternalPageSize)

So, before we merge #107, can you give this a try first, to see if it's a better solution?

@sboagibm
Copy link
Contributor

sboagibm commented Jul 3, 2018

@ckadner will you get a chance to try ^^^ today?

@ckadner
Copy link
Contributor Author

ckadner commented Jul 4, 2018

@sboagibm -- your suggested fix appears to work. I will update this PR accordingly.

sboagibm pushed a commit that referenced this issue Jul 5, 2018
* Remove 4 minute timeout for log follow process (#106)

The process that follows the training logs of an ongoing training job
should not timeout after 4 minutes. Instead the log follow process
should complete after the training job itself is finished.

This behavior is necessary to enable chaining up commands to create
machine learning pipelines, where subsequent commands require the output
data of the training job whose logs are being "followed" like in our
ART notebook.

This commit reinstates the log follow behavior prior merge of PR #79

* Updates suggested by sboagibm

Intention was to not rely on a long term stream being held open, 
but to be able to re-open a new stream starting from where the 
old left off, if the connection terminates.
@Tomcli
Copy link
Contributor

Tomcli commented Jul 5, 2018

Closed with #107

@Tomcli Tomcli closed this as completed Jul 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants