Logs --follow process times out after 4 minutes #106

ckadner · 2018-06-28T09:50:36Z

It used to be that the FfDL CLI command to follow the logs of an ongoing training job $FFDL_CMD logs --follow ${MODEL_ID} would tail the training logs until completion of the training job. The logs --follow process returned control only after the training job was complete. This was a useful feature when chaining up commands to create a semi-automated machine learning pipeline, where subsequent commands require the output data of the training job whose logs are being "followed". We have a small example of such a training pipeline in our ART notebook which is currently broken.

That behavior changed with the merge of PR #79. Now the the $FFDL_CMD logs --follow ${MODEL_ID} process terminates after 4 minutes -- usually before the training job is completed -- which causes the failure of subsequent processes that depend on training output data.

Code change causing the regression:

https://github.com/IBM/FfDL/pull/79/files?utf8=%E2%9C%93&diff=split&w=1#diff-7376976023aba3c29977b24e4794f938R1406

-	var ctx context.Context
-	var cancel context.CancelFunc
-	logr.Debugf("follow is %t", req.Follow)
-	if req.Follow {
-		ctx, cancel = context.WithTimeout(context.Background(), 10*(time.Hour*24))
-	} else {
-		ctx, cancel = context.WithTimeout(context.Background(), 5*time.Second)
-      }
+	ctx, cancel := context.WithTimeout(context.Background(), time.Minute*4)
 	defer cancel()

Command output showing the behavior:

Showing prematurely aborted logs --follow process with apparent 4 min timeout.

$FFDL_CMD train manifest.yml model.zip

Deploying model with manifest 'manifest.yml' and model file 'model.zip'...
Model ID: training-uLQ7ZMDmR
OK

$FFDL_CMD logs --follow training-uLQ7ZMDmR  &&  date

Getting model training logs for 'training-uLQ7ZMDmR'...
Training with training/test data at:
  DATA_DIR: /mnt/data/training-data-bbe28e19-4fba-4e29-af5f-564f0e0d3f53
  MODEL_DIR: /job/model-code
  TRAINING_JOB: 
  TRAINING_COMMAND: pip3 install keras; python3 convolutional_keras.py --data ${DATA_DIR}/mnist.npz
...
Wed Jun 27 19:19:34 UTC 2018: Running training job
...
Train on 54000 samples, validate on 6000 samples
Epoch 1/1
  128/54000 [..............................] - ETA: 5:21 - loss: 2.2977 - acc: 0.1562
  256/54000 [..............................] - ETA: 4:37 - loss: 2.2591 - acc: 0.1758
...
45184/54000 [========================>.....] - ETA: 39s - loss: 0.3311 - acc: 0.8972
45312/54000 [========================>.....] - ETA: 39s - loss: 0.3305 - acc: 0.8974
45440/54000 [========================>.....] - ETA: 38s - loss: 0.3299 - acc: 0.8976

Wed Jun 27 12:23:35 PDT 2018

Notice the time stamps:
Wed Jun 27 19:19:34 UTC 2018: Running training job -> date/time training starts
Wed Jun 27 12:23:35 PDT 2018 > date/time just after the logs job returns (after 4 min)

The text was updated successfully, but these errors were encountered:

The process that follows the training logs of an ongoing training job should not timeout after 4 minutes. Instead the log follow process should complete after the training job itself is finished. This behavior is necessary to enable chaining up commands to create machine learning pipelines, where subsequent commands require the output data of the training job whose logs are being "followed" like in our ART notebook. This commit reinstates the log follow behavior prior merge of PR IBM#79

sboagibm · 2018-07-02T15:34:14Z

@ckadner Apologies for the late response, as I've been on vacation.

The lines in question as part of a larger commit that changed how GetTrainedModelLogs works. I traced the original DLaaS commit, from your's truly, and I think my intention was to not rely on a long term stream being held open, but to be able to re-open a new stream starting from where the old left off, if the connection terminates. But I think the

		ctx, cancel := context.WithTimeout(context.Background(), time.Minute * 4)
		defer cancel()

section needs to be moved to inside the for loop just above

query := trainedModelLogRequestToTrainerQuery(req, rindex, oldEndpointInternalPageSize)

So, before we merge #107, can you give this a try first, to see if it's a better solution?

sboagibm · 2018-07-03T14:20:31Z

@ckadner will you get a chance to try ^^^ today?

ckadner · 2018-07-04T08:36:39Z

@sboagibm -- your suggested fix appears to work. I will update this PR accordingly.

* Remove 4 minute timeout for log follow process (#106) The process that follows the training logs of an ongoing training job should not timeout after 4 minutes. Instead the log follow process should complete after the training job itself is finished. This behavior is necessary to enable chaining up commands to create machine learning pipelines, where subsequent commands require the output data of the training job whose logs are being "followed" like in our ART notebook. This commit reinstates the log follow behavior prior merge of PR #79 * Updates suggested by sboagibm Intention was to not rely on a long term stream being held open, but to be able to re-open a new stream starting from where the old left off, if the connection terminates.

Tomcli · 2018-07-05T16:26:33Z

Closed with #107

ckadner mentioned this issue Jun 28, 2018

Remove 4 minute timeout for log follow process #107

Merged

Tomcli closed this as completed Jul 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logs --follow process times out after 4 minutes #106

Logs --follow process times out after 4 minutes #106

ckadner commented Jun 28, 2018

sboagibm commented Jul 2, 2018

sboagibm commented Jul 3, 2018

ckadner commented Jul 4, 2018

Tomcli commented Jul 5, 2018

Logs --follow process times out after 4 minutes #106

Logs --follow process times out after 4 minutes #106

Comments

ckadner commented Jun 28, 2018

Code change causing the regression:

Command output showing the behavior:

sboagibm commented Jul 2, 2018

sboagibm commented Jul 3, 2018

ckadner commented Jul 4, 2018

Tomcli commented Jul 5, 2018