[Label Bot Continuous Training] Needs Training Needs to take into account whether there is a model currently being trained #178

jlewi · 2020-07-26T17:53:28Z

Our synchronous training pipeline is currently spawning multiple instances of training rather than the expected 1 model per hour.

The problem appears to be the code to decide whether to train a model only looks at whether there is a trained model.
So I don't think we take into account whether a model is currently being trained.

code-intelligence/Label_Microservice/go/cmd/automl/pkg/automl/automl.go

Line 101 in faeb657

    
           func GetLatestTrained(projectID string, location string, modelName string) (*automlpb.Model, error) {

My conjecture is the following happens

We launch a Tekton job to train the model
The notebook loads the data into AutoML which is a blocking operatin
The notebook initiates an AutoML training job but doesn't block until training is complete
- This is intentional since we want to upload the notebook output and not wait for the AutoML job to complete.

At this point

A new model doesn't exist yet (it is still being trained)
needsTrain will continue to return true
Since there is no Tekton job running the controller will launch another job

issue-label-bot · 2020-07-26T17:53:37Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/bug	0.63

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

jlewi · 2020-07-26T17:58:32Z

It looks like we need to also look at the datasets and see if there is a model training in progress.

Temporarily disable continuous retraining until we can fix #178

* NeedsSync needs to check whether there is a model being trained or if there is a dataset being imported. Otherwise we end up launching multiple overlapping jobs because it takes a long time for the model to train. During which time the Tekton job will have finished. * Related to kubeflow#178

* NeedsSync needs to check whether there is a model being trained or if there is a dataset being imported. Otherwise we end up launching multiple overlapping jobs because it takes a long time for the model to train. During which time the Tekton job will have finished. * Related to #178

* It is NeedsTraining not NeedsSync that needs to check whether there is a training job running. Related to kubeflow#178

* It is NeedsTraining not NeedsSync that needs to check whether there is a training job running. Related to #178

jlewi · 2020-10-05T13:06:12Z

#182 auto PR created for a model trained by manually running the notebook.

Need to verify that a new model is trained automatically and then deployed.

jlewi · 2020-10-06T14:02:33Z

#184 opened a PR to update to the same model. It doesn't look like a new model got trained.

issue-label-bot bot added the kind/bug label Jul 26, 2020

jlewi pushed a commit to jlewi/code-intelligence that referenced this issue Jul 26, 2020

Temporarily disable continuous retraining until we can fix kubeflow#178

94f27f3

jlewi closed this as completed in b893c14 Jul 28, 2020

jlewi added a commit that referenced this issue Jul 28, 2020

Merge pull request #179 from jlewi/labels_gitops_tekton

45f43ce

Temporarily disable continuous retraining until we can fix #178

jlewi reopened this Oct 4, 2020

jlewi mentioned this issue Oct 4, 2020

LabelBot NeedSync needs to check if model is being trained. #181

Merged

jlewi pushed a commit to jlewi/code-intelligence that referenced this issue Oct 4, 2020

Fix bug checking if training is running.

780aa4c

* It is NeedsTraining not NeedsSync that needs to check whether there is a training job running. Related to kubeflow#178

jlewi mentioned this issue Oct 4, 2020

Fix bug checking if training is running. #182

Merged

k8s-ci-robot pushed a commit that referenced this issue Oct 4, 2020

Fix bug checking if training is running. (#182)

9940f13

* It is NeedsTraining not NeedsSync that needs to check whether there is a training job running. Related to #178

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Label Bot Continuous Training] Needs Training Needs to take into account whether there is a model currently being trained #178

[Label Bot Continuous Training] Needs Training Needs to take into account whether there is a model currently being trained #178

jlewi commented Jul 26, 2020

issue-label-bot bot commented Jul 26, 2020

jlewi commented Jul 26, 2020

jlewi commented Oct 5, 2020

jlewi commented Oct 6, 2020

[Label Bot Continuous Training] Needs Training Needs to take into account whether there is a model currently being trained #178

[Label Bot Continuous Training] Needs Training Needs to take into account whether there is a model currently being trained #178

Comments

jlewi commented Jul 26, 2020

issue-label-bot bot commented Jul 26, 2020

jlewi commented Jul 26, 2020

jlewi commented Oct 5, 2020

jlewi commented Oct 6, 2020