Skip to content
This repository has been archived by the owner on Jan 31, 2022. It is now read-only.

[Label Bot Continuous Training] Needs Training Needs to take into account whether there is a model currently being trained #178

Open
jlewi opened this issue Jul 26, 2020 · 4 comments
Labels

Comments

@jlewi
Copy link
Contributor

jlewi commented Jul 26, 2020

Our synchronous training pipeline is currently spawning multiple instances of training rather than the expected 1 model per hour.

The problem appears to be the code to decide whether to train a model only looks at whether there is a trained model.
So I don't think we take into account whether a model is currently being trained.

func GetLatestTrained(projectID string, location string, modelName string) (*automlpb.Model, error) {

My conjecture is the following happens

  • We launch a Tekton job to train the model
  • The notebook loads the data into AutoML which is a blocking operatin
  • The notebook initiates an AutoML training job but doesn't block until training is complete
    • This is intentional since we want to upload the notebook output and not wait for the AutoML job to complete.

At this point

  • A new model doesn't exist yet (it is still being trained)
  • needsTrain will continue to return true
  • Since there is no Tekton job running the controller will launch another job
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.63

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 26, 2020

It looks like we need to also look at the datasets and see if there is a model training in progress.

jlewi pushed a commit to jlewi/code-intelligence that referenced this issue Jul 26, 2020
@jlewi jlewi closed this as completed in b893c14 Jul 28, 2020
jlewi added a commit that referenced this issue Jul 28, 2020
Temporarily disable continuous retraining until we can fix #178
@jlewi jlewi reopened this Oct 4, 2020
jlewi pushed a commit to jlewi/code-intelligence that referenced this issue Oct 4, 2020
* NeedsSync needs to check whether there is a model being trained or
  if there is a dataset being imported. Otherwise we end up launching
  multiple overlapping jobs because it takes a long time for the model
  to train. During which time the Tekton job will have finished.

* Related to kubeflow#178
k8s-ci-robot pushed a commit that referenced this issue Oct 4, 2020
* NeedsSync needs to check whether there is a model being trained or
  if there is a dataset being imported. Otherwise we end up launching
  multiple overlapping jobs because it takes a long time for the model
  to train. During which time the Tekton job will have finished.

* Related to #178
jlewi pushed a commit to jlewi/code-intelligence that referenced this issue Oct 4, 2020
* It is NeedsTraining not NeedsSync that needs to check whether there is
  a training job running.

Related to kubeflow#178
k8s-ci-robot pushed a commit that referenced this issue Oct 4, 2020
* It is NeedsTraining not NeedsSync that needs to check whether there is
  a training job running.

Related to #178
@jlewi
Copy link
Contributor Author

jlewi commented Oct 5, 2020

#182 auto PR created for a model trained by manually running the notebook.

Need to verify that a new model is trained automatically and then deployed.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 6, 2020

#184 opened a PR to update to the same model. It doesn't look like a new model got trained.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant