-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set the metadata only during first training run #3684
Conversation
ludwig/api.py
Outdated
"Previous metadata has been detected. Overriding `training_set_metadata` with metadata from previous " | ||
"training run." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge fan of this warning. The concept of a training set metadata is an internal one to LudwigModel, the user doesn't need to know about it and does not know about it, so the warning is not super useful for that user.
Better warning would sound like "This model has been trained before and its architecture has been defined based on the original training set properties (i.e. the number of output classes for a category output). The new data provided will be mapped into the previous architecture and it is not possible to modify the architecture based on the new training data provided, if you want to achieve that you should concatenate the new data with the previous data and train a new model from scratch." Or soemthing along those lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if there's a test that we can write for this in test_api.py!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Infernaught, the changes LGTM. However, it looks like there was a bad rebase since these changes from previous commits appear in the PR. Could you reconcile these diffs?
This PR allows users to call model.train multiple times (such that the first training run is on a dataset that contains all possible outputs and all subsequent training runs are on datasets whose outputs are subsets of the first's) by setting the metadata only during the first training run.