Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any way to benefit from retraining? #26

Closed
JoseAF opened this issue Jul 8, 2022 · 4 comments
Closed

Any way to benefit from retraining? #26

JoseAF opened this issue Jul 8, 2022 · 4 comments

Comments

@JoseAF
Copy link

JoseAF commented Jul 8, 2022

Most users who try to train a classifier have to carry out several attempts at training until they get acceptable results. This means that in each consecutive attempt they have to resend and retrain the classifier using almost the same training data, with only a few added samples. This seems wasteful. Is there any way to use Yggdrasil to benefit from this knowledge?

@achoum
Copy link
Collaborator

achoum commented Jul 12, 2022

Hi Jose,

There are different reasons and different solutions to train multiple models on the same (or similar dataset). Can you give details of your objective?

For example, if you are re-training multiple models to test hyper-parameters, or to create an ensemble of sub-models, there is no generic solution to avoid re-training the full models.

However, if you want to update an existing model with a small amount of new training examples, there are existing works. Take a look at the "online learning" domain. Currently, YDF does not offer any direct solution for that.

However, and while this is likely not as good as an online learning method, you can either resume training of a model on a new dataset, or ensemble a set of models, each trained on a different snapshot of data.

@JoseAF
Copy link
Author

JoseAF commented Jul 14, 2022

Hi Mathieu

Thank for the reply.

It's the second option that I'm interested in, updating an existing model several times with small amounts of new training examples. I'd be interested to look into the possibility you mention regarding resuming training of a model with a new dataset. I assume the new dataset would be only the small extra training examples. Can you point me to the code that does this?

@achoum
Copy link
Collaborator

achoum commented Jul 15, 2022

I assume the new dataset would be only the small extra training examples.
This will depend on your problem, the size of the dataset and the type of learning algorithm.
It might be that you need to include some of the past data (not all, otherwise, you don't have any benefit).

You need to experiment to figure what works.
This is something I have little experience with, but here is what I would try:

  1. Look at the literature if there are some published meta-methods that work on top of a standard learning algorithm.

  2. With TF-DF; If you use GBT

Set the temp argument (e.g. temp_directory="/tmp/training_cache") and enable resuming training (try_resume_training=True).
Train 200 trees on the first dataset (i.e. model.fit(first_dataset)) and then train an extra 200 trees on the second dataset:

model.learner_params["num_trees"] = 200 + 200 # Add an extra 200 trees
model.fit(second_dataset).

  1. With both GBT and RF:

Train the two models independently on the different datasets, and then combine them (e.g. average the predictions) using the keras functional API. Look at the composition colab for some example of model composition.

@JoseAF
Copy link
Author

JoseAF commented Jul 15, 2022

Hi Mathieu

Thanks for the suggestions. I'll think through these...

@achoum achoum closed this as completed Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants