Any way to benefit from retraining? #26

JoseAF · 2022-07-08T11:44:28Z

Most users who try to train a classifier have to carry out several attempts at training until they get acceptable results. This means that in each consecutive attempt they have to resend and retrain the classifier using almost the same training data, with only a few added samples. This seems wasteful. Is there any way to use Yggdrasil to benefit from this knowledge?

achoum · 2022-07-12T12:24:50Z

Hi Jose,

There are different reasons and different solutions to train multiple models on the same (or similar dataset). Can you give details of your objective?

For example, if you are re-training multiple models to test hyper-parameters, or to create an ensemble of sub-models, there is no generic solution to avoid re-training the full models.

However, if you want to update an existing model with a small amount of new training examples, there are existing works. Take a look at the "online learning" domain. Currently, YDF does not offer any direct solution for that.

However, and while this is likely not as good as an online learning method, you can either resume training of a model on a new dataset, or ensemble a set of models, each trained on a different snapshot of data.

JoseAF · 2022-07-14T06:20:34Z

Hi Mathieu

Thank for the reply.

It's the second option that I'm interested in, updating an existing model several times with small amounts of new training examples. I'd be interested to look into the possibility you mention regarding resuming training of a model with a new dataset. I assume the new dataset would be only the small extra training examples. Can you point me to the code that does this?

achoum · 2022-07-15T12:32:18Z

I assume the new dataset would be only the small extra training examples.
This will depend on your problem, the size of the dataset and the type of learning algorithm.
It might be that you need to include some of the past data (not all, otherwise, you don't have any benefit).

You need to experiment to figure what works.
This is something I have little experience with, but here is what I would try:

Look at the literature if there are some published meta-methods that work on top of a standard learning algorithm.
With TF-DF; If you use GBT

Set the temp argument (e.g. temp_directory="/tmp/training_cache") and enable resuming training (try_resume_training=True).
Train 200 trees on the first dataset (i.e. model.fit(first_dataset)) and then train an extra 200 trees on the second dataset:

model.learner_params["num_trees"] = 200 + 200 # Add an extra 200 trees
model.fit(second_dataset).

With both GBT and RF:

Train the two models independently on the different datasets, and then combine them (e.g. average the predictions) using the keras functional API. Look at the composition colab for some example of model composition.

JoseAF · 2022-07-15T12:41:32Z

Hi Mathieu

Thanks for the suggestions. I'll think through these...

achoum closed this as completed Sep 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any way to benefit from retraining? #26

Any way to benefit from retraining? #26

JoseAF commented Jul 8, 2022

achoum commented Jul 12, 2022

JoseAF commented Jul 14, 2022

achoum commented Jul 15, 2022

JoseAF commented Jul 15, 2022

Any way to benefit from retraining? #26

Any way to benefit from retraining? #26

Comments

JoseAF commented Jul 8, 2022

achoum commented Jul 12, 2022

JoseAF commented Jul 14, 2022

achoum commented Jul 15, 2022

JoseAF commented Jul 15, 2022