New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: lr scheduler #248
feat: lr scheduler #248
Conversation
Codecov Report
@@ Coverage Diff @@
## main #248 +/- ##
==========================================
+ Coverage 87.88% 88.54% +0.66%
==========================================
Files 37 37
Lines 1799 1834 +35
==========================================
+ Hits 1581 1624 +43
+ Misses 218 210 -8
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@@ -28,8 +29,13 @@ def fit( | |||
epochs: int = 10, | |||
batch_size: int = 256, | |||
loss: Union[str, 'AnyDNN'] = 'SiameseLoss', | |||
optimizer: Optional['AnyOptimizer'] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why u can't pass an optimizer object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pytorch we move the model to a GPU if needed. However, if this is done after optimizer is initialized, this will break the optimizer. Here is a quote from pytorch documentation
If you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.
In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that u are moving fit
arguments to attributes
of the objects
. What a Tuner
is should be determined by the model and the loss
perhaps. Optimizers
and learning_rates
should be different call arguments possible with the same object
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well no, the problem persists, and the factory idea is good, but they should remain call arguments and not attributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about this, and I have decided to go against it. What is an "attribute" of Tuner and what is not is just semantics really - I can easily argue that because this is Tuner
and not Model
all things crucial to tuning need to go in __init__
.
I have instead a very concrete argument - all the things whose state can change during training should go into __init__
. This is due to checkpointing - if my training is interrupted in epoch 5, and I want to later start from that position, I need a way of restoring that state. There are two ways of doing this:
- Having the
fit
method already start with initialized states. In this case, I can, right after initializing the trainer, callTrainingCheckpoint.load(tuner, path_to_checkpoint)
. This is my preferred option, as it decouples checkpoint from the trainer. - Having the checkpoint and training tightly coupled, so that you do
tuner.fit(load_checkpoint=path_to_checkpoint)
. I don't really like this option.
And the way this is done in different frameworks:
- in keras you call
compile
on a model first (equivalent of our trainer init), passing optimizer with learning rate, metrics, etc. Infit
you only pass data and related args - in pytorch lightning you pass all the configuration (including things like number of epochs, device) in init, and pass model with data (optimizer is specified as part of the model definition) and checkpoint path in
fit
- here they have checkpointing tightly coupled with the training method - in pytorch ignite you have to create all objects before calling
Engine.run
- which takes only data and number of epochs.
Based on the prefered way of implementing checkpoints, and the observation that this is indeed how it is also done in other frameworks, I would go ahead with my implementation.
@@ -69,26 +75,34 @@ def fit( | |||
- ``TripletLoss`` for Triplet network | |||
:param num_items_per_class: Number of items from a single class to include in | |||
the batch. Only relevant for class datasets | |||
:param configure_optimizer: A function that allows you to provide a custom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this usage makes much sense. Why would I make the optimizer returned depend on the model? Why can't I pass the optimizer directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above - basically, in pytorch to construct the optimizer you need to pass the parameters of the model (and same in paddle paddle)
configure_optimizer: Optional[ | ||
Callable[[AnyDNN], Union[AnyOptimizer, Tuple[AnyOptimizer, AnyScheduler]]] | ||
] = None, | ||
learning_rate: float = 1e-3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are talking about making learning_rate
scheduling, it seems weird to link the identity of a Tuner
to a learning_rate
value, it seems more adequate to be a fit
argument than an attribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is only the "default" learning rate, if the user does not provide an optimizer (in which case the learning rate stays constant). I think renaming this will suffice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is, learning_rate
is an argument of fit
not something that the tuner
should be adapted to. Same tuner identity should be able to work with different learning_rates
.
Callable[[AnyDNN], Union[AnyOptimizer, Tuple[AnyOptimizer, AnyScheduler]]] | ||
] = None, | ||
learning_rate: float = 1e-3, | ||
scheduler_step: str = 'batch', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for this
Co-authored-by: Joan Fontanals <joan.martinez@jina.ai>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure about configure_optimizer
the usage looks a bit wired to me. At least, the naming sounds wired.
One question I have is the learning_rate
and default_learning_rate
, does it means that if user provide a learning_rate
, the default..
will be overwritten? why change the name here?
@bwanglzu The naming is taken from Pytorch lightning where they do the same thing. I see I forgot to rename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Implements a learning rate scheduler
TODO