Scheduling the learning rate has become a necessity, not an option, in training the deep learning model
By adjusting the lr properly, the model can learn faster and more efficiently than otherwise
This repository provides a powerful custom learning rate scheduler with the latest techniques available for keras optimizer
Here is a simple code snippet for use
lr = 0.001
iterations = 100000
lr_scheduler = LRScheduler(iterations=iterations, lr=lr, policy='step')
for i in range(iterations):
...
lr_scheduler.update(optimizer, i)
with tf.GradientTape() as tape:
y_pred = model(x, training=True)
loss = loss_function(y_true, y_pred)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
...
Use a fixed learning rate. Same as not using scheduling
Step decay dramatically lowers the learning rate later in the training, allowing more detailed learning to the parts that could not be learned with constant lr
lr_scheduler = LRScheduler(iterations=iterations, lr=lr, policy='step')
I found that using warm_up with 'step' policy makes learning more stable and faster
Warm_up is a method of learning by slowly increasing lr from 0 until the set lr is reached
This method is used to train various deep learning models and works mostly well regardless of which optimizer you use
Here's how to use warm_up
lr_scheduler = LRScheduler(iterations=iterations, lr=lr, policy='step', warm_up=0.1)
The default warm_up value is 0.1, which is used to lr warm up by 10% of the given iterations
If the model does not train stably, it may be helpful to use a larger value, such as 0.5
The lr of the 'step' policy with a warm_up value of 0.5 is scheduled as follows
The cosine decay policy is a method of decreasing lr using the cosine function, raising it rapidly, and then decreasing it again
It is known to learn the model quickly in a short time
lr_scheduler = LRScheduler(iterations=iterations, lr=lr, policy='cosine')
At the end of each cosine cycle, the cycle length is doubled, which can be adjusted to cycle_weight
The one cycle policy, also called super convergence, is similar to the cosine function
But unlike cosine, it's a scheduling method that has only one increase and one decrease
According to a super convergence paper, learning is up to 10 times faster than using constant lr
lr_scheduler = LRScheduler(iterations=iterations, lr=lr, policy='onecycle')