Questions about regularization and pruning #54

hunterkun · 2018-10-02T07:15:39Z

I found you take regularization as another means of pruning. But the procedure is different between them. Pruning is taking effect on the beginning of batchon_minibatch_begin while regularization is on the end of batchon_minibatch_end. It means that you set the regularization term zero below the threshold every batch iteration during training.
What is the reason for this? I think it's natural that this happens on the end of one epoch or end of whole training when the regularization terms have been decreased enough for pruning.
The regularization and pruning both use the same zeros_mask_dict, it may brings some messes. for example apply_mask in on_minibatch_end of class RegularizationPolicy would be called by regularization mask, but also pruning mask if there are both regularizer and pruner.
What is the purpose of keeping the regulatization mask of the last epoch. I guess it may be used by some remover in thinning.py, right?

The text was updated successfully, but these errors were encountered:

nzmora · 2018-10-06T21:39:05Z

Hi,

I'll start with a long explanation :-), and then I'll take your questions.

Regularization can be a means to achieve sparsity - but there is an important distinction between sparsity and pruning which relates to the rest of my answer. Sparsity is a measure of the absolute zeros in a tensor. Pruning algorithms are one approach to achieve sparsity. But the distinction is even deeper.

Consider what happens when we prune connections: we remove those connections entirely from the network which means that no information flows through these connections: neither forward data, nor backward gradients. Practically, we mask both weights during the forward pass, and gradients during the backward pass. But you know this 😉

What happens when we regularize? At first glance, there is no relation between pruning and regularization, because in regularization we just use an added loss term to put “downward pressure” on the weights (individually; or in grouped structures) - We don't remove connections. So no masking should be involved, right?
Well, not quite: we use a “soft-thresholding operator” (i.e. thresholding + masking) to prevent the weights from oscillating around zero (I tried to show this in this notebook using L1 regularization on a toy example).
OK, so when we regularize, we also mask the weights – but what about the gradients? No, we leave the gradients alone, because we don’t want to completely remove the regularized connections from the network: i.e. we want the regularized connections to continue passing information in the backward direction. Another way to look at this difference between pruning and regularization: pruned connections are removed forever, however regularized connections that are masked out (they are below some threshold) can sometimes grow back in size.

" I think it's natural that this happens on the end of one epoch or end of whole training when the regularization terms have been decreased enough for pruning."
This is an interesting idea. If we implemented it, we wouldn't be able to easily see in the logs the sparsity of the weights during part of the training (because we wouldn't have absolute zeros, most likely). But this is not the reason I chose to threshold regularization at the end of each mini-batch. You see, pruning is iterative and therefore not "continuous": We prune, then we fine-tune for a "long" time, then we prune some more, and fine-tune some more, and so on. Regularization is "continuous" by definition: every time we compute the data loss, we also also compute the regularization loss. And as far as I understand, the “soft-thresholding operator” is part of every regularization calculation (on_minibatch_end).
BTW, you can also configure the regularizer not to threshold. BTW 2: Today we can only prune at the beginning epochs, but in the future I want to allow scheduling of pruning at the mini-batch granularity.
"The regularization and pruning both use the same zeros_mask_dict, it may brings some messes." This is a good comment and it tells me that I didn't document the interaction between pruning and regularization. I think that when you choose to mix these two, you want to smoothly push the solution towards sparsity (using the regularization loss term), but prune using a more "clumsy" pruning schedule. Now, the only reason to use a pruner when you're already using a regularizer, is if the pruner is more aggressive than the regularizer (otherwise, the pruner does nothing - it's mask is below the regularization mask). To sum up: if you're both pruning and regularizing, don't enable the regularizer's mask.
"What is the purpose of keeping the regulatization mask of the last epoch. I guess it may be used by some remover in thinning.py, right?" - Correct: we keep the mask to get sparsity, which we can exploit to remove structures (thinning.py).

Thanks for the interesting comments,
Neta

nzmora · 2018-10-25T15:07:17Z

@hunterkun I'm closing because this has been idle for 19 days. If you have questions remaining we can reopen, or use another issue.
Cheers,
Neta

nzmora self-assigned this Oct 25, 2018

nzmora closed this as completed Oct 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about regularization and pruning #54

Questions about regularization and pruning #54

hunterkun commented Oct 2, 2018

nzmora commented Oct 6, 2018

nzmora commented Oct 25, 2018

Questions about regularization and pruning #54

Questions about regularization and pruning #54

Comments

hunterkun commented Oct 2, 2018

nzmora commented Oct 6, 2018

nzmora commented Oct 25, 2018