New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
glorot_normal init should be glorot_uniform? #52
Comments
|
Maybe we need both. I've seen this scheme around in various forms (sometimes normal, sometimes uniform) and various names (Caffe seems to call it xavier?). And it's been around far prior Glorot 2010 tbh. @untom thoughts on this? |
|
where was it used before that paper? On Thu, Apr 16, 2015 at 7:10 PM, fchollet notifications@github.com wrote:
|
|
It hasn't necessarily been written down very often (earliest might have been LeCun98), but it has been part of the old toolbox of NN "tricks" for a long time. |
|
The main assumption behind the Glorot initialization is that the variance of the gradients should be the same in each layer. In eq 12 of the paper you can see that to achieve this, the variance of the layer-sizes should be 2 / (fan-in + fan-out) . To achieve this, you could initialize your weights either directly from a normal with So in terms of the Glorot paper, So the only question remaining is whether one should use a normal or a uniform distribution. This is a personal choice. People around Bengio's group have always preferred initializing from a uniform distribution (hence they used that in the paper), while e.g. Hinton always advocates using a normal distribution. I personally think that using a normal is the more natural thing, since at the end of training, the weight distribution always looks approximately Gaussian anyway (unless you use e.g. L1 regularization), no matter what you started with. So my reasoning is that with a normal, you at least have the correct prior. Which is why my patch to keras used the normal, as well. To be honest, I do not think it makes much of a difference. It certainly does not make a difference in terms of how Glorot derived his initialization. |
|
So the reasoning I have heard for initializing from uniform instead of The argument that it ends up being Gaussian doesn't seem strong to me... Can you give a reference for Hinton advocating normals? wrt the code, I just think it would be good to actually use the On Fri, Apr 17, 2015 at 2:12 AM, untom notifications@github.com wrote:
|
|
That's an interesting reason. But thinking about it, are outliers in the weight-sizes are necessarily a bad thing? As long as overall the gradients don't explode/collapse (which is what the variance-derivation of Glorot is about) you're still okay, aren't you? And as long as learning can proceed, your weights will be adjusted.
The whole point of a "good initialization" is having one that somehow aids learning. You could always argue "if you and up with the same solution in the end, what does it matter how I initialize?". Like I said, picking a prior that matches the posterior just makes sense to me, but I'll admit the argument isn't strong. However, having each unit initialized by a Gaussian also allows each unit to focus more strongly on a given combination of inputs (since fewer of the weights will be large, and likely the combination of large weights will be different for each unit). Of course you'll end up with units that have nonsensical combinations, and those will take longer to recover than if they'd have been initialized uniformly. Every paper from Hinton's group that mentions how they initialized weights uses Gaussians. Off the top of my head , a specific example from Hinton himself comes from his RBM-Training guide (https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf , section 8). But if you go through papers from his group, several mention Gaussian initialization, none mention Uniform distributions. In any case, like I said I don't think it will make much difference in the end. So if you think uniform is more userfriendly, then maybe that's the best thing to do. (Personally, I found the normal easier to understand because it doesn't hide the |
|
After some research, uniform seems to be more common across other libraries. I don't really buy the gaussian argument. A random normal array is less likely to be a good approximation of another random normal array than a constant or random uniform (small scale) array. Yes, the distribution of values will be the same in aggregate, but the mean absolute error per weight will be larger. The point of a good initialization is one where your weights 1) make learning possible (avoid pathological cases with no gradient or exploding gradients), and 2) are the closest possible to the final learned weights. Normal distributions seem less likely to fit 2) compared to uniform distributions. I will add glorot_uniform and he_uniform, to match what other DL libraries are doing. I think we should also make glorot_uniform the default initialization for the layers in which uniform is currently used as default. |
|
Benchmarking on MNIST gives me better results with glorot_uniform compared to glorot_normal. glorot_uniform also appears to perform about as well as lecun_uniform. glorot_uniform: glorot_normal: Code is at https://www.kaggle.com/users/123235/fchollet/digit-recognizer/simple-deep-mlp-with-keras |
|
Interesting, thanks for the tests! :) |
I'm assuming this is meant to implement the novel initialization proposed in this paper: http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf
at the bottom of page 253, but that is a uniform initialization, and the numerator is 6, not 2.
The text was updated successfully, but these errors were encountered: