-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we normalize learning rate by # neurons? #643
Comments
I'm cautiously in favour of this, as it'd be really nice to be able to just increase the number of neurons without worrying about reducing the learning rate at the same time. That said, I haven't worked with learning stuff much..... |
I also (cautiously) think this makes sense. Are we talking about |
I looked at the average decoder size at one point and if I recall correctly they indeed are inversly proportional to the number of neurons. |
I'd like to see how well the scaling actually works. This effect is true in basically all neural learning rules, not just PES, but this is not a common solution. That makes me a bit worried that the fix isn't as simple as it seems. It's just kind of a feature of learning rates that they're fiddly, and need to be retuned whenever anything about the model changes. I also think that there's a lot of potential for confusion with these under the hood changes, where you take the value the user sets and then modify it in a way that may not be obvious to them. Unless that modification is totally transparent (i.e., the learning rate scaling works perfectly), it can introduce some really irritating bugs. So basically, if it is as easy as dividing by the number of neurons, and that totally makes the issue go away, then I'm all for it. But if it just kind of works, then it might do more harm than good. |
Agreed, some sort of test showing that this actually addresses the problem On Wed, Feb 4, 2015 at 5:32 PM, Daniel Rasmussen notifications@github.com
|
For the record, I've never noticed this myself either. But, I do end up having to change the learning rate in every case that I've used it, and if this allows me to not change it in all cases, I'm all for it. |
Doesn't that show things being worse with scale=True? |
No, it shows that they're consistent. All the rows learn at the same rate, whereas on the left they have different learning rates despite having the same learning rate (yeah, read that sentence again). |
Basically when we scale the learning rate by the number of neurons the rate of the actual learning is consistent (right column), whereas without the scaling it changes across numbers of neurons (left column: 10 neurons shows little change, whereas 1000 neurons shows almost instant learning). |
I think, I have a mathematical argument for this statement. If that is of interest I can write it down tomorrow or so. |
But given more neurons, you would expect it to learn more quickly. It just looks like we've counteracted the increased representational power by (needlessly) slowing down the learning. But it's also true that in this case it's likely hard to distinguish accuracy of learning from speed of learning, since the function is so simple. So maybe the accuracy is increasing with neuron number, and it's just hard to see. At the least it demonstrates that the simple linear scaling seems pretty accurate, which is good. |
Interesting... my instinct is that with more neurons, it should learn at the same rate, or perhaps more slowly (as the learning interferes with each other). In think the biggest problem is that what is "too high" a learning rate changes with n_neurons (because of this magnitude of the decoders issue). For example, running @hunse 's script with rate=5e-5 gives pretty much the same behaviour for different n_neurons with scaling, but as n_neurons gets larger without scaling you get horrible results. |
Are we also scaling the learning rate by dt? That also seems like something that would have a similar argument for it... Also, would this scaling be standard across all learning rules? Or just PES? (Note: I do think that if this starts getting complex and people are having differing intuitions, then we should go with the standard approach of not scaling at all and letting people handle it themselves, with documentation that suggests things like dropping the learning rate as you get more neurons.) |
I feel pretty strongly against any hidden scaling of the learning rate, how would this be done? Maybe a parameter you could set to true would be more appropriate for this than automatically doing it? otherwise it might be trying to do too much for the user automatically...and yeah different learning rules like terry pointed out! it gets complicated! |
when i said doing too much i meant more assuming we know what the user is trying to do |
No; we have unit tests that test that learning rules do not change behavior when
I dunno, @hunse's plots are pretty convincing, in my opinion. Setting learning rates is honestly stupid BS tweaking that people shouldn't be spend the majority of their time doing, and yet they do. It is in now way intuitive to me that if I have more neurons, I need to tweak my number of neurons -- I think I have the most PES experience out of all of us and this never once occurred to me (though it is obvious now that it's been brought up).
This I'm not sure of. Certainly encoder learning rules (which we don't have now) wouldn't be scaled; unsupervised rules I'm not sure if it makes sense; I could see both sides. I would say that the fact that it's not intuitive for all learning rules to be scaled is the only argument that I see for not doing the scaling. |
Actually, yes, we are scaling the learning rate by |
The plots are great, but I really think it shouldn't be done by default. It's hidden magic. Setting the learning rate is something that's tweaky anyways, and it's only going to get harder to do that if there's some additional multiplication going on that you're not expecting. |
Using this logic, we shouldn't be scaling by |
My intuition goes also in the direction of scaling the learning rate, but I have to add the disclaimer that I have no experience with learning at all so far. Also, the argument with the hidden magic would also apply to the scaling of the evaluation points (which actually caused us some trouble). |
Yeah, I would enjoy not scaling by Also I think the only time that this kind of scaling wouldn't be applicable to keep the rate of convergence constant is in a learning rule that is non-local, right? |
@studywolf, the way I would interpret it is you're not actually scaling the @jgosmann, Yeah I haven't written down the math, but Eric gave a sound argument. Basically, if you double the number of neurons, then to achieve the same decoded value (assuming no error), you need to halve each decoder. Then halving each component of a vector also halves its length. To be more concrete, I think this topic should be changed to just talking about PES, and whether the issue justifies adding a scalar to PES that will be set to normalize the deltas according to decoder length (which is indeed as simple as dividing by |
@studywolf, it seems to me that those learning rules you're talking about have an iterative formulation, whereas the Nengo math is continuous in time and we only discretize it to be able to simulate it. I think, for two different, but sufficiently small, |
+1 to @arvoelke's rephrasing. Currently PES is Dd = kappa * dot(E, a); this would introduce a new term, so it becomes Dd = kappa * 1/n * dot(E, a). This makes sense because the error, E, is affected more when more decoders are changing, so adding in the 1/n scaling factor compensates for this (and minimizing the error is what the learning rule is supposed to do, not change decoders). As long as we update Nengo's PES descriptions to reflect this (and preferably any subsequent publications that put in the PES equation) then there's no "hidden magic". Instead, it decouples the learning rate from I would argue that all learning rules that modify decoders should do this (in their mathematical formulation, as well as their Nengo implementation). Our unsupervised rules don't modify decoders (they're specified in terms of weights, so whether this modifies encoders or decoders is up for debate) so they don't need to be modified. |
I'm fine with changing the PES rule to include the scaling term. I think the intuition I and @studywolf share is that the learning rule should act as advertised. If the learning rule says it's
This same problem exists with straight connection weights. E.g., if you increase the number of presynaptic neurons, your average synaptic weight is going to go down (under an NEF initialization, or under most random initialization schemes). So I'm not sure that we can make a strong distincton there. But something like BCM is an example of a well established theory, that I don't think we should be going and changing behind the scenes. |
To clarify, "I'm fine with changing the PES rule" sounds like I meant that begrudgingly. I think it's definitely a good idea, it's a good theoretical change to the rule that makes it more robust. |
Oh yes, I'm all for transparency. In fact, we should change PES to be The differential accounts for the time scaling, and the negative isn't there right now, but it's what I'm proposing in #642. |
+1 using better math! |
👍 I like it. I guess the one downside is that it makes the rule seem less local, in that each synapse has to "know" how many neurons are in the presynaptic population. But we can just make the point in any publications that you can just roll that into the learning rate, so it's no more or less plausible than any other way of setting learning rates. |
👍 🌈 handwavey arguments about plausibility 🌈 👍 |
Fixed in #642. |
Amir noticed with the two-link arm that if you crank up the number of neurons, unintuitively the performance degrades.
What is going on is that the magnitude of each decoder gets smaller the more neurons you have, but the learning deltas remain the same. Applying the relatively larger delta makes the learning oscillate. Instead, I think the delta should be scaled by the average size of each decoder, which Eric and I believe is just inversely proportional to the number of neurons.
So I propose we scale the learning rate by dividing by
n_neurons
. This will also make the number "friendlier", since it is currently on the order of1e-6
for 500 neurons, and so the change would make it5e-4
.The text was updated successfully, but these errors were encountered: