Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we normalize learning rate by # neurons? #643

Closed
arvoelke opened this issue Feb 4, 2015 · 32 comments
Closed

Should we normalize learning rate by # neurons? #643

arvoelke opened this issue Feb 4, 2015 · 32 comments

Comments

@arvoelke
Copy link
Contributor

arvoelke commented Feb 4, 2015

Amir noticed with the two-link arm that if you crank up the number of neurons, unintuitively the performance degrades.

What is going on is that the magnitude of each decoder gets smaller the more neurons you have, but the learning deltas remain the same. Applying the relatively larger delta makes the learning oscillate. Instead, I think the delta should be scaled by the average size of each decoder, which Eric and I believe is just inversely proportional to the number of neurons.

So I propose we scale the learning rate by dividing by n_neurons. This will also make the number "friendlier", since it is currently on the order of 1e-6 for 500 neurons, and so the change would make it 5e-4.

@tcstewar
Copy link
Contributor

tcstewar commented Feb 4, 2015

I'm cautiously in favour of this, as it'd be really nice to be able to just increase the number of neurons without worrying about reducing the learning rate at the same time. That said, I haven't worked with learning stuff much.....

@tbekolay
Copy link
Member

tbekolay commented Feb 4, 2015

I also (cautiously) think this makes sense. Are we talking about pre or post's n_neurons? pre I'm guessing? I'm also curious if this also holds for weight matrices (for when we learn on transforms instead of decoders). Seems like it should be since the weights are just decoders multiplied by the (unit length) encoder, but I've never looked at the actual values for these things.

@jgosmann
Copy link
Collaborator

jgosmann commented Feb 4, 2015

I looked at the average decoder size at one point and if I recall correctly they indeed are inversly proportional to the number of neurons.

@drasmuss
Copy link
Member

drasmuss commented Feb 4, 2015

I'd like to see how well the scaling actually works. This effect is true in basically all neural learning rules, not just PES, but this is not a common solution. That makes me a bit worried that the fix isn't as simple as it seems. It's just kind of a feature of learning rates that they're fiddly, and need to be retuned whenever anything about the model changes.

I also think that there's a lot of potential for confusion with these under the hood changes, where you take the value the user sets and then modify it in a way that may not be obvious to them. Unless that modification is totally transparent (i.e., the learning rate scaling works perfectly), it can introduce some really irritating bugs.

So basically, if it is as easy as dividing by the number of neurons, and that totally makes the issue go away, then I'm all for it. But if it just kind of works, then it might do more harm than good.

@studywolf
Copy link
Collaborator

Agreed, some sort of test showing that this actually addresses the problem
properly would be great. I've never actually noticed this problem myself...

On Wed, Feb 4, 2015 at 5:32 PM, Daniel Rasmussen notifications@github.com
wrote:

I'd like to see how well the scaling actually works. This effect is true
in basically all neural learning rules, not just PES, but this is not a
common solution. That makes me a bit worried that the fix isn't as simple
as it seems. It's just kind of a feature of learning rates that they're
fiddly, and need to be retuned whenever anything about the model changes.

I also think that there's a lot of potential for confusion with these
under the hood changes, where you take the value the user sets and then
modify it in a way that may not be obvious to them. Unless that
modification is totally transparent (i.e., the learning rate scaling works
perfectly), it can introduce some really irritating bugs.

So basically, if it is as easy as dividing by the number of neurons, and
that totally makes the issue go away, then I'm all for it. But if it just
kind of works, then it might do more harm than good.


Reply to this email directly or view it on GitHub
#643 (comment).

@tbekolay
Copy link
Member

tbekolay commented Feb 4, 2015

For the record, I've never noticed this myself either. But, I do end up having to change the learning rate in every case that I've used it, and if this allows me to not change it in all cases, I'm all for it.

@hunse
Copy link
Collaborator

hunse commented Feb 4, 2015

A simple communication channel test shows that the scaling works fairly well:
test_learningrate

import numpy as np
import matplotlib.pyplot as plt

import nengo
from nengo.networks import EnsembleArray
from nengo.processes import WhiteNoise
nengo.log()


def run_model(n_neurons, scale):
    model = nengo.Network()

    rng = np.random.RandomState(9)

    rate = 1e-6
    if scale:
        rate *= 100. / n_neurons

    with model:
        dim = 1

        u = nengo.Node(output=WhiteNoise(60, high=5).f(d=dim, rng=rng))
        a = nengo.Ensemble(n_neurons, dim)
        b = nengo.Ensemble(100, dim)
        error = nengo.Ensemble(100, dim)

        nengo.Connection(u, a)
        nengo.Connection(a, error, transform=-1)
        nengo.Connection(b, error)

        conn = nengo.Connection(a, b,
                                function=lambda x: np.zeros(dim),
                                learning_rule_type=nengo.PES(rate))
        nengo.Connection(error, conn.learning_rule)

        ap = nengo.Probe(a, synapse=0.01)
        bp = nengo.Probe(b, synapse=0.01)


    s = nengo.Simulator(model)
    s.run(10)

    t = s.trange()
    x = s.data[ap]
    y = s.data[bp]
    return t, x, y


plt.figure(1)
for i, n_neurons in enumerate([10, 100, 500, 1000]):
    for j, scale in enumerate([False, True]):
        plt.subplot(4, 2, 2*i + j + 1)
        t, x, y = run_model(n_neurons, scale)
        plt.plot(t, x)
        plt.plot(t, y)
        plt.title('n_neurons = %s, scale = %s' % (n_neurons, scale))

plt.show()

@drasmuss
Copy link
Member

drasmuss commented Feb 4, 2015

Doesn't that show things being worse with scale=True?

@tbekolay
Copy link
Member

tbekolay commented Feb 4, 2015

No, it shows that they're consistent. All the rows learn at the same rate, whereas on the left they have different learning rates despite having the same learning rate (yeah, read that sentence again).

@hunse
Copy link
Collaborator

hunse commented Feb 4, 2015

Basically when we scale the learning rate by the number of neurons the rate of the actual learning is consistent (right column), whereas without the scaling it changes across numbers of neurons (left column: 10 neurons shows little change, whereas 1000 neurons shows almost instant learning).

@jgosmann
Copy link
Collaborator

jgosmann commented Feb 5, 2015

I looked at the average decoder size at one point and if I recall correctly they indeed are inversly proportional to the number of neurons.

I think, I have a mathematical argument for this statement. If that is of interest I can write it down tomorrow or so.

@drasmuss
Copy link
Member

drasmuss commented Feb 5, 2015

Basically when we scale the learning rate by the number of neurons the rate of the actual learning is consistent (right column), whereas without the scaling it changes across numbers of neurons (left column: 10 neurons shows little change, whereas 1000 neurons shows almost instant learning).

But given more neurons, you would expect it to learn more quickly. It just looks like we've counteracted the increased representational power by (needlessly) slowing down the learning. But it's also true that in this case it's likely hard to distinguish accuracy of learning from speed of learning, since the function is so simple. So maybe the accuracy is increasing with neuron number, and it's just hard to see. At the least it demonstrates that the simple linear scaling seems pretty accurate, which is good.

@tcstewar
Copy link
Contributor

tcstewar commented Feb 5, 2015

But given more neurons, you would expect it to learn more quickly.

Interesting... my instinct is that with more neurons, it should learn at the same rate, or perhaps more slowly (as the learning interferes with each other).

In think the biggest problem is that what is "too high" a learning rate changes with n_neurons (because of this magnitude of the decoders issue). For example, running @hunse 's script with rate=5e-5 gives pretty much the same behaviour for different n_neurons with scaling, but as n_neurons gets larger without scaling you get horrible results.

@tcstewar
Copy link
Contributor

tcstewar commented Feb 5, 2015

Are we also scaling the learning rate by dt? That also seems like something that would have a similar argument for it...

Also, would this scaling be standard across all learning rules? Or just PES?

(Note: I do think that if this starts getting complex and people are having differing intuitions, then we should go with the standard approach of not scaling at all and letting people handle it themselves, with documentation that suggests things like dropping the learning rate as you get more neurons.)

@studywolf
Copy link
Collaborator

I feel pretty strongly against any hidden scaling of the learning rate, how would this be done? Maybe a parameter you could set to true would be more appropriate for this than automatically doing it? otherwise it might be trying to do too much for the user automatically...and yeah different learning rules like terry pointed out! it gets complicated!

@studywolf
Copy link
Collaborator

when i said doing too much i meant more assuming we know what the user is trying to do

@tbekolay
Copy link
Member

tbekolay commented Feb 5, 2015

Are we also scaling the learning rate by dt?

No; we have unit tests that test that learning rules do not change behavior when dt is changed.

I feel pretty strongly against any hidden scaling of the learning rate

I dunno, @hunse's plots are pretty convincing, in my opinion. Setting learning rates is honestly stupid BS tweaking that people shouldn't be spend the majority of their time doing, and yet they do. It is in now way intuitive to me that if I have more neurons, I need to tweak my number of neurons -- I think I have the most PES experience out of all of us and this never once occurred to me (though it is obvious now that it's been brought up).

would this scaling be standard across all learning rules

This I'm not sure of. Certainly encoder learning rules (which we don't have now) wouldn't be scaled; unsupervised rules I'm not sure if it makes sense; I could see both sides. I would say that the fact that it's not intuitive for all learning rules to be scaled is the only argument that I see for not doing the scaling.

@hunse
Copy link
Collaborator

hunse commented Feb 5, 2015

Are we also scaling the learning rate by dt?

No; we have unit tests that test that learning rules do not change behavior when dt is changed.

Actually, yes, we are scaling the learning rate by dt, which is why the behaviour is unchanged when dt is changed. Currently we're doing this scaling in all learning rules.

@studywolf
Copy link
Collaborator

The plots are great, but I really think it shouldn't be done by default. It's hidden magic. Setting the learning rate is something that's tweaky anyways, and it's only going to get harder to do that if there's some additional multiplication going on that you're not expecting.

@hunse
Copy link
Collaborator

hunse commented Feb 5, 2015

The plots are great, but I really think it shouldn't be done by default. It's hidden magic. Setting the learning rate is something that's tweaky anyways, and it's only going to get harder to do that if there's some additional multiplication going on that you're not expecting.

Using this logic, we shouldn't be scaling by dt either. I think which way is "normal" depends a lot on what your expectations are. If you know a lot about learning, then you might expect that the learning rate is not scaled by dt or n_neurons, and it might surprise you if it is. On the other hand, if you don't really understand how learning works, then it might surprise you that changing the number of neurons changes the "rate of convergence" (the amount of time for the decoders to get close to their steady state solution). I would argue that this rate of convergence is really the learning rate, as in the rate that the system learns at. And I think this rate should stay the same as the number of neurons or dt changes. If we have to change the parameter that we call the learning rate with dt and n_neurons to do that, so be it, because that learning rate has no physical intuition, whereas the rate of convergence does. So I say keep the real learning rate constant!

@jgosmann
Copy link
Collaborator

jgosmann commented Feb 5, 2015

My intuition goes also in the direction of scaling the learning rate, but I have to add the disclaimer that I have no experience with learning at all so far. Also, the argument with the hidden magic would also apply to the scaling of the evaluation points (which actually caused us some trouble).

@studywolf
Copy link
Collaborator

Yeah, I would enjoy not scaling by dt either. If I have a learning rule I understand and have implemented elsewhere and then see it in nengo and the rates are way different because it's behind the scenes being scaled by various parameters that's confusing. So, then, I guess it's a difference of having the 'training wheels' learning rule that blackboxes some parameter setting to make it easier to use and one where if you know the math you don't have try to figure out what the heck is going on that it's not doing what you think it should.

Also I think the only time that this kind of scaling wouldn't be applicable to keep the rate of convergence constant is in a learning rule that is non-local, right?

@arvoelke
Copy link
Contributor Author

arvoelke commented Feb 5, 2015

@studywolf, the way I would interpret it is you're not actually scaling the learning_rate, as much as you're modifying the PES rule to have a scalar that's dependent on the size of your decoders. Then you're setting that scalar to normalize your deltas, to make the actual rate of change (as well as stability) of PES mostly independent of the number of neurons, which I assume is a desired property. So it's only magic in-so-far that PES is magic. Everything is a leaky abstraction after all.

@jgosmann, Yeah I haven't written down the math, but Eric gave a sound argument. Basically, if you double the number of neurons, then to achieve the same decoded value (assuming no error), you need to halve each decoder. Then halving each component of a vector also halves its length.


To be more concrete, I think this topic should be changed to just talking about PES, and whether the issue justifies adding a scalar to PES that will be set to normalize the deltas according to decoder length (which is indeed as simple as dividing by n, and in turn makes \delta\hat{x} independent of n). I wouldn't expect learning rates to have any reasonably consistent interpretation between different learning rules anyway.

@jgosmann
Copy link
Collaborator

jgosmann commented Feb 5, 2015

@studywolf, it seems to me that those learning rules you're talking about have an iterative formulation, whereas the Nengo math is continuous in time and we only discretize it to be able to simulate it. I think, for two different, but sufficiently small, dt the behavior should be the same.

@tbekolay
Copy link
Member

tbekolay commented Feb 5, 2015

+1 to @arvoelke's rephrasing. Currently PES is Dd = kappa * dot(E, a); this would introduce a new term, so it becomes Dd = kappa * 1/n * dot(E, a). This makes sense because the error, E, is affected more when more decoders are changing, so adding in the 1/n scaling factor compensates for this (and minimizing the error is what the learning rule is supposed to do, not change decoders).

As long as we update Nengo's PES descriptions to reflect this (and preferably any subsequent publications that put in the PES equation) then there's no "hidden magic". Instead, it decouples the learning rate from n_neurons, which have no obvious reason to be coupled.

I would argue that all learning rules that modify decoders should do this (in their mathematical formulation, as well as their Nengo implementation). Our unsupervised rules don't modify decoders (they're specified in terms of weights, so whether this modifies encoders or decoders is up for debate) so they don't need to be modified.

@drasmuss
Copy link
Member

drasmuss commented Feb 5, 2015

I'm fine with changing the PES rule to include the scaling term. I think the intuition I and @studywolf share is that the learning rule should act as advertised. If the learning rule says it's kappa * dot (E,a), it should do that, and not modify kappa in some secret way. Modifying the theory of the learning rule to automatically scale with the number of neurons is different, and makes a lot more sense to me.

I would argue that all learning rules that modify decoders should do this (in their mathematical formulation, as well as their Nengo implementation). Our unsupervised rules don't modify decoders (they're specified in terms of weights, so whether this modifies encoders or decoders is up for debate) so they don't need to be modified.

This same problem exists with straight connection weights. E.g., if you increase the number of presynaptic neurons, your average synaptic weight is going to go down (under an NEF initialization, or under most random initialization schemes). So I'm not sure that we can make a strong distincton there. But something like BCM is an example of a well established theory, that I don't think we should be going and changing behind the scenes.

@drasmuss
Copy link
Member

drasmuss commented Feb 5, 2015

To clarify, "I'm fine with changing the PES rule" sounds like I meant that begrudgingly. I think it's definitely a good idea, it's a good theoretical change to the rule that makes it more robust.

@hunse
Copy link
Collaborator

hunse commented Feb 5, 2015

Oh yes, I'm all for transparency. In fact, we should change PES to be

PES

The differential accounts for the time scaling, and the negative isn't there right now, but it's what I'm proposing in #642.

@tbekolay
Copy link
Member

tbekolay commented Feb 5, 2015

+1 using better math!

@drasmuss
Copy link
Member

drasmuss commented Feb 5, 2015

👍 I like it. I guess the one downside is that it makes the rule seem less local, in that each synapse has to "know" how many neurons are in the presynaptic population. But we can just make the point in any publications that you can just roll that into the learning rate, so it's no more or less plausible than any other way of setting learning rates.

@tbekolay
Copy link
Member

tbekolay commented Feb 5, 2015

👍 🌈 handwavey arguments about plausibility 🌈 👍

@tbekolay tbekolay added this to the 2.1.0 release milestone Mar 3, 2015
@arvoelke
Copy link
Contributor Author

Fixed in #642.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants