Should we normalize learning rate by # neurons? #643

arvoelke · 2015-02-04T20:33:59Z

Amir noticed with the two-link arm that if you crank up the number of neurons, unintuitively the performance degrades.

What is going on is that the magnitude of each decoder gets smaller the more neurons you have, but the learning deltas remain the same. Applying the relatively larger delta makes the learning oscillate. Instead, I think the delta should be scaled by the average size of each decoder, which Eric and I believe is just inversely proportional to the number of neurons.

So I propose we scale the learning rate by dividing by n_neurons. This will also make the number "friendlier", since it is currently on the order of 1e-6 for 500 neurons, and so the change would make it 5e-4.

The text was updated successfully, but these errors were encountered:

tcstewar · 2015-02-04T20:37:17Z

I'm cautiously in favour of this, as it'd be really nice to be able to just increase the number of neurons without worrying about reducing the learning rate at the same time. That said, I haven't worked with learning stuff much.....

tbekolay · 2015-02-04T22:04:46Z

I also (cautiously) think this makes sense. Are we talking about pre or post's n_neurons? pre I'm guessing? I'm also curious if this also holds for weight matrices (for when we learn on transforms instead of decoders). Seems like it should be since the weights are just decoders multiplied by the (unit length) encoder, but I've never looked at the actual values for these things.

jgosmann · 2015-02-04T22:23:33Z

I looked at the average decoder size at one point and if I recall correctly they indeed are inversly proportional to the number of neurons.

drasmuss · 2015-02-04T22:32:23Z

I'd like to see how well the scaling actually works. This effect is true in basically all neural learning rules, not just PES, but this is not a common solution. That makes me a bit worried that the fix isn't as simple as it seems. It's just kind of a feature of learning rates that they're fiddly, and need to be retuned whenever anything about the model changes.

I also think that there's a lot of potential for confusion with these under the hood changes, where you take the value the user sets and then modify it in a way that may not be obvious to them. Unless that modification is totally transparent (i.e., the learning rate scaling works perfectly), it can introduce some really irritating bugs.

So basically, if it is as easy as dividing by the number of neurons, and that totally makes the issue go away, then I'm all for it. But if it just kind of works, then it might do more harm than good.

studywolf · 2015-02-04T22:34:27Z

Agreed, some sort of test showing that this actually addresses the problem
properly would be great. I've never actually noticed this problem myself...

On Wed, Feb 4, 2015 at 5:32 PM, Daniel Rasmussen notifications@github.com
wrote:

I'd like to see how well the scaling actually works. This effect is true
in basically all neural learning rules, not just PES, but this is not a
common solution. That makes me a bit worried that the fix isn't as simple
as it seems. It's just kind of a feature of learning rates that they're
fiddly, and need to be retuned whenever anything about the model changes.

I also think that there's a lot of potential for confusion with these
under the hood changes, where you take the value the user sets and then
modify it in a way that may not be obvious to them. Unless that
modification is totally transparent (i.e., the learning rate scaling works
perfectly), it can introduce some really irritating bugs.

So basically, if it is as easy as dividing by the number of neurons, and
that totally makes the issue go away, then I'm all for it. But if it just
kind of works, then it might do more harm than good.

—
Reply to this email directly or view it on GitHub
#643 (comment).

tbekolay · 2015-02-04T22:36:23Z

For the record, I've never noticed this myself either. But, I do end up having to change the learning rate in every case that I've used it, and if this allows me to not change it in all cases, I'm all for it.

hunse · 2015-02-04T23:00:57Z

A simple communication channel test shows that the scaling works fairly well:

import numpy as np
import matplotlib.pyplot as plt

import nengo
from nengo.networks import EnsembleArray
from nengo.processes import WhiteNoise
nengo.log()


def run_model(n_neurons, scale):
    model = nengo.Network()

    rng = np.random.RandomState(9)

    rate = 1e-6
    if scale:
        rate *= 100. / n_neurons

    with model:
        dim = 1

        u = nengo.Node(output=WhiteNoise(60, high=5).f(d=dim, rng=rng))
        a = nengo.Ensemble(n_neurons, dim)
        b = nengo.Ensemble(100, dim)
        error = nengo.Ensemble(100, dim)

        nengo.Connection(u, a)
        nengo.Connection(a, error, transform=-1)
        nengo.Connection(b, error)

        conn = nengo.Connection(a, b,
                                function=lambda x: np.zeros(dim),
                                learning_rule_type=nengo.PES(rate))
        nengo.Connection(error, conn.learning_rule)

        ap = nengo.Probe(a, synapse=0.01)
        bp = nengo.Probe(b, synapse=0.01)


    s = nengo.Simulator(model)
    s.run(10)

    t = s.trange()
    x = s.data[ap]
    y = s.data[bp]
    return t, x, y


plt.figure(1)
for i, n_neurons in enumerate([10, 100, 500, 1000]):
    for j, scale in enumerate([False, True]):
        plt.subplot(4, 2, 2*i + j + 1)
        t, x, y = run_model(n_neurons, scale)
        plt.plot(t, x)
        plt.plot(t, y)
        plt.title('n_neurons = %s, scale = %s' % (n_neurons, scale))

plt.show()

drasmuss · 2015-02-04T23:24:38Z

Doesn't that show things being worse with scale=True?

tbekolay · 2015-02-04T23:25:46Z

No, it shows that they're consistent. All the rows learn at the same rate, whereas on the left they have different learning rates despite having the same learning rate (yeah, read that sentence again).

hunse · 2015-02-04T23:28:04Z

Basically when we scale the learning rate by the number of neurons the rate of the actual learning is consistent (right column), whereas without the scaling it changes across numbers of neurons (left column: 10 neurons shows little change, whereas 1000 neurons shows almost instant learning).

jgosmann · 2015-02-05T00:09:59Z

I looked at the average decoder size at one point and if I recall correctly they indeed are inversly proportional to the number of neurons.

I think, I have a mathematical argument for this statement. If that is of interest I can write it down tomorrow or so.

drasmuss · 2015-02-05T01:37:08Z

Basically when we scale the learning rate by the number of neurons the rate of the actual learning is consistent (right column), whereas without the scaling it changes across numbers of neurons (left column: 10 neurons shows little change, whereas 1000 neurons shows almost instant learning).

But given more neurons, you would expect it to learn more quickly. It just looks like we've counteracted the increased representational power by (needlessly) slowing down the learning. But it's also true that in this case it's likely hard to distinguish accuracy of learning from speed of learning, since the function is so simple. So maybe the accuracy is increasing with neuron number, and it's just hard to see. At the least it demonstrates that the simple linear scaling seems pretty accurate, which is good.

tcstewar · 2015-02-05T02:02:53Z

But given more neurons, you would expect it to learn more quickly.

Interesting... my instinct is that with more neurons, it should learn at the same rate, or perhaps more slowly (as the learning interferes with each other).

In think the biggest problem is that what is "too high" a learning rate changes with n_neurons (because of this magnitude of the decoders issue). For example, running @hunse 's script with rate=5e-5 gives pretty much the same behaviour for different n_neurons with scaling, but as n_neurons gets larger without scaling you get horrible results.

tcstewar · 2015-02-05T03:14:23Z

Are we also scaling the learning rate by dt? That also seems like something that would have a similar argument for it...

Also, would this scaling be standard across all learning rules? Or just PES?

(Note: I do think that if this starts getting complex and people are having differing intuitions, then we should go with the standard approach of not scaling at all and letting people handle it themselves, with documentation that suggests things like dropping the learning rate as you get more neurons.)

studywolf · 2015-02-05T03:22:01Z

I feel pretty strongly against any hidden scaling of the learning rate, how would this be done? Maybe a parameter you could set to true would be more appropriate for this than automatically doing it? otherwise it might be trying to do too much for the user automatically...and yeah different learning rules like terry pointed out! it gets complicated!

studywolf · 2015-02-05T03:23:15Z

when i said doing too much i meant more assuming we know what the user is trying to do

tbekolay · 2015-02-05T03:27:48Z

Are we also scaling the learning rate by dt?

No; we have unit tests that test that learning rules do not change behavior when dt is changed.

I feel pretty strongly against any hidden scaling of the learning rate

I dunno, @hunse's plots are pretty convincing, in my opinion. Setting learning rates is honestly stupid BS tweaking that people shouldn't be spend the majority of their time doing, and yet they do. It is in now way intuitive to me that if I have more neurons, I need to tweak my number of neurons -- I think I have the most PES experience out of all of us and this never once occurred to me (though it is obvious now that it's been brought up).

would this scaling be standard across all learning rules

This I'm not sure of. Certainly encoder learning rules (which we don't have now) wouldn't be scaled; unsupervised rules I'm not sure if it makes sense; I could see both sides. I would say that the fact that it's not intuitive for all learning rules to be scaled is the only argument that I see for not doing the scaling.

hunse · 2015-02-05T03:30:12Z

Are we also scaling the learning rate by dt?

No; we have unit tests that test that learning rules do not change behavior when dt is changed.

Actually, yes, we are scaling the learning rate by dt, which is why the behaviour is unchanged when dt is changed. Currently we're doing this scaling in all learning rules.

studywolf · 2015-02-05T03:31:28Z

The plots are great, but I really think it shouldn't be done by default. It's hidden magic. Setting the learning rate is something that's tweaky anyways, and it's only going to get harder to do that if there's some additional multiplication going on that you're not expecting.

hunse · 2015-02-05T03:39:30Z

The plots are great, but I really think it shouldn't be done by default. It's hidden magic. Setting the learning rate is something that's tweaky anyways, and it's only going to get harder to do that if there's some additional multiplication going on that you're not expecting.

Using this logic, we shouldn't be scaling by dt either. I think which way is "normal" depends a lot on what your expectations are. If you know a lot about learning, then you might expect that the learning rate is not scaled by dt or n_neurons, and it might surprise you if it is. On the other hand, if you don't really understand how learning works, then it might surprise you that changing the number of neurons changes the "rate of convergence" (the amount of time for the decoders to get close to their steady state solution). I would argue that this rate of convergence is really the learning rate, as in the rate that the system learns at. And I think this rate should stay the same as the number of neurons or dt changes. If we have to change the parameter that we call the learning rate with dt and n_neurons to do that, so be it, because that learning rate has no physical intuition, whereas the rate of convergence does. So I say keep the real learning rate constant!

jgosmann · 2015-02-05T03:55:30Z

My intuition goes also in the direction of scaling the learning rate, but I have to add the disclaimer that I have no experience with learning at all so far. Also, the argument with the hidden magic would also apply to the scaling of the evaluation points (which actually caused us some trouble).

studywolf · 2015-02-05T03:56:42Z

Yeah, I would enjoy not scaling by dt either. If I have a learning rule I understand and have implemented elsewhere and then see it in nengo and the rates are way different because it's behind the scenes being scaled by various parameters that's confusing. So, then, I guess it's a difference of having the 'training wheels' learning rule that blackboxes some parameter setting to make it easier to use and one where if you know the math you don't have try to figure out what the heck is going on that it's not doing what you think it should.

Also I think the only time that this kind of scaling wouldn't be applicable to keep the rate of convergence constant is in a learning rule that is non-local, right?

arvoelke · 2015-02-05T04:36:41Z

@studywolf, the way I would interpret it is you're not actually scaling the learning_rate, as much as you're modifying the PES rule to have a scalar that's dependent on the size of your decoders. Then you're setting that scalar to normalize your deltas, to make the actual rate of change (as well as stability) of PES mostly independent of the number of neurons, which I assume is a desired property. So it's only magic in-so-far that PES is magic. Everything is a leaky abstraction after all.

@jgosmann, Yeah I haven't written down the math, but Eric gave a sound argument. Basically, if you double the number of neurons, then to achieve the same decoded value (assuming no error), you need to halve each decoder. Then halving each component of a vector also halves its length.

To be more concrete, I think this topic should be changed to just talking about PES, and whether the issue justifies adding a scalar to PES that will be set to normalize the deltas according to decoder length (which is indeed as simple as dividing by n, and in turn makes \delta\hat{x} independent of n). I wouldn't expect learning rates to have any reasonably consistent interpretation between different learning rules anyway.

jgosmann · 2015-02-05T14:32:50Z

@studywolf, it seems to me that those learning rules you're talking about have an iterative formulation, whereas the Nengo math is continuous in time and we only discretize it to be able to simulate it. I think, for two different, but sufficiently small, dt the behavior should be the same.

tbekolay · 2015-02-05T14:39:33Z

+1 to @arvoelke's rephrasing. Currently PES is Dd = kappa * dot(E, a); this would introduce a new term, so it becomes Dd = kappa * 1/n * dot(E, a). This makes sense because the error, E, is affected more when more decoders are changing, so adding in the 1/n scaling factor compensates for this (and minimizing the error is what the learning rule is supposed to do, not change decoders).

As long as we update Nengo's PES descriptions to reflect this (and preferably any subsequent publications that put in the PES equation) then there's no "hidden magic". Instead, it decouples the learning rate from n_neurons, which have no obvious reason to be coupled.

I would argue that all learning rules that modify decoders should do this (in their mathematical formulation, as well as their Nengo implementation). Our unsupervised rules don't modify decoders (they're specified in terms of weights, so whether this modifies encoders or decoders is up for debate) so they don't need to be modified.

drasmuss · 2015-02-05T15:10:54Z

I'm fine with changing the PES rule to include the scaling term. I think the intuition I and @studywolf share is that the learning rule should act as advertised. If the learning rule says it's kappa * dot (E,a), it should do that, and not modify kappa in some secret way. Modifying the theory of the learning rule to automatically scale with the number of neurons is different, and makes a lot more sense to me.

I would argue that all learning rules that modify decoders should do this (in their mathematical formulation, as well as their Nengo implementation). Our unsupervised rules don't modify decoders (they're specified in terms of weights, so whether this modifies encoders or decoders is up for debate) so they don't need to be modified.

This same problem exists with straight connection weights. E.g., if you increase the number of presynaptic neurons, your average synaptic weight is going to go down (under an NEF initialization, or under most random initialization schemes). So I'm not sure that we can make a strong distincton there. But something like BCM is an example of a well established theory, that I don't think we should be going and changing behind the scenes.

drasmuss · 2015-02-05T15:20:04Z

To clarify, "I'm fine with changing the PES rule" sounds like I meant that begrudgingly. I think it's definitely a good idea, it's a good theoretical change to the rule that makes it more robust.

hunse · 2015-02-05T15:29:48Z

Oh yes, I'm all for transparency. In fact, we should change PES to be

The differential accounts for the time scaling, and the negative isn't there right now, but it's what I'm proposing in #642.

tbekolay · 2015-02-05T15:51:53Z

+1 using better math!

drasmuss · 2015-02-05T17:11:41Z

👍 I like it. I guess the one downside is that it makes the rule seem less local, in that each synapse has to "know" how many neurons are in the presynaptic population. But we can just make the point in any publications that you can just roll that into the learning rate, so it's no more or less plausible than any other way of setting learning rates.

tbekolay · 2015-02-05T17:49:27Z

👍 🌈 handwavey arguments about plausibility 🌈 👍

arvoelke · 2015-07-13T14:59:29Z

Fixed in #642.

arvoelke added the discussion label Feb 4, 2015

tbekolay added this to the 2.1.0 release milestone Mar 3, 2015

arvoelke mentioned this issue May 28, 2015

Learning unmodulated #642

Merged

arvoelke closed this as completed Jul 13, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we normalize learning rate by # neurons? #643

Should we normalize learning rate by # neurons? #643

arvoelke commented Feb 4, 2015

tcstewar commented Feb 4, 2015

tbekolay commented Feb 4, 2015

jgosmann commented Feb 4, 2015

drasmuss commented Feb 4, 2015

studywolf commented Feb 4, 2015

tbekolay commented Feb 4, 2015

hunse commented Feb 4, 2015

drasmuss commented Feb 4, 2015

tbekolay commented Feb 4, 2015

hunse commented Feb 4, 2015

jgosmann commented Feb 5, 2015

drasmuss commented Feb 5, 2015

tcstewar commented Feb 5, 2015

tcstewar commented Feb 5, 2015

studywolf commented Feb 5, 2015

studywolf commented Feb 5, 2015

tbekolay commented Feb 5, 2015

hunse commented Feb 5, 2015

studywolf commented Feb 5, 2015

hunse commented Feb 5, 2015

jgosmann commented Feb 5, 2015

studywolf commented Feb 5, 2015

arvoelke commented Feb 5, 2015

jgosmann commented Feb 5, 2015

tbekolay commented Feb 5, 2015

drasmuss commented Feb 5, 2015

drasmuss commented Feb 5, 2015

hunse commented Feb 5, 2015

tbekolay commented Feb 5, 2015

drasmuss commented Feb 5, 2015

tbekolay commented Feb 5, 2015

arvoelke commented Jul 13, 2015

Should we normalize learning rate by # neurons? #643

Should we normalize learning rate by # neurons? #643

Comments

arvoelke commented Feb 4, 2015

tcstewar commented Feb 4, 2015

tbekolay commented Feb 4, 2015

jgosmann commented Feb 4, 2015

drasmuss commented Feb 4, 2015

studywolf commented Feb 4, 2015

tbekolay commented Feb 4, 2015

hunse commented Feb 4, 2015

drasmuss commented Feb 4, 2015

tbekolay commented Feb 4, 2015

hunse commented Feb 4, 2015

jgosmann commented Feb 5, 2015

drasmuss commented Feb 5, 2015

tcstewar commented Feb 5, 2015

tcstewar commented Feb 5, 2015

studywolf commented Feb 5, 2015

studywolf commented Feb 5, 2015

tbekolay commented Feb 5, 2015

hunse commented Feb 5, 2015

studywolf commented Feb 5, 2015

hunse commented Feb 5, 2015

jgosmann commented Feb 5, 2015

studywolf commented Feb 5, 2015

arvoelke commented Feb 5, 2015

jgosmann commented Feb 5, 2015

tbekolay commented Feb 5, 2015

drasmuss commented Feb 5, 2015

drasmuss commented Feb 5, 2015

hunse commented Feb 5, 2015

tbekolay commented Feb 5, 2015

drasmuss commented Feb 5, 2015

tbekolay commented Feb 5, 2015

arvoelke commented Jul 13, 2015