-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gains grow without bound as the neuron count is increased #1534
Comments
Any idea whether these effects would be exaggerated for learning networks? I've often found that learning models (even learning a communication channel) have a hard time accurately learning extreme values like this, so curious if this is the root cause. |
I'm definitely liking #3 as the way to go... especially since it's such a small change to the code ( |
In my thesis I also observed that there is an increased distortion error close the boundary of the radius. I attributed it to the effect that uniformly distributed evaluation points do not fully cover the unit-hypersphere (because they only generate a “hyper-polygon”). But the problem described here probably contributes too. I'd be curious what part of the error can be attributed to each of these explanations. Also this taken together with the discussion in #1243 and #1248 leads to me to a rather radical proposal of a fourth possible solution (or might this be equivalent to proposal 2? that one seems to be underspecified on how it should be done): one should not specify the distribution of
Of course there are certain disadvantages:
I'd be curious how specifying a gain distribution would actually affect the error. |
I hadn't considered this, but definitely possible. Might be a good idea to codify all of these different observations into a series of benchmarks that we can use to assess different options.
I think neuron parameter also makes sense, as you did in the code you shared with me: class LIFRateSafe(nengo.LIFRate):
x_max = nengo.params.NumberParam('x_max')
def __init__(self, x_max=1.1, tau_rc=0.02, tau_ref=0.002, amplitude=1):
super().__init__(tau_rc=tau_rc, tau_ref=tau_ref, amplitude=amplitude)
self.x_max = x_max
def gain_bias(self, max_rates, intercepts):
"""Analytically determine gain, bias."""
max_rates = np.array(max_rates, dtype=float, copy=False, ndmin=1)
intercepts = np.array(intercepts, dtype=float, copy=False, ndmin=1)
inv_tau_ref = 1. / self.tau_ref if self.tau_ref > 0 else np.inf
if np.any(max_rates > inv_tau_ref):
raise ValidationError("Max rates must be below the inverse "
"refractory period (%0.3f)" % inv_tau_ref,
attr='max_rates', obj=self)
x = 1.0 / (1 - np.exp(
(self.tau_ref - (1.0 / max_rates)) / self.tau_rc))
gain = (1 - x) / (intercepts - self.x_max)
bias = 1 - gain * intercepts
return gain, bias One way of going about option 2 that is equally simple is to replace class LIFRateUnbiased(nengo.LIFRate):
anchor = nengo.params.NumberParam('anchor')
def __init__(self, anchor=-1, tau_rc=0.02, tau_ref=0.002, amplitude=1):
super().__init__(tau_rc=tau_rc, tau_ref=tau_ref, amplitude=amplitude)
self.anchor = anchor
def gain_bias(self, max_rates, intercepts):
"""Analytically determine gain, bias."""
max_rates = np.array(max_rates, dtype=float, copy=False, ndmin=1)
intercepts = np.array(intercepts, dtype=float, copy=False, ndmin=1)
inv_tau_ref = 1. / self.tau_ref if self.tau_ref > 0 else np.inf
if np.any(max_rates > inv_tau_ref):
raise ValidationError("Max rates must be below the inverse "
"refractory period (%0.3f)" % inv_tau_ref,
attr='max_rates', obj=self)
x = 1.0 / (1 - np.exp(
(self.tau_ref - (1.0 / max_rates)) / self.tau_rc))
gain = (1 - x) / (self.anchor - 1)
bias = 1 - gain * intercepts
return gain, bias This is what the distribution of gains look like in each case. Note that
Right on point. I've been finding that this interacts with
Surprisingly I've been finding that you need to be entirely in the Variant column in order to see an improvement. Any combination of parameters that have one or more Default choice still suffer in some way. I'm also seeing that
I'm thinking that the
Yay! It's encouraging that we were both thinking along the same lines. I gave a concrete instance of proposal 2 with Does my proposal help with the disadvantages you mentioned? This redefines the |
For what it's worth, here's the above Likewise, here's a zoom-in of the graph from the previous post ( |
Interesting, this is sort of specifying the distribution of gains (as in my proposal), but via the
The 0.1 value is for spiking LIFs and I was already aware that 0.01 works better for LIFRate, presumably due to the missing spiking noise. So I'm not yet fully convinced that we were over-regularizing because of the gains, but that might be an additional effect. btw: Maybe you find the benchmarking code and plotting in this notebook useful. It can give you nice plots of the error (separated into noise and distortion, though the former isn't that relevant*) across the representational space with CIs.
|
I accidentally stumbled into this issue again in the context of solvers with Since each postsynaptic neuron becomes a separate target, the ones with larger gains become more difficult to fit in proportion. Notably, the intercept doesn't even need to become very large for the error to become large. With only 10 separate simulations of 100 postsynaptic neurons, the worst-case RMSE can become approximately 200 times worse from one trial to the next (depending on how [un]lucky the intercept placement is). On one hand this makes sense because the targets and errors are varying by the same factor as the gain. However, an important observation is that the variance in the corresponding postsynaptic weights can then vary by a factor of 40,000 (or import nengo
import numpy as np
import matplotlib.pyplot as plt
from nengo.builder.ensemble import get_activities
from nengo.utils.numpy import rmse
def trial(seed, solver=nengo.solvers.LstsqL2(weights=True),
n_pre=50, n_post=100, test_size=1000):
with nengo.Network(seed=seed) as model:
x = nengo.Ensemble(n_pre, 1)
y = nengo.Ensemble(n_post, 1)
conn = nengo.Connection(x, y, solver=solver)
with nengo.Simulator(model, progress_bar=None) as sim:
pass
assert conn.solver.weights
eval_points = nengo.dists.Uniform(-1, 1).sample(test_size, 1, rng=sim.rng)
A = get_activities(sim.data[x], x, eval_points)
Y = sim.data[y].scaled_encoders.dot(eval_points.T)
rmses = rmse(sim.data[conn].weights.dot(A.T), Y, axis=1)
j = np.argmax(rmses)
return (
sim.data[y].gain[j],
sim.data[y].intercepts[j],
np.mean(sim.data[conn].weights[j, :] ** 2),
rmses[j],
)
gains = []
intercepts = []
weights = []
errors = []
for i in range(10):
g, c, w, e = trial(i)
gains.append(g)
intercepts.append(c)
weights.append(w)
errors.append(e)
print(np.max(weights) / np.min(weights), np.max(errors) / np.min(errors))
fig, ax = plt.subplots(1, 3, figsize=(16, 4), sharey=True)
ax[0].scatter(gains, errors)
ax[1].scatter(intercepts, errors)
ax[2].scatter(weights, errors)
ax[0].set_ylabel("RMSE")
ax[0].set_xlabel("Gain")
ax[1].set_xlabel("Intercept")
ax[2].set_xlabel("Var(w)")
fig.show() Note: I am computing the RMSEs manually because of #1539 for |
I'm not convinced that having gains that are independent of intercepts is ideal. I think we need some sort of metric to measure the "effectiveness" of a neuron. One such metric would be to measure how much the RMSE increases if that neuron is left out. Of course, we want to do it in such a way that we also account for the noise of the neuron (neurons with lower firing rates are noisier), which is related to the decoder weight for the neuron that we regularize, so maybe this comes out in the wash with regularization. To make it even more complicated, we want to consider the neuron's effect on RMSE across the space of representable functions, not just for a particular one. All in all, this seems like a fair bit of work to do rigorously. Intuitively, though, I think there is sense behind neurons with intercepts closer to 1 having larger gains. That way, they can fire at about the same rate as other neurons in that region, and contribute equally. If they have lower rates (as they do when gains are independent of intercepts), then they're going to be a lot noisier. One limitation of your experiments above, @arvoelke, is that they don't account for this spike noise at all. They're all using the ideal rate curves, which ignore the noise we get when we switch to spikes (though of course the solvers do try to deal with it via regularization). |
I did come up with a You can see it in action in the https://github.com/nengo/nengo/tree/gain-bounds-norm branch. Unfortunately, I ran into some problems getting the generic class LIFRateNorm(nengo.LIFRate):
max_x = nengo.params.NumberParam('max_x')
def __init__(self, max_x=1.0, tau_rc=0.02, tau_ref=0.002, amplitude=1):
super().__init__(tau_rc=tau_rc, tau_ref=tau_ref, amplitude=amplitude)
self.max_x = max_x
def gain_bias(self, max_rates, intercepts):
"""Analytically determine gain, bias."""
max_rates = np.array(max_rates, dtype=float, copy=False, ndmin=1)
intercepts = np.array(intercepts, dtype=float, copy=False, ndmin=1)
inv_tau_ref = 1. / self.tau_ref if self.tau_ref > 0 else np.inf
if np.any(max_rates > inv_tau_ref):
raise ValidationError("Max rates must be below the inverse "
"refractory period (%0.3f)" % inv_tau_ref,
attr='max_rates', obj=self)
x = -1 / np.expm1((self.tau_ref - 1 / max_rates) / self.tau_rc)
normalizer = 0.5 * self.max_x + 0.5 # == 1 when max_x == 1, 0.5*max_x when max_x >> 1
gain = (x - 1) * normalizer / (self.max_x - intercepts)
bias = 1 - gain * intercepts
return gain, bias |
We merged #1561, which addresses the most common cases by changing the default intercept distribution. But I would say this issue is not quite finished; it might be good to have the functionality in |
Related Issues
Background Analysis
As the intercept of a neuron,
c
, approaches1
from the left-hand side, the gain on the response curver
grows without bound. To make this precise, the curve must transition fromr(c) = 0
tor(1) = m
, wherem
is themax_rate
, which is by default sampled fromU[200, 400)
. This results in a secant with a slope of(m - 0) / (1 - c)
between these two points. Asc -> 1
this slope goes off to infinity.Let
n
be the number of neurons. The probability,p
, of generating at least one uniform random variable from[-1, 1)
that falls within the interval[c, 1)
is:Plugging this into the secant equation gives a slope of:
Example Manifestation
For example, given
20000
neurons there is a 99% chance that at least one of the ReLU's gains is on the order of 10^6 (see this by plugging inm = 300
,p = 0.99
,n = 20000
into Mathematica). This is validated by the following figure+code, which also demonstrates that this happens regardless of neuron model (although the analysis for the precise gain differs by a constant factor for models other than ReLU):Impact on Network Accuracy
The consequence is that the error shoots up at the end-points of the representation (+/- 1). Showing the error in the tuning curves of some
LIFRate
neurons below.To illustrate the severity of this issue, we can repeat this a number of times while scaling
n
and evaluating the curves at different input values approaching the positive edge (0.9
,0.99
,0.999
).As
u
becomes closer to1
, the RMSE increases (on log-y scale). The quality of the representation is roughly 10x worse at1000
neurons on average. Similarly,100
neurons atu = 0.9
does about as well as1000
neurons atu = 0.99
.This same effect happens for the negative edge, and the relative magnitude of the effect can become more-or-less pronounced by altering the magnitude of L2-regularization. This also applies to the spiking case and to other neuron models -- although
RectifiedLinear
does much better at constant and linear functions due to the shape of its response curve.Why Didn't I Notice this Before?
For all models except for
RectifiedLinear
, saturation effects kick in to handle this somewhat gracefully by producing a response curve that is essentially a step-function. As an aside: if pure step-functions are useful response curves, then why have them only at +/- 1 (why not at other points in the vector space)?The effect is most apparent when scaling up the number of neurons and measuring the error. Most of our models and tests do not analyze this scaling systematically to find the conditions where it might be falling apart.
The effect is only evident for input values close to the radius. We often set the radius to be slightly larger than what we need. Moreover, we often add neurons until the network's function falls within specification, rather than decomposing the error into its absolute quantities in order to understand their relative contributions with respect to the input statistics.
The effect is a systematic bias, in the sense that individual trials can be better or worse at many different points within the input space. The edges aren't always worse than every other point. But this is still an issue because it introduces a systematic bias in performance that can require 10x as many neurons on average if certain operating points are critical.
Decoder regularization helps compensate for exploding gains to some extent, by regularizing the decoding weights that correspond to the neurons with high gains. However, this is not a principled solution, as L2-regularization assumes all neurons should be treated equally. The first plot reveals that they are not. Moreover, the ability of the optimization problem to be sensitive to these extreme slopes relies on the density of evaluation points near +/- 1.
Possible Solutions
Switching the default intercept distribution to
CosineSimilarity(d+2)
might solve this ford > 1
(nengo/enhancement-proposals#10).An interim solution is to always pick the radius to be larger than you need.
The following possibilities would require a change to our definition of
max_rates
. It also remains to be seen whether any of them resolve the impact on network accuracy shown above without requiring some other change such as extending evaluation points outside the radius.Clip the gains at some reasonable maximum.
Balance the response curves such that gains are distributed independently of intercept (i.e., such that you would get a flat band in the first plot).
@tcstewar suggested shifting the anchor point for
max_rates
to1 + eps
whereeps
is some fudge factor (e.g.,eps = 0.1
).Preventing intercepts from approaching 1 is a sub-optimal solution, because it is still helpful to get a dynamic range for input values close to +/- 1 that is comparable to the dynamic range for all other values. It's just that we don't want this dynamic range to approach infinite slope, since that introduces extreme sensitivity to the magnitude of that neuron's corresponding decoder (with respect to small changes in the input) in a way that is not captured by the optimization problem.
The text was updated successfully, but these errors were encountered: