Implement auto early exaggeration #220

dkobak · 2022-11-09T09:40:37Z

Implements #218.

First, early_exaggeration="auto" is now set to max(12, exaggeration).

Second, the learning rate. We have various functions that currently take learning_rate="auto" and set it to max(200, N/12). I did not change this, because those functions usually do not know what the early exaggeration was. So I kept it as is. I only changed the behaviour of the base class: there learning_rate="auto" is now set to max(200, N/early_exaggeration).

This works as intended:

X = np.random.randn(10000,10)

TSNE(verbose=True).fit(X)
# Prints 
# TSNE(early_exaggeration=12, verbose=True)
# Uses lr=833.33

TSNE(verbose=True, exaggeration=5).fit(X)
# Prints
# TSNE(early_exaggeration=12, exaggeration=5, verbose=True)
# Uses lr=833.33

TSNE(verbose=True, exaggeration=20).fit(X)
# Prints
# TSNE(early_exaggeration=20, exaggeration=20, verbose=True)
# Uses lr=500.00

(Note that the learning rate is currently not printed by the repr(self) because it's kept as "auto" at construction time and only set later. That's also how we had it before.)

pavlin-policar · 2022-11-09T12:14:37Z

openTSNE/tsne.py


        gradient_descent_params = {
            "dof": self.dof,
            "negative_gradient_method": self.negative_gradient_method,
-            "learning_rate": self.learning_rate,
+            "learning_rate": learning_rate_now,


I'm not sure this is the correct put to put this. This would set the learning rate for the duration of the whole embedding, not just the early exaggeration phase, right? I think the best place to put this might actually be inside the gradient_descent call function, so the correct rescaling will happen during any call to .optimize, not just the standard TSNE().fit call.

I wasn't aware that this was an issue during the early exaggeration phase at all. It is actually necessary to rescale the learning rate, if the exaggeration isn't 1? In that case, wouldn't it also make sense to rescale it given any exaggeration value, not just the early exaggeration phase?

The current default openTSNE behaviour (which is the same in FIt-SNE and about to become default in soon-to-be-released sklearn 1.2) is to use learning rate N/12. Here 12 corresponds to the early exaggeration, the idea being that if learning_rate * exaggeration > N, then the optimization may diverge (as shown by Linderman & Steinerberger). One COULD use learning rate N, instead of N/12, for the 2nd optimization phase, once early exaggeration is turned off. But we do not do it. We could decide to do it, and I think it may even be a good idea, but this will make openTSNE default behaviour different from other implementations. In any case, this I think requires a different PR and a separate issue, maybe comparing current and suggested scheme on a range of datasets.

In this PR I assumed that we want to keep the learning rate constant throughout optimization. Which is why I set it to N/early_exaggeration and keep the same learning rate for the second phase.

Yes, that makes sense, I'd forgotten about that. However, I'm a bit unsure what the best way about implementing would be. We currently handle learning_rate="auto" here in the _handle_nice_params function. However, this function is called in every .optimize step, so that we can perform whatever optimization sequence we want.

With the way you've implemented it now, we'd be going around this function because we'd change the learning_rate from auto to a number before we even got to this function. This would also cause inconsistent behaviour between

TSNE().fit()

and

embedding.optimize(...) embedding.optimize(...)

since the first one would then use the "correct" learning rate, while the second would use N/12 in both cases. I am very much opposed to this inconsistency, but I'm not sure what the best way to handle this would be.

Perhaps the best course of action would be to investigate rescaling the learning rate based on exaggeration, so we'd end up with a different learning rate for the early exaggeration and the standard optimization phase. I think this would best fit into our framework, and it seems the most principled approach.

I see your point. I will start a separate issue.

pavlin-policar · 2022-11-09T13:27:15Z

To build on top of #220 (comment), I've run a quick experiment to see convergence rates using lr/exaggeration, and the results indicate that N/exag may actually lead to better and faster convergence than simply dividing by N/ee_rate:

Visually, all four embeddings look pretty similar, and I wouldn't say ones are less converged than the others, but it seems like we can get the same KL divergence faster this way. What do you think?

dkobak · 2022-11-09T13:30:17Z

Does exag in this plot mean early exaggeration? And late exaggeration is set to 1? Is this MNIST?

pavlin-policar · 2022-11-09T13:33:40Z

Does exag in this plot mean early exaggeration?

Yes, the "regular" phase is run with exaggeration=1

Is this MNIST?

No, this is my typical macosko example.

This is one example, so we'd definitely have to check it on several more, but this does indicate "that at least it wouldn't hurt". And if we were to implement N/exag, this would fit in much more cleanly into the entire openTSNE architecture. I wouldn't mind being slightly inconsistent with FIt-SNE or scikit-learn in this respect, since the visualizations seem visually indistinguishable from one another.

dkobak · 2022-11-09T13:57:18Z

I agree. I ran the same test on MNIST, and observe faster convergence using the suggested approach.

Incidentally, I first did it wrong because I did not specify correct momentum terms for the two optimize() calls. This made me think that I am not aware of any reason for why momentum should be different. I tried using momentum=0.8 for both stages, and it seems to be better than the current 0.5->0.8 scheme.

Note that your criticism that TSNE().fit() and twice calling embedding.optimize(...) is not identical, also applies to momentum differences, no?

pavlin-policar · 2022-11-09T14:19:30Z

Note that your criticism that TSNE().fit() and twice calling embedding.optimize(...) is not identical, also applies to momentum differences, no?

Yes, this is the same issue, and this has bothered me from the very start. So I'm very happy to see that using a single momentum seems to lead to faster convergence, as this would justify defaulting to 0.8 everywhere.

I'm not aware of any justification for this choice either, this came from the original LVM and I never bothered to question it.

I see similar behaviour on the macosko dataset:

dkobak · 2022-11-09T14:35:30Z

It seems that 250 iterations may be too many with this momentum setting, but maybe let's not touch the n_iter defaults for now.

Would be good to check this on a couple of more datasets, perhaps also very small ones (Iris?), but overall I think it looks good.

pavlin-policar · 2022-11-09T14:48:41Z

It seems that 250 iterations may be too many with this momentum setting, but maybe let's not touch the n_iter defaults for now.

Yes, I agree.

Would be good to check this on a couple of more datasets, perhaps also very small ones (Iris?), but overall I think it looks good.

I'd also check a couple of big ones, the cao one, maybe the 10x mouse as well. It might also be interesting to see if we actually need lr=200 on iris. Maybe lr=N=150 would be better. The 200 now seems kind of arbitrary.

dkobak · 2022-11-09T15:08:30Z

Iris:

I'd also check a couple of big ones, the cao one, maybe the 10x mouse as well. It might also be interesting to see if we actually need lr=200 on iris. Maybe lr=N=150 would be better. The 200 now seems kind of arbitrary.

Very good point. Red line shows that turning off learning rate "clipping" (I mean clipping to 200) works actually very good.

pavlin-policar · 2022-11-09T15:48:24Z

That's great! I think we should test it for even smaller data sets, but this indicates that we can get rid of the 200 altogether.

dkobak · 2022-11-09T15:52:12Z

Are you going to run it on smth with sample size over 1mln? Sounds like you have everything set up for these experiments. But if you want, I can run something as well.

pavlin-policar · 2022-11-09T16:09:03Z

Yes, sure, I'll find a few more data sets and run them. If everything goes well, we'll change the defaults to momentum=0.8 and lr=N/exaggeration, and this will solve all the issues outlined above.

dkobak · 2022-11-09T18:28:52Z

This may be worth bumping to 0.7!

dkobak · 2022-11-22T10:30:26Z

Have you had a chance to run it on some other datasets? Otherwise I would give it a try on something large, I am curious :)

pavlin-policar · 2022-11-23T17:55:45Z

Hey Dmitry, no, unfortunately, I haven't had time yet. It's a busy semester for teaching :) If you have any benchmarks you'd like to run, I'd be happy to see the results.

dkobak · 2022-11-23T22:52:06Z

Just ran it in an identical way for Iris, MNIST, and n=1.3 mln dataset from 10x. I used uniform affinities with k=10 in all cases, to speed things up.

Old: current default
New: learning rate N/12, followed by learning rate N, momentum always .8, learning rate below 200 is allowed

I think everything is very consistent:

Learning rate N makes convergence in the 2ns phase faster
Momentum .8 makes convergence in the 1st phase faster
Learning rate below 200 for small datasets prevents fluctuations in loss

pavlin-policar · 2022-11-26T20:47:35Z

I've also tried this on two other data sets: shekhar (20k) and cao (2mln):

shekhar (20k):

cao (2mln):

In both cases, using momentum=0.8 and learning_rate=N/exaggeration works better. Along with the the other examples you provide, I feel this is sufficient to change the defaults.

Learning rate below 200 for small datasets prevents fluctuations in loss

I don't understand this entirely. So, can we use learning_rate=N/exaggeration in all cases? Or do we keep it as it is right now and use learning_rate=max(N/exaggeration, 200)?

dkobak · 2022-11-26T21:06:02Z

I don't understand this entirely. So, can we use learning_rate=N/exaggeration in all cases? Or do we keep it as it is right now and use learning_rate=max(N/exaggeration, 200)?

I now think that we should use learning_rate=N/exaggeration in all cases. For iris (N=150), this would mean learning rate 150/12 = 12.5 during the early exaggeration phase. Currently we use learning rate 200 during that phase, which is too high. It does not lead to divergence (not sure why; maybe due to some gradient or step size clipping?) but does lead to oscillating loss, clearly suggesting that something is not well with the gradient descent. Learning rate 12.5 seems much more stable.

pavlin-policar · 2022-11-26T21:32:52Z

Yes, I agree. I tried it myself and there are big oscillations. E.g. subsampling iris to 50 data points also leads to less oscillation with learning_rate=N/exaggeration.

I think the best course of action is to add a learning_rate="auto" option, make that the default, and then handle it in _handle_nice_params, as I've written in the original code review.

dkobak · 2022-11-26T21:50:15Z

Do you want me to edit this PR, or do you prefer to make a new one?

I think the best course of action is to add a learning_rate="auto" option, make that the default, and then handle it in _handle_nice_params, as I've written in the original code review.

Is it okay to make _handle_nice_params take the exaggeration value as input? Currently it does not.

pavlin-policar · 2022-11-26T22:08:04Z

I think adding it to this PR is completely fine.

Is it okay to make _handle_nice_params take the exaggeration value as input? Currently it does not.

I think it does. The exaggeration factor should come in through .optimize, and should be captured in the **gradient_descent_params. Then, this should be passed through here.

dkobak · 2022-11-26T22:11:32Z

But exaggeration factor does not really feel like a "gradient descent parameter"... So I was reluctant to add it into the gradient_descent_params. What about passing it into _handle_nice_params() as an additional separate input parameter? Like this:

def _handle_nice_params(embedding: np.ndarray, exaggeration: double, optim_params: dict) -> None:

Edit: sorry, misread your comment. If it is already passed in as you said, then there is of course no need to change it.

Edit2: but actually, looking at how gradient_descent_params is created, it seems that exaggeration is not included:

openTSNE/openTSNE/tsne.py

Line 1399 in 46d65ae

gradient_descent_params = {

pavlin-policar · 2022-11-27T10:51:09Z

My understanding is that it is already included in the _handle_nice_params. Indeed, if I run a simple example and print the optim_params in _handle_nice_params, I get

{'learning_rate': 'auto', 'momentum': 0.5, 'theta': 0.5, 'max_grad_norm': None, 'max_step_norm': 5,
'n_jobs': 1, 'verbose': False, 'callbacks': None, 'callbacks_every_iters': 50,
'negative_gradient_method': 'bh', 'n_interpolation_points': 3, 'min_num_intervals': 50,
'ints_in_interval': 1, 'dof': 1, 'exaggeration': 12, 'n_iter': 25}

for the EE phase, and

{'learning_rate': 'auto', 'momentum': 0.8, 'theta': 0.5, 'max_grad_norm': None, 'max_step_norm': 5,
'n_jobs': 1, 'verbose': False, 'callbacks': None, 'callbacks_every_iters': 50,
'negative_gradient_method': 'bh', 'n_interpolation_points': 3, 'min_num_intervals': 50,
'ints_in_interval': 1, 'dof': 1, 'exaggeration': None, 'n_iter': 50}

for the standard phase. Importantly, exaggeration and the learning_rate is already among them. So I would imagine something like this would be totally fine:

learning_rate = optim_params["learning_rate"]
if learning_rate == "auto":
    exaggeration = optim_params.get("exaggeration", None)
    if exaggeration is None:
        exaggeration = 1
    learning_rate = n_samples / exaggeration
optim_params["learning_rate"] = learning_rate

dkobak · 2022-11-27T21:08:37Z

I see. Still not quite sure where it gets added to the dictionary, but it does not matter now.

I did the changes and added a test that checks if running optimize() twice (with exaggeration 12 and then without) produces the same result as running fit() with default params.

pavlin-policar · 2022-12-02T09:55:10Z

I think this is fine now. I'm glad we found a way to simplify the API and speed up convergence at the same time :)

dkobak added 2 commits November 9, 2022 10:14

Implement auto early exaggeration

421598e

Fix the defaults

f5959e2

pavlin-policar reviewed Nov 9, 2022

View reviewed changes

Switch to lr=N/ex and momentum=.8

b68b100

dkobak added 3 commits November 27, 2022 22:11

fix import in tests

34952c1

Fix the test

9b222d1

Fix assertion call

1316d3b

Fix n_iter in tests

5ec7e79

pavlin-policar merged commit 5d829fe into pavlin-policar:master Dec 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement auto early exaggeration #220

Implement auto early exaggeration #220

dkobak commented Nov 9, 2022 •

edited

pavlin-policar Nov 9, 2022

dkobak Nov 9, 2022 •

edited

pavlin-policar Nov 9, 2022

dkobak Nov 9, 2022

pavlin-policar commented Nov 9, 2022

dkobak commented Nov 9, 2022

pavlin-policar commented Nov 9, 2022 •

edited

dkobak commented Nov 9, 2022 •

edited

pavlin-policar commented Nov 9, 2022 •

edited

dkobak commented Nov 9, 2022

pavlin-policar commented Nov 9, 2022

dkobak commented Nov 9, 2022

pavlin-policar commented Nov 9, 2022

dkobak commented Nov 9, 2022

pavlin-policar commented Nov 9, 2022

dkobak commented Nov 9, 2022

dkobak commented Nov 22, 2022

pavlin-policar commented Nov 23, 2022

dkobak commented Nov 23, 2022

pavlin-policar commented Nov 26, 2022

dkobak commented Nov 26, 2022 •

edited

pavlin-policar commented Nov 26, 2022

dkobak commented Nov 26, 2022

pavlin-policar commented Nov 26, 2022

dkobak commented Nov 26, 2022 •

edited

pavlin-policar commented Nov 27, 2022

dkobak commented Nov 27, 2022 •

edited

pavlin-policar commented Dec 2, 2022

Implement auto early exaggeration #220

Implement auto early exaggeration #220

Conversation

dkobak commented Nov 9, 2022 • edited

pavlin-policar Nov 9, 2022

Choose a reason for hiding this comment

dkobak Nov 9, 2022 • edited

Choose a reason for hiding this comment

pavlin-policar Nov 9, 2022

Choose a reason for hiding this comment

dkobak Nov 9, 2022

Choose a reason for hiding this comment

pavlin-policar commented Nov 9, 2022

dkobak commented Nov 9, 2022

pavlin-policar commented Nov 9, 2022 • edited

dkobak commented Nov 9, 2022 • edited

pavlin-policar commented Nov 9, 2022 • edited

dkobak commented Nov 9, 2022

pavlin-policar commented Nov 9, 2022

dkobak commented Nov 9, 2022

pavlin-policar commented Nov 9, 2022

dkobak commented Nov 9, 2022

pavlin-policar commented Nov 9, 2022

dkobak commented Nov 9, 2022

dkobak commented Nov 22, 2022

pavlin-policar commented Nov 23, 2022

dkobak commented Nov 23, 2022

pavlin-policar commented Nov 26, 2022

dkobak commented Nov 26, 2022 • edited

pavlin-policar commented Nov 26, 2022

dkobak commented Nov 26, 2022

pavlin-policar commented Nov 26, 2022

dkobak commented Nov 26, 2022 • edited

pavlin-policar commented Nov 27, 2022

dkobak commented Nov 27, 2022 • edited

pavlin-policar commented Dec 2, 2022

dkobak commented Nov 9, 2022 •

edited

dkobak Nov 9, 2022 •

edited

pavlin-policar commented Nov 9, 2022 •

edited

dkobak commented Nov 9, 2022 •

edited

pavlin-policar commented Nov 9, 2022 •

edited

dkobak commented Nov 26, 2022 •

edited

dkobak commented Nov 26, 2022 •

edited

dkobak commented Nov 27, 2022 •

edited