Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement auto early exaggeration #220

Merged
merged 7 commits into from Dec 2, 2022

Conversation

dkobak
Copy link
Contributor

@dkobak dkobak commented Nov 9, 2022

Implements #218.

First, early_exaggeration="auto" is now set to max(12, exaggeration).

Second, the learning rate. We have various functions that currently take learning_rate="auto" and set it to max(200, N/12). I did not change this, because those functions usually do not know what the early exaggeration was. So I kept it as is. I only changed the behaviour of the base class: there learning_rate="auto" is now set to max(200, N/early_exaggeration).

This works as intended:

X = np.random.randn(10000,10)

TSNE(verbose=True).fit(X)
# Prints 
# TSNE(early_exaggeration=12, verbose=True)
# Uses lr=833.33

TSNE(verbose=True, exaggeration=5).fit(X)
# Prints
# TSNE(early_exaggeration=12, exaggeration=5, verbose=True)
# Uses lr=833.33

TSNE(verbose=True, exaggeration=20).fit(X)
# Prints
# TSNE(early_exaggeration=20, exaggeration=20, verbose=True)
# Uses lr=500.00

(Note that the learning rate is currently not printed by the repr(self) because it's kept as "auto" at construction time and only set later. That's also how we had it before.)

openTSNE/tsne.py Outdated

gradient_descent_params = {
"dof": self.dof,
"negative_gradient_method": self.negative_gradient_method,
"learning_rate": self.learning_rate,
"learning_rate": learning_rate_now,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the correct put to put this. This would set the learning rate for the duration of the whole embedding, not just the early exaggeration phase, right? I think the best place to put this might actually be inside the gradient_descent call function, so the correct rescaling will happen during any call to .optimize, not just the standard TSNE().fit call.

I wasn't aware that this was an issue during the early exaggeration phase at all. It is actually necessary to rescale the learning rate, if the exaggeration isn't 1? In that case, wouldn't it also make sense to rescale it given any exaggeration value, not just the early exaggeration phase?

Copy link
Contributor Author

@dkobak dkobak Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current default openTSNE behaviour (which is the same in FIt-SNE and about to become default in soon-to-be-released sklearn 1.2) is to use learning rate N/12. Here 12 corresponds to the early exaggeration, the idea being that if learning_rate * exaggeration > N, then the optimization may diverge (as shown by Linderman & Steinerberger). One COULD use learning rate N, instead of N/12, for the 2nd optimization phase, once early exaggeration is turned off. But we do not do it. We could decide to do it, and I think it may even be a good idea, but this will make openTSNE default behaviour different from other implementations. In any case, this I think requires a different PR and a separate issue, maybe comparing current and suggested scheme on a range of datasets.

In this PR I assumed that we want to keep the learning rate constant throughout optimization. Which is why I set it to N/early_exaggeration and keep the same learning rate for the second phase.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense, I'd forgotten about that. However, I'm a bit unsure what the best way about implementing would be. We currently handle learning_rate="auto" here in the _handle_nice_params function. However, this function is called in every .optimize step, so that we can perform whatever optimization sequence we want.

With the way you've implemented it now, we'd be going around this function because we'd change the learning_rate from auto to a number before we even got to this function. This would also cause inconsistent behaviour between

TSNE().fit()

and

embedding.optimize(...)
embedding.optimize(...)

since the first one would then use the "correct" learning rate, while the second would use N/12 in both cases. I am very much opposed to this inconsistency, but I'm not sure what the best way to handle this would be.

Perhaps the best course of action would be to investigate rescaling the learning rate based on exaggeration, so we'd end up with a different learning rate for the early exaggeration and the standard optimization phase. I think this would best fit into our framework, and it seems the most principled approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. I will start a separate issue.

@pavlin-policar
Copy link
Owner

To build on top of #220 (comment), I've run a quick experiment to see convergence rates using lr/exaggeration, and the results indicate that N/exag may actually lead to better and faster convergence than simply dividing by N/ee_rate:

image

Visually, all four embeddings look pretty similar, and I wouldn't say ones are less converged than the others, but it seems like we can get the same KL divergence faster this way. What do you think?

@dkobak
Copy link
Contributor Author

dkobak commented Nov 9, 2022

Does exag in this plot mean early exaggeration? And late exaggeration is set to 1? Is this MNIST?

@pavlin-policar
Copy link
Owner

pavlin-policar commented Nov 9, 2022

Does exag in this plot mean early exaggeration?

Yes, the "regular" phase is run with exaggeration=1

Is this MNIST?

No, this is my typical macosko example.

This is one example, so we'd definitely have to check it on several more, but this does indicate "that at least it wouldn't hurt". And if we were to implement N/exag, this would fit in much more cleanly into the entire openTSNE architecture. I wouldn't mind being slightly inconsistent with FIt-SNE or scikit-learn in this respect, since the visualizations seem visually indistinguishable from one another.

@dkobak
Copy link
Contributor Author

dkobak commented Nov 9, 2022

I agree. I ran the same test on MNIST, and observe faster convergence using the suggested approach.

Incidentally, I first did it wrong because I did not specify correct momentum terms for the two optimize() calls. This made me think that I am not aware of any reason for why momentum should be different. I tried using momentum=0.8 for both stages, and it seems to be better than the current 0.5->0.8 scheme.

Note that your criticism that TSNE().fit() and twice calling embedding.optimize(...) is not identical, also applies to momentum differences, no?

optimization-learing-rate

@pavlin-policar
Copy link
Owner

pavlin-policar commented Nov 9, 2022

Note that your criticism that TSNE().fit() and twice calling embedding.optimize(...) is not identical, also applies to momentum differences, no?

Yes, this is the same issue, and this has bothered me from the very start. So I'm very happy to see that using a single momentum seems to lead to faster convergence, as this would justify defaulting to 0.8 everywhere.

I'm not aware of any justification for this choice either, this came from the original LVM and I never bothered to question it.

I see similar behaviour on the macosko dataset:
image

@dkobak
Copy link
Contributor Author

dkobak commented Nov 9, 2022

It seems that 250 iterations may be too many with this momentum setting, but maybe let's not touch the n_iter defaults for now.

Would be good to check this on a couple of more datasets, perhaps also very small ones (Iris?), but overall I think it looks good.

@pavlin-policar
Copy link
Owner

It seems that 250 iterations may be too many with this momentum setting, but maybe let's not touch the n_iter defaults for now.

Yes, I agree.

Would be good to check this on a couple of more datasets, perhaps also very small ones (Iris?), but overall I think it looks good.

I'd also check a couple of big ones, the cao one, maybe the 10x mouse as well. It might also be interesting to see if we actually need lr=200 on iris. Maybe lr=N=150 would be better. The 200 now seems kind of arbitrary.

@dkobak
Copy link
Contributor Author

dkobak commented Nov 9, 2022

Iris:

optimization-learing-rate-iris

I'd also check a couple of big ones, the cao one, maybe the 10x mouse as well. It might also be interesting to see if we actually need lr=200 on iris. Maybe lr=N=150 would be better. The 200 now seems kind of arbitrary.

Very good point. Red line shows that turning off learning rate "clipping" (I mean clipping to 200) works actually very good.

@pavlin-policar
Copy link
Owner

That's great! I think we should test it for even smaller data sets, but this indicates that we can get rid of the 200 altogether.

@dkobak
Copy link
Contributor Author

dkobak commented Nov 9, 2022

Are you going to run it on smth with sample size over 1mln? Sounds like you have everything set up for these experiments. But if you want, I can run something as well.

@pavlin-policar
Copy link
Owner

Yes, sure, I'll find a few more data sets and run them. If everything goes well, we'll change the defaults to momentum=0.8 and lr=N/exaggeration, and this will solve all the issues outlined above.

@dkobak
Copy link
Contributor Author

dkobak commented Nov 9, 2022

This may be worth bumping to 0.7!

@dkobak
Copy link
Contributor Author

dkobak commented Nov 22, 2022

Have you had a chance to run it on some other datasets? Otherwise I would give it a try on something large, I am curious :)

@pavlin-policar
Copy link
Owner

Hey Dmitry, no, unfortunately, I haven't had time yet. It's a busy semester for teaching :) If you have any benchmarks you'd like to run, I'd be happy to see the results.

@dkobak
Copy link
Contributor Author

dkobak commented Nov 23, 2022

Just ran it in an identical way for Iris, MNIST, and n=1.3 mln dataset from 10x. I used uniform affinities with k=10 in all cases, to speed things up.

Old: current default
New: learning rate N/12, followed by learning rate N, momentum always .8, learning rate below 200 is allowed

optimization-learing-rate

I think everything is very consistent:

  • Learning rate N makes convergence in the 2ns phase faster
  • Momentum .8 makes convergence in the 1st phase faster
  • Learning rate below 200 for small datasets prevents fluctuations in loss

@pavlin-policar
Copy link
Owner

I've also tried this on two other data sets: shekhar (20k) and cao (2mln):

shekhar (20k):
image

cao (2mln):
image

In both cases, using momentum=0.8 and learning_rate=N/exaggeration works better. Along with the the other examples you provide, I feel this is sufficient to change the defaults.

Learning rate below 200 for small datasets prevents fluctuations in loss

I don't understand this entirely. So, can we use learning_rate=N/exaggeration in all cases? Or do we keep it as it is right now and use learning_rate=max(N/exaggeration, 200)?

@dkobak
Copy link
Contributor Author

dkobak commented Nov 26, 2022

I don't understand this entirely. So, can we use learning_rate=N/exaggeration in all cases? Or do we keep it as it is right now and use learning_rate=max(N/exaggeration, 200)?

I now think that we should use learning_rate=N/exaggeration in all cases. For iris (N=150), this would mean learning rate 150/12 = 12.5 during the early exaggeration phase. Currently we use learning rate 200 during that phase, which is too high. It does not lead to divergence (not sure why; maybe due to some gradient or step size clipping?) but does lead to oscillating loss, clearly suggesting that something is not well with the gradient descent. Learning rate 12.5 seems much more stable.

@pavlin-policar
Copy link
Owner

Yes, I agree. I tried it myself and there are big oscillations. E.g. subsampling iris to 50 data points also leads to less oscillation with learning_rate=N/exaggeration.

I think the best course of action is to add a learning_rate="auto" option, make that the default, and then handle it in _handle_nice_params, as I've written in the original code review.

@dkobak
Copy link
Contributor Author

dkobak commented Nov 26, 2022

Do you want me to edit this PR, or do you prefer to make a new one?

I think the best course of action is to add a learning_rate="auto" option, make that the default, and then handle it in _handle_nice_params, as I've written in the original code review.

Is it okay to make _handle_nice_params take the exaggeration value as input? Currently it does not.

@pavlin-policar
Copy link
Owner

I think adding it to this PR is completely fine.

Is it okay to make _handle_nice_params take the exaggeration value as input? Currently it does not.

I think it does. The exaggeration factor should come in through .optimize, and should be captured in the **gradient_descent_params. Then, this should be passed through here.

@dkobak
Copy link
Contributor Author

dkobak commented Nov 26, 2022

But exaggeration factor does not really feel like a "gradient descent parameter"... So I was reluctant to add it into the gradient_descent_params. What about passing it into _handle_nice_params() as an additional separate input parameter? Like this:

def _handle_nice_params(embedding: np.ndarray, exaggeration: double, optim_params: dict) -> None:

Edit: sorry, misread your comment. If it is already passed in as you said, then there is of course no need to change it.

Edit2: but actually, looking at how gradient_descent_params is created, it seems that exaggeration is not included:

gradient_descent_params = {

@pavlin-policar
Copy link
Owner

My understanding is that it is already included in the _handle_nice_params. Indeed, if I run a simple example and print the optim_params in _handle_nice_params, I get

{'learning_rate': 'auto', 'momentum': 0.5, 'theta': 0.5, 'max_grad_norm': None, 'max_step_norm': 5,
'n_jobs': 1, 'verbose': False, 'callbacks': None, 'callbacks_every_iters': 50,
'negative_gradient_method': 'bh', 'n_interpolation_points': 3, 'min_num_intervals': 50,
'ints_in_interval': 1, 'dof': 1, 'exaggeration': 12, 'n_iter': 25}

for the EE phase, and

{'learning_rate': 'auto', 'momentum': 0.8, 'theta': 0.5, 'max_grad_norm': None, 'max_step_norm': 5,
'n_jobs': 1, 'verbose': False, 'callbacks': None, 'callbacks_every_iters': 50,
'negative_gradient_method': 'bh', 'n_interpolation_points': 3, 'min_num_intervals': 50,
'ints_in_interval': 1, 'dof': 1, 'exaggeration': None, 'n_iter': 50}

for the standard phase. Importantly, exaggeration and the learning_rate is already among them. So I would imagine something like this would be totally fine:

learning_rate = optim_params["learning_rate"]
if learning_rate == "auto":
    exaggeration = optim_params.get("exaggeration", None)
    if exaggeration is None:
        exaggeration = 1
    learning_rate = n_samples / exaggeration
optim_params["learning_rate"] = learning_rate

@dkobak
Copy link
Contributor Author

dkobak commented Nov 27, 2022

I see. Still not quite sure where it gets added to the dictionary, but it does not matter now.

I did the changes and added a test that checks if running optimize() twice (with exaggeration 12 and then without) produces the same result as running fit() with default params.

@pavlin-policar
Copy link
Owner

I think this is fine now. I'm glad we found a way to simplify the API and speed up convergence at the same time :)

@pavlin-policar pavlin-policar merged commit 5d829fe into pavlin-policar:master Dec 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants