Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement auto early exaggeration #220

Merged
merged 7 commits into from Dec 2, 2022
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
37 changes: 26 additions & 11 deletions openTSNE/tsne.py
Expand Up @@ -995,17 +995,20 @@ class TSNE(BaseEstimator):

learning_rate: Union[str, float]
The learning rate for t-SNE optimization. When ``learning_rate="auto"``
the appropriate learning rate is selected according to max(200, N / 12),
as determined in Belkina et al. "Automated optimized parameters for
T-distributed stochastic neighbor embedding improve visualization and
analysis of large datasets", 2019.
the appropriate learning rate is selected according to ``max(200,
N/early_exaggeration),`` as determined in Belkina et al. "Automated
optimized parameters for T-distributed stochastic neighbor embedding
improve visualization and analysis of large datasets", 2019.

early_exaggeration_iter: int
The number of iterations to run in the *early exaggeration* phase.

early_exaggeration: float
early_exaggeration: Union[str, float]
The exaggeration factor to use during the *early exaggeration* phase.
Typical values range from 12 to 32.
Typical values range from 4 to 32. When ``early_exaggeration="auto"``
early exaggeration factor defaults to 12, unless desired subsequent
exaggeration is higher, i.e.: ``early_exaggeration = max(12,
exaggeration)``.

n_iter: int
The number of iterations to run in the normal optimization regime.
Expand Down Expand Up @@ -1122,7 +1125,7 @@ def __init__(
perplexity=30,
learning_rate="auto",
early_exaggeration_iter=250,
early_exaggeration=12,
early_exaggeration="auto",
n_iter=500,
exaggeration=None,
dof=1,
Expand All @@ -1148,7 +1151,13 @@ def __init__(
self.n_components = n_components
self.perplexity = perplexity
self.learning_rate = learning_rate
self.early_exaggeration = early_exaggeration
if early_exaggeration == "auto":
if exaggeration is None:
self.early_exaggeration = 12
else:
self.early_exaggeration = max(12, exaggeration)
else:
self.early_exaggeration = early_exaggeration
self.early_exaggeration_iter = early_exaggeration_iter
self.n_iter = n_iter
self.exaggeration = exaggeration
Expand Down Expand Up @@ -1201,7 +1210,7 @@ def fit(self, X=None, affinities=None, initialization=None):

Parameters
----------
X: Optional[np.ndarray}
X: Optional[np.ndarray]
The data matrix to be embedded.

affinities: Optional[openTSNE.affinity.Affinities]
Expand Down Expand Up @@ -1271,7 +1280,7 @@ def prepare_initial(self, X=None, affinities=None, initialization=None):

Parameters
----------
X: Optional[np.ndarray}
X: Optional[np.ndarray]
The data matrix to be embedded.

affinities: Optional[openTSNE.affinity.Affinities]
Expand Down Expand Up @@ -1395,11 +1404,17 @@ def prepare_initial(self, X=None, affinities=None, initialization=None):
raise ValueError(
f"Unrecognized initialization scheme `{initialization}`."
)

# Set the auto learning rate depending on the value of early exaggeration
if self.learning_rate == "auto":
learning_rate_now = max(200, n_samples / self.early_exaggeration)
else:
learning_rate_now = self.learning_rate

gradient_descent_params = {
"dof": self.dof,
"negative_gradient_method": self.negative_gradient_method,
"learning_rate": self.learning_rate,
"learning_rate": learning_rate_now,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the correct put to put this. This would set the learning rate for the duration of the whole embedding, not just the early exaggeration phase, right? I think the best place to put this might actually be inside the gradient_descent call function, so the correct rescaling will happen during any call to .optimize, not just the standard TSNE().fit call.

I wasn't aware that this was an issue during the early exaggeration phase at all. It is actually necessary to rescale the learning rate, if the exaggeration isn't 1? In that case, wouldn't it also make sense to rescale it given any exaggeration value, not just the early exaggeration phase?

Copy link
Contributor Author

@dkobak dkobak Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current default openTSNE behaviour (which is the same in FIt-SNE and about to become default in soon-to-be-released sklearn 1.2) is to use learning rate N/12. Here 12 corresponds to the early exaggeration, the idea being that if learning_rate * exaggeration > N, then the optimization may diverge (as shown by Linderman & Steinerberger). One COULD use learning rate N, instead of N/12, for the 2nd optimization phase, once early exaggeration is turned off. But we do not do it. We could decide to do it, and I think it may even be a good idea, but this will make openTSNE default behaviour different from other implementations. In any case, this I think requires a different PR and a separate issue, maybe comparing current and suggested scheme on a range of datasets.

In this PR I assumed that we want to keep the learning rate constant throughout optimization. Which is why I set it to N/early_exaggeration and keep the same learning rate for the second phase.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense, I'd forgotten about that. However, I'm a bit unsure what the best way about implementing would be. We currently handle learning_rate="auto" here in the _handle_nice_params function. However, this function is called in every .optimize step, so that we can perform whatever optimization sequence we want.

With the way you've implemented it now, we'd be going around this function because we'd change the learning_rate from auto to a number before we even got to this function. This would also cause inconsistent behaviour between

TSNE().fit()

and

embedding.optimize(...)
embedding.optimize(...)

since the first one would then use the "correct" learning rate, while the second would use N/12 in both cases. I am very much opposed to this inconsistency, but I'm not sure what the best way to handle this would be.

Perhaps the best course of action would be to investigate rescaling the learning rate based on exaggeration, so we'd end up with a different learning rate for the early exaggeration and the standard optimization phase. I think this would best fit into our framework, and it seems the most principled approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. I will start a separate issue.

# By default, use the momentum used in unexaggerated phase
"momentum": self.final_momentum,
# Barnes-Hut params
Expand Down