Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transforming new data points puts them on the outskirts of existing clusters ... #211

Open
phiweger opened this issue Mar 20, 2019 · 3 comments

Comments

Projects
None yet
2 participants
@phiweger
Copy link

commented Mar 20, 2019

... but never "inside" of clusters.

The blue points in the figure are "new" points that are transformed into the existing projection of the points of other colors.

points

Is there something I am doing wrong?

config = {
    'random_state': 42,
    'n_neighbors': 10,
    'n_components': 2,
    'metric': 'cosine', 
    'spread': 2,
    'min_dist': 0.01
}

trans = umap.UMAP(**config)
trans = trans.fit(a)                                         # fit on some points
projection_a = trans.transform(a)                # project them
projection_b = trans.transform(b)               # project the "new" points

Thank you for your help!

@lmcinnes

This comment has been minimized.

Copy link
Owner

commented Mar 21, 2019

I think this comes down to a combination of the cosine metric and the curse of dimensionality. There is nothing inherent in UMAP transform that causes this (indeed, it often doesn't happen), but it can certainly happen due to curse of dimensionality putting points in the corners of a hypercube, and hence almost always "on the outside" of training data.

@phiweger

This comment has been minimized.

Copy link
Author

commented Mar 22, 2019

Why does the cosine metric influence this?

I could try normalizing the vectors and using euclidean distance, what is your thought on this approach?

Thanks

@lmcinnes

This comment has been minimized.

Copy link
Owner

commented Mar 22, 2019

I think cosine is only playing a small part here, but as an angular distance it make some different in curse of dimensionality type problems. Normalizing won't change much because you will be using euclidean as a proxy angular distance, which won't resolve the issue. I think it is more accurate to say that, in some real sense, this is just how the data is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.