Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

umap_transform uses a different distance metric if loaded in #117

Closed
mdrnao opened this issue Jan 24, 2024 · 5 comments
Closed

umap_transform uses a different distance metric if loaded in #117

mdrnao opened this issue Jan 24, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@mdrnao
Copy link

mdrnao commented Jan 24, 2024

Hi - firstly thanks for an excellent package!

I am currently using umap with correlation as the distance metric, then saving it for future use. However, when I use umap_transform with the saved umap model, and ret_extra = "nn", I find that it reports cosine as the distance metric. When I use the transform function on the umap model without saving and reloading in between, correlation is the reported NN metric.

For the "fresh" umap model:
model$metric 'correlation'
model$nn_index$metric 'correlation'

for the loaded in version:
model$metric 'correlation'
model$nn_index$metric 'cosine'

I noticed in the load_uwot function you've hard coded if (metric == "correlation") {annoy_metric <- "cosine"} and I was curious why?

Thanks,
Holly

@jlmelville
Copy link
Owner

Thanks for the report, there might be a bug here, but I'll need to do some checking.

Just to follow up on your last question now: if you use the the correlation distance, then the underlying Annoy calculation uses the cosine distance. This is because the correlation distance is equivalent to the cosine distance after mean-centering each row. So the annoy_metric bit is just an implementation detail.

@mdrnao
Copy link
Author

mdrnao commented Jan 24, 2024

Thanks for the swift reply and clarification! If there is anything I can do to assist, let me know.

I just double-checked the NN idx and dist output using the fresh and loaded-in versions of the same model in the transform function, and I do get the same indexed neighbours but the distances are different. The results should be the same and not a problem for the embeddings, but I was hoping to utilise the NN correlations.

the fresh model:

> umap.trans$nn$correlation$dist[1:5,1:5]
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 0.2502186 0.2512920 0.2519980 0.2521188 0.2528236
[2,] 0.2529842 0.2531742 0.2532540 0.2532705 0.2550313
[3,] 0.2392112 0.2398928 0.2402835 0.2411647 0.2420704
[4,] 0.2447240 0.2447956 0.2453944 0.2458417 0.2471814
[5,] 0.2545708 0.2578911 0.2581325 0.2582460 0.2583417

and the loaded in version:

> umap.trans2$nn$cosine$dist[1:5,1:5]
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 0.7458150 0.7461820 0.7464244 0.7464666 0.7467052
[2,] 0.7494166 0.7494699 0.7495013 0.7495075 0.7500989
[3,] 0.7464465 0.7466761 0.7468034 0.7470980 0.7474063
[4,] 0.7483876 0.7484159 0.7486170 0.7487704 0.7492193
[5,] 0.7448720 0.7460172 0.7460974 0.7461340 0.7461688

if we look at the first few lines of the verbose output from the transform function with the former:

15:49:04 Setting model random seed 42
15:49:04 Read 36 rows and found 16824 numeric columns
15:49:04 Processing block 1 of 1
15:49:04 Annoy search: subtracting row means for correlation
15:49:04 Writing NN index file to temp file /tmp/user/444605590/RtmpIsZEeM/file26ee2758fd73
15:49:05 Searching Annoy index using 36 threads, search_k = 7500
15:49:05 Commencing smooth kNN distance calibration using 36 threads with target n_neighbors = 75

but with the latter:

15:49:24 Setting model random seed 42
15:49:24 Read 36 rows and found 16824 numeric columns
15:49:24 Processing block 1 of 1
15:49:24 Writing NN index file to temp file /tmp/user/444605590/RtmpIsZEeM/file26ee310a685b
15:49:25 Searching Annoy index using 36 threads, search_k = 7500
15:49:26 Commencing smooth kNN distance calibration using 36 threads with target n_neighbors = 75

@jlmelville jlmelville self-assigned this Jan 24, 2024
@mdrnao
Copy link
Author

mdrnao commented Jan 25, 2024

For what it's worth - if I change the loaded in model nn_index metric to 'correlation' I restore the behaviour from a fresh model, returning correlation values from the transform function.

@jlmelville
Copy link
Owner

@mdrnao yes, this is definitely a bug and I just pushed a fix, so it will be fixed in the next release of uwot. Although I don't know if this is feasible for your workflow, doing what you did by changing the nn_index$metric back to correlation after a call to load_uwot would be a workaround until uwot is updated (there are some on-going dependency issues with the irlba package that may make submitting a new version potentially a bit painful until those get remedied).

Thank you for the assistance in tracking down what was happening and apologies for the oversight.

@jlmelville jlmelville added the bug Something isn't working label Jan 27, 2024
@mdrnao
Copy link
Author

mdrnao commented Jan 31, 2024

Ah thank you so much! The work around is absolutely fine for me.

Good luck with the new version, and thanks again for your support.

jlmelville pushed a commit that referenced this issue Feb 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants