Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading val_P/row_P/col_P from files: unused part of the code? #16

Closed
dkobak opened this issue Jun 7, 2018 · 10 comments
Closed

Loading val_P/row_P/col_P from files: unused part of the code? #16

dkobak opened this issue Jun 7, 2018 · 10 comments

Comments

@dkobak
Copy link
Collaborator

dkobak commented Jun 7, 2018

In the Annoy function, there is a large block of code that loads val_P/row_P/col_P vectors from files if they exist: https://github.com/KlugerLab/FIt-SNE/blob/master/src/tsne.cpp#L960. However, the code to save these vectors to the file is commented out: https://github.com/KlugerLab/FIt-SNE/blob/master/src/tsne.cpp#L1161. So all of that does not seem to be doing anything.

Is this a remnant of some debugging? If so, can it perhaps be deleted to simplify the code? Or is it a feature that is not yet fully implemented?

@linqiaozhi
Copy link
Member

Good question--there are two uses for this code:

  1. I found that people often wanted to obtain the input similarity matrix that t-SNE generates and load it into, say, Matlab. Then they would use it for some other visualization or processing task.

  2. I had originally implemented the following feature: if the val_P/row_P/col_P files existed, then load them and skip the computation of input similarities. I found this very useful when my input data (and perplexity) were fixed, and I was experimenting with other parameters and initializations. It is common for people to run t-SNE several times with different initializations, and since the input similarities matrix is exactly the same for each of these experiments, it made sense to not recompute them every time. I ended up commenting it out because I was afraid people would forget they were using the old input similarities. If implemented correctly (probably with a parameter passed in, as opposed to just checking for the files), I think this would be a useful feature.

@dkobak
Copy link
Collaborator Author

dkobak commented Jun 11, 2018

I see. I agree it can be a useful feature, but perhaps it should be implemented in TSNE::run() and not downstream. If save flag is on, then row_P, col_P, val_P (or P) are saved into a file after they are computed. If load flag is on, then they are read from the file, and not recomputed. By default, they are not loaded and not saved. Something like that.

Maybe it can be implemented as one input flag: default 0 means do nothing, 1 means save, 2 means load. And Matlab/R/Python wrappers can use more informative names for this flag, e.g. 'load'/'save'/None.

@linqiaozhi
Copy link
Member

Agreed--I think this would be a great way to implement it.

@dkobak
Copy link
Collaborator Author

dkobak commented Jun 13, 2018

I might do it but most likely not before the end of the month. If you don't mind, leave this open for now.

By the way, the code for computeGaussianPerplexity() using Annoy and using VP is quite similar (the same memory allocation, the same multithreading, the same progress messages, etc.) and if I do any extensions such as e.g. adding K_random random points to the K nearest neighbours, then I have to do it in both places. What do you think of combining these two functions into one with knn_algo as an input parameter? I think this could simplify the code.

@dkobak
Copy link
Collaborator Author

dkobak commented Jul 13, 2018

Hi George. I could pull request the stuff I've been working on during next week. Just wanted to double check what you think about combining computeGaussianPerplexity() for annoy/VP into one function (as per my comment above).

I could do the following PRs:

  1. Load/save Ps into file.
  2. Combine computeGaussianPerplexity() for annoy/VP to simplify the code.
  3. Perplexity combination
  4. Supply K_random in addition to K

Let me know if you think any of that is not needed here. I am not quite sure how useful (4) will in the end be, but I found it interesting to experiment with.

@linqiaozhi
Copy link
Member

Hey @dkobak, so sorry for the late response! I thought I replied, but clearly I did not.

Refactoring the redundant code in computeGaussianPerplexity() out into one function would be great. I think that 1 and 2 are very useful.

As for 3, I think you had sent me a citation showing that this was useful. Did you find that in large datasets its useful as well? Also, is it computationally feasible on large datasets? Large perplexities mean lots of nearest neighbors, and I know that looping through a large number of neighbors at each iteration is the bottleneck for speed when you have lots of points...so I just want to make sure it's useful in large datasets.

As for 4, I think it's really interesting. Do you have a citation for how this can improve the embedding? I'd like to only include features that have been shown to be useful, so that it's not confusing to people.

Thanks so much, and again, my apologies for the tardy response!

@dkobak
Copy link
Collaborator Author

dkobak commented Jul 21, 2018

Re 3: yes, there is a paper advocating this procedure. You are of course right about large datasets: it's not really possible to use large perplexities for large datasets. However, for n \approx 25k datasets I found it very useful. In the paper I am preparing we are actually going to recommend this perplexity combination as one of the useful "tricks". For one particular n=25k dataset that I'm using as my main example, I obtain best results using combination of perplexities [10, 100, 500]. So it's up to you, but if you don't want to merge this feature into your code, I will have to refer readers to my branch :)

As for 4, no, I don't have any citations. I did find it marginally useful in some situations but not useful enough to be honest. I'll probably not be describing/recommending it now. I agree that it's not worth merging it into the master.

I'd like to only include features that have been shown to be useful, so that it's not confusing to people.

Sure. Even though you do have some "experimental" stuff in the code, like e.g. setting fixed sigma instead of perplexity etc. :)

@linqiaozhi
Copy link
Member

Great, whenever you get the chance, please go ahead and also PR the perplexity combination (3). I look forward to trying it out--and reading the paper you are preparing!

Thanks so much for all your help!!

@dkobak
Copy link
Collaborator Author

dkobak commented Jul 25, 2018

I will be away for the next two weeks and by now it's clear that I won't be able to do this before I leave. So I will try to do these PRs some time in the middle of August. In case you are also on vacations then, it'll wait until September. Cheers.

@dkobak
Copy link
Collaborator Author

dkobak commented Aug 17, 2018

I close this issue now that #25 was merged.

@dkobak dkobak closed this as completed Aug 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants