Loading val_P/row_P/col_P from files: unused part of the code? #16

dkobak · 2018-06-07T13:56:54Z

In the Annoy function, there is a large block of code that loads val_P/row_P/col_P vectors from files if they exist: https://github.com/KlugerLab/FIt-SNE/blob/master/src/tsne.cpp#L960. However, the code to save these vectors to the file is commented out: https://github.com/KlugerLab/FIt-SNE/blob/master/src/tsne.cpp#L1161. So all of that does not seem to be doing anything.

Is this a remnant of some debugging? If so, can it perhaps be deleted to simplify the code? Or is it a feature that is not yet fully implemented?

linqiaozhi · 2018-06-08T17:41:15Z

Good question--there are two uses for this code:

I found that people often wanted to obtain the input similarity matrix that t-SNE generates and load it into, say, Matlab. Then they would use it for some other visualization or processing task.
I had originally implemented the following feature: if the val_P/row_P/col_P files existed, then load them and skip the computation of input similarities. I found this very useful when my input data (and perplexity) were fixed, and I was experimenting with other parameters and initializations. It is common for people to run t-SNE several times with different initializations, and since the input similarities matrix is exactly the same for each of these experiments, it made sense to not recompute them every time. I ended up commenting it out because I was afraid people would forget they were using the old input similarities. If implemented correctly (probably with a parameter passed in, as opposed to just checking for the files), I think this would be a useful feature.

dkobak · 2018-06-11T12:41:55Z

I see. I agree it can be a useful feature, but perhaps it should be implemented in TSNE::run() and not downstream. If save flag is on, then row_P, col_P, val_P (or P) are saved into a file after they are computed. If load flag is on, then they are read from the file, and not recomputed. By default, they are not loaded and not saved. Something like that.

Maybe it can be implemented as one input flag: default 0 means do nothing, 1 means save, 2 means load. And Matlab/R/Python wrappers can use more informative names for this flag, e.g. 'load'/'save'/None.

linqiaozhi · 2018-06-12T21:06:46Z

Agreed--I think this would be a great way to implement it.

dkobak · 2018-06-13T08:26:59Z

I might do it but most likely not before the end of the month. If you don't mind, leave this open for now.

By the way, the code for computeGaussianPerplexity() using Annoy and using VP is quite similar (the same memory allocation, the same multithreading, the same progress messages, etc.) and if I do any extensions such as e.g. adding K_random random points to the K nearest neighbours, then I have to do it in both places. What do you think of combining these two functions into one with knn_algo as an input parameter? I think this could simplify the code.

dkobak · 2018-07-13T16:37:45Z

Hi George. I could pull request the stuff I've been working on during next week. Just wanted to double check what you think about combining computeGaussianPerplexity() for annoy/VP into one function (as per my comment above).

I could do the following PRs:

Load/save Ps into file.
Combine computeGaussianPerplexity() for annoy/VP to simplify the code.
Perplexity combination
Supply K_random in addition to K

Let me know if you think any of that is not needed here. I am not quite sure how useful (4) will in the end be, but I found it interesting to experiment with.

linqiaozhi · 2018-07-20T13:27:00Z

Hey @dkobak, so sorry for the late response! I thought I replied, but clearly I did not.

Refactoring the redundant code in computeGaussianPerplexity() out into one function would be great. I think that 1 and 2 are very useful.

As for 3, I think you had sent me a citation showing that this was useful. Did you find that in large datasets its useful as well? Also, is it computationally feasible on large datasets? Large perplexities mean lots of nearest neighbors, and I know that looping through a large number of neighbors at each iteration is the bottleneck for speed when you have lots of points...so I just want to make sure it's useful in large datasets.

As for 4, I think it's really interesting. Do you have a citation for how this can improve the embedding? I'd like to only include features that have been shown to be useful, so that it's not confusing to people.

Thanks so much, and again, my apologies for the tardy response!

dkobak · 2018-07-21T21:04:29Z

Re 3: yes, there is a paper advocating this procedure. You are of course right about large datasets: it's not really possible to use large perplexities for large datasets. However, for n \approx 25k datasets I found it very useful. In the paper I am preparing we are actually going to recommend this perplexity combination as one of the useful "tricks". For one particular n=25k dataset that I'm using as my main example, I obtain best results using combination of perplexities [10, 100, 500]. So it's up to you, but if you don't want to merge this feature into your code, I will have to refer readers to my branch :)

As for 4, no, I don't have any citations. I did find it marginally useful in some situations but not useful enough to be honest. I'll probably not be describing/recommending it now. I agree that it's not worth merging it into the master.

I'd like to only include features that have been shown to be useful, so that it's not confusing to people.

Sure. Even though you do have some "experimental" stuff in the code, like e.g. setting fixed sigma instead of perplexity etc. :)

linqiaozhi · 2018-07-22T09:45:34Z

Great, whenever you get the chance, please go ahead and also PR the perplexity combination (3). I look forward to trying it out--and reading the paper you are preparing!

Thanks so much for all your help!!

dkobak · 2018-07-25T13:41:27Z

I will be away for the next two weeks and by now it's clear that I won't be able to do this before I leave. So I will try to do these PRs some time in the middle of August. In case you are also on vacations then, it'll wait until September. Cheers.

dkobak · 2018-08-17T14:11:28Z

I close this issue now that #25 was merged.

dkobak closed this as completed Aug 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading val_P/row_P/col_P from files: unused part of the code? #16

Loading val_P/row_P/col_P from files: unused part of the code? #16

dkobak commented Jun 7, 2018

linqiaozhi commented Jun 8, 2018

dkobak commented Jun 11, 2018 •

edited

Loading

linqiaozhi commented Jun 12, 2018

dkobak commented Jun 13, 2018

dkobak commented Jul 13, 2018

linqiaozhi commented Jul 20, 2018

dkobak commented Jul 21, 2018

linqiaozhi commented Jul 22, 2018

dkobak commented Jul 25, 2018

dkobak commented Aug 17, 2018

Loading val_P/row_P/col_P from files: unused part of the code? #16

Loading val_P/row_P/col_P from files: unused part of the code? #16

Comments

dkobak commented Jun 7, 2018

linqiaozhi commented Jun 8, 2018

dkobak commented Jun 11, 2018 • edited Loading

linqiaozhi commented Jun 12, 2018

dkobak commented Jun 13, 2018

dkobak commented Jul 13, 2018

linqiaozhi commented Jul 20, 2018

dkobak commented Jul 21, 2018

linqiaozhi commented Jul 22, 2018

dkobak commented Jul 25, 2018

dkobak commented Aug 17, 2018

dkobak commented Jun 11, 2018 •

edited

Loading