Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime and RAM usage compared to FIt-SNE #101

Closed
dkobak opened this issue Jan 5, 2020 · 25 comments
Closed

Runtime and RAM usage compared to FIt-SNE #101

dkobak opened this issue Jan 5, 2020 · 25 comments
Labels
discussion Needs more discussion

Comments

@dkobak
Copy link
Contributor

dkobak commented Jan 5, 2020

I understand that openTSNE is expected to be slower than FIt-SNE, but I'd like to understand how much slower it is in typical situations. As I reported earlier, when I run it on 70000x50 PCA-reduced MNIST data with default parameters and n_jobs=-1, I get ~60 seconds with FIt-SNE and ~120 seconds with openTSNE. Every 50 iterations take around 2s vs around 4s.

I did not check for this specific case, but I suspect that FFT takes only a small fraction of this time, and the computational bottleneck is formed by the attractive forces. Can one profile openTSNE and see how much time is taken by different steps, such as repulsive/attractive computations?

Apart from that, and possibly even more worryingly, I replicated the data 6x and added some noise, to get a 420000x50 data matrix. It takes FIt-SNE around 1Gb of RAM to allocate the space for the kNN matrix, so it works just fine on my laptop. However, openTSNE rapidly took >7Gb of RAM and crashed the kernel (I have 16 Gb but around half was taken by other processes). This happened in the first seconds, so I assume it happens during the kNN search. Does pynndescent eat up so much memory in this case?

@pavlin-policar
Copy link
Owner

I did not check for this specific case, but I suspect that FFT takes only a small fraction of this time, and the computational bottleneck is formed by the attractive forces. Can one profile openTSNE and see how much time is taken by different steps, such as repulsive/attractive computations?

Yes, it's possible to benchmark this, but it's a little involved. You can add the # cython: profile=True directives to the cython files, then compile the thing from source by running python setup.py develop. Then you can get reports with cProfile, which is the default one if you're using pycharm. It's been a while since I've done any benchmarking, but from what I remember, neither of the attractive or repulsive forces "dominated" the runtime and I spent quite a bit optimizing the computation of the attractive forces to use as little instructions as possible.

I fear there isn't all that much I can do to make it faster since the generated C code from cython is a lot, and there is a lot of overhead to switching between compiled cython code and python compared to pure C++.

Apart from that, and possibly even more worryingly, I replicated the data 6x and added some noise, to get a 420000x50 data matrix. It takes FIt-SNE around 1Gb of RAM to allocate the space for the kNN matrix, so it works just fine on my laptop. However, openTSNE rapidly took >7Gb of RAM and crashed the kernel (I have 16 Gb but around half was taken by other processes). This happened in the first seconds, so I assume it happens during the kNN search. Does pynndescent eat up so much memory in this case?

This has been my suspicion for a long time actually. I haven't really run into this myself because I typically work on large RAM machines when doing any actual work, but I've been told this does happen. Unfortunately, for the time being, we still rely on pynndescent, which seems to be the source of almost all my problems with openTSNE. I would really like to replace this with something that doesn't use numba, but I haven't really found any good alternative. Numba makes it so easy to use custom metrics and it is easy to install that there's really nothing like it. I've considered switching to annoy several times since it also seems to be faster, but nothing has come of it yet, and from what I remember, it was a pain to package.

If you do go into the code and find improvements, PRs are always welcome 😄

@dkobak
Copy link
Contributor Author

dkobak commented Jan 6, 2020

I'm trying to understand the RAM usage of pynndescent a bit better here: lmcinnes/pynndescent#90. See also https://twitter.com/hippopedoid/status/1213935495350276096.

I think pynndescent is great in that it supports so many metrics and also sparse matrices. That said, it'd be great if openTSNE were also supporting annoy which seems much more memory-effective, and allowed to choose between pynndescent and annoy. What's the problem with using annoy? It seems to be on conda-forge: https://anaconda.org/conda-forge/python-annoy

@dkobak
Copy link
Contributor Author

dkobak commented Jan 6, 2020

Oh, regarding building the index vs querying (the twitter conversation linked above), I just remembered that you mentioned that here: #28 (comment).

@pavlin-policar
Copy link
Owner

Hmm, I should really follow twitter more often, I just saw that your new preprint on UMAP initialization. Very interesting work.

To be quite honest, pynndescent is amazing. As you said, custom metrics, sparse matrices, it's quite fast, IMO it's probably the best library for ANN search.

However, numba is a huge dependency (it bloats our Orange installer by 40MB or so) and seems unstable. For example, just last week one of my colleagues got segfaults when running some data through pynndescent. Removing a couple of samples fixed it. Totally bizarre stuff, and totally difficult to debug. On my new mac, I get some strange errors that I haven't had the time yet to debug. Specifying the number of jobs is a mystery and I can't ever seem to get it to use the right number of threads. 0.4 broke backward-compatibility, which makes me weary of depending on them as now I have to worry that every new release will break something (now that UMAP depends on them instead of using their own internal implementation, I hope they'll be better about this).

What's the problem with using annoy?

As far as I remember, annoy has a very limited python interface, and the only way to add samples to the index is to do it in a for-loop. The same goes for retrieving the distances and nearest neighbors. For-loops are slow in python so I worry this silly overhead would take a lot of time. But it might not be that big a deal. We'd really just need to try it. It should be very simple to write a KNNIndex for annoy and benchmark that. It's nice because it supports cosine distance, but it's a lot more limited than pynndescent.

Oh, regarding building the index vs querying (the twitter conversation linked above), ...

If they've made substantial changes to how the index is built, it might make sense to build the index with the proper number of neighbors now. But again, this is something that needs to be benchmarked.

@pavlin-policar
Copy link
Owner

And, I've just noticed that the tests are failing due to some numba related problem from pynndescent. This is why I am very frustrated with pynndescent and would like to get rid of it as soon as possible.

@dkobak
Copy link
Contributor Author

dkobak commented Jan 6, 2020

I can understand the frustration but IMHO if you entirely remove pynndecent from openTSNE it'd be a big step back, because it will lose the sparse matrix support. Currently openTSNE is the only existing t-SNE implementation supporting sparse inputs.

@dkobak
Copy link
Contributor Author

dkobak commented Jan 6, 2020

Hopefully it will be fixed soon... lmcinnes/pynndescent#91 (comment)

@pavlin-policar
Copy link
Owner

Yes, don't worry, I have no intention of removing it until I find something that will include all the features that are so great in pynndescent. Sparse support is very much something that I fully indend to keep.

I'm glad they're aware of the issue and I hope it gets resolved as soon as possible.

@lmcinnes
Copy link

lmcinnes commented Jan 6, 2020

Sorry about all the breakage in the 0.4 release, but there were some signifcant low hanging performance gains available and I felt it was worth it. Right now not many projects that I know of were explicitly depending on pynndescent so I thought it was better to get the changes made earlier. Hopefully things will stabilise from here. The multi-threading is still a complicated story since there are multiple multi-threading mechanisms in play. One, written by Tom White, is very powerful and scales up very well; however it doesn't scale down that well (for 4 cores it is, at best, break even in performance, and worse for fewer) and is certainly memory intensive. The 0.4 version introduces a numba based parallel implementation that does not scale up as well, but works well for the 1-8 core range, and uses a lot less memory. Ideally I would like to move to that long term, but the numba parallel options are not quite there yet (they are coming).

As for breakage with numba -- yes, it seems like the 0.47 version has introduced some recent issues as well. It is frustrating, but on the other hand it is far easier to write an maintain python using numba than what would be required to get similar performance any other way (Cython, or straight C++). Hopefully there will be less issues in the future.

Sorry again for all the issues; hopefully they can get resolved.

@pavlin-policar
Copy link
Owner

Hi, thanks for the explanation. As for breaking changes in publicly available software, it's generally best practice to have a grace period where you'd raise some DeprecationWarnings for one version, then remove it in the next. This is what we do in Orange, and this way, we give anyone depending on our software a couple of weeks to adapt their code to the new API.

I appreciate that this must be equally frustrating to you since I imagine most of your problems are also caused by numba updates. As I said, I think pynndescent is a great piece of software, but it's quite unreliable because things seem to break quite often.

It is frustrating, but on the other hand it is far easier to write an maintain python using numba than what would be required to get similar performance any other way (Cython, or straight C++).

I'm not sure I agree. It seems to me that numba is still quite immature and if I constantly have to return to openTSNE to fix bugs introduced by some downstream dependency, this does not scream easy to maintain to me. I would much rather have a stable piece of software written in C++ which might have taken a bit longer to develop, but it will be faster and more stable. I haven't made changes to the core openTSNE code in probably about a year, yet I find myself constantly returning to it to fix some bug or implement some workaround for numba or figuring out which versions work and which don't.

I apologize if I sound a little frustrated, but I've just returned from holiday to find failing tests with cryptic numba error messages which aren't even part of my code.

So as not to completely derail this issue, it's my understanding that the low_memory parameter in pynndescent will determine which of the two multithreading algorithms will be used. Is that correct? If so, I suppose the easiest way to address Dmitry's original issue would be to default to using the new less-memory intensive version. I would guess most people are running things on their PCs where they rarely have more than 8 cores available. And for the people running things on cpu nodes, adding a parameter to switch to the memory-hungry algorithm should be totally fine.

@dkobak
Copy link
Contributor Author

dkobak commented Jan 7, 2020

So as not to completely derail this issue, it's my understanding that the low_memory parameter in pynndescent will determine which of the two multithreading algorithms will be used. Is that correct?

As far as I understood @lmcinnes -- no, this is not correct. He said "explicit" multithreading happens whenever n_jobs>1 and an "implicit" numba multithreading happens when n_jobs=1: lmcinnes/pynndescent#90 (comment).

I am not sure what low_memory does exactly, maybe Leland can clarify.

@lmcinnes
Copy link

lmcinnes commented Jan 7, 2020

There was a caching trick used in NNdescent that could provide dramatic speedups; unfortunately for some datasets this could cause a memory explosion (as too much got cached). The low memory turned off this caching. The current version uses a much less agressive caching approach. The low memory option still turns it off, but it is much less likely to have as dramatic an effect on memory usage in comparison to previous versions.

@lmcinnes
Copy link

lmcinnes commented Jan 7, 2020

@pavlin-policar : Sorry for the problems I have apparently caused. To be honest I am somewhat new to maintaining projects of any significant visibility and am very much learning as I go. I will endeavour to provide a little more of a smooth path in future.

Finally I agree that numba is immature; on the other hand so is pynndescent (hence the significant changes it underwent; there are probably more to come, ideally including a distributed computation version). This is much of the reason why UMAP continues to maintain its own internal implementation for now. I am very sorry that my haste caused so much trouble.

I could recommend annoy or nmslib as more stable ANN libraries written in C++.

@pavlin-policar
Copy link
Owner

"explicit" multithreading happens whenever n_jobs>1 and an "implicit" numba multithreading happens when n_jobs=1

Ok, so that is very confusing as an end-user and I would encourage this to be changed. So setitng n_jobs=1 will use significantly less memory than setting n_jobs to anything higher. Of course, we could write some workaround in openTSNE and add our own use_less_memory parameter that would switch between the two, then set the numba threads environment parameter, but I really don't like this. This is both complicated and confusing and I would hope would be changed. So I would leave this as-is for the time being and hope it gets changed in pynndescent as soon as possible.

I could recommend annoy or nmslib as more stable ANN libraries written in C++.

@lmcinnes I have actually looked at and considered both of them, but I found problems with both of them (albeit this was quite a while ago). Like I said, I haven't really found anything that has the nice features of pynndescent. For the time being, we are certainly going to be sticking with pynndescent.

@dkobak
Copy link
Contributor Author

dkobak commented Jan 8, 2020

I am not completely sure, but I think the amount of RAM is just directly proportional to n_jobs. So n_jobs=1 will use 2x less RAM than n_jobs=2.

I am also not completely sure if the "implicit" numba multithreading ONLY kicks in with n_jobs=1 or if it's actually always on, independent of n_jobs. This should be clear enough from the code, but I am not familiar with numba multithreading so don't know what to look for in the code.

I think as far as openTSNE is concerned, it might be a good idea to add a separate n_jobs_pynndescent parameter (default to 1). Currently you are passing n_jobs into pynndescent, but as it leads to much higher RAM usage, one might want to have n_jobs=-1 but keep n_jobs_pynndescent at the default 1.

@pavlin-policar
Copy link
Owner

I don' like adding yet another parameter that would be specific to pynndescent at all. I understand why you'd like the flexibility, but from a maintenance standpoint, it's not great. Hopefully, they change the behavior to something a little more reasonable in the near future, then when that happens, we'll have to release a new version to slowly remove this parameter. Like I was complaining above, changing the public API should be done with care, and any new parameter we introduce is going to have to be supported for the next couple of versions.

Additionally, I feel there are already far too many parameters for the normal user (currently 24!) in the TSNE public API. I'd consider adding this to the API of NNDescent KNNIndex, which can then be passed to the t-SNE constructor, that would be better IMO.

@dkobak
Copy link
Contributor Author

dkobak commented Jan 8, 2020

I am not sure what behaviour from pynndescent you would think would be more reasonable. I was initially confused mainly because I did not expect that increasing n_jobs would increase the required RAM by a lot. So there is a trade-off between used memory and speed. But that's how it is, at least currently. What's unreasonable about it?

@pavlin-policar
Copy link
Owner

That part makes perfect sense. The part that doesn't make sense is that when we set n_jobs=1, it will use a different algorithm which uses less memory, but will still use multiple cores (as set by the numba environmental variable). So unless the user actually knows to set the environmental variable to 1, there really is no way to use a single core. And this requires knowledge about how the library works, which the average user does not. I would call this unreasonable behaviour, because it's complicated and very surprising -- one would expect that setting the number of jobs to 1 would actually result in a single core being used.

@dkobak
Copy link
Contributor Author

dkobak commented Feb 12, 2020

So thinking this over again, I think the best course of action might be to add Annoy support and make it default for dense inputs and all metrics that Annoy supports (of course assuming that it is not slower; this one would need to check). For sparse inputs or any other metrics, the code should switch to pynndescent. And of course one should be able to explicitly choose which method is used. Annoy actually says right in its headline that it is "optimized for memory usage". What do you think?

Then everything about n_jobs can be left as it is, and one should just be prepared to use a lot of RAM when using pynndescent on many threads.

@dkobak
Copy link
Contributor Author

dkobak commented Feb 23, 2020

I might be able/willing to try to implement KNNIndex for Annoy and do some experiments regarding the runtime. Or have you already been looking into it?

@pavlin-policar
Copy link
Owner

@dkobak Sorry, I missed this, my inbox has been a bit of a mess. No, I haven't really had a chance to look at this any further. Have you managed to tinker around with this yet?

@dkobak
Copy link
Contributor Author

dkobak commented Mar 2, 2020

No, I haven't yet. Should we try to test it together with the changes of the default parameters? It could be a nice update together.

@dkobak
Copy link
Contributor Author

dkobak commented Mar 10, 2020

I implemented Annoy, and it seems to work well! Should I do a PR?

Screenshot from 2020-03-10 17-20-46

(Showing the same cell twice because I noticed that for some weird reason re-running openTSNE call with pynndescent sometimes makes it execute a lot faster; not sure why, maybe it's some kind of numba quirk? Looks almost as if something needs to be "initialized" the first time pynndescent is run after kernel restart.)

UPDATE: Ran the same thing on my laptop:

Screenshot from 2020-03-10 23-49-44

@pavlin-policar
Copy link
Owner

This is very nice. The slowdown you noticed is one of the annoying things about numba. When the code is first run, it has to go through the JIT compilation, meaning the first call will take a bit longer.

@pavlin-policar pavlin-policar added the discussion Needs more discussion label Mar 12, 2020
@dkobak
Copy link
Contributor Author

dkobak commented Mar 13, 2020

I guess this issue can be closed now after #111.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Needs more discussion
Projects
None yet
Development

No branches or pull requests

3 participants