-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VRAM and cell numbers #14
Comments
This is a great question. I admit we have been mostly testing on 32gb GPUs but I can generate some random data with known sparsity to get a feel for how far we can push different GPUs. What is the ideal target size and sparsity for your GPU? Are there any public datasets with a similar size/sparsity? |
Good examples would be here: http://mousebrain.org/downloads.html (either the aggregate loom https://storage.googleapis.com/linnarsson-lab-loom/l5_all.agg.loom or maybe you can see where you hit the limit with concatenating different subsets http://mousebrain.org/loomfiles_level_L1.html ?) For example, I'm currently processing a dataset of 330,000 cells including all of the above and a bunch of our datasets combined with batch correction. |
Wanted to provide a small update just to let you know that I have been looking into this. I think the most straightforward solution here might be to use the unified virtual memory allocator. I’ll get an example together. |
Ok, thanks! I'll close this. |
I've made a modification to the notebook to enable the Unified Virtual Memory manager in RAPIDS & CuPy. Specifically, the change looks like this: import rmm
rmm.reinitialize(
managed_memory=True, # Allows oversubscription
devices=0, # GPU device IDs to register. By default registers only GPU 0.
)
cp.cuda.set_allocator(rmm.rmm_cupy_allocator) This should allow you to oversubscribe your GPU memory to use all available host memory and it will page memory onto the GPU as needed. While swapping pages can slow down the workflow, it does make it much easier to experiment and explore different datasets & sizes without having to think about available GPU memory. For example, I was able to load the 10x 1.3M neuron dataset pretty easily onto a 32gb GPU and do all sorts of transformations to it without ever encountering an OOM. If you get a chance, you should try this feature out and let us know if it allow you to scale higher than before. Specifically, we're also very curious to know if you still find it extremely fast. I didn't notice much hit to performance at all and I wasn't even paying attention to how much memory I was using (I'm sure it was well over 100gb by the time I was done). |
I will try the 1.3M neuron dataset myself and some internal datasets at some point today. Using this new modification looks excellent so far. Initial results are: 54 seconds (GPU) vs. 563 seconds (CPU) on a 32 core/ 64 thread Threadripper with a 1080 Ti and 128GB ram. |
Just FYI, the 1.3M neuron dataset crashes the kernel on this line:
However, my other dataset of ~330K cells gets past this step. It doesn't seem to be a RAM/swap issue as I don't seem to be approaching the ceiling. Perhaps it's a VRAM issue? I'll try again later. |
I meant to respond to you earlier about this. Indeed, as I was playing around with this I noticed the crash as well. It's actually a known bug in Cusparse and we're waiting for them to fix it. I played around a little bit and managed to isolate the bug to entry If you slice off the first |
Just FYI, I'm consistently crashing the kernel here with your new notebook and rapids_scanpy_funcs.py file:
Any suggestions? |
What is the shape of |
(989838, 23781) right after this step (adding a line):
and there is no python error except for the kernal crash and a system request to send a report. Let me try shaving off cells to see if it's a memory issue. |
I believe we might have hit a similar issue today where our Jupyter kernel crashed without giving any type of useful error information. I’m pretty sure it’s because we were running on a system that didn’t have enough main memory. While the benefit to using the managed memory option is the ability to oversubscribe the GPU memory, it does now increase the requirement on the amount of main memory needed. Many of the CPU examples of the 1.3M cells dataset indicate a requirement of at least 30gb of main memory to do the processing end to end. I think you can get away with a smaller gpu and managed memory, but this comes at the expense of needing more main memory. |
I can run it fine with 300K cells (266 seconds) but somewhere a little above that number it fails to work. At 500K it gets stuck, never finishing but at 700K or above it seems to crash the kernel. As I mentioned, I have 128 GB of RAM and 628GB set aside for Swap but it doesn't appear to get near that limit--especially with 500K. We are ordering workstations with 256 GB of RAM and hopefully, I'll add more VRAM with the next gen of video cards. Update: it's a cell number somewhere between 350K and 400K that causes the kernel crash for me...350 took 299 seconds to finish but 400K crashed the kernel. |
Just FYI, I've upgraded to 256 GB of RAM and completely reinstalled drivers and CUDA (from 10.1 to 10.2) and still have problems with the code getting "stuck" perpetually in the regression or scaling steps (no kernel crashes lately but a few CPU cores are continually engaged by Python but no progression to completion). Have you run this code on a 2080 Ti or other non-TESLA card? This only happens above 350K cells. Is there any way to troubleshoot this? |
Unfortunately, I don't have any 2080 Ti's available to try and reproduce your problem on my end and the 1M cell notebook appears to be working w/ the T4 instances in AWS, which rules out the problem being exclusive to the Turing architecture. This behavior does sound very strange, though.
A lot of times when errors are printed, they end up displaying in the command-line that's running the Jupyter notebook and not in the notebook itself. Do you see any errors on the command-line? You can set If you have a command-line available, can you also run the |
Sorry for the delay...coming back to this it seems like it's now hanging at
I'm guessing this is related to issue #53 ??? |
Can you comment on the amount of cells that can be processed as a function of VRAM? At 11GB, it seems like I'm running into memory limits with bigger samples. (The tutorial ran fine).
What is the max number of cells that you have tested?
The text was updated successfully, but these errors were encountered: