Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VRAM and cell numbers #14

Closed
JBreunig opened this issue Jun 9, 2020 · 16 comments
Closed

VRAM and cell numbers #14

JBreunig opened this issue Jun 9, 2020 · 16 comments
Assignees

Comments

@JBreunig
Copy link

JBreunig commented Jun 9, 2020

Can you comment on the amount of cells that can be processed as a function of VRAM? At 11GB, it seems like I'm running into memory limits with bigger samples. (The tutorial ran fine).

What is the max number of cells that you have tested?

@cjnolet
Copy link
Member

cjnolet commented Jun 9, 2020

@JBreunig,

This is a great question. I admit we have been mostly testing on 32gb GPUs but I can generate some random data with known sparsity to get a feel for how far we can push different GPUs.

What is the ideal target size and sparsity for your GPU? Are there any public datasets with a similar size/sparsity?

@JBreunig
Copy link
Author

Good examples would be here: http://mousebrain.org/downloads.html (either the aggregate loom https://storage.googleapis.com/linnarsson-lab-loom/l5_all.agg.loom or maybe you can see where you hit the limit with concatenating different subsets http://mousebrain.org/loomfiles_level_L1.html ?)

For example, I'm currently processing a dataset of 330,000 cells including all of the above and a bunch of our datasets combined with batch correction.

@cjnolet
Copy link
Member

cjnolet commented Jun 14, 2020

@JBreunig,

Wanted to provide a small update just to let you know that I have been looking into this. I think the most straightforward solution here might be to use the unified virtual memory allocator. I’ll get an example together.

@JBreunig
Copy link
Author

Ok, thanks! I'll close this.

@cjnolet
Copy link
Member

cjnolet commented Jun 16, 2020

@JBreunig,

I've made a modification to the notebook to enable the Unified Virtual Memory manager in RAPIDS & CuPy. Specifically, the change looks like this:

import rmm
rmm.reinitialize(
    managed_memory=True, # Allows oversubscription
    devices=0, # GPU device IDs to register. By default registers only GPU 0.
)

cp.cuda.set_allocator(rmm.rmm_cupy_allocator)

This should allow you to oversubscribe your GPU memory to use all available host memory and it will page memory onto the GPU as needed. While swapping pages can slow down the workflow, it does make it much easier to experiment and explore different datasets & sizes without having to think about available GPU memory.

For example, I was able to load the 10x 1.3M neuron dataset pretty easily onto a 32gb GPU and do all sorts of transformations to it without ever encountering an OOM.

If you get a chance, you should try this feature out and let us know if it allow you to scale higher than before. Specifically, we're also very curious to know if you still find it extremely fast. I didn't notice much hit to performance at all and I wasn't even paying attention to how much memory I was using (I'm sure it was well over 100gb by the time I was done).

@JBreunig
Copy link
Author

I will try the 1.3M neuron dataset myself and some internal datasets at some point today.

Using this new modification looks excellent so far. Initial results are: 54 seconds (GPU) vs. 563 seconds (CPU) on a 32 core/ 64 thread Threadripper with a 1080 Ti and 128GB ram.

@JBreunig
Copy link
Author

Just FYI, the 1.3M neuron dataset crashes the kernel on this line:

filtered = rapids_scanpy_funcs.filter_cells(sparse_gpu_array, min_genes=min_genes_per_cell, max_genes=max_genes_per_cell)

However, my other dataset of ~330K cells gets past this step. It doesn't seem to be a RAM/swap issue as I don't seem to be approaching the ceiling. Perhaps it's a VRAM issue? I'll try again later.

@cjnolet
Copy link
Member

cjnolet commented Jun 17, 2020

@JBreunig,

I meant to respond to you earlier about this. Indeed, as I was playing around with this I noticed the crash as well. It's actually a known bug in Cusparse and we're waiting for them to fix it. I played around a little bit and managed to isolate the bug to entry 1057790 in the input data. The problem is in the conversion from the CPU array to the GPU.

If you slice off the first 1M records, or vstack everything up to 1057789 and above 1057791, the filter will work. I was able to run the 1M fairly easily all the way up to the regress_out. We're also very close to merging a PR on cuML that will enable sparse inputs for PCA (and doesn't require conversion to dense for the mean centering).

@JBreunig
Copy link
Author

Just FYI, I'm consistently crashing the kernel here with your new notebook and rapids_scanpy_funcs.py file:

%%time
sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes, flavor="cell_ranger")
adata = adata[:, adata.var.highly_variable]

Any suggestions?

@cjnolet
Copy link
Member

cjnolet commented Jun 23, 2020

@JBreunig,

What is the shape of adata passed into the highly_variable_genes? Is it giving you any type of error at all before the kernel crashes?

@JBreunig
Copy link
Author

(989838, 23781) right after this step (adding a line):

%%time
adata = anndata.AnnData(sparse_gpu_array.get())
adata.var_names = genes.to_pandas()
adata.shape

and there is no python error except for the kernal crash and a system request to send a report.

Let me try shaving off cells to see if it's a memory issue.

@cjnolet
Copy link
Member

cjnolet commented Jun 25, 2020

@JBreunig,

I believe we might have hit a similar issue today where our Jupyter kernel crashed without giving any type of useful error information. I’m pretty sure it’s because we were running on a system that didn’t have enough main memory.

While the benefit to using the managed memory option is the ability to oversubscribe the GPU memory, it does now increase the requirement on the amount of main memory needed.

Many of the CPU examples of the 1.3M cells dataset indicate a requirement of at least 30gb of main memory to do the processing end to end. I think you can get away with a smaller gpu and managed memory, but this comes at the expense of needing more main memory.

@JBreunig
Copy link
Author

JBreunig commented Jun 25, 2020

I can run it fine with 300K cells (266 seconds) but somewhere a little above that number it fails to work. At 500K it gets stuck, never finishing but at 700K or above it seems to crash the kernel. As I mentioned, I have 128 GB of RAM and 628GB set aside for Swap but it doesn't appear to get near that limit--especially with 500K.

We are ordering workstations with 256 GB of RAM and hopefully, I'll add more VRAM with the next gen of video cards.

Update: it's a cell number somewhere between 350K and 400K that causes the kernel crash for me...350 took 299 seconds to finish but 400K crashed the kernel.

@JBreunig
Copy link
Author

Just FYI, I've upgraded to 256 GB of RAM and completely reinstalled drivers and CUDA (from 10.1 to 10.2) and still have problems with the code getting "stuck" perpetually in the regression or scaling steps (no kernel crashes lately but a few CPU cores are continually engaged by Python but no progression to completion). Have you run this code on a 2080 Ti or other non-TESLA card?

This only happens above 350K cells.

Is there any way to troubleshoot this?

@cjnolet
Copy link
Member

cjnolet commented Sep 16, 2020

@JBreunig,

Have you run this code on a 2080 Ti or other non-TESLA card?

Unfortunately, I don't have any 2080 Ti's available to try and reproduce your problem on my end and the 1M cell notebook appears to be working w/ the T4 instances in AWS, which rules out the problem being exclusive to the Turing architecture. This behavior does sound very strange, though.

Is there any way to troubleshoot this?

A lot of times when errors are printed, they end up displaying in the command-line that's running the Jupyter notebook and not in the notebook itself. Do you see any errors on the command-line?

You can set verbose=True in the call to rapids_scanpy_funcs.regress_out, which will print something after every 500 cells are processed. If that's not enough, you can add more prints to the regress_out and scale functions in rapids_scanpy_funcs.

If you have a command-line available, can you also run the nvidia-smi command? That should at least help us determine if the GPU is being actively utilized when the code gets stuck.

@JBreunig
Copy link
Author

JBreunig commented Dec 9, 2020

Sorry for the delay...coming back to this it seems like it's now hanging at

%%time
sparse_gpu_array, genes = rapids_scanpy_funcs.filter_genes(sparse_gpu_array, genes, min_cells=1)

I'm guessing this is related to issue #53 ???

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants