VRAM and cell numbers #14

JBreunig · 2020-06-09T19:54:21Z

Can you comment on the amount of cells that can be processed as a function of VRAM? At 11GB, it seems like I'm running into memory limits with bigger samples. (The tutorial ran fine).

What is the max number of cells that you have tested?

cjnolet · 2020-06-09T23:35:33Z

@JBreunig,

This is a great question. I admit we have been mostly testing on 32gb GPUs but I can generate some random data with known sparsity to get a feel for how far we can push different GPUs.

What is the ideal target size and sparsity for your GPU? Are there any public datasets with a similar size/sparsity?

JBreunig · 2020-06-10T22:49:33Z

Good examples would be here: http://mousebrain.org/downloads.html (either the aggregate loom https://storage.googleapis.com/linnarsson-lab-loom/l5_all.agg.loom or maybe you can see where you hit the limit with concatenating different subsets http://mousebrain.org/loomfiles_level_L1.html ?)

For example, I'm currently processing a dataset of 330,000 cells including all of the above and a bunch of our datasets combined with batch correction.

cjnolet · 2020-06-14T01:34:14Z

@JBreunig,

Wanted to provide a small update just to let you know that I have been looking into this. I think the most straightforward solution here might be to use the unified virtual memory allocator. I’ll get an example together.

JBreunig · 2020-06-14T03:45:21Z

Ok, thanks! I'll close this.

cjnolet · 2020-06-16T03:22:00Z

@JBreunig,

I've made a modification to the notebook to enable the Unified Virtual Memory manager in RAPIDS & CuPy. Specifically, the change looks like this:

import rmm
rmm.reinitialize(
    managed_memory=True, # Allows oversubscription
    devices=0, # GPU device IDs to register. By default registers only GPU 0.
)

cp.cuda.set_allocator(rmm.rmm_cupy_allocator)

This should allow you to oversubscribe your GPU memory to use all available host memory and it will page memory onto the GPU as needed. While swapping pages can slow down the workflow, it does make it much easier to experiment and explore different datasets & sizes without having to think about available GPU memory.

For example, I was able to load the 10x 1.3M neuron dataset pretty easily onto a 32gb GPU and do all sorts of transformations to it without ever encountering an OOM.

If you get a chance, you should try this feature out and let us know if it allow you to scale higher than before. Specifically, we're also very curious to know if you still find it extremely fast. I didn't notice much hit to performance at all and I wasn't even paying attention to how much memory I was using (I'm sure it was well over 100gb by the time I was done).

JBreunig · 2020-06-16T17:23:18Z

I will try the 1.3M neuron dataset myself and some internal datasets at some point today.

Using this new modification looks excellent so far. Initial results are: 54 seconds (GPU) vs. 563 seconds (CPU) on a 32 core/ 64 thread Threadripper with a 1080 Ti and 128GB ram.

JBreunig · 2020-06-17T02:01:58Z

Just FYI, the 1.3M neuron dataset crashes the kernel on this line:

filtered = rapids_scanpy_funcs.filter_cells(sparse_gpu_array, min_genes=min_genes_per_cell, max_genes=max_genes_per_cell)

However, my other dataset of ~330K cells gets past this step. It doesn't seem to be a RAM/swap issue as I don't seem to be approaching the ceiling. Perhaps it's a VRAM issue? I'll try again later.

cjnolet · 2020-06-17T03:15:38Z

@JBreunig,

I meant to respond to you earlier about this. Indeed, as I was playing around with this I noticed the crash as well. It's actually a known bug in Cusparse and we're waiting for them to fix it. I played around a little bit and managed to isolate the bug to entry 1057790 in the input data. The problem is in the conversion from the CPU array to the GPU.

If you slice off the first 1M records, or vstack everything up to 1057789 and above 1057791, the filter will work. I was able to run the 1M fairly easily all the way up to the regress_out. We're also very close to merging a PR on cuML that will enable sparse inputs for PCA (and doesn't require conversion to dense for the mean centering).

JBreunig · 2020-06-22T19:19:22Z

Just FYI, I'm consistently crashing the kernel here with your new notebook and rapids_scanpy_funcs.py file:

%%time
sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes, flavor="cell_ranger")
adata = adata[:, adata.var.highly_variable]

Any suggestions?

cjnolet · 2020-06-23T02:09:17Z

@JBreunig,

What is the shape of adata passed into the highly_variable_genes? Is it giving you any type of error at all before the kernel crashes?

JBreunig · 2020-06-23T22:53:52Z

(989838, 23781) right after this step (adding a line):

%%time
adata = anndata.AnnData(sparse_gpu_array.get())
adata.var_names = genes.to_pandas()
adata.shape

and there is no python error except for the kernal crash and a system request to send a report.

Let me try shaving off cells to see if it's a memory issue.

cjnolet · 2020-06-25T01:17:27Z

@JBreunig,

I believe we might have hit a similar issue today where our Jupyter kernel crashed without giving any type of useful error information. I’m pretty sure it’s because we were running on a system that didn’t have enough main memory.

While the benefit to using the managed memory option is the ability to oversubscribe the GPU memory, it does now increase the requirement on the amount of main memory needed.

Many of the CPU examples of the 1.3M cells dataset indicate a requirement of at least 30gb of main memory to do the processing end to end. I think you can get away with a smaller gpu and managed memory, but this comes at the expense of needing more main memory.

JBreunig · 2020-06-25T04:17:26Z

I can run it fine with 300K cells (266 seconds) but somewhere a little above that number it fails to work. At 500K it gets stuck, never finishing but at 700K or above it seems to crash the kernel. As I mentioned, I have 128 GB of RAM and 628GB set aside for Swap but it doesn't appear to get near that limit--especially with 500K.

We are ordering workstations with 256 GB of RAM and hopefully, I'll add more VRAM with the next gen of video cards.

Update: it's a cell number somewhere between 350K and 400K that causes the kernel crash for me...350 took 299 seconds to finish but 400K crashed the kernel.

JBreunig · 2020-08-31T17:47:21Z

Just FYI, I've upgraded to 256 GB of RAM and completely reinstalled drivers and CUDA (from 10.1 to 10.2) and still have problems with the code getting "stuck" perpetually in the regression or scaling steps (no kernel crashes lately but a few CPU cores are continually engaged by Python but no progression to completion). Have you run this code on a 2080 Ti or other non-TESLA card?

This only happens above 350K cells.

Is there any way to troubleshoot this?

cjnolet · 2020-09-16T20:54:12Z

@JBreunig,

Have you run this code on a 2080 Ti or other non-TESLA card?

Unfortunately, I don't have any 2080 Ti's available to try and reproduce your problem on my end and the 1M cell notebook appears to be working w/ the T4 instances in AWS, which rules out the problem being exclusive to the Turing architecture. This behavior does sound very strange, though.

Is there any way to troubleshoot this?

A lot of times when errors are printed, they end up displaying in the command-line that's running the Jupyter notebook and not in the notebook itself. Do you see any errors on the command-line?

You can set verbose=True in the call to rapids_scanpy_funcs.regress_out, which will print something after every 500 cells are processed. If that's not enough, you can add more prints to the regress_out and scale functions in rapids_scanpy_funcs.

If you have a command-line available, can you also run the nvidia-smi command? That should at least help us determine if the GPU is being actively utilized when the code gets stuck.

JBreunig · 2020-12-09T04:12:57Z

Sorry for the delay...coming back to this it seems like it's now hanging at

%%time
sparse_gpu_array, genes = rapids_scanpy_funcs.filter_genes(sparse_gpu_array, genes, min_cells=1)

I'm guessing this is related to issue #53 ???

JBreunig closed this as completed Jun 14, 2020

JBreunig reopened this Aug 31, 2020

avantikalal assigned cjnolet Sep 3, 2020

JBreunig closed this as completed Aug 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VRAM and cell numbers #14

VRAM and cell numbers #14

JBreunig commented Jun 9, 2020

cjnolet commented Jun 9, 2020

JBreunig commented Jun 10, 2020

cjnolet commented Jun 14, 2020

JBreunig commented Jun 14, 2020

cjnolet commented Jun 16, 2020

JBreunig commented Jun 16, 2020

JBreunig commented Jun 17, 2020

cjnolet commented Jun 17, 2020

JBreunig commented Jun 22, 2020

cjnolet commented Jun 23, 2020

JBreunig commented Jun 23, 2020

cjnolet commented Jun 25, 2020

JBreunig commented Jun 25, 2020 •

edited

Loading

JBreunig commented Aug 31, 2020

cjnolet commented Sep 16, 2020 •

edited

Loading

JBreunig commented Dec 9, 2020 •

edited

Loading

VRAM and cell numbers #14

VRAM and cell numbers #14

Comments

JBreunig commented Jun 9, 2020

cjnolet commented Jun 9, 2020

JBreunig commented Jun 10, 2020

cjnolet commented Jun 14, 2020

JBreunig commented Jun 14, 2020

cjnolet commented Jun 16, 2020

JBreunig commented Jun 16, 2020

JBreunig commented Jun 17, 2020

cjnolet commented Jun 17, 2020

JBreunig commented Jun 22, 2020

cjnolet commented Jun 23, 2020

JBreunig commented Jun 23, 2020

cjnolet commented Jun 25, 2020

JBreunig commented Jun 25, 2020 • edited Loading

JBreunig commented Aug 31, 2020

cjnolet commented Sep 16, 2020 • edited Loading

JBreunig commented Dec 9, 2020 • edited Loading

JBreunig commented Jun 25, 2020 •

edited

Loading

cjnolet commented Sep 16, 2020 •

edited

Loading

JBreunig commented Dec 9, 2020 •

edited

Loading