Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: merging files for memory issues? #10

Closed
chooliu opened this issue Jul 5, 2022 · 9 comments
Closed

feature request: merging files for memory issues? #10

chooliu opened this issue Jul 5, 2022 · 9 comments
Labels
enhancement New feature or request

Comments

@chooliu
Copy link

chooliu commented Jul 5, 2022

Hi there--I'm an analyst in the Luo lab (associated with one of the single-cell methylomes datasets used in the preprint!). Thanks for developing this software-- I love the user-friendliness and the VMR concept in scbs, and running scbs on CpG methylation is quite smooth.

However, I'm running into some issues as I frequently need to work with non-CpG methylation (CH methylation), which can be associated with ~20-fold more loci. Using 64GB memory and ~2,000 cells, larger chromosomes will fail at the end of the scbs prepare step in what looks like the .coo to .npz conversion.

While I'm currently attempting to re-run with more memory, this is a relatively low cell count dataset for us. I'd be great to somehow merge sets of cells (so maybe I could run on 1,000 of the dataset at a time?) or somehow read/write each chromosome in blocks to use less memory, though I'm not sure either is possible with the details of the sparse format. Are there any ways to "recover" a run to convert the 1.coo file (which is successfully created) without re-running prepare, or any other recommendations?

Cheers!

Populating 57334996 x 2165 matrix for chromosome 19...
Converting from COO to CSR...
Writing to scbs_out_CH/19.npz ...
Populating 260521582 x 2165 matrix for chromosome 1...
Traceback (most recent call last):
  File "/u/home/c/lib/python3.8/site-packages/scbs/prepare.py", line 156, in _load_csr_from_coo
coo = pd.read_csv(coo_path, delimiter=",", header=None).values
File "/u/home/c/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/u/home/c/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/u/home/c/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 581, in _read
return parser.read(nrows)
File "/u/home/c/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1269, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "/u/home/c/lib/python3.8/site-packages/pandas/core/frame.py", line 636, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "/u/home/c/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 502, in dict_to_mgr
return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File "/u/home/c/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 156, in arrays_to_mgr
return create_block_manager_from_column_arrays(
  File "/u/home/c/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1954, in create_block_manager_from_column_arrays
  blocks = _form_blocks(arrays, consolidate)
  File "/u/home/c/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 2028, in _form_blocks
  values, placement = _stack_arrays(list(tup_block), dtype)
  File "/u/home/c/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 2067, in _stack_arrays
  stacked = np.empty(shape, dtype=dtype)
  numpy.core._exceptions.MemoryError: Unable to allocate 51.7 GiB for an array with shape (3, 2313401180) and data type int64
  
  During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    File "/u/home/c/bin/scbs", line 8, in <module>
    sys.exit(cli())
  File "/u/home/c/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
  return self.main(*args, **kwargs)
  File "/u/home/c/lib/python3.8/site-packages/click/core.py", line 1053, in main
  rv = self.invoke(ctx)
  File "/u/home/c/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
  return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/u/home/c/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
  return ctx.invoke(self.callback, **ctx.params)
  File "/u/home/c/lib/python3.8/site-packages/click/core.py", line 754, in invoke
  return __callback(*args, **kwargs)
  File "/u/home/c/lib/python3.8/site-packages/scbs/cli.py", line 157, in prepare_cli
  prepare(**kwargs)
  File "/u/home/c/lib/python3.8/site-packages/scbs/prepare.py", line 44, in prepare
  mat = _load_csr_from_coo(coo_path, chrom_size, n_cells)
  File "/u/home/c/lib/python3.8/site-packages/scbs/prepare.py", line 165, in _load_csr_from_coo
  raise type(exc)(f"{exc} (problematic file: {coo_path})").with_traceback(
    TypeError: __init__() missing 1 required positional argument: 'dtype'
@LKremer
Copy link
Owner

LKremer commented Jul 6, 2022

Hi @chooliu ,
thanks for the nice feedback! Sounds like you got quite an impressive data set.

In the long run, we want to re-write the whole prepare script to make it faster and more memory efficient. Currently, we first read the methylation files and write them to a sparse matrix in COO format, and then convert the COO matrix to CSR format (stored in the form of .npz files). I'm sure there must be a way to either skip COO entirely and write straight to CSR, or to convert COO to CSR without reading the whole COO matrix into memory. I'll think about it and see if I can find a better way.

Your suggestion to process the data in chunks would also work, of course. I'm not sure yet what the best solution to this problem is.

We didn't implement a function to resume the prepare script starting from COO files, so the easiest way would be to re-run scbs prepare. Manually recovering the COO file is possible in theory, but I think it's a little tricky. I can't really recommend it. If you still want to give it a shot, you can use Python to read the COO file, convert it to CSR format, and store it as .npz file. Have a look at _load_csr_from_coo() to see how to load a COO file. You can then save it with scipy.sparse.save_npz(). But this seems pretty tedious and error-prone, and you still need to get all of the .npz files of the other chromosomes somehow. Another problem is that scbs prepare produces a bunch of meta data (a file listing the cell names, quality metrics, etc.) and you wouldn't get these files if you do everything manually. Or, if you re-run without chromosome 1, the quality metrics would miss some values, of course. So yeah, I think re-running is the best way, even if it takes a while.

Thanks for using scbs and reporting this issue.
I will let you know once I found a way to decrease the memory requirements.

@LKremer
Copy link
Owner

LKremer commented Jul 20, 2022

Hi @chooliu,

I rewrote the memory-inefficient part of scbs prepare that caused your crash. The COO file is now read in chunks instead of reading the whole chromosome. Before I release this version, could you please try this version and tell me if it fixed your problem? I tested it on our own data and it seems to work.

scbs-0.4.0.tar.gz

After downloading the .tar.gz file, you can install it like this:

python3 -m pip install --upgrade scbs-0.4.0.tar.gz

Then check if you have the correct version (0.4.0) by just typing scbs.

After updating to 0.4.0 you can just use scbs prepare like you did before, but it should use less memory.

If you're still running out of memory, you can also lower the size of the chunks now. By default each chromosome is now read in chunks of 10 megabases each, so that e.g. mouse chr1 consists of 20 chunks. If you want to lower the memory requirements even further, you can set e.g. --chunksize 1000000 and then it would be 200 chunks of 1 Mb each. Might be a little slower but it will save more RAM.

Please let me know if it solved your issue :)

@LKremer
Copy link
Owner

LKremer commented Jul 22, 2022

I did some testing on a large data set with 2568 cells. I quantified GpC sites which are ~10x more frequent than CpG sites. So this data set is almost as big as the one you tried.

I measured the peak memory usage of different scbs versions and these are the results:

scbs 0.3.3:   43.72 gigabytes
scbs 0.4.0:   14.75 gigabytes

Surprisingly, lowering --chunksize didn't decrease memory usage further, so I think these ~15 GB are used by another part of the code that I didn't change. In any case, 15 GB seems manageable. For your larger data sets it may bigger than that. But you can definitely fit many more cells into your 64 gigs of RAM now!

@LKremer
Copy link
Owner

LKremer commented Feb 3, 2023

closing for now, since this was addressed in release 0.4.0

@LKremer LKremer closed this as completed Feb 3, 2023
@chooliu
Copy link
Author

chooliu commented Feb 3, 2023

Sorry Lucas, I could have sworn that I responded to your message from way back when and got re-notified when the issue was closed.

Thanks so much for looking into memory requirements! I think non-CpG methylation is somewhat niche, but also vitally important to a lot of folks working in brain, development, etc (where a lot of single-cell methylation work is being done).

I shared the scbs preprint with my group last year & more folks besides me are now playing with it--will let you know how it goes and thanks again :)

@LKremer
Copy link
Owner

LKremer commented Feb 6, 2023

No problem, and thanks for making me aware of this issue! I agree, CH methylation is interesting and we also had a look at it with scbs. We didn't notice the memory issues though, cause we had fewer cells and I we were using a 126gb RAM machine. So thanks for making me aware of this issue, and also thanks for sharing scbs with your peers :)

@chooliu
Copy link
Author

chooliu commented Feb 6, 2023

Oops, my apologies on that: my recollection was the preprint exclusively discussed CpGs. Very excited to see how our field moves forward as larger cell count datasets emerge. :)

@LKremer
Copy link
Owner

LKremer commented Feb 6, 2023

You're right, we didn't discuss it in the preprint. But of course you can also input other data types such as CH methylation data. Good point actually, maybe we should discuss it in the next version of the paper.

@chooliu
Copy link
Author

chooliu commented Feb 6, 2023

I'm obviously biased here, but I think that inclusion would be interesting!

In particular, our group typically use both CG & CH together for clustering as CH is very useful for most brain datasets (https://lhqing.github.io/ALLCools/intro.html from collaborators in San Diego; joins separate CH-PCs and CG-PCs as input features) and one direction I've been exploring in my methods development work is whether CH-DMR calling requires distinct considerations.

Cheers,
Choo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants