-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: merging files for memory issues? #10
Comments
Hi @chooliu , In the long run, we want to re-write the whole prepare script to make it faster and more memory efficient. Currently, we first read the methylation files and write them to a sparse matrix in COO format, and then convert the COO matrix to CSR format (stored in the form of .npz files). I'm sure there must be a way to either skip COO entirely and write straight to CSR, or to convert COO to CSR without reading the whole COO matrix into memory. I'll think about it and see if I can find a better way. Your suggestion to process the data in chunks would also work, of course. I'm not sure yet what the best solution to this problem is. We didn't implement a function to resume the prepare script starting from COO files, so the easiest way would be to re-run scbs prepare. Manually recovering the COO file is possible in theory, but I think it's a little tricky. I can't really recommend it. If you still want to give it a shot, you can use Python to read the COO file, convert it to CSR format, and store it as .npz file. Have a look at _load_csr_from_coo() to see how to load a COO file. You can then save it with Thanks for using scbs and reporting this issue. |
Hi @chooliu, I rewrote the memory-inefficient part of scbs prepare that caused your crash. The COO file is now read in chunks instead of reading the whole chromosome. Before I release this version, could you please try this version and tell me if it fixed your problem? I tested it on our own data and it seems to work. After downloading the .tar.gz file, you can install it like this:
Then check if you have the correct version (0.4.0) by just typing After updating to 0.4.0 you can just use scbs prepare like you did before, but it should use less memory. If you're still running out of memory, you can also lower the size of the chunks now. By default each chromosome is now read in chunks of 10 megabases each, so that e.g. mouse chr1 consists of 20 chunks. If you want to lower the memory requirements even further, you can set e.g. Please let me know if it solved your issue :) |
I did some testing on a large data set with 2568 cells. I quantified GpC sites which are ~10x more frequent than CpG sites. So this data set is almost as big as the one you tried. I measured the peak memory usage of different scbs versions and these are the results:
Surprisingly, lowering |
closing for now, since this was addressed in release 0.4.0 |
Sorry Lucas, I could have sworn that I responded to your message from way back when and got re-notified when the issue was closed. Thanks so much for looking into memory requirements! I think non-CpG methylation is somewhat niche, but also vitally important to a lot of folks working in brain, development, etc (where a lot of single-cell methylation work is being done). I shared the scbs preprint with my group last year & more folks besides me are now playing with it--will let you know how it goes and thanks again :) |
No problem, and thanks for making me aware of this issue! I agree, CH methylation is interesting and we also had a look at it with scbs. We didn't notice the memory issues though, cause we had fewer cells and I we were using a 126gb RAM machine. So thanks for making me aware of this issue, and also thanks for sharing scbs with your peers :) |
Oops, my apologies on that: my recollection was the preprint exclusively discussed CpGs. Very excited to see how our field moves forward as larger cell count datasets emerge. :) |
You're right, we didn't discuss it in the preprint. But of course you can also input other data types such as CH methylation data. Good point actually, maybe we should discuss it in the next version of the paper. |
I'm obviously biased here, but I think that inclusion would be interesting! In particular, our group typically use both CG & CH together for clustering as CH is very useful for most brain datasets (https://lhqing.github.io/ALLCools/intro.html from collaborators in San Diego; joins separate CH-PCs and CG-PCs as input features) and one direction I've been exploring in my methods development work is whether CH-DMR calling requires distinct considerations. Cheers, |
Hi there--I'm an analyst in the Luo lab (associated with one of the single-cell methylomes datasets used in the preprint!). Thanks for developing this software-- I love the user-friendliness and the VMR concept in scbs, and running scbs on CpG methylation is quite smooth.
However, I'm running into some issues as I frequently need to work with non-CpG methylation (CH methylation), which can be associated with ~20-fold more loci. Using 64GB memory and ~2,000 cells, larger chromosomes will fail at the end of the
scbs prepare
step in what looks like the .coo to .npz conversion.While I'm currently attempting to re-run with more memory, this is a relatively low cell count dataset for us. I'd be great to somehow merge sets of cells (so maybe I could run on 1,000 of the dataset at a time?) or somehow read/write each chromosome in blocks to use less memory, though I'm not sure either is possible with the details of the sparse format. Are there any ways to "recover" a run to convert the 1.coo file (which is successfully created) without re-running prepare, or any other recommendations?
Cheers!
The text was updated successfully, but these errors were encountered: