problem with loading canFam3ToHg38.over.chain #9

rkimoakbioinformatics · 2019-05-16T20:30:03Z

I have a Windows system with 32GB ram and LiftOver("canFam3ToHg38.over.chain") took all of the memory and did not finish. cProfile.run of the same command with a portion of the chain file showed that most of the time was spent in add_interval function and another related function. Can there be a way to do this?

konstantint · 2019-05-25T14:16:04Z

Hmm. Although the canFam3ToHg38 is a large file with 38M blocks, I don't see how those wouldn't fit in 32G (also, I'm trying to load it on a machine with even more memory and it seems to be stuck).
This might be related to the overly naive implementation of the intervaltree data structure being used, however it's pretty hard to debug (especially given that other chainfiles have worked fine so far).

I'll need to try a more decent intervaltree implementation (e.g. something with rebalancing), however I'm not sure when I'll get the time to look into it.

konstantint · 2019-05-25T20:49:23Z

So, apparently, the problem is not in the algorithm, but indeed in the memory consumption (the 64GB machine could handle the file).

I tried to add a couple of changes which should considerably reduce the memory consumption. It still takes forever (several hours) to load, though. Try the newer version and see whether it helps. In principle, once you load the file (if it succeeds), you may try pickling the resulting LiftOver object - I suspect it would be faster to unpickle it than re-read and re-index everything.

There's also a new flag for the LiftOver constructor which might help a bit. Namely, show_progress=True will display a progress bar. You need to do pip install tqdm for this to work, though.

rkimoakbioinformatics · 2019-05-25T22:08:24Z

Thanks for your answer. I have been thinking about the possibility of speeding up by using something like Cython or numba. Have you tried already, or do you have any thought on this?

rkimoakbioinformatics · 2019-05-26T17:27:06Z

I tried using ncls (https://github.com/hunt-genes/ncls) and numpy. It seems to work. For canFam3ToHg38.over.chain, it takes about 2 minutes and 7.3 GB of memory to load the chain file. If you are interested, the modified file is at https://drive.google.com/open?id=1OA65_8EUrk9zZi-iQqMKYQASEAN27mA1.

konstantint · 2019-05-26T19:32:34Z

No, I haven't tried neither Numba nor Cython because the initial aim was to have a super-simple pure-Python tool (also, numba did not even exist when this was first written, I believe). And it does work fine for the "common" use-cases (i.e. hg18-to-hg37, etc), that canFam3 mapping is more of an exception in terms of the amount of re-mapping.

Things like numba/cython/psyco/pypy are in general reasonable directions to look for speed-ups, however I suspect getting a considerable effect might not be straightforward. Also, I'd prefer to have a tool which does not require compilation or heavy dependencies.

I tried running the current version on a Linux server with 32G memory, and the results are as follows:

Loading & indexing data: 13:35 (min:sec)
Pickling: 13:14
Unpickling: 4:38
Pickled object size: 5.2GB.

Thus, if you need to just get things done, for now I'd suggest to either try a Linux machine (somewhy it runs faster than on Windows, also note that Python 3.7 seems about 1.5-2x the speed of 2.7) and, in case you need to load the file multiple times, pre-pickle it.
Alternatively, perhaps just try the standard liftover binary.

konstantint · 2019-05-27T21:57:43Z

(I somehow missed your last comment before posting mine, now saw an email notification though).

Nice, I'll check the ncl option and will probably replace my data structure with that one if it is faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problem with loading canFam3ToHg38.over.chain #9

problem with loading canFam3ToHg38.over.chain #9

rkimoakbioinformatics commented May 16, 2019

konstantint commented May 25, 2019

konstantint commented May 25, 2019

rkimoakbioinformatics commented May 25, 2019

rkimoakbioinformatics commented May 26, 2019 •

edited

konstantint commented May 26, 2019

konstantint commented May 27, 2019

problem with loading canFam3ToHg38.over.chain #9

problem with loading canFam3ToHg38.over.chain #9

Comments

rkimoakbioinformatics commented May 16, 2019

konstantint commented May 25, 2019

konstantint commented May 25, 2019

rkimoakbioinformatics commented May 25, 2019

rkimoakbioinformatics commented May 26, 2019 • edited

konstantint commented May 26, 2019

konstantint commented May 27, 2019

rkimoakbioinformatics commented May 26, 2019 •

edited