Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem with loading canFam3ToHg38.over.chain #9

Open
rkimoakbioinformatics opened this issue May 16, 2019 · 6 comments
Open

problem with loading canFam3ToHg38.over.chain #9

rkimoakbioinformatics opened this issue May 16, 2019 · 6 comments

Comments

@rkimoakbioinformatics
Copy link

I have a Windows system with 32GB ram and LiftOver("canFam3ToHg38.over.chain") took all of the memory and did not finish. cProfile.run of the same command with a portion of the chain file showed that most of the time was spent in add_interval function and another related function. Can there be a way to do this?

@konstantint
Copy link
Owner

Hmm. Although the canFam3ToHg38 is a large file with 38M blocks, I don't see how those wouldn't fit in 32G (also, I'm trying to load it on a machine with even more memory and it seems to be stuck).
This might be related to the overly naive implementation of the intervaltree data structure being used, however it's pretty hard to debug (especially given that other chainfiles have worked fine so far).

I'll need to try a more decent intervaltree implementation (e.g. something with rebalancing), however I'm not sure when I'll get the time to look into it.

@konstantint
Copy link
Owner

So, apparently, the problem is not in the algorithm, but indeed in the memory consumption (the 64GB machine could handle the file).

I tried to add a couple of changes which should considerably reduce the memory consumption. It still takes forever (several hours) to load, though. Try the newer version and see whether it helps. In principle, once you load the file (if it succeeds), you may try pickling the resulting LiftOver object - I suspect it would be faster to unpickle it than re-read and re-index everything.

There's also a new flag for the LiftOver constructor which might help a bit. Namely, show_progress=True will display a progress bar. You need to do pip install tqdm for this to work, though.

@rkimoakbioinformatics
Copy link
Author

Thanks for your answer. I have been thinking about the possibility of speeding up by using something like Cython or numba. Have you tried already, or do you have any thought on this?

@rkimoakbioinformatics
Copy link
Author

rkimoakbioinformatics commented May 26, 2019

I tried using ncls (https://github.com/hunt-genes/ncls) and numpy. It seems to work. For canFam3ToHg38.over.chain, it takes about 2 minutes and 7.3 GB of memory to load the chain file. If you are interested, the modified file is at https://drive.google.com/open?id=1OA65_8EUrk9zZi-iQqMKYQASEAN27mA1.

@konstantint
Copy link
Owner

No, I haven't tried neither Numba nor Cython because the initial aim was to have a super-simple pure-Python tool (also, numba did not even exist when this was first written, I believe). And it does work fine for the "common" use-cases (i.e. hg18-to-hg37, etc), that canFam3 mapping is more of an exception in terms of the amount of re-mapping.

Things like numba/cython/psyco/pypy are in general reasonable directions to look for speed-ups, however I suspect getting a considerable effect might not be straightforward. Also, I'd prefer to have a tool which does not require compilation or heavy dependencies.

I tried running the current version on a Linux server with 32G memory, and the results are as follows:

  • Loading & indexing data: 13:35 (min:sec)
  • Pickling: 13:14
  • Unpickling: 4:38
    Pickled object size: 5.2GB.

Thus, if you need to just get things done, for now I'd suggest to either try a Linux machine (somewhy it runs faster than on Windows, also note that Python 3.7 seems about 1.5-2x the speed of 2.7) and, in case you need to load the file multiple times, pre-pickle it.
Alternatively, perhaps just try the standard liftover binary.

@konstantint
Copy link
Owner

(I somehow missed your last comment before posting mine, now saw an email notification though).

Nice, I'll check the ncl option and will probably replace my data structure with that one if it is faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants