Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OverflowError: value too large to convert to int #8

Open
ethering opened this issue Feb 11, 2019 · 2 comments
Open

OverflowError: value too large to convert to int #8

ethering opened this issue Feb 11, 2019 · 2 comments

Comments

@ethering
Copy link

Hi,
I've successfully installed HyDe and run both the test data, along with some test data of my own (5Kb sequences of the whole genome data below).

Now, I'm trying to input whole genome sequences of a 2.4Gb each for 24 species over 5 taxa, but I'm getting the following error:

Command line:
run_hyde_mp.py -i hyde_samples.phy -m map.txt -o OUT -j 16 -n 24 -t 5 -s 2410758013

Error:

Traceback (most recent call last):
  File "run_hyde_mp.py", line 141, in <module>
    data = hd.HydeData(infile, mapfile, outgroup, nind, ntaxa, nsites, quiet)
  File "phyde/core/data.pyx", line 108, in phyde.core.data.HydeData.__init__
OverflowError: value too large to convert to int

I'm providing 2TB RAM, so presumably it's not a resources problem. I'm wondering if there's a maximum sequence length that I can run here and if you might know what it is (I can then cut my genome sequences into chunks and re-run HyDe on each chunk).
Many thanks,
Graham

@pblischak
Copy link
Owner

Hi Graham,

The biggest sequences that we have run through HyDe were about 250 Mb, so maybe that could be a good place to start for splitting things up. If you have chromosome-level information you could also potentially run things chromosome-by-chromosome.

One other thing -- in my experience, using too many threads with the run_hyde_mp.py script when you have a big data set can actually cause analyses to run slower. I'm pretty sure the reason for this is that the built-in multiprocessing library for Python creates a copy of the data for each thread. If you're giving the script a lot of threads, then each one needs a copy and things are really slow. I don't remember the exact numbers off the top of my head, but our analysis of the Heliconius data from our paper was only faster when we used 2 threads, and was slower when we used 4.

@ethering
Copy link
Author

ethering commented Mar 4, 2019

Hi,
Just to feedback. I tried a few different split sequence sizes (started with 10x 250Mb and then worked down). In the end, I found that only needed to split my genome into two halves to get it to work with HyDe. The first sequences was 1 Gbases long, the second, 1.4 Gbases long. They ran fine. I'm wondering if HyDe is limited by 32 bit signed integer (2,147,483,647). I also found this bug in bedtools when I was trying to use bed files to split my genomes up. If I have time, I'll create sequences that are 2,147,483,647 and 2,147,483,648 in length and see if the both run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants