Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OverflowError: cannot convert float infinity to integer #2

Closed
ericopolo opened this issue Apr 17, 2021 · 5 comments
Closed

OverflowError: cannot convert float infinity to integer #2

ericopolo opened this issue Apr 17, 2021 · 5 comments

Comments

@ericopolo
Copy link

Hi! I'm running F4 with my data and I'm getting the error below. It already happened twice, and after a similar (although different) number of burn-in steps (this one was at step 28556 and the first time it happened at 28674). The input I'm using was successfully tested on Treemix, and I know my copy of F4.py works because I could run it with your "example.txt" without problems. My dataset has just a little more SNPs than the example (797), but maybe the issue has something to do with estimation of effective population sizes, because the burn-in never seems to end and that variable that caused the overflow becomes ridiculously big. I'm attaching my input file, so the problem can be reproduced.

Running simulations (burnin 28556)...Traceback (most recent call last):
  File "/home/ericolegal/bin/f4.py", line 881, in <module>
    mean_effective_population_size = int(mean_effective_population_size*mean_effective_population_size_scaler)
OverflowError: cannot convert float infinity to integer

input.txt

Cheers,

Érico.

@mmatschiner
Copy link
Owner

Hi Érico,
this sounds to me like the pairs of species that you specify are not actually pairs. Could this be the case? Does Treemix group either Pop1 and Pop2 as sisters or Pop3 and Pop4 as sisters?
Cheers,
Michael

@ericopolo
Copy link
Author

Hi, Michael, thanks for the quick response! First of all, I'm sorry, I actually didn't realize the order of populations in the input file would matter, so their position was completely random; I assumed simulations would alternate between the 3 possible topologies, but now I've noticed the information of the assumed topology in the output.

Indeed, most likely they are not actually pairs. Without allowing migration events, Treemix (as well as other phylogenetic methods) returns (1,3),(2,4). When allowing for migration (any amount of events) the unrooted topology becomes (1,4),(2,3) - what, by the way, agrees with mtDNA data. So the input file I gave you assumes the least likely topology, what would explain an overestimated theta.

However, changing the topology didn't have any effect. I've just tested the other two possible ones and the behavior is the same. Burn-in goes on until 28,000 or so and finally stops with that same error.

@mmatschiner
Copy link
Owner

Hi Érico,

This seems to be a tricky one. During the burnin phase in the simulations, F4 is increasing the effective population size in situations when the number of simulated SNPs variable in more than one population and the number of simulated SNPs variable on both sides of the root is smaller than the empirical number for these two measures. But even with an increased effective population size going to infinity, the empirical numbers are not reached in your case. Could it be that you filtered the dataset based on a minor allele frequency threshold or similar? It appears that the proportions of the SNPs that are variable in more than two populations and on both sides of the root of the assumed tree is unusually large.

@ericopolo
Copy link
Author

Hello again, Michael!

You're right. I had filtered out SNPs with a MAF < 0.1. Without this filtering (i.e., removing only constant sites and sites with missing data) things worked just fine, at least with the two more likely topologies. With the third one (the one I sent you), however, I got another error, that occurred just after the simulations:

Traceback (most recent call last):
  File "/home/ericolegal/bin/f4.py", line 1235, in <module>
    outlier_lines[x] = outlier_lines[x].replace("\n","") + " | " + "{0:.2f}".format(line_weight) + "\n"
ValueError: Unknown format code 'f' for object of type 'str'

The output file ends at the "interpretation" part, without saying anything about the SNPs driving the f4 statistic to be different than zero, like it does with the other two topologies. I'm sending the input again, this time without the MAF filtering. input.txt

Back to the original issue, although my MAF filtering seems to improve Treemix results (they make much more sense from the biogeographic point of view), I can see the problem it causes when simulating data. To fix that I think it would be necessary to ask users for that information (MAF filtering) and include an equivalent MAF filter at the "masking" stage, along with the inclusion of missing data, and before the burn-in. I know, of course, the filtering in itself is not necessary for the f4 test, and maybe not even useful in any sense, especially with you simulation method, and that such improvement would only make things more convenient to some users, so maybe a warning in README would suffice?

Now, maybe this is not the right place to talk about this, but I'm very puzzled about the result I got. What made me search for alternatives to the "fourpop" algorithm implemented in Treemix is that it was the only test that told me there's no introgression with a particular topology - (1,3),(2,4). As I mentioned before, Treemix finds this topology when not accounting for migration, but one migration event is enough not only to improve significantly the likelihood and lower the values in the residual covariance matrix, but also with a very strong statistical support for a high weight value. The "threepop" test tells me that pop2 is admixed whenever pop1 is included among the other two (what makes sense geographically). D3 (a statistic based on ABBA-BABA) also tells me there is introgression. Fourpop, in the other hand, is telling me that the topology I get from phylogenetic methods that don't take into account any kind of horizontal transfer (e.g. SNAPP, ASTRAL) needs no other process than genetic drift to explain the data. I wondered whether that was being caused for issues with the jackknife blocks (my data being from ddRADseq). So I came across your method, and was confident that the fastsimcoal simulations would tell me that life is good. But it is not. Your test also told me ILS is enough to explain the topology. And now I'm troubled trying to conciliate those results. Have you seen something like this? Any thoughts would be much appreciated.

Anyway, thank you so much for your help with this "overflow" issue, it was a fun one to solve, thanks to your attention and support.

Cheers,

Érico.

@mmatschiner
Copy link
Owner

Hi Érico,

I'm glad the one issue is sorted out. Regarding the second (the ValueError): I also noticed that when I tested your dataset, and I figured out that it was due to whitespace at the end of each line. I made a small change to the script so that this should now parse correctly. If you re-download F4, you should not see this issue anymore.

Regarding your suggestion to allow the user to account for MAF filtering in the simulations, I agree, this would be possible and it seems to be a good idea. I'll see if I find time to implement it, but have too much on my plate at the moment.

Finally, I agree that the set of results that you describe does not sound very consistent. I assume you did that already, but if not, what I would first look at would be the D statistic. @millanek's Dsuite program allows the quick calculation of different versions of the D-statistic as well as the F-branch statistic which might be helpful in your case.

Cheers,
Michael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants