Error with ReLERNN_SIMULATE #7

josieparis · 2019-12-04T18:02:49Z

Hi!

Really excited about using ReLERNN to estimate recombination in some natural data with a low-ish sample size (n22) and also to have a go on some poolseq data too

Just tried to run on my natural data, and I get the following error message when reading the hd5f files

Reading HDF5: "ReLERNN/splitVCFs/paria_marianne_1027798.final_chr1:0-34343053.hdf5"...
Process Process-2:
Error: chromosomes have different numbers of samples
Traceback (most recent call last):
  File "/gpfs/ts0/home/jrp228/.local/bin/ReLERNN_SIMULATE", line 4, in <module>
    __import__('pkg_resources').run_script('ReLERNN==0.1', 'ReLERNN_SIMULATE')
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/setuptools-38.4.0-py3.6.egg/pkg_resources/__init__.py", line 750, in run_script
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/setuptools-38.4.0-py3.6.egg/pkg_resources/__init__.py", line 1527, in run_script
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/EGG-INFO/scripts/ReLERNN_SIMULATE", line 219, in <module>
Traceback (most recent call last):
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/ReLERNN/manager.py", line 199, in worker_countSites
    if md_mask.any():
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/allel/abc.py", line 43, in __getattr__
    return getattr(self.values, item)
AttributeError: 'Dataset' object has no attribute 'any'
    main()
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/EGG-INFO/scripts/ReLERNN_SIMULATE", line 108, in main
    wins, nSamps, maxS, maxLen = vcf_manager.countSites(nProc=nProc)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/ReLERNN/manager.py", line 170, in countSites
    return sorted_wins, nSamps[0], maxS, maxLen
IndexError: list index out of range

My vcf file is pretty standard, although there is some missing data, and I'm running ReLERNN like this:
ReLERNN_SIMULATE -v paria_marianne_1027798.final.vcf -g STAR.extents.bed -m STAR.chromosomes.release.repeats.bed -d ReLERNN/ -u 4.8e-8 --unphased

I checked the vcf files generated in the first step of the script, and they all have the same number of samples:
for i in *vcf; do bcftools query -l $i | wc -l; done | sort | uniq
22

The text was updated successfully, but these errors were encountered:

jradrion · 2019-12-04T19:04:06Z

Hi there!

Sorry that you found a potential bug!

Can you check to see if a file named /networks/windowSizes.txt was created? The second column should list the number of haploid samples that ReLERNN thinks are present for each chromosome. If /networks/windowSizes.txt was not created could you send me say the first 1000 lines of your VCF? If it was created can you send me the first thousand or so lines from the VCF corresponding to one of the chromosomes that doesn't have 22 in the second column?

josieparis · 2019-12-05T11:59:09Z

Hi Jeffrey,

Thanks for getting back to me so quick! windowSizes.txt is generated, but it's empty.

Here's the first 1000 lines of my VCF file
vcf_1000.vcf.gz

jradrion · 2019-12-05T18:24:37Z

Hmm, the vcf you sent me was missing the header. Is this the first ~1000 lines from the file you tried to run ReLERNN on? Higher up in your original error message do you see RuntimeError: VCF file is missing mandatory header line ("#CHROM...")? If so I think that might be the problem.

josieparis · 2019-12-09T11:01:28Z

Ah my bad, sorry I removed the header. Here's the first 1000 lines with the header!
vcf_1000_header.vcf.gz

jradrion · 2019-12-09T20:40:28Z

No problem! Thanks for sending this, I'll post a resolution ASAP.

jradrion · 2019-12-10T01:08:17Z

OK, I think I fixed the problem. The vcf you sent me is now running without issue on my end. Can you try pulling the changes/reinstalling and then give it another go? Please let me know if this doesn't resolve your issues.

josieparis · 2019-12-10T16:46:11Z

Hi Jeffrey,

Great! I have a windows.sizes file now!

Although some of my chromosomes now have a sample size of 44, and two chromosomes are 22 ...

Traceback (most recent call last): File "/gpfs/ts0/home/jrp228/.local/bin/ReLERNN_SIMULATE", line 4, in <module> __import__('pkg_resources').run_script('ReLERNN==0.1', 'ReLERNN_SIMULATE') File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/setuptools-38.4.0-py3.6.egg/pkg_resources/__init__.py", line 750, in run_script File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/setuptools-38.4.0-py3.6.egg/pkg_resources/__init__.py", line 1527, in run_script File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/EGG-INFO/scripts/ReLERNN_SIMULATE", line 219, in <module> main() File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/EGG-INFO/scripts/ReLERNN_SIMULATE", line 128, in main md_mask = np.concatenate(md_mask) File "<__array_function__ internals>", line 6, in concatenate ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 44 and the array at index 8 has size 22

Thanks again for your help with this! Let me know if you would like any more info/files etc

jradrion · 2019-12-10T18:02:37Z

Hmm... Well n=44 is the correct number of haploid chromosomes (22 diploid samples), at least in the file you sent me. The two contigs showing n=22 is what I will need to look at. I'm assuming these are hemizygous sex chromosomes? Can you check to see how these samples are encoded? Looking at the VCFv4.2 specification, I'm not seeing any standard for encoding hemizygous chromosomes. I believe ReLERNN should be able to handle them as if they were autosomes with missing data if they are encoded (1/.), but I will need to double check this. Currently, the best thing to do might be to remove them from the original VCF file and run them separately.

Could you send me some lines from at least one of the chromosomes that is reporting n=22 in windowSizes.txt? Thanks again, and sorry for the hassle!

jradrion · 2019-12-10T18:35:12Z

Just got your files! I'll post a resolution as soon as I have one. Thanks!

jradrion · 2019-12-12T19:47:50Z

OK, I was able to run ReLERNN successfully on my end with the updated files I sent you. Please let me know if you are still getting errors. Thanks for your patience!

josieparis · 2019-12-23T21:32:18Z

I can confirm that this issue's now fixed! Thanks @jradrion

gavinmonahan · 2020-07-06T07:53:55Z

Hi @jradrion
I read through the above and I believe I'm having the same problem. I created my vcf files from csv files using a custom script and I could have made an error in doing so, but after a few weeks of troubleshooting I haven't got anywhere. I was able to run the test example with no issues.
Please see the attached .vcf and stdout. The windowSizes.txt file is empty.
I'm hoping I can just apply the same fix as above.

I appreciate your help and many thanks in advance
BJchB2.zip

jradrion · 2020-07-06T17:39:14Z

Hi @gavinmonahan, thank you for bringing this to my attention. I'll take a look and get back to you ASAP.

jradrion · 2020-07-06T21:25:46Z

@gavinmonahan, I just took a quick look at your VCF and it looks like there are a total of five scaffolds (B01-B05) and a combined total of 618 polymorphic sites between them? If this is correct I don't think ReLERNN is going to be of too much help. Are you sure the file you sent is correct? For example, it looks like scaffold B04 has a total of 62 variants with the max coordinate suggesting a scaffold length of at least 41Mb?

gavinmonahan · 2020-07-07T02:16:27Z

@jradrion thanks for having a look. Yes that's correct. Is that too few polymorphic sites for ReLERNN? There were originally 1354 sites, however I removed sites where there were over 50% no calls. Would any of that explain the error I got when trying to run ReLERNN?

jradrion · 2020-07-07T18:12:40Z

@gavinmonahan It's not strictly too few polymorphic sites, but we suggest tempering any conclusions based on predictions from genomic windows with fewer than 200 sites, which in this case would be the entirety of your genome.

I think the error you were getting is based on a real bug, though I suspect the bug is directly related to trying to run ReLERNN on so few sites. I'm trying to pin this down now.

jradrion · 2020-07-08T17:16:12Z

@gavinmonahan, The errors are, at least in part, due to how you created your VCF. I've noticed two problems, although there may be more, that are causing errors when scikit-allel attempts to parse your file. We use scikit-allel for all parsing of the VCF so you'll have to make sure your file conforms to their requirements.

In the QUAL field you use "MISSING" instead of a "." to indicate a missing value
The values for your coordinates are not monotonically increasing

I'm going to close this issue for now, but I'm happy to reopen it if you are still getting errors using a VCF that can be parsed with scikit-allel.

gavinmonahan · 2020-07-09T08:14:48Z

@jradrion I'm glad it's just a problem with my VCF! Thanks again for looking into this and finding the errors. I will update my VCFs and hopefully that will solve the issue.

jradrion · 2020-07-09T20:32:48Z

No problem, @gavinmonahan. Hopefully this will resolve the issue with the program running without error. Unfortunately it won't change that fact that predictions on genomic windows with such extremely low SNP density are going to be unreliable.

dylandebaun · 2023-04-27T22:22:32Z

Hello I think I may be having a similar issue. I’ve removed all the hemizygous/haploid chromosomes from my vcf and my windowSizes file only has chromosomes with sample size 6

However, I am getting the following error:
Reading HDF5 mask: /home/ddebaun/mendel-nas1/redo_recombination/splitVCFs/Leioheterodon_madagascarensis_B_biallelic_7204_RagTag:0-11000_md_mask.hdf5...
Traceback (most recent call last):
File "/home/ddebaun/mendel-nas1/miniconda3/bin/ReLERNN_SIMULATE", line 245, in
main()
File "/home/ddebaun/mendel-nas1/miniconda3/bin/ReLERNN_SIMULATE", line 152, in main
md_mask = np.concatenate(md_mask)
File "<array_function internals>", line 180, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 6 and the array at index 6 has size 3

I’m including the information I used for this run (first 50k lines of vcf). Is this also an issue with the vcf for scikit-allel? Any help would be appreciated!
filestorun.zip

josieparis closed this as completed Dec 23, 2019

jradrion reopened this Jul 6, 2020

jradrion closed this as completed Jul 8, 2020

dylandebaun mentioned this issue May 5, 2023

RELERNN SIMULATE issue with vcf? #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with ReLERNN_SIMULATE #7

Error with ReLERNN_SIMULATE #7

josieparis commented Dec 4, 2019 •

edited

Loading

jradrion commented Dec 4, 2019

josieparis commented Dec 5, 2019

jradrion commented Dec 5, 2019

josieparis commented Dec 9, 2019

jradrion commented Dec 9, 2019

jradrion commented Dec 10, 2019

josieparis commented Dec 10, 2019

jradrion commented Dec 10, 2019 •

edited

Loading

jradrion commented Dec 10, 2019

jradrion commented Dec 12, 2019

josieparis commented Dec 23, 2019

gavinmonahan commented Jul 6, 2020

jradrion commented Jul 6, 2020

jradrion commented Jul 6, 2020

gavinmonahan commented Jul 7, 2020

jradrion commented Jul 7, 2020

jradrion commented Jul 8, 2020

gavinmonahan commented Jul 9, 2020

jradrion commented Jul 9, 2020

dylandebaun commented Apr 27, 2023

Error with ReLERNN_SIMULATE #7

Error with ReLERNN_SIMULATE #7

Comments

josieparis commented Dec 4, 2019 • edited Loading

jradrion commented Dec 4, 2019

josieparis commented Dec 5, 2019

jradrion commented Dec 5, 2019

josieparis commented Dec 9, 2019

jradrion commented Dec 9, 2019

jradrion commented Dec 10, 2019

josieparis commented Dec 10, 2019

jradrion commented Dec 10, 2019 • edited Loading

jradrion commented Dec 10, 2019

jradrion commented Dec 12, 2019

josieparis commented Dec 23, 2019

gavinmonahan commented Jul 6, 2020

jradrion commented Jul 6, 2020

jradrion commented Jul 6, 2020

gavinmonahan commented Jul 7, 2020

jradrion commented Jul 7, 2020

jradrion commented Jul 8, 2020

gavinmonahan commented Jul 9, 2020

jradrion commented Jul 9, 2020

dylandebaun commented Apr 27, 2023

josieparis commented Dec 4, 2019 •

edited

Loading

jradrion commented Dec 10, 2019 •

edited

Loading