Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with ReLERNN_SIMULATE #7

Closed
josieparis opened this issue Dec 4, 2019 · 20 comments
Closed

Error with ReLERNN_SIMULATE #7

josieparis opened this issue Dec 4, 2019 · 20 comments

Comments

@josieparis
Copy link

josieparis commented Dec 4, 2019

Hi!

Really excited about using ReLERNN to estimate recombination in some natural data with a low-ish sample size (n22) and also to have a go on some poolseq data too

Just tried to run on my natural data, and I get the following error message when reading the hd5f files

Reading HDF5: "ReLERNN/splitVCFs/paria_marianne_1027798.final_chr1:0-34343053.hdf5"...
Process Process-2:
Error: chromosomes have different numbers of samples
Traceback (most recent call last):
  File "/gpfs/ts0/home/jrp228/.local/bin/ReLERNN_SIMULATE", line 4, in <module>
    __import__('pkg_resources').run_script('ReLERNN==0.1', 'ReLERNN_SIMULATE')
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/setuptools-38.4.0-py3.6.egg/pkg_resources/__init__.py", line 750, in run_script
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/setuptools-38.4.0-py3.6.egg/pkg_resources/__init__.py", line 1527, in run_script
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/EGG-INFO/scripts/ReLERNN_SIMULATE", line 219, in <module>
Traceback (most recent call last):
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/ReLERNN/manager.py", line 199, in worker_countSites
    if md_mask.any():
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/allel/abc.py", line 43, in __getattr__
    return getattr(self.values, item)
AttributeError: 'Dataset' object has no attribute 'any'
    main()
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/EGG-INFO/scripts/ReLERNN_SIMULATE", line 108, in main
    wins, nSamps, maxS, maxLen = vcf_manager.countSites(nProc=nProc)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/ReLERNN/manager.py", line 170, in countSites
    return sorted_wins, nSamps[0], maxS, maxLen
IndexError: list index out of range


My vcf file is pretty standard, although there is some missing data, and I'm running ReLERNN like this:
ReLERNN_SIMULATE -v paria_marianne_1027798.final.vcf -g STAR.extents.bed -m STAR.chromosomes.release.repeats.bed -d ReLERNN/ -u 4.8e-8 --unphased

I checked the vcf files generated in the first step of the script, and they all have the same number of samples:
for i in *vcf; do bcftools query -l $i | wc -l; done | sort | uniq
22

@jradrion
Copy link
Contributor

jradrion commented Dec 4, 2019

Hi there!

Sorry that you found a potential bug!

Can you check to see if a file named /networks/windowSizes.txt was created? The second column should list the number of haploid samples that ReLERNN thinks are present for each chromosome. If /networks/windowSizes.txt was not created could you send me say the first 1000 lines of your VCF? If it was created can you send me the first thousand or so lines from the VCF corresponding to one of the chromosomes that doesn't have 22 in the second column?

@josieparis
Copy link
Author

Hi Jeffrey,

Thanks for getting back to me so quick! windowSizes.txt is generated, but it's empty.

Here's the first 1000 lines of my VCF file
vcf_1000.vcf.gz

@jradrion
Copy link
Contributor

jradrion commented Dec 5, 2019

Hmm, the vcf you sent me was missing the header. Is this the first ~1000 lines from the file you tried to run ReLERNN on? Higher up in your original error message do you see RuntimeError: VCF file is missing mandatory header line ("#CHROM...")? If so I think that might be the problem.

@josieparis
Copy link
Author

Ah my bad, sorry I removed the header. Here's the first 1000 lines with the header!
vcf_1000_header.vcf.gz

@jradrion
Copy link
Contributor

jradrion commented Dec 9, 2019

No problem! Thanks for sending this, I'll post a resolution ASAP.

@jradrion
Copy link
Contributor

OK, I think I fixed the problem. The vcf you sent me is now running without issue on my end. Can you try pulling the changes/reinstalling and then give it another go? Please let me know if this doesn't resolve your issues.

@josieparis
Copy link
Author

Hi Jeffrey,

Great! I have a windows.sizes file now!

Although some of my chromosomes now have a sample size of 44, and two chromosomes are 22 ...

Traceback (most recent call last): File "/gpfs/ts0/home/jrp228/.local/bin/ReLERNN_SIMULATE", line 4, in <module> __import__('pkg_resources').run_script('ReLERNN==0.1', 'ReLERNN_SIMULATE') File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/setuptools-38.4.0-py3.6.egg/pkg_resources/__init__.py", line 750, in run_script File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/setuptools-38.4.0-py3.6.egg/pkg_resources/__init__.py", line 1527, in run_script File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/EGG-INFO/scripts/ReLERNN_SIMULATE", line 219, in <module> main() File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/EGG-INFO/scripts/ReLERNN_SIMULATE", line 128, in main md_mask = np.concatenate(md_mask) File "<__array_function__ internals>", line 6, in concatenate ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 44 and the array at index 8 has size 22

Thanks again for your help with this! Let me know if you would like any more info/files etc

@jradrion
Copy link
Contributor

jradrion commented Dec 10, 2019

Hmm... Well n=44 is the correct number of haploid chromosomes (22 diploid samples), at least in the file you sent me. The two contigs showing n=22 is what I will need to look at. I'm assuming these are hemizygous sex chromosomes? Can you check to see how these samples are encoded? Looking at the VCFv4.2 specification, I'm not seeing any standard for encoding hemizygous chromosomes. I believe ReLERNN should be able to handle them as if they were autosomes with missing data if they are encoded (1/.), but I will need to double check this. Currently, the best thing to do might be to remove them from the original VCF file and run them separately.

Could you send me some lines from at least one of the chromosomes that is reporting n=22 in windowSizes.txt? Thanks again, and sorry for the hassle!

@jradrion
Copy link
Contributor

Just got your files! I'll post a resolution as soon as I have one. Thanks!

@jradrion
Copy link
Contributor

OK, I was able to run ReLERNN successfully on my end with the updated files I sent you. Please let me know if you are still getting errors. Thanks for your patience!

@josieparis
Copy link
Author

I can confirm that this issue's now fixed! Thanks @jradrion

@gavinmonahan
Copy link

Hi @jradrion
I read through the above and I believe I'm having the same problem. I created my vcf files from csv files using a custom script and I could have made an error in doing so, but after a few weeks of troubleshooting I haven't got anywhere. I was able to run the test example with no issues.
Please see the attached .vcf and stdout. The windowSizes.txt file is empty.
I'm hoping I can just apply the same fix as above.

I appreciate your help and many thanks in advance
BJchB2.zip

@jradrion
Copy link
Contributor

jradrion commented Jul 6, 2020

Hi @gavinmonahan, thank you for bringing this to my attention. I'll take a look and get back to you ASAP.

@jradrion jradrion reopened this Jul 6, 2020
@jradrion
Copy link
Contributor

jradrion commented Jul 6, 2020

@gavinmonahan, I just took a quick look at your VCF and it looks like there are a total of five scaffolds (B01-B05) and a combined total of 618 polymorphic sites between them? If this is correct I don't think ReLERNN is going to be of too much help. Are you sure the file you sent is correct? For example, it looks like scaffold B04 has a total of 62 variants with the max coordinate suggesting a scaffold length of at least 41Mb?

@gavinmonahan
Copy link

@jradrion thanks for having a look. Yes that's correct. Is that too few polymorphic sites for ReLERNN? There were originally 1354 sites, however I removed sites where there were over 50% no calls. Would any of that explain the error I got when trying to run ReLERNN?

@jradrion
Copy link
Contributor

jradrion commented Jul 7, 2020

@gavinmonahan It's not strictly too few polymorphic sites, but we suggest tempering any conclusions based on predictions from genomic windows with fewer than 200 sites, which in this case would be the entirety of your genome.

I think the error you were getting is based on a real bug, though I suspect the bug is directly related to trying to run ReLERNN on so few sites. I'm trying to pin this down now.

@jradrion
Copy link
Contributor

jradrion commented Jul 8, 2020

@gavinmonahan, The errors are, at least in part, due to how you created your VCF. I've noticed two problems, although there may be more, that are causing errors when scikit-allel attempts to parse your file. We use scikit-allel for all parsing of the VCF so you'll have to make sure your file conforms to their requirements.

  1. In the QUAL field you use "MISSING" instead of a "." to indicate a missing value
  2. The values for your coordinates are not monotonically increasing

I'm going to close this issue for now, but I'm happy to reopen it if you are still getting errors using a VCF that can be parsed with scikit-allel.

@jradrion jradrion closed this as completed Jul 8, 2020
@gavinmonahan
Copy link

@jradrion I'm glad it's just a problem with my VCF! Thanks again for looking into this and finding the errors. I will update my VCFs and hopefully that will solve the issue.

@jradrion
Copy link
Contributor

jradrion commented Jul 9, 2020

No problem, @gavinmonahan. Hopefully this will resolve the issue with the program running without error. Unfortunately it won't change that fact that predictions on genomic windows with such extremely low SNP density are going to be unreliable.

@dylandebaun
Copy link

Hello I think I may be having a similar issue. I’ve removed all the hemizygous/haploid chromosomes from my vcf and my windowSizes file only has chromosomes with sample size 6

However, I am getting the following error:
Reading HDF5 mask: /home/ddebaun/mendel-nas1/redo_recombination/splitVCFs/Leioheterodon_madagascarensis_B_biallelic_7204_RagTag:0-11000_md_mask.hdf5...
Traceback (most recent call last):
File "/home/ddebaun/mendel-nas1/miniconda3/bin/ReLERNN_SIMULATE", line 245, in
main()
File "/home/ddebaun/mendel-nas1/miniconda3/bin/ReLERNN_SIMULATE", line 152, in main
md_mask = np.concatenate(md_mask)
File "<array_function internals>", line 180, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 6 and the array at index 6 has size 3

I’m including the information I used for this run (first 50k lines of vcf). Is this also an issue with the vcf for scikit-allel? Any help would be appreciated!
filestorun.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants