Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wholegenome.interval_list not recognized by CreateIntervalsBed process #56

Closed
klmr opened this issue Oct 29, 2019 · 9 comments
Closed
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@klmr
Copy link

klmr commented Oct 29, 2019

I may be overlooking something but Sarek does not seem to document the input file formats/purposes of the genome files. For most files, the purpose is obvious but some, at least for me, aren’t. And for others it isn’t clear what file format is expected.

For instance, I had assumed that the wholegenome.interval_list Picard-formatted file from the GATK resource bundle would be valid as a genomes.intervals file, but the result is a cryptic error message, as well as a work directory full of weird BED files:

java.lang.NumberFormatException: For input string: "VN" […]

It turns out that the relevant Sarek process only supports two out of the three formats described by the GATK documentation, and notably does not support the Picard-format file, which is included in the official GATK bundle.

@maxulysse
Copy link
Member

Hi,
Thanks a lot for your issue.
We did not notice before that particular problem, which should of course be fixed.
There is some documentation about the interval file:
https://github.com/nf-core/sarek/blob/master/docs/reference.md#intervals
But I do understand if you feel like it's not enough.
I'll make sure to explain more about the expected format.
But this wholegenome.intervals_list should have been working.
I'll look more into it.
Thanks again,
Maxime

@klmr
Copy link
Author

klmr commented Oct 29, 2019

Hi @maxulysse, thanks for the prompt reply, I indeed missed that particular documentation file, my bad. To clarify, the .interval_list file format is distinct from the .list format. See my GATK documentation link for a complete description, but for example the file will contain the following entries:

@SQ     SN:chr1 LN:248956422    M5:6aef897c3d6ff0c78aff06ac189178dd     AS:20   UR:/seq/references/kendrix/v0/kendrix.fasta     SP:Homo sapiens
…
chr1    1       248956422       +       .

@maxulysse
Copy link
Member

Thanks a lot for the link @klmr, I'll look more into this format to make sur that it works with sarek.

@maxulysse maxulysse self-assigned this Oct 29, 2019
@maxulysse maxulysse added bug Something isn't working enhancement New feature or request labels Oct 29, 2019
@maxulysse
Copy link
Member

By the way with which genome are you working?

@klmr
Copy link
Author

klmr commented Oct 29, 2019

I’m using GRCh38, and the file in question can be found in the GATK resource bundle on GCP (requires login) or as a download via HTTP.

@maxulysse
Copy link
Member

I do think it's the file that we used to create the intervals file that we host on AWS iGenomes.

@maxulysse
Copy link
Member

I think I have a solution.
Any chance you can try out https://github.com/MaxUlysse/sarek/tree/Intervals?
It's adding this small snippet in the CreateIntervalsBed process

     else if (hasExtension(intervals, "interval_list"))
        """
        cat ${intervals} | grep -v "@" > intervals.temp
        awk -vFS="\t" '{
          name = sprintf("%s_%d-%d", \$1, \$2, \$3);
          printf("%s\\t%d\\t%d\\n", \$1, \$2-1, \$3) > name ".bed"
        }' intervals.temp
        """ 

@klmr
Copy link
Author

klmr commented Oct 30, 2019

Yes, that works, thanks! I’ve added a review comment on your changeset.

@maxulysse maxulysse changed the title Undocumented interval list format wholegenome.interval_list not recognized by CreateIntervalsBed process Oct 30, 2019
@maxulysse
Copy link
Member

PR has been created ;-)
Thanks for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants