Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow tabs in fasta header when creating decoys for salmon index #878

Closed
paoloAngelino opened this issue Sep 30, 2022 · 4 comments · Fixed by nf-core/modules#3043
Closed
Labels
bug Something isn't working
Milestone

Comments

@paoloAngelino
Copy link

Description of the bug

A fasta header can contain comments together with the name of the contig. Example:
>HLA-DRB1*16:02:01 HLA00878

The corresponding line in the decoy.txt file to be passed to salmon index would be
HLA-DRB1*16:02:01 HLA00878

the problem is that the comment is interpreted by salmon as an extra decoy, while creating the index, and it stops with an error. In my case:

[2022-09-05 14:58:36.028] [puff::index::jointLog] [critical] The decoy file contained the names of 3892 decoy sequences, but 3367 were matched by sequences in the reference file provided. To prevent unintentional errors downstream, please ensure that the decoy file exactly matches with the fasta file that is being indexed.
[2022-09-05 14:58:36.424] [puff::index::jointLog] [error] The fixFasta phase failed with exit code 1

An additional cleaning step in rnaseq/modules/nf-core/modules/salmon/index/main.nf would fix the issue. What I propose is to replace line 31:
sed -i.bak -e 's/>//g' decoys.txt
with

mv decoys.txt decoys.txt.bak
awk '{print $1}' decoys.txt.bak > decoys.txt 
sed -i -e 's/>//g' > decoys.txt

Command used and terminal output

No response

Relevant files

No response

System information

No response

@paoloAngelino paoloAngelino added the bug Something isn't working label Sep 30, 2022
@paoloAngelino paoloAngelino changed the title salmon fails to create index if referece fasta file contains comments in the header salmon fails to create index if reference fasta file contains comments in the header Sep 30, 2022
@drpatelh drpatelh added this to the 3.10 milestone Dec 12, 2022
@drpatelh
Copy link
Member

drpatelh commented Dec 19, 2022

Hi @paoloAngelino ! Thanks for reporting.

I am unable to reproduce this if I run:

$ echo ">HLA-DRB1*16:02:01      HLA00878" | grep '^>' | cut -d ' ' -f 1
>HLA-DRB1*16:02:01

Which is essentially running this bit of code in the module that splits on spaces and only takes the first value. As you can see above this prints the contig name without the comment delimited by spaces.

@drpatelh
Copy link
Member

Closing this for now but feel free to re-open if the issue persists or if you are able to identify another reason for this failure.

@paoloAngelino
Copy link
Author

@drpatelh Thanks for your answer.
I've actually found that in the reference that I'm using, the comments in the contig header are separated by a TAB.

grep '^>' hg38.fa  | cat -T
>HLA-DRB1*15:03:01:01^IHLA00870 11567 bp
>HLA-DRB1*15:03:01:02^IHLA03454 11569 bp
>HLA-DRB1*16:02:01^IHLA00878 11005 bp

This is the reason why, in my case, cut -d ' ' -f 1 doesn't split the header.
I don't know if having TAB as separator is a special case, or if it could happen for other fasta references. But, maybe including also TAB as a separator could be more general? This would work for me:
grep '^>' hg38.fa | cut -d ' ' -f 1 | cut -d $'\t' -f 1

@drpatelh
Copy link
Member

Cool. Will re-open and fix in the next release. In the meantime, you will have to clean the fasta before passing to the pipeline.

@drpatelh drpatelh reopened this Dec 22, 2022
@drpatelh drpatelh modified the milestones: 3.10, 3.11 Dec 22, 2022
@drpatelh drpatelh modified the milestones: 3.12, 3.11 Mar 5, 2023
@drpatelh drpatelh changed the title salmon fails to create index if reference fasta file contains comments in the header Allow tabs in fasta header when creating decoys for salmon index Mar 16, 2023
drpatelh added a commit to drpatelh/nf-core-rnaseq that referenced this issue Mar 16, 2023
drpatelh added a commit that referenced this issue Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants