Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-process transcripts fasta when using --gencode #864

Closed
andreas-wilm opened this issue Sep 1, 2022 · 6 comments
Closed

Pre-process transcripts fasta when using --gencode #864

andreas-wilm opened this issue Sep 1, 2022 · 6 comments
Labels
bug Something isn't working
Milestone

Comments

@andreas-wilm
Copy link

Description of the bug

Hi all,

I've downloaded references from Gencode (i.e. genome, transcript, annotation) and used nf-core/rnaseq with --aligner star_salmon --pseudo_aligner salmon --gencode. Unfortunately the STAR quantification with salmon fails, because of the GENCODE typical pipe separation of transcript names. See below for the error. The --gencode flag is and can only be used for the salmon index step, if I understand correctly, but it doesn't help with salmon quant (see also here) on the STAR aligned BAM file.

Command used and terminal output

$ nextflow run nf-core/rnaseq -revision 3.8.1 -profile docker  --aligner star_salmon --pseudo_aligner salmon  --gencode --save_reference --gtf /data/references/gencode.v41.annotation.gtf.gz  --fasta /data/references/GRCh38.primary_assembly.genome.fa.gz --transcript_fasta /data/references/gencode.v41.transcripts.fa.gz --input $(pwd)/samplesheet.csv  --outdir /data/processed/ -w /data/scratch -bg

Salmon quant fails with


2022-08-31 13:24:02.095] [jointLog] [critical] Transcript ENST00000555544.1 appeared in the BAM header, but was not in the provided FASTA file
...
[2022-08-31 13:24:02.095] [jointLog] [critical] Please provide a reference FASTA file that includes all targets present in the BAM header

This transcript and all other are part of the transcripts.fa, but in the Gencode typical format:

$ zgrep ENST00000554281.1 /data/references/gencode.v41.transcripts.fa.gz
>ENST00000554281.1|ENSG00000185100.11|OTTHUMG00000170818.3|OTTHUMT00000410534.1|ADSS1-205|ADSS1|567|retained_intron|


### Relevant files

_No response_

### System information

- version 22.04.5 build 5708
-  cloud instance
- executor: local
- container engine: docker
- Amazon Linux 2
- nf-core/rnaseq revision 3.8.1
@andreas-wilm andreas-wilm added the bug Something isn't working label Sep 1, 2022
@drpatelh
Copy link
Member

drpatelh commented Sep 2, 2022

Hi @andreas-wilm ! Thanks for reporting. I have personally never used a GENCODE reference. Do you have an idea as to what we should change in the pipeline to deal with this error?

We should only be using the --gencode flag in the indexing step and not in the quantifcation step at the moment:

withName: 'SALMON_INDEX' {
ext.args = params.gencode ? '--gencode' : ''

@tamuanand
Copy link

Hi @andreas-wilm

Did you try posting this on salmon GitHub - https://github.com/COMBINE-lab/salmon/issues/

Probably @rob-p might have an answer?

@tamuanand
Copy link

@drpatelh

Probably what @rob-p mentions in the issue linked by @andreas-wilm could be tried?

Do you have an idea as to what we should change in the pipeline to deal with this error?

However, if your BAM file already
contains the stripped transcript names 
(i.e. if the BAM file header has the names
 without everything past the initial |),
 then I believe you can use the 
following command to have 
salmon do the same to the fasta file
on the fly, so that the names match.

salmon-1.5.1_linux_x86_64/bin/salmon \
quant --ont -p 4 -t <(awk '{ if ($0 ~ "^>") { split($0,a,"|"); print a[1] } else { print $0 } }'  \
Genome_files/gencode.vM24.transcripts.fa) -l U -a Documents/Day2_03_DRS_pass.bam \
-o Documents/counts/Day2_03_DRS_pass

@andreas-wilm
Copy link
Author

Hi @drpatelh and @tamuanand, thanks for looking into this. Removing everything after the pipe from the gencode.v41.transcripts.fa file (with a simple cut -f 1 -d '|') worked. The awk command above does something similar on the fly using a fifo. I guess this might work as a drop-in replacement when --gencode was provided.

Please note: I did this after the run failed, on the staged files and restarted with -resume, i.e. I haven't tested yet, whether doing this from the start fixes the problem, but it should.

@Pancreas-Pratik
Copy link

Hey I can confirm the above done by @andreas-wilm works when doing the cut -f 1 -d '|' to the gencode.transcripts.fa from the start of the pipeline, rather than after -resume. The pipeline began and completed successfully, (without any errors or warnings). I used the mouse M30 from Gencode here.

As a side note, which was important for me, I was also able to import the salmon outputs using tximeta (https://github.com/mikelove/tximeta) successfully simply using se <- tximeta(coldata). I didn't have to use makeLinkedTxome(), at all. tximeta was able to match/identify the "index_seq_hash": in meta_info.json from the salmon output of nf-core/rnaseq to the ones in it's pre-made hastable. This was a success here.

The issue I am having now which is probably for https://github.com/Bioconductor/GenomicFeatures and/or https://github.com/mikelove/tximeta is after inputting se <- tximeta(coldata), which is pretty much this warning here: https://support.bioconductor.org/p/127869/ I will try to figure it out. I think I just have to specify somehow what columns to read from in the Gencode GTF (through I think, maybe, GenomicFeatures::makeTxDbFromGFF()) that, I think, tximeta uses.

Regarding the issue of the OP, though, below is what I did which worked (Now, I guess, what to change, and where, in nf-core/rnaseq is the question?):

Hi @drpatelh and @tamuanand, thanks for looking into this. Removing everything after the pipe from the gencode.v41.transcripts.fa file (with a simple cut -f 1 -d '|') worked. The awk command above does something similar on the fly using a fifo. I guess this might work as a drop-in replacement when --gencode was provided.

Please note: I did this after the run failed, on the staged files and restarted with -resume, i.e. I haven't tested yet, whether doing this from the start fixes the problem, but it should.

The following worked (confirming @andreas-wilm above #864 (comment)):

Print working directory:

bash-4.2$ pwd
/data/references

List files in current working directory:

bash-4.2$ ls
gencode.vM30.annotation.gtf.gz	gencode.vM30.transcripts.fa.gz	GRCm39.primary_assembly.genome.fa.gz

What do the transcript names look like in gencode.vM30.transcripts.fa.gz:

bash-4.2$ zcat gencode.vM30.transcripts.fa.gz | grep ENS | head
>ENSMUST00000193812.2|ENSMUSG00000102693.2|OTTMUSG00000049935.1|OTTMUST00000127109.1|4933401J01Rik-201|4933401J01Rik|1070|TEC|
>ENSMUST00000082908.3|ENSMUSG00000064842.3|-|-|Gm26206-201|Gm26206|110|snRNA|
>ENSMUST00000162897.2|ENSMUSG00000051951.6|OTTMUSG00000026353.2|OTTMUST00000086625.1|Xkr4-203|Xkr4|4153|processed_transcript|
>ENSMUST00000159265.2|ENSMUSG00000051951.6|OTTMUSG00000026353.2|OTTMUST00000086624.1|Xkr4-202|Xkr4|2989|processed_transcript|
>ENSMUST00000070533.5|ENSMUSG00000051951.6|OTTMUSG00000026353.2|OTTMUST00000065166.1|Xkr4-201|Xkr4|3634|protein_coding|
>ENSMUST00000192857.2|ENSMUSG00000102851.2|OTTMUSG00000049958.1|OTTMUST00000127143.1|Gm18956-201|Gm18956|480|processed_pseudogene|
>ENSMUST00000195335.2|ENSMUSG00000103377.2|OTTMUSG00000049960.1|OTTMUST00000127145.1|Gm37180-201|Gm37180|2819|TEC|
>ENSMUST00000192336.2|ENSMUSG00000104017.2|OTTMUSG00000049961.1|OTTMUST00000127146.1|Gm37363-201|Gm37363|2233|TEC|
>ENSMUST00000194099.2|ENSMUSG00000103025.2|OTTMUSG00000049930.1|OTTMUST00000127101.1|Gm37686-201|Gm37686|2309|TEC|
>ENSMUST00000161581.2|ENSMUSG00000089699.2|OTTMUSG00000026352.1|OTTMUST00000065165.1|Gm1992-201|Gm1992|250|lncRNA|

What would they look like without everything after the first |

bash-4.2$ zcat gencode.vM30.transcripts.fa.gz | cut -d "|" -f1 | grep ENS | head
>ENSMUST00000193812.2
>ENSMUST00000082908.3
>ENSMUST00000162897.2
>ENSMUST00000159265.2
>ENSMUST00000070533.5
>ENSMUST00000192857.2
>ENSMUST00000195335.2
>ENSMUST00000192336.2
>ENSMUST00000194099.2
>ENSMUST00000161581.2

Above looks okay.

Let's do it and save the output to a file:

bash-4.2$ zcat gencode.vM30.transcripts.fa.gz | cut -d "|" -f1 > gencode.vM30.transcripts_fixed.fa

Let's double check to see if what we did, was what we wanted:

bash-4.2$ cat gencode.vM30.transcripts_fixed.fa | grep ENS | head
>ENSMUST00000193812.2
>ENSMUST00000082908.3
>ENSMUST00000162897.2
>ENSMUST00000159265.2
>ENSMUST00000070533.5
>ENSMUST00000192857.2
>ENSMUST00000195335.2
>ENSMUST00000192336.2
>ENSMUST00000194099.2
>ENSMUST00000161581.2

Above looks good.

(This might not be necessary, but...) Let's gzip the "fixed" transcripts.

bash-4.2$ gzip gencode.vM30.transcripts_fixed.fa

And list the directory again:

bash-4.2$ ls
gencode.vM30.annotation.gtf.gz	gencode.vM30.transcripts.fa.gz	gencode.vM30.transcripts_fixed.fa.gz  GRCm39.primary_assembly.genome.fa.gz

Print working directory:

bash-4.2$ pwd
/data/references

And use the full directory paths in nf-core/rnaseq (making sure to use the one that was just "fixed": gencode.vM30.transcripts_fixed.fa.gz)

bash-4.2$ find "$(pwd)"
/data/references
/data/references/GRCm39.primary_assembly.genome.fa.gz
/data/references/gencode.vM30.transcripts_fixed.fa.gz
/data/references/gencode.vM30.annotation.gtf.gz
/data/references/gencode.vM30.transcripts.fa.gz

So the flag directories in the nf-core/rnaseq should look like (or using these within the nf-params.json)

--gtf /data/references/gencode.vM30.annotation.gtf.gz
--fasta /data/references/GRCm39.primary_assembly.genome.fa.gz
--transcript_fasta /data/references/gencode.vM30.transcripts_fixed.fa.gz

This worked! I stopped getting the same error that @andreas-wilm was getting in the OP here.

@drpatelh drpatelh added this to the 3.9 milestone Sep 25, 2022
@drpatelh drpatelh changed the title Salmon quantification faile Pre-process transcripts fasta when using --gencode Sep 30, 2022
drpatelh added a commit to drpatelh/nf-core-rnaseq that referenced this issue Sep 30, 2022
drpatelh added a commit that referenced this issue Sep 30, 2022
@drpatelh
Copy link
Member

Fixed in #875

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants