Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix transcriptome staging issues on DNAnexus for rsem/prepareference #727

Closed
2 tasks done
drejom opened this issue Nov 20, 2021 · 16 comments · Fixed by nf-core/modules#1163
Closed
2 tasks done

Fix transcriptome staging issues on DNAnexus for rsem/prepareference #727

drejom opened this issue Nov 20, 2021 · 16 comments · Fixed by nf-core/modules#1163
Labels
bug Something isn't working
Milestone

Comments

@drejom
Copy link
Contributor

drejom commented Nov 20, 2021

Check Documentation

I have checked the following places for your error:

Description of the bug

Steps to reproduce

Steps to reproduce the behaviour:

When using the app v1.0.0-beta.6, the test profile runs successfully (-profile test,docker -r tar --skip_bbsplit), but I haven't managed a run otherwise. It fails pretty quietly somewhere around Star_Align.

The log from that step shows some output files missing:

dfda3e01f2b6: Verifying Checksum
dfda3e01f2b6: Download complete
7ff999a2256f: Download complete
3aaade50789a: Pull complete
00cf8b9f3d2a: Pull complete
7ff999a2256f: Pull complete
10c3bb32200b: Verifying Checksum
10c3bb32200b: Download complete
1721f154786d: Verifying Checksum
1721f154786d: Download complete
d2ba336f2e44: Pull complete
dfda3e01f2b6: Pull complete
10c3bb32200b: Pull complete
1721f154786d: Pull complete
Digest: sha256:e33a844c7244068c6bf252f4b94e34500be4a62719eeb59dcab260a9da1fcd1d
Status: Downloaded newer image for quay.io/biocontainers/star:2.6.1d--0
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
Nov 15 22:38:28 ..... started STAR run
Nov 15 22:38:28 ..... loading genome
Nov 15 22:38:41 ..... processing annotations GTF
Nov 15 22:38:48 ..... inserting junctions into the genome indices
Nov 15 22:40:13 ..... started 1st pass mapping
Nov 15 22:41:02 ..... finished 1st pass mapping
Nov 15 22:41:03 ..... inserting junctions into the genome indices
Nov 15 22:42:35 ..... started mapping
CPU: 16% (16 cores) * Memory: 36836/112707MB * Storage: 35/515GB * Net: 55↓/1↑MBps
Nov 15 22:46:46 ..... finished successfully
file-G69K6f09V6kb7Z246qzPYbZx
file-G69K6g09V6kYfPBJ7p18zyZp
file-G69K6j89V6kXV1xB5QjK3kpY
ls: cannot access '*sortedByCoord.out.bam': No such file or directory
ls: cannot access '*Aligned.unsort.out.bam': No such file or directory
ls: cannot access '*fastq.gz': No such file or directory
file-G69K6k89V6kxQxG757xpj0QJ
file-G69K6k89V6kzX6pj7p4qj3p6
file-G69K6pQ9V6kYK0qZ5KpyPqP1
file-G69K6pQ9V6kbjj0f77VfGBj9
file-G69K6k89V6kyG6F562BpxP9g
file-G69K6k89V6kx8f596ZgKZGJ4
file-G69K6qQ9V6kV6f6f6Z3jjZk1
file-G69K6vQ9V6kZp15b6qxGZK5Y
file-G69K6yQ9V6ky0B6J8zq10BbV

However, the log from the run shows an issue with SALMON_QUANT

Error executing process > 'NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT (AGO1_SCR_rep1)'
Caused by:
  Process `NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT` input file name collision -- There are multiple input files for each of the following file names: genome.transcripts.fa
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
Execution cancelled -- Finishing pending tasks before exit

The nextflow.log is stuck in an 'open' state, so I cant read/download/attach it

System

  • Hardware: DNAnexus
  • Version 1.0.0-beta.6
@drejom drejom added the bug Something isn't working label Nov 20, 2021
@drpatelh
Copy link
Member

I am trying to reproduce this on AWS Batch at the moment but I suspect it will work there because our full-sized AWS tests work. Is this something you can help us to debug too please @GHAStVHenry? I'm sure @drejom would be happy to provide you with any info you need.

@drpatelh
Copy link
Member

@drejom would you mind dumping the contents of .command.sh and .command.run here please? (redacting whatever is required) if you have access to them?

@pditommaso
Copy link
Contributor

One way to debug this is copying the .command.sh and .command.run and running the locally (provided you have dx tool installed)

@drejom
Copy link
Contributor Author

drejom commented Nov 24, 2021

I'm a bit stumped because the process that causes the error (eg SALMON_QUANT in the attached logs) only appears in the error message; there's no record of the job being submitted, so I can't retrieve the run folder or its contents. Not sure how to proceed?

@GHAStVHenry
Copy link

GHAStVHenry commented Nov 29, 2021

I found the problem...
There are 2 genome.transcripts.fa files saved in the rsem folder of the work folder.

genome.chrlist 
genome.fa
genome.grp
genome.idx.fa
genome.n2g.idx.fa
genome.seq
genome.ti
genome.transcripts.fa : file-G6X08jj0BGgQy5q2BqvyFk2X
genome.transcripts.fa : file-G6X08Y00BGgV5q1gKgf5zqXP

I'm not familiar with rsem-prepare-reference, but I'm guessing that STAR --runMode genomeGenerate and rsem-prepare-reference both write the file. The container which the work folders are is blob storage so the second write is creating a second file with a unique fileID, normally the second one would overwrite the first and no one would be wiser. They have the same md5sum, so they are identical... just created/modified at different times:

Created               Sun Nov 28 16:35:56 2021
Last modified         Sun Nov 28 16:36:01 2021

and

Created               Sun Nov 28 16:36:19 2021
Last modified         Sun Nov 28 16:36:22 2021

Is there a way to add to the RSEM_PREPAREREFERENCE process --star option, a cleanup script to clear out the unnecessary one after the STAR --runMode genomeGenerate?

  • The reason why the test profile works is it takes the fasta as a user input, doesn't take use the RSEM_PREPAREREFERENCE process to create it.
  • The reason why I didn't come across this issues when I successfully ran rnaseq 3 months ago on DNAnexus with real data is because my aligner of choice is HISAT2 and when you don't choose STAR an alternative version of the process runs without the STAR --runMode genomeGenerate and therefore no duplicate file.

@drpatelh can you add that rm of genome.transcripts.fa after the STAR --runMode genomeGenerate command?

@drpatelh
Copy link
Member

Hi @GHAStVHenry ! Thanks for troubleshooting this!

In principle, this sounds like a plausible explanation but I am a little confused as to how it is happening with the default parameters used by the pipeline:

  1. STAR and rsem-prepare-reference should only be run sequentially if you use --aligner star_rsem but the default is --aligner star_salmon:

    STAR \\
    --runMode genomeGenerate \\
    --genomeDir rsem/ \\
    --genomeFastaFiles $fasta \\
    --sjdbGTFfile $gtf \\
    --runThreadN $task.cpus \\
    $memory \\
    $options.args2
    rsem-prepare-reference \\
    --gtf $gtf \\
    --num-threads $task.cpus \\
    ${args.join(' ')} \\
    $fasta \\
    rsem/genome
    cat <<-END_VERSIONS > versions.yml
    ${getProcessName(task.process)}:
    ${getSoftwareName(task.process)}: \$(rsem-calculate-expression --version | sed -e "s/Current version: RSEM v//g")
    star: \$(STAR --version | sed -e "s/STAR_//g")
    END_VERSIONS

  2. I tried running with --aligner star_rsem and changed this line to rsem/genome.test:

    The file listing is below and you will see that we now don't have any genome.transcripts.fa at all indicating that STAR isn't creating one beforehand.

    Genome
    Log.out
    SA
    SAindex
    chrLength.txt
    chrName.txt
    chrNameLength.txt
    chrStart.txt
    exonGeTrInfo.tab
    exonInfo.tab
    geneInfo.tab
    genome.fa
    genome.test.chrlist
    genome.test.grp
    genome.test.idx.fa
    genome.test.n2g.idx.fa
    genome.test.seq
    genome.test.ti
    genome.test.transcripts.fa
    genomeParameters.txt
    sjdbInfo.txt
    sjdbList.fromGTF.out.tab
    sjdbList.out.tab
    transcriptInfo.tab
    

So if I had to narrow it down based on your observation, I suspect that when rsem-prepare-reference is used in isolation to create the transcriptome here:

rsem-prepare-reference \\
--gtf $gtf \\
--num-threads $task.cpus \\
$options.args \\
$fasta \\
rsem/genome

it is somehow writing a file with the same name internally but still need to confirm.

@drpatelh
Copy link
Member

drpatelh commented Nov 29, 2021

Are you able to upload those files here or see if they are any different @GHAStVHenry along with any timestamps.

genome.transcripts.fa : file-G6X08jj0BGgQy5q2BqvyFk2X
genome.transcripts.fa : file-G6X08Y00BGgV5q1gKgf5zqXP

@GHAStVHenry
Copy link

The files have the same md5sums, so they'll be the same, let me know if you want me to upload them and there is a sequential write time difference. The actual times are above.

Hmmm, you are right re- default --aligner star_salmon... the .command.sh shows only the rsem-prepare-reference not the sequential STAR.

Actually my first post firmly put the blame in the rsem-prepare-reference and then I read the conditional for the sequential STAR and changed my mind, forgetting that that isn't even what happened with my test. I'm not familiar with rsem-prepare-reference but it seems weird that it would write the same file twice... but that seems to be what's happening.

@drpatelh
Copy link
Member

drpatelh commented Nov 29, 2021

Ok. Thanks! If they are the same then that may be less problematic otherwise we would have no way of picking which one to take (or telling NF to anyway).

I have pushed a quick fix to the tar branch for this in d98e7a2

This just takes a hard-named file called genome.transcripts.fa instead of a glob which was causing the original staging issue. Hopefully, this means any one of the above files will be used downstream in the pipeline and should solve our issue!

Would you mind giving it a go with -r tar in the command?

@GHAStVHenry
Copy link

GHAStVHenry commented Nov 29, 2021

Alright, tried it... it kinda sorta worked... the
Process NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT input file name collision -- There are multiple input files for each of the following file names: genome.transcripts.fa
is resolved... but now
SALMON_QUANT started but had this error...

============
Exception : [The provided transcript file: "genome.transcripts.fa" does not exist!
]
============

SALMON_QUANT's work folder didn't have the fasta in there... actually no inputs are there, but looking at other process work folders, it doesn't appear that inputs are saved/retained, so I can't confirm this.

@GHAStVHenry
Copy link

GHAStVHenry commented Nov 29, 2021

Alright, I think I understand the problem... it isn't rsem-prepare-reference

output:
path "rsem" , emit: index
path "rsem/*transcripts.fa", emit: transcript_fasta
path "versions.yml" , emit: versions

outputs the transcripts fasta both in the index output in the rsem folder as well as explicitly as transcript_fasta. Does index need the transcripts fasta in there?

I'm not sure which, but DNAnexus/Nextflow is capable of handling the multiple files of the same name as is evidenced by my test from above:
WARN: Multiple files matching path: 'dx://container-G6XQbfQ02QXpYQxB5211v0Vz:/scratch/d0/06fbf7008a445d76956859f0e94be7/rsem' -- picking: file-G6XQx8j02QXkZy4z59fF1XQK
Each glob based output sees multiple files and chooses one, but then because there are 2 inputs containing the same file, 2 copies get written anyway, which causes the conflict.
...but for some reason with your modification no file is getting sent now... I can't confirm that it really isn't getting there, inputs aren't uploaded/retained in the work folders so I don't know if it was there.

I set up some tests, but I accidentally pushed my commits to the wrong remote, I sent it to NFCore repo (tar branch) instead of my fork, but if it doesn't work, or if you have a better way of fixing it, you can revert it. Will update with result of test...

@GHAStVHenry
Copy link

GHAStVHenry commented Nov 29, 2021

Ok, after fighting with the glob output of index to try and exclude the transcripts fasta finally got it working... it FINALLY got past SALMON_QUANT... now waiting to see if it gets through the end to see if index needs transcripts fasta in it

EDIT: WORKED!!!

@drpatelh
Copy link
Member

Awesome! Great work! 🥳

Ok I pushed a last commit based on what you found. This is just so we don't mess with the default files generated by RSEM and pass them all along in the index.

Would you mind running a last test using the -r tar branch?

cbae500

@GHAStVHenry
Copy link

GHAStVHenry commented Nov 30, 2021

The first test I tried was similar to that and didn't work... yours did though, at least it got past SALMON_QUANT, will update once it gets to the end!

EDIT: WORKED!!!

@drejom
Copy link
Contributor Author

drejom commented Nov 30, 2021

Worked for me too! Rippa!! Thanks @GHAStVHenry @drpatelh

@drpatelh
Copy link
Member

Rippa <- 🤣

Ok. Will leave this open until we properly push the fixes into the main pipeline. Thanks guys.

@drpatelh drpatelh added this to the 3.5 milestone Dec 13, 2021
drpatelh added a commit to drpatelh/nf-core-rnaseq that referenced this issue Dec 13, 2021
@drpatelh drpatelh changed the title NFCORE_RNASEQ:RNASEQ:ALIGN_STAR:STAR_ALIGN fails on dnanexus Fix transcriptome staging issues on DNAnexus for rsem/prepareference Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants