Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:TX2GENE error on iGenomes TAIR10 #1132

Closed
holmrenser opened this issue Nov 24, 2023 · 4 comments · Fixed by #1150
Closed

NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:TX2GENE error on iGenomes TAIR10 #1132

holmrenser opened this issue Nov 24, 2023 · 4 comments · Fixed by #1150
Assignees
Labels
bug Something isn't working
Milestone

Comments

@holmrenser
Copy link

holmrenser commented Nov 24, 2023

Description of the bug

I tried running nf-core/rnaseq 3.13.2 on the Arabidopsis thaliana TAIR10 genome from iGenomes and ran into an issue with the tx2gene step of processing the annotation gtf. I have used previous pipeline versions on the same genome without this issue.

Command used and terminal output

nextflow run nf-core/rnaseq --input samplesheet_full.csv --outdir mapping_full --genome TAIR10 --max_cpus 8 -profile docker -r 3.13.2

ERROR ~ Error executing process > 'NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:TX2GENE (genome.filtered.gtf)'

Caused by:
  Process `NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:TX2GENE (genome.filtered.gtf)` terminated with an error exit status (1)

Command executed:

  tx2gene.py \
      --quant_type salmon \
      --gtf genome.filtered.gtf \
      --quants quants \
      --id gene_id \
      --extra gene_name \
      -o tx2gene.tsv

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:TX2GENE":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  Traceback (most recent call last):
    File "/home/<omitted>/.nextflow/assets/nf-core/rnaseq/bin/tx2gene.py", line 162, in <module>
      if not map_transcripts_to_gene(args.quant_type, args.gtf, args.quants, args.id, args.extra, args.output):
    File "/home/<omitted>/.nextflow/assets/nf-core/rnaseq/bin/tx2gene.py", line 122, in map_transcripts_to_gene
      transcript_attribute = discover_transcript_attribute(gtf_file, transcripts)
    File "/home/<omitted>/.nextflow/assets/nf-core/rnaseq/bin/tx2gene.py", line 59, in discover_transcript_attribute
      attributes = dict(item.strip().split(" ", 1) for item in cols[8].split(";") if item.strip())
  ValueError: dictionary update sequence element #4 has length 1; 2 is required

Work dir:
  /lustre/<omitted>/full_experiment/work/57/7fc5bc40a9829787f3723fa27f46a9

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

Relevant files

No response

System information

Nextflow version: 23.10.0.5889
Hardware: 96CPU server, lustre filesystem
Executor: local
Container engine: Docker
OS: Ubuntu 20.04.6
Version of nf-core/rnaseq: 3.13.2

@holmrenser holmrenser added the bug Something isn't working label Nov 24, 2023
@Guy2Horev
Copy link

I got the same error today after the workflow was updated.

@drpatelh drpatelh added this to the 3.13.3 milestone Jan 3, 2024
@pinin4fjords
Copy link
Member

This is genuinely a bad GTF file rather than a pipeline issue: there's a semicolon in one of the gene names, specifically at line 33090

1   ensembl CDS 4810488 4811109 .   +   0   exon_number "1"; gene_biotype "protein_coding"; gene_id "AT1G14040"; gene_name "PHO1;H3"; gene_source "ensembl"; gene_version "1"; p_id "P3587"; protein_id "AT1G14040.1"; protein_version "1"; transcript_biotype "protein_coding"; transcript_id "AT1G14040.1"; transcript_name "PHO1;H3"; transcript_source "ensembl"; transcript_version "1"; tss_id "TSS29975";

"PHO1;H3" is not a good value and it's upsetting the parsing of the semicolon-delimited attributes field.

It wasn't an issue previously because we didn't sample enough lines (which @MatthiasZepper fixed).

I'll try to add something to skip a limited number of bad lines (we don't need them all for this part of the code). In the meantime I recommend you review our guidelines on reference file usage- you really are better off using more recent files from Ensembl (and you can complain to Ensembl about invalid formatting like this).

@pinin4fjords
Copy link
Member

Actually, I think I can do better and allow those semicolons- PR incoming. I still maintain they're a silly idea though...

@pinin4fjords pinin4fjords linked a pull request Jan 3, 2024 that will close this issue
10 tasks
@drpatelh
Copy link
Member

drpatelh commented Jan 3, 2024

Fixed in #1150

@drpatelh drpatelh closed this as completed Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants