Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatible Igenomes NCBI GRCh38 genes.gtf - gene_biotype annotation missing #460

Closed
tweep opened this issue Aug 30, 2020 · 9 comments
Closed
Milestone

Comments

@tweep
Copy link

tweep commented Aug 30, 2020

Hi,
I'm running into issues with the format of the genes.gtf which is referenced in v1.4.2 /conf/igenomes.config
The NCBI gtf does not include the gene_biotype field, so the pipeline fails in the feature_counts setting when using the default igenomes.config with the error message below. Can you point us to a valid gtf ?

ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
  The specified gene identifier attribute is 'gene_biotype' 
  An example of attributes included in your GTF annotation is 'gene_id "DDX11L1"; gene_name "DDX11L1"; transcript_id "rna0"; tss_id "TSS31672"; 

S3 cmd to retrieve genes.gtf
aws s3 --no-sign-request --region eu-west-1 sync s3://ngi-igenomes/igenomes/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/ /gnet/is5/p04/data/bioinfo/ehive/reference-data/igenomes/references/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/ --exclude "*" --include "genes.gtf"

@leiendeckerlu
Copy link

Indeed super annoying error, ran into the same problem, can be circumvented by supplying the --skipBiotypeQC parameter.

@drpatelh
Copy link
Member

drpatelh commented Sep 2, 2020

Yes, apologies for this. It has indeed caused alot of confusion since the NCBI version of that genome was added. The fix that @leiendeckerlu is the best one for that version. I have been thinking about dropping that genome from igenomes.config for the next release to avoid this confusion but that would be annoying for users that have been using that version of the genome for other analysis up until now. Another workaround is to auto-set --skipBiotypeQC if that version of the genome is used (not ideal if you want that info) or maybe even to print an explicit warning to the terminal. Will try and make things a bit more explicit in the next release but I think that's the best we can do unless there is a better suggestion?

@drpatelh
Copy link
Member

drpatelh commented Sep 2, 2020

Note: this has also been discussed at some length in #159

@drpatelh drpatelh added this to the 1.5 milestone Sep 2, 2020
@drpatelh
Copy link
Member

This has been added as warning here 1e752cb and will also auto-set --skip_biotype_qc if using NCBI GRCh38 from igenomes.config. Since we may be moving to RefGenie to manage reference genomes I think this will suffice for now.

@drpatelh
Copy link
Member

This will now be auto-detected as of #603 such that featureCounts won't be run if the biotype field provided to the pipeline doesn't exist in the GTF file. This will now work for any GTF and not just those that were manually specified to generate a warning as mentioned in the above comment.

@LucasMS
Copy link

LucasMS commented Oct 22, 2021

Hi all,
I am finding a similar problem which can not be solved by --skip_biotype_qc. I am using assembly and gtf from NCBI (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/215/GCA_000001215.4_Release_6_plus_ISO1_MT/) which I downloaded manually and specified with the flags --fasta and --gtf.

nextflow run nf-core/rnaseq -r 3.4 --input drosophila.melanogaster.rna.sample.sheet.csv \
--fasta GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna \
--gtf GCA_000001215.4_Release_6_plus_ISO1_MT_genomic.gtf \
 -profile ccga_med

 #############

Error executing process > 'NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:RSEM_PREPAREREFERENCE_TRANSCRIPTS (rsem/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna)'

Caused by:
  Process `NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:RSEM_PREPAREREFERENCE_TRANSCRIPTS (rsem/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna)` terminated with an error exit status (255)

Command executed:

  rsem-prepare-reference \
      --gtf GCF_000001215.4_Release_6_plus_ISO1_MT_genomic_genes.gtf \
      --num-threads 12 \
       \
      rsem/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna \
      rsem/genome
  
  cat <<-END_VERSIONS > versions.yml
  RSEM_PREPAREREFERENCE_TRANSCRIPTS:
      rsem: $(rsem-calculate-expression --version | sed -e "s/Current version: RSEM v//g")
      star: $(STAR --version | sed -e "s/STAR_//g")
  END_VERSIONS

Command exit status:
  255

Command output:
  rsem-extract-reference-transcripts rsem/genome 0 GCF_000001215.4_Release_6_plus_ISO1_MT_genomic_genes.gtf None 0 rsem/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna
  "rsem-extract-reference-transcripts rsem/genome 0 GCF_000001215.4_Release_6_plus_ISO1_MT_genomic_genes.gtf None 0 rsem/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna" failed! Plase check if you provide correct parameters/options for the pipeline!

Command error:
  �[33mWARNING:�[0m Skipping mount /sw_ifs/sw/spack/spack-0.13.4/usr/opt/spack/linux-centos7-x86_64/gcc-8.3.0/singularity-3.5.2-5fbksv64ebpysiyeojqy4um7mvl5g7e7/var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
  The reference contains no transcripts!

I also try to work this around by running it with the gff from NCBI and a gff converted by gffread to gtf. There, normally the error is that the similar command returns something like

 Error executing process > 'NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:RSEM_PREPAREREFERENCE_TRANSCRIPTS (rsem/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna)'

Caused by:
  Process `NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:RSEM_PREPAREREFERENCE_TRANSCRIPTS (rsem/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna)` terminated with an error exit status (255)

Command executed:

  rsem-prepare-reference \
      --gtf GCF_000001215.4_Release_6_plus_ISO1_MT_genomic_genes.gtf \
      --num-threads 12 \
       \
      rsem/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna \
      rsem/genome
  
  cat <<-END_VERSIONS > versions.yml
  RSEM_PREPAREREFERENCE_TRANSCRIPTS:
      rsem: $(rsem-calculate-expression --version | sed -e "s/Current version: RSEM v//g")
      star: $(STAR --version | sed -e "s/STAR_//g")
  END_VERSIONS

Command exit status:
  255

Command output:
  rsem-extract-reference-transcripts rsem/genome 0 GCF_000001215.4_Release_6_plus_ISO1_MT_genomic_genes.gtf None 0 rsem/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna
  Parsed 200000 lines
  "rsem-extract-reference-transcripts rsem/genome 0 GCF_000001215.4_Release_6_plus_ISO1_MT_genomic_genes.gtf None 0 rsem/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna" failed! Plase check if you provide correct parameters/options for the pipeline!

Command error:
  �[33mWARNING:�[0m Skipping mount /sw_ifs/sw/spack/spack-0.13.4/usr/opt/spack/linux-centos7-x86_64/gcc-8.3.0/singularity-3.5.2-5fbksv64ebpysiyeojqy4um7mvl5g7e7/var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
  The GTF file might be corrupted!
  Stop at line : NT_033777.3	RefSeq	exon	21375060	21375912	.	-	.	transcript_id "rna-NM_001260299.1"; gene_name "mod(mdg4)"; Dbxref "FLYBASE:FBtr0307759,GeneID:49228,Genbank:NM_001260299.1,FLYBASE:FBgn0002781"; Note "mod(mdg4)-RAD%3B CG32491-RAD%3B Dmel\mod(mdg4)-RAD%3B Dmel\CG32491-RAD"; exception "trans-splicing"; gbkey "mRNA"; gene "mod(mdg4)"; locus_tag "Dmel_CG32491"; orig_protein_id "gnl|FlyBase|CG32491-PAD|gb|AFH06546"; orig_transcript_id "gnl|FlyBase|CG32491-RAD"; partial "true"; product "modifier of mdg4%2C transcript variant AD"; transcriptID "NM_001260299.1";
  Error Message: Cannot find gene_id!

I think this is related to issue #92 from rsem.

This is my current setup: revision 3.4, singularity, nextflow 21.05.0.edge.

Side note, is only me or the link to get an invite to the slack channel does not work?

I thank you very much for the help and the nice pipeline!
Cheers,
Lucas

robomics added a commit to paulsengroup/2022-mcf10a-cancer-progression that referenced this issue Feb 22, 2023
Using iGenomes reference assembly and GTF turned out to be problematic.
See here: https://nf-co.re/docs/usage/reference_genomes#illumina-aws-igenomes
and here: nf-core/rnaseq#460

So instead we are now using the assembly and GTFs published by Ensembl
robomics added a commit to paulsengroup/2022-mcf10a-cancer-progression that referenced this issue Feb 22, 2023
Using assemblies and GTFs from iGenomes turned out to be problematic.
See here: https://nf-co.re/docs/usage/reference_genomes#illumina-aws-igenomes
and here: nf-core/rnaseq#460

So instead we are now using the assembly and GTFs published by Ensembl.
@avramiaharonoff
Copy link

Hello @LucasMS, did you manage to solve this problem? I running into the same one. I converted from gff3 to gtf, and receive the second of your two errors.

@berk1835
Copy link

berk1835 commented Dec 28, 2023

@aaharonoff @LucasMS
I had this issue, in the end I manually added gene_id "text" to the gtf file in the same column and row as transcript_id "text". In my case, the transcript_id and gene_id were the same. Probably not the most efficient way to do it, especially if you shall be using lots of genomes rather than the same one multiple times, but it did the trick.

@LucasMS
Copy link

LucasMS commented Jan 7, 2024

Hi @ aaharonoff , sorry! I do not remember how I solved or worked around this problem. I hope @berk1835 solution works for you. Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants