Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTF_2_GFF3 fails with latest Gencode GTF for human #72

Closed
amizeranschi opened this issue Jun 29, 2023 · 21 comments
Closed

GTF_2_GFF3 fails with latest Gencode GTF for human #72

amizeranschi opened this issue Jun 29, 2023 · 21 comments
Assignees
Labels
bug Something isn't working

Comments

@amizeranschi
Copy link

Description of the bug

The pipeline fails when running with --rmats and the latest GTF file for human for Gencode: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.annotation.gtf.gz, even though --gencode is specified in the command.

GTF file source: https://www.gencodegenes.org/human/

I'm including a command below that reproduces the error. The sample sheet and contrast sheet are based on the pipeline's test profile, but I've customized the GTF and reference genome.

Command used and terminal output

nextflow run nf-core/rnasplice -r dev -latest -profile docker --outdir test-small-rnasplice --gencode --genome GRCh38 \
--input https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/samplesheet.csv \
--contrasts https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/contrastsheet.csv \
--gtf https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.annotation.gtf.gz --rmats --rmats_read_len 75 --rmats_paired_stats --aligner star_salmon --dexseq_exon


ERROR ~ Error executing process > 'NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:GTF_2_GFF3'

Caused by:
  Process `NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:GTF_2_GFF3` terminated with an error exit status (139)

Command executed:

  gffread gencode.v43.annotation.gtf -L | awk -F'	' -vOFS='	' '{ gsub("transcript", "mRNA", $3); print}' > gencode.v43.annotation_genes.gff3
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:GTF_2_GFF3":
      gffread: $(gffread --version 2>&1)
  END_VERSIONS

Command exit status:
  139

Command output:
  (empty)

Command error:
  .command.sh: line 2:    28 Segmentation fault      (core dumped) gffread gencode.v43.annotation.gtf -L
          30 Done                    | awk -F'	' -vOFS='	' '{ gsub("transcript", "mRNA", $3); print}' > gencode.v43.annotation_genes.gff3

Relevant files

No response

System information

No response

@amizeranschi amizeranschi added the bug Something isn't working label Jun 29, 2023
@valentinoruggieri
Copy link
Collaborator

Hi @amizeranschi.
Thank you for reporting the issue. We will let you know as soon as possible

dkoppstein added a commit to dkoppstein/rnasplice that referenced this issue Jul 16, 2023
@amizeranschi
Copy link
Author

For the record, the referenced commits by @dkoppstein have solved my issue. Would be great to see this fix merged into the base pipeline.

Unfortunately, my test run still ended up failing, due to a different reason. This is the updated command that I ran:

nextflow run dkoppstein/rnasplice -r dev -latest -profile docker --outdir test-small-rnasplice --gencode --genome GRCh38 \
--input https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/samplesheet.csv \
--contrasts https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/contrastsheet.csv \
--gtf https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.annotation.gtf.gz --rmats --rmats_read_len 75 --rmats_paired_stats --aligner star_salmon --dexseq_exon

And below is the new issue. @valentinoruggieri any help with this new error would be much appreciated.

ERROR ~ Error executing process > 'NFCORE_RNASPLICE:RNASPLICE:ALIGN_STAR:STAR_ALIGN (ERR188428)'

Caused by:
  Process `NFCORE_RNASPLICE:RNASPLICE:ALIGN_STAR:STAR_ALIGN (ERR188428)` terminated with an error exit status (105)

Command executed:

  STAR \
      --genomeDir STARIndex \
      --readFilesIn input1/ERR188428_1_val_1.fq.gz input2/ERR188428_2_val_2.fq.gz \
      --runThreadN 12 \
      --outFileNamePrefix ERR188428. \
       \
      --sjdbGTFfile gencode.v43.annotation.gtf \
      --outSAMattrRGline 'ID:ERR188428'  'SM:ERR188428'  \
      --quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM Unsorted --readFilesCommand gunzip -c --runRNGseed 0 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outSAMattributes NH HI AS NM MD --quantTranscriptomeBan Singleend
  
  
  
  if [ -f ERR188428.Unmapped.out.mate1 ]; then
      mv ERR188428.Unmapped.out.mate1 ERR188428.unmapped_1.fastq
      gzip ERR188428.unmapped_1.fastq
  fi
  if [ -f ERR188428.Unmapped.out.mate2 ]; then
      mv ERR188428.Unmapped.out.mate2 ERR188428.unmapped_2.fastq
      gzip ERR188428.unmapped_2.fastq
  fi
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASPLICE:RNASPLICE:ALIGN_STAR:STAR_ALIGN":
      star: $(STAR --version | sed -e "s/STAR_//g")
      samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
      gawk: $(echo $(gawk --version 2>&1) | sed 's/^.*GNU Awk //; s/, .*$//')
  END_VERSIONS

Command exit status:
  105

Command output:
  	STAR --genomeDir STARIndex --readFilesIn input1/ERR188428_1_val_1.fq.gz input2/ERR188428_2_val_2.fq.gz --runThreadN 12 --outFileNamePrefix ERR188428. --sjdbGTFfile gencode.v43.annotation.gtf --outSAMattrRGline ID:ERR188428 SM:ERR188428 --quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM Unsorted --readFilesCommand gunzip -c --runRNGseed 0 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outSAMattributes NH HI AS NM MD --quantTranscriptomeBan Singleend
  	STAR version: 2.7.9a   compiled: 2021-05-04T09:43:56-0400 vega:/home/dobin/data/STAR/STARcode/STAR.master/source
  Jul 18 09:03:48 ..... started STAR run
  Jul 18 09:03:48 ..... loading genome

Command error:
  	STAR --genomeDir STARIndex --readFilesIn input1/ERR188428_1_val_1.fq.gz input2/ERR188428_2_val_2.fq.gz --runThreadN 12 --outFileNamePrefix ERR188428. --sjdbGTFfile gencode.v43.annotation.gtf --outSAMattrRGline ID:ERR188428 SM:ERR188428 --quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM Unsorted --readFilesCommand gunzip -c --runRNGseed 0 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outSAMattributes NH HI AS NM MD --quantTranscriptomeBan Singleend
  	STAR version: 2.7.9a   compiled: 2021-05-04T09:43:56-0400 vega:/home/dobin/data/STAR/STARcode/STAR.master/source
  Jul 18 09:03:48 ..... started STAR run
  Jul 18 09:03:48 ..... loading genome
  
  EXITING because of FATAL ERROR: Genome version: 20201 is INCOMPATIBLE with running STAR version: 2.7.9a
  SOLUTION: please re-generate genome from scratch with running version of STAR, or with version: 2.7.4a
  
  Jul 18 09:03:48 ...... FATAL ERROR, exiting

@dkoppstein
Copy link
Contributor

@amizeranschi thanks for the confirmation. I don't have access to a linux box with Docker to run the formal tests, so am waiting on opening a PR, if you or one of the devs can test run -profile test,docker on my branch it would be great.

However, I am also encountering another issue downstream at the index_gff step with a Gencode GTF... will try tinkering around a bit, then may need to submit another bug report.

For your STAR issue in the short term, it seems like the AWS iGenomes are out of date -- I would suggest using the latest ENSEMBL or GENCODE fasta and gtf files, and specifying them on the command line.

@amizeranschi
Copy link
Author

amizeranschi commented Jul 18, 2023

OK, I will give this a try. Either way, I guess the STAR index will have to be rebuilt, correct? I will try using the latest FASTA and GTF from Gencode, I just saw that they recently made a new release, v. 44.

@dkoppstein
Copy link
Contributor

dkoppstein commented Jul 18, 2023 via email

@amizeranschi
Copy link
Author

OK, then I'll add --save_reference to my command so that I can later reuse the new STAR index. Thanks a lot for your help.

@valentinoruggieri
Copy link
Collaborator

valentinoruggieri commented Jul 18, 2023

@amizeranschi @dkoppstein Thank you both for reviewing and testing the code.
As you mentioned above, updating gffread from 0.12.1 to 0.12.7 solves the issue in our test too. In addition, we are evaluating to add the option "--keep-gene" in the gffread script (e.g. gffread $gtf -L --keep-genes ) since it seems that "genes" are not automatically added in the gff3 file, and this might cause problems in some downstream processes.

Regarding the STAR issue, as you pointed out it depends on the STAR version (2.5.1b 2016) used to generate the index in the AWS iGenome and the STAR version used for the mapping (version 2.7.9a). We are looking for a way to take into consideration this discrepancy.
However, passing the genome file with the option '--fasta' skips the issue since the index is rebuilt using the right STAR version.

@amizeranschi
Copy link
Author

OK, changing the command to the one below got things moving forward for me:


nextflow run dkoppstein/rnasplice -r dev -latest -profile docker --outdir test-small-rnasplice --gencode --igenomes_ignore --genome null --save_reference \
--input https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/samplesheet.csv \
--contrasts https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/contrastsheet.csv \
--fasta https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz \
--gtf https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz \
--rmats --rmats_read_len 75 --rmats_paired_stats --aligner star_salmon --dexseq_exon

@dkoppstein I am now getting some GFF-related warnings and errors, but I'm not sure if this is what you were referring to earlier. This is what I get:

ERROR ~ Error executing process > 'NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_RUN (1)'

Caused by:
  Process `NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_RUN (1)` terminated with an error exit status (1)

Command executed:

  miso --run index ERR188428_sorted.bam --output-dir miso_data/ERR188428 --read-len 75
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_RUN":
      misopy: $(python -c "import pkg_resources; print(pkg_resources.get_distribution('misopy').version)")
  END_VERSIONS

Command exit status:
  1

Command output:
  
  Using MISO settings file: /usr/local/lib/python2.7/site-packages/misopy/settings/miso_settings.txt
  Computing Psi values...
    - GFF index: index
    - BAM: ERR188428_sorted.bam
    - Read length: 75
    - Output directory: miso_data/ERR188428
  Checking your GFF annotation and BAM for mismatches...
  Checking if BAM has mixed read lengths...
  07/18/2023 11:25:14 AM - miso_main - WARNING - Found mixed length reads in your BAM file: ERR188428_sorted.bam
  MISO does not support mixed read lengths. Please trim your reads to a uniform length and re-run or run MISO separately on each read length. Proceeding anyway, though this may result in errors or inability to match reads to your annotation.
  Read lengths were: 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75
  
  07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your GFF annotation file and your BAM file might not have matching headers (chromosome names.) If this is the case, your run will fail as no reads from the BAM could be matched up with your annotation.
  07/18/2023 11:25:19 AM - miso_main - WARNING - Please see:
  	http://genes.mit.edu/burgelab/miso/docs/
   for more information.
  07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your BAM file (ERR188428_sorted.bam) contains 'chr' chromosomes (UCSC-style) while your GFF annotation (index) does not.
  07/18/2023 11:25:19 AM - miso_main - WARNING - The first few BAM chromosomes were: chr7,chr6,chr5,chr4,chr3,chr2,chr1,chr9,chr8,chr12,chr11,chr10
  BAM references: 
  ('chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM', 'GL000008.2', 'GL000009.2', 'GL000194.1', 'GL000195.1', 'GL000205.2', 'GL000208.1', 'GL000213.1', 'GL000214.1', 'GL000216.2', 'GL000218.1', 'GL000219.1', 'GL000220.1', 'GL000221.1', 'GL000224.1', 'GL000225.1', 'GL000226.1', 'KI270302.1', 'KI270303.1', 'KI270304.1', 'KI270305.1', 'KI270310.1', 'KI270311.1', 'KI270312.1', 'KI270315.1', 'KI270316.1', 'KI270317.1', 'KI270320.1', 'KI270322.1', 'KI270329.1', 'KI270330.1', 'KI270333.1', 'KI270334.1', 'KI270335.1', 'KI270336.1', 'KI270337.1', 'KI270338.1', 'KI270340.1', 'KI270362.1', 'KI270363.1', 'KI270364.1', 'KI270366.1', 'KI270371.1', 'KI270372.1', 'KI270373.1', 'KI270374.1', 'KI270375.1', 'KI270376.1', 'KI270378.1', 'KI270379.1', 'KI270381.1', 'KI270382.1', 'KI270383.1', 'KI270384.1', 'KI270385.1', 'KI270386.1', 'KI270387.1', 'KI270388.1', 'KI270389.1', 'KI270390.1', 'KI270391.1', 'KI270392.1', 'KI270393.1', 'KI270394.1', 'KI270395.1', 'KI270396.1', 'KI270411.1', 'KI270412.1', 'KI270414.1', 'KI270417.1', 'KI270418.1', 'KI270419.1', 'KI270420.1', 'KI270422.1', 'KI270423.1', 'KI270424.1', 'KI270425.1', 'KI270429.1', 'KI270435.1', 'KI270438.1', 'KI270442.1', 'KI270448.1', 'KI270465.1', 'KI270466.1', 'KI270467.1', 'KI270468.1', 'KI270507.1', 'KI270508.1', 'KI270509.1', 'KI270510.1', 'KI270511.1', 'KI270512.1', 'KI270515.1', 'KI270516.1', 'KI270517.1', 'KI270518.1', 'KI270519.1', 'KI270521.1', 'KI270522.1', 'KI270528.1', 'KI270529.1', 'KI270530.1', 'KI270538.1', 'KI270539.1', 'KI270544.1', 'KI270548.1', 'KI270579.1', 'KI270580.1', 'KI270581.1', 'KI270582.1', 'KI270583.1', 'KI270584.1', 'KI270587.1', 'KI270588.1', 'KI270589.1', 'KI270590.1', 'KI270591.1', 'KI270593.1', 'KI270706.1', 'KI270707.1', 'KI270708.1', 'KI270709.1', 'KI270710.1', 'KI270711.1', 'KI270712.1', 'KI270713.1', 'KI270714.1', 'KI270715.1', 'KI270716.1', 'KI270717.1', 'KI270718.1', 'KI270719.1', 'KI270720.1', 'KI270721.1', 'KI270722.1', 'KI270723.1', 'KI270724.1', 'KI270725.1', 'KI270726.1', 'KI270727.1', 'KI270728.1', 'KI270729.1', 'KI270730.1', 'KI270731.1', 'KI270732.1', 'KI270733.1', 'KI270734.1', 'KI270735.1', 'KI270736.1', 'KI270737.1', 'KI270738.1', 'KI270739.1', 'KI270740.1', 'KI270741.1', 'KI270742.1', 'KI270743.1', 'KI270744.1', 'KI270745.1', 'KI270746.1', 'KI270747.1', 'KI270748.1', 'KI270749.1', 'KI270750.1', 'KI270751.1', 'KI270752.1', 'KI270753.1', 'KI270754.1', 'KI270755.1', 'KI270756.1', 'KI270757.1')
  07/18/2023 11:25:19 AM - miso_main - WARNING - The first few GFF chromosomes were: 
  07/18/2023 11:25:19 AM - miso_main - WARNING - Run is likely to fail or produce empty output. Proceeding anyway...
  Mapping genes to their indexed GFF representation, using index
  Searching for index/genes_to_filenames.shelve..
    - File not found.
  Skipping: /home/ubuntu/work/dc/eeb28063f47f2t support mixed read lengths. Please trim your reads to a uniform length and re-run or run MISO separately on each read length. Proceeding anyway, though this may result in errors or inability to match reads to your annotation.
  Read lengths were: 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75
  
  07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your GFF annotation file and your BAM file might not have matching headers (chromosome names.) If this is the case, your run will fail as no reads from the BAM could be matched up with your annotation.
  07/18/2023 11:25:19 AM - miso_main - WARNING - Please see:
  	http://genes.mit.edu/burgelab/miso/docs/
   for more information.
  07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your BAM file (ERR188428_sorted.bam) contains 'chr' chromosomes (UCSC-style) while your GFF annotation (index) does not.
  07/18/2023 11:25:19 AM - miso_main - WARNING - The first few BAM chromosomes were: chr7,chr6,chr5,chr4,chr3,chr2,chr1,chr9,chr8,chr12,chr11,chr10
  BAM references: 
  ('chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM', 'GL000008.2', 'GL000009.2', 'GL000194.1', 'GL000195.1', 'GL000205.2', 'GL000208.1', 'GL000213.1', 'GL000214.1', 'GL000216.2', 'GL000218.1', 'GL000219.1', 'GL000220.1', 'GL000221.1', 'GL000224.1', 'GL000225.1', 'GL000226.1', 'KI270302.1', 'KI270303.1', 'KI270304.1', 'KI270305.1', 'KI270310.1', 'KI270311.1', 'KI270312.1', 'KI270315.1', 'KI270316.1', 'KI270317.1', 'KI270320.1', 'KI270322.1', 'KI270329.1', 'KI270330.1', 'KI270333.1', 'KI270334.1', 'KI270335.1', 'KI270336.1', 'KI270337.1', 'KI270338.1', 'KI270340.1', 'KI270362.1', 'KI270363.1', 'KI270364.1', 'KI270366.1', 'KI270371.1', 'KI270372.1', 'KI270373.1', 'KI270374.1', 'KI270375.1', 'KI270376.1', 'KI270378.1', 'KI270379.1', 'KI270381.1', 'KI270382.1', 'KI270383.1', 'KI270384.1', 'KI270385.1', 'KI270386.1', 'KI270387.1', 'KI270388.1', 'KI270389.1', 'KI270390.1', 'KI270391.1', 'KI270392.1', 'KI270393.1', 'KI270394.1', 'KI270395.1', 'KI270396.1', 'KI270411.1', 'KI270412.1', 'KI270414.1', 'KI270417.1', 'KI270418.1', 'KI270419.1', 'KI270420.1', 'KI270422.1', 'KI270423.1', 'KI270424.1', 'KI270425.1', 'KI270429.1', 'KI270435.1', 'KI270438.1', 'KI270442.1', 'KI270448.1', 'KI270465.1', 'KI270466.1', 'KI270467.1', 'KI270468.1', 'KI270507.1', 'KI270508.1', 'KI270509.1', 'KI270510.1', 'KI270511.1', 'KI270512.1', 'KI270515.1', 'KI270516.1', 'KI270517.1', 'KI270518.1', 'KI270519.1', 'KI270521.1', 'KI270522.1', 'KI270528.1', 'KI270529.1', 'KI270530.1', 'KI270538.1', 'KI270539.1', 'KI270544.1', 'KI270548.1', 'KI270579.1', 'KI270580.1', 'KI270581.1', 'KI270582.1', 'KI270583.1', 'KI270584.1', 'KI270587.1', 'KI270588.1', 'KI270589.1', 'KI270590.1', 'KI270591.1', 'KI270593.1', 'KI270706.1', 'KI270707.1', 'KI270708.1', 'KI270709.1', 'KI270710.1', 'KI270711.1', 'KI270712.1', 'KI270713.1', 'KI270714.1', 'KI270715.1', 'KI270716.1', 'KI270717.1', 'KI270718.1', 'KI270719.1', 'KI270720.1', 'KI270721.1', 'KI270722.1', 'KI270723.1', 'KI270724.1', 'KI270725.1', 'KI270726.1', 'KI270727.1', 'KI270728.1', 'KI270729.1', 'KI270730.1', 'KI270731.1', 'KI270732.1', 'KI270733.1', 'KI270734.1', 'KI270735.1', 'KI270736.1', 'KI270737.1', 'KI270738.1', 'KI270739.1', 'KI270740.1', 'KI270741.1', 'KI270742.1', 'KI270743.1', 'KI270744.1', 'KI270745.1', 'KI270746.1', 'KI270747.1', 'KI270748.1', 'KI270749.1', 'KI270750.1', 'KI270751.1', 'KI270752.1', 'KI270753.1', 'KI270754.1', 'KI270755.1', 'KI270756.1', 'KI270757.1')
  07/18/2023 11:25:19 AM - miso_main - WARNING - The first few GFF chromosomes were: 
  07/18/2023 11:25:19 AM - miso_main - WARNING - Run is likely to fail or produce empty output. Proceeding anyway...
  Mapping genes to their indexed GFF representation, using index
  Searching for index/genes_to_filenames.shelve..
    - File not found.
  Skipping: index/genes.gff
  Skipping: index/compressed_ids_to_genes.shelve.bak
  Skipping: index/genes_to_filenames.shelve.bak
  Skipping: index/genes_to_filenames.shelve.dat
  Skipping: index/compressed_ids_to_genes.shelve.dir
  Skipping: index/genes_to_filenames.shelve.dir
  Skipping: index/compressed_ids_to_genes.shelve.dat
  07/18/2023 11:25:34 AM - miso_main - ERROR - No genes to run on. Did you pass me the wrong path to your index GFF directory? Or perhaps your indexed GFF directory is empty?

Command error:
  
  Using MISO settings file: /usr/local/lib/python2.7/site-packages/misopy/settings/miso_settings.txt
  Computing Psi values...
    - GFF index: index
    - BAM: ERR188428_sorted.bam
    - Read length: 75
    - Output directory: miso_data/ERR188428
  Checking your GFF annotation and BAM for mismatches...
  Checking if BAM has mixed read lengths...
  07/18/2023 11:25:14 AM - miso_main - WARNING - Found mixed length reads in your BAM file: ERR188428_sorted.bam
  MISO does not support mixed read lengths. Please trim your reads to a uniform length and re-run or run MISO separately on each read length. Proceeding anyway, though this may result in errors or inability to match reads to your annotation.
  Read lengths were: 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75
  
  07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your GFF annotation file and your BAM file might not have matching headers (chromosome names.) If this is the case, your run will fail as no reads from the BAM could be matched up with your annotation.
  07/18/2023 11:25:19 AM - miso_main - WARNING - Please see:
  	http://genes.mit.edu/burgelab/miso/docs/
   for more information.
  07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your BAM file (ERR188428_sorted.bam) contains 'chr' chromosomes (UCSC-style) while your GFF annotation (index) does not.
  07/18/2023 11:25:19 AM - miso_main - WARNING - The first few BAM chromosomes were: chr7,chr6,chr5,chr4,chr3,chr2,chr1,chr9,chr8,chr12,chr11,chr10
  BAM references: 
  ('chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM', 'GL000008.2', 'GL000009.2', 'GL000194.1', 'GL000195.1', 'GL000205.2', 'GL000208.1', 'GL000213.1', 'GL000214.1', 'GL000216.2', 'GL000218.1', 'GL000219.1', 'GL000220.1', 'GL000221.1', 'GL000224.1', 'GL000225.1', 'GL000226.1', 'KI270302.1', 'KI270303.1', 'KI270304.1', 'KI270305.1', 'KI270310.1', 'KI270311.1', 'KI270312.1', 'KI270315.1', 'KI270316.1', 'KI270317.1', 'KI270320.1', 'KI270322.1', 'KI270329.1', 'KI270330.1', 'KI270333.1', 'KI270334.1', 'KI270335.1', 'KI270336.1', 'KI270337.1', 'KI270338.1', 'KI270340.1', 'KI270362.1', 'KI270363.1', 'KI270364.1', 'KI270366.1', 'KI270371.1', 'KI270372.1', 'KI270373.1', 'KI270374.1', 'KI270375.1', 'KI270376.1', 'KI270378.1', 'KI270379.1', 'KI270381.1', 'KI270382.1', 'KI270383.1', 'KI270384.1', 'KI270385.1', 'KI270386.1', 'KI270387.1', 'KI270388.1', 'KI270389.1', 'KI270390.1', 'KI270391.1', 'KI270392.1', 'KI270393.1', 'KI270394.1', 'KI270395.1', 'KI270396.1', 'KI270411.1', 'KI270412.1', 'KI270414.1', 'KI270417.1', 'KI270418.1', 'KI270419.1', 'KI270420.1', 'KI270422.1', 'KI270423.1', 'KI270424.1', 'KI270425.1', 'KI270429.1', 'KI270435.1', 'KI270438.1', 'KI270442.1', 'KI270448.1', 'KI270465.1', 'KI270466.1', 'KI270467.1', 'KI270468.1', 'KI270507.1', 'KI270508.1', 'KI270509.1', 'KI270510.1', 'KI270511.1', 'KI270512.1', 'KI270515.1', 'KI270516.1', 'KI270517.1', 'KI270518.1', 'KI270519.1', 'KI270521.1', 'KI270522.1', 'KI270528.1', 'KI270529.1', 'KI270530.1', 'KI270538.1', 'KI270539.1', 'KI270544.1', 'KI270548.1', 'KI270579.1', 'KI270580.1', 'KI270581.1', 'KI270582.1', 'KI270583.1', 'KI270584.1', 'KI270587.1', 'KI270588.1', 'KI270589.1', 'KI270590.1', 'KI270591.1', 'KI270593.1', 'KI270706.1', 'KI270707.1', 'KI270708.1', 'KI270709.1', 'KI270710.1', 'KI270711.1', 'KI270712.1', 'KI270713.1', 'KI270714.1', 'KI270715.1', 'KI270716.1', 'KI270717.1', 'KI270718.1', 'KI270719.1', 'KI270720.1', 'KI270721.1', 'KI270722.1', 'KI270723.1', 'KI270724.1', 'KI270725.1', 'KI270726.1', 'KI270727.1', 'KI270728.1', 'KI270729.1', 'KI270730.1', 'KI270731.1', 'KI270732.1', 'KI270733.1', 'KI270734.1', 'KI270735.1', 'KI270736.1', 'KI270737.1', 'KI270738.1', 'KI270739.1', 'KI270740.1', 'KI270741.1', 'KI270742.1', 'KI270743.1', 'KI270744.1', 'KI270745.1', 'KI270746.1', 'KI270747.1', 'KI270748.1', 'KI270749.1', 'KI270750.1', 'KI270751.1', 'KI270752.1', 'KI270753.1', 'KI270754.1', 'KI270755.1', 'KI270756.1', 'KI270757.1')
  07/18/2023 11:25:19 AM - miso_main - WARNING - The first few GFF chromosomes were: 
  07/18/2023 11:25:19 AM - miso_main - WARNING - Run is likely to fail or produce empty output. Proceeding anyway...
  Mapping genes to their indexed GFF representation, using index
  Searching for index/genes_to_filenames.shelve..
    - File not found.
  Skipping: /home/ubuntu/work/dc/eeb28063f47f2t support mixed read lengths. Please trim your reads to a uniform length and re-run or run MISO separately on each read length. Proceeding anyway, though this may result in errors or inability to match reads to your annotation.
  Read lengths were: 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75
  
  07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your GFF annotation file and your BAM file might not have matching headers (chromosome names.) If this is the case, your run will fail as no reads from the BAM could be matched up with your annotation.
  07/18/2023 11:25:19 AM - miso_main - WARNING - Please see:
  	http://genes.mit.edu/burgelab/miso/docs/
   for more information.
  07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your BAM file (ERR188428_sorted.bam) contains 'chr' chromosomes (UCSC-style) while your GFF annotation (index) does not.
  07/18/2023 11:25:19 AM - miso_main - WARNING - The first few BAM chromosomes were: chr7,chr6,chr5,chr4,chr3,chr2,chr1,chr9,chr8,chr12,chr11,chr10
  BAM references: 
  ('chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM', 'GL000008.2', 'GL000009.2', 'GL000194.1', 'GL000195.1', 'GL000205.2', 'GL000208.1', 'GL000213.1', 'GL000214.1', 'GL000216.2', 'GL000218.1', 'GL000219.1', 'GL000220.1', 'GL000221.1', 'GL000224.1', 'GL000225.1', 'GL000226.1', 'KI270302.1', 'KI270303.1', 'KI270304.1', 'KI270305.1', 'KI270310.1', 'KI270311.1', 'KI270312.1', 'KI270315.1', 'KI270316.1', 'KI270317.1', 'KI270320.1', 'KI270322.1', 'KI270329.1', 'KI270330.1', 'KI270333.1', 'KI270334.1', 'KI270335.1', 'KI270336.1', 'KI270337.1', 'KI270338.1', 'KI270340.1', 'KI270362.1', 'KI270363.1', 'KI270364.1', 'KI270366.1', 'KI270371.1', 'KI270372.1', 'KI270373.1', 'KI270374.1', 'KI270375.1', 'KI270376.1', 'KI270378.1', 'KI270379.1', 'KI270381.1', 'KI270382.1', 'KI270383.1', 'KI270384.1', 'KI270385.1', 'KI270386.1', 'KI270387.1', 'KI270388.1', 'KI270389.1', 'KI270390.1', 'KI270391.1', 'KI270392.1', 'KI270393.1', 'KI270394.1', 'KI270395.1', 'KI270396.1', 'KI270411.1', 'KI270412.1', 'KI270414.1', 'KI270417.1', 'KI270418.1', 'KI270419.1', 'KI270420.1', 'KI270422.1', 'KI270423.1', 'KI270424.1', 'KI270425.1', 'KI270429.1', 'KI270435.1', 'KI270438.1', 'KI270442.1', 'KI270448.1', 'KI270465.1', 'KI270466.1', 'KI270467.1', 'KI270468.1', 'KI270507.1', 'KI270508.1', 'KI270509.1', 'KI270510.1', 'KI270511.1', 'KI270512.1', 'KI270515.1', 'KI270516.1', 'KI270517.1', 'KI270518.1', 'KI270519.1', 'KI270521.1', 'KI270522.1', 'KI270528.1', 'KI270529.1', 'KI270530.1', 'KI270538.1', 'KI270539.1', 'KI270544.1', 'KI270548.1', 'KI270579.1', 'KI270580.1', 'KI270581.1', 'KI270582.1', 'KI270583.1', 'KI270584.1', 'KI270587.1', 'KI270588.1', 'KI270589.1', 'KI270590.1', 'KI270591.1', 'KI270593.1', 'KI270706.1', 'KI270707.1', 'KI270708.1', 'KI270709.1', 'KI270710.1', 'KI270711.1', 'KI270712.1', 'KI270713.1', 'KI270714.1', 'KI270715.1', 'KI270716.1', 'KI270717.1', 'KI270718.1', 'KI270719.1', 'KI270720.1', 'KI270721.1', 'KI270722.1', 'KI270723.1', 'KI270724.1', 'KI270725.1', 'KI270726.1', 'KI270727.1', 'KI270728.1', 'KI270729.1', 'KI270730.1', 'KI270731.1', 'KI270732.1', 'KI270733.1', 'KI270734.1', 'KI270735.1', 'KI270736.1', 'KI270737.1', 'KI270738.1', 'KI270739.1', 'KI270740.1', 'KI270741.1', 'KI270742.1', 'KI270743.1', 'KI270744.1', 'KI270745.1', 'KI270746.1', 'KI270747.1', 'KI270748.1', 'KI270749.1', 'KI270750.1', 'KI270751.1', 'KI270752.1', 'KI270753.1', 'KI270754.1', 'KI270755.1', 'KI270756.1', 'KI270757.1')
  07/18/2023 11:25:19 AM - miso_main - WARNING - The first few GFF chromosomes were: 
  07/18/2023 11:25:19 AM - miso_main - WARNING - Run is likely to fail or produce empty output. Proceeding anyway...
  Mapping genes to their indexed GFF representation, using index
  Searching for index/genes_to_filenames.shelve..
    - File not found.
  Skipping: index/genes.gff
  Skipping: index/compressed_ids_to_genes.shelve.bak
  Skipping: index/genes_to_filenames.shelve.bak
  Skipping: index/genes_to_filenames.shelve.dat
  Skipping: index/compressed_ids_to_genes.shelve.dir
  Skipping: index/genes_to_filenames.shelve.dir
  Skipping: index/compressed_ids_to_genes.shelve.dat
  07/18/2023 11:25:34 AM - miso_main - ERROR - No genes to run on. Did you pass me the wrong path to your index GFF directory? Or perhaps your indexed GFF directory is empty?

@valentinoruggieri
Copy link
Collaborator

valentinoruggieri commented Jul 18, 2023

adding the option "--keep-gene" in the gffread script of the GTF_2_GFF3 module ( gffread $gtf -L --keep-genes | awk -F'\t' -vOFS='\t' '{ gsub("transcript", "mRNA", \$3); print}' > ${gtf.baseName}_genes.gff3 ) should solve the issue

@amizeranschi
Copy link
Author

@dkoppstein I also ran a built-in test earlier, as you suggested:

nextflow run dkoppstein/rnasplice -r dev -latest -profile test,docker

and the outcome was similar to my test above. Would you be able to implement the fix suggested by @valentinoruggieri?

ERROR ~ Error executing process > 'NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_RUN (1)'

Caused by:
  Process `NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_RUN (1)` terminated with an error exit status (1)

Command executed:

  miso --run index ERR188383_sorted.bam --output-dir miso_data/ERR188383 --read-len 75
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_RUN":
      misopy: $(python -c "import pkg_resources; print(pkg_resources.get_distribution('misopy').version)")
  END_VERSIONS

Command exit status:
  1

Command output:
  MISO (Mixture of Isoforms model)
  Probabilistic analysis of RNA-Seq data for detecting differential isoforms
  Use --help argument to view options.
  
  Using MISO settings file: /usr/local/lib/python2.7/site-packages/misopy/settings/miso_settings.txt
  Computing Psi values...
    - GFF index: index
    - BAM: ERR188383_sorted.bam
    - Read length: 75
    - Output directory: miso_data/ERR188383
  Checking your GFF annotation and BAM for mismatches...
  Checking if BAM has mixed read lengths...
  07/18/2023 12:11:06 PM - miso_main - WARNING - Found mixed length reads in your BAM file: ERR188383_sorted.bam
  MISO does not support mixed read lengths. Please trim your reads to a uniform length and re-run or run MISO separately on each read length. Proceeding anyway, though this may result in errors or inability to match reads to your annotation.
  Read lengths were: 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75
  
  Mapping genes to their indexed GFF representation, using index
  Searching for index/genes_to_filenames.shelve..
    - File not found.
  Skipping: index/genes.gff
  Skipping: index/compressed_ids_to_genes.shelve.bak
  Skipping: index/genes_to_filenames.shelve.bak
  Skipping: index/genes_to_filenames.shelve.dat
  Skipping: index/compressed_ids_to_genes.shelve.dir
  Skipping: index/genes_to_filenames.shelve.dir
  Skipping: index/compressed_ids_to_genes.shelve.dat
  07/18/2023 12:11:11 PM - miso_main - ERROR - No genes to run on. Did you pass me the wrong path to your index GFF directory? Or perhaps your indexed GFF directory is empty?

Command error:
  MISO (Mixture of Isoforms model)
  Probabilistic analysis of RNA-Seq data for detecting differential isoforms
  Use --help argument to view options.
  
  Using MISO settings file: /usr/local/lib/python2.7/site-packages/misopy/settings/miso_settings.txt
  Computing Psi values...
    - GFF index: index
    - BAM: ERR188383_sorted.bam
    - Read length: 75
    - Output directory: miso_data/ERR188383
  Checking your GFF annotation and BAM for mismatches...
  Checking if BAM has mixed read lengths...
  07/18/2023 12:11:06 PM - miso_main - WARNING - Found mixed length reads in your BAM file: ERR188383_sorted.bam
  MISO does not support mixed read lengths. Please trim your reads to a uniform length and re-run or run MISO separately on each read length. Proceeding anyway, though this may result in errors or inability to match reads to your annotation.
  Read lengths were: 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75
  
  Mapping genes to their indexed GFF representation, using index
  Searching for index/genes_to_filenames.shelve..
    - File not found.
  Skipping: index/genes.gff
  Skipping: index/compressed_ids_to_genes.shelve.bak
  Skipping: index/genes_to_filenames.shelve.bak
  Skipping: index/genes_to_filenames.shelve.dat
  Skipping: index/compressed_ids_to_genes.shelve.dir
  Skipping: index/genes_to_filenames.shelve.dir
  Skipping: index/compressed_ids_to_genes.shelve.dat
  07/18/2023 12:11:11 PM - miso_main - ERROR - No genes to run on. Did you pass me the wrong path to your index GFF directory? Or perhaps your indexed GFF directory is empty?

@dkoppstein
Copy link
Contributor

@amizeranschi I just pushed the suggested fix to my fork.

@amizeranschi
Copy link
Author

Thanks, it seems like that did the trick. The built-in test ran fine now. I'll also run my own test above and see how that turns out.

@amizeranschi
Copy link
Author

Unfortunately, my test still ran into an error. Reminder, this is the command I'm running:

nextflow run dkoppstein/rnasplice -r dev -latest -profile docker --outdir test-small-rnasplice --gencode --igenomes_ignore --genome null --save_reference \
--input https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/samplesheet.csv \
--contrasts https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/contrastsheet.csv \
--fasta https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz \
--gtf https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz \
--rmats --rmats_read_len 75 --rmats_paired_stats --aligner star_salmon --dexseq_exon

And this was the outcome, with the latest version of @dkoppstein's fork. Any idea about this error, @valentinoruggieri?

Caused by:
  Process `NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_SASHIMI (1)` terminated with an error exit status (1)

Command executed:

  sashimi_plot --plot-event ENSG00000004961 index miso_settings.txt --output-dir sashimi
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_SASHIMI":
      python: $(python --version | sed "s/Python //g")
      misopy: $(python -c "import pkg_resources; print(pkg_resources.get_distribution('misopy').version)")
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  /usr/local/lib/python2.7/site-packages/matplotlib/cbook/deprecation.py:107: MatplotlibDeprecationWarning: The mpl_toolkits.axes_grid module was deprecated in version 2.1. Use mpl_toolkits.axes_grid1 and mpl_toolkits.axisartist provies the same functionality instead.
    warnings.warn(message, mplDeprecation, stacklevel=1)
  Traceback (most recent call last):
    File "/usr/local/bin/sashimi_plot", line 11, in <module>
      sys.exit(main())
    File "/usr/local/lib/python2.7/site-packages/misopy/sashimi_plot/sashimi_plot.py", line 276, in main
      plot_label=plot_label)
    File "/usr/local/lib/python2.7/site-packages/misopy/sashimi_plot/sashimi_plot.py", line 142, in plot_event
      %(event_name, pickle_dir)
  Exception: Event ENSG00000004961 not found in pickled directory index. Are you sure this is the right directory for the event?

@valentinoruggieri
Copy link
Collaborator

@amizeranschi it seems the error arises because the "miso_genes" IDs in nextflow.config are slightly different from those of the genome version you used (e.g. ENSG00000004961 vs ENSG00000004961.15) . You should change them accordingly.

@amizeranschi
Copy link
Author

Hi @valentinoruggieri

I'm afraid I don't fully follow you. Could you please be a bit more specific? Where and what should I change in regards to those "miso_genes" IDs?

@valentinoruggieri
Copy link
Collaborator

rnasplice relies on "miso_genes" params to specify the gene IDs you want to plot via miso/sashimi_plot function. By default, it includes 3 genes ('ENSG00000004961, ENSG00000005302, ENSG00000147403') (check the nextflow.config file).
Users need to specify the gene IDs they want to plot accordingly to the genome annotation they are using.
For example, in the gencode version you are using (version 44), genes ID are coded with an extra suffix (.*) like "ENSG00000004961.15". So to correctly instruct miso, you need to use a list of gene IDs that corresponds to that version.
You can do that:

  1. by modifying the gene IDs in the nextflow.config file
    (for example replacing the default values with 'ENSG00000004961.15, ENSG00000005302.19, ENSG00000147403.18');
  2. by adding an argument to your command line --miso_genes 'ENSG00000004961.15';
    [https://nf-co.re/rnasplice/dev/parameters] for more details.

@amizeranschi
Copy link
Author

I see, thanks a lot for the added details. I have added the following to my command line:

--miso_genes "ENSG00000004961.15, ENSG00000005302.19, ENSG00000147403.18"

and everything worked perfectly.

It might also be helpful to update the documentation to reflect the actual default value for --miso_genes from nextflow.config. The website currently lists only two genes: ENSG00000004961, ENSG00000005302:

https://nf-co.re/rnasplice/dev/parameters#miso_genes

@dkoppstein
Copy link
Contributor

I would also suggest changing these values automatically if --gencode is specified.

@amizeranschi
Copy link
Author

Good point @dkoppstein. I opened a separate issue report here: #78

@amizeranschi
Copy link
Author

@dkoppstein could you please create a PR with your fixes?

@amizeranschi
Copy link
Author

Thanks a lot @jma1991 for fixing this issue, can confirm that it works. Unfortunately, now I'm running into another issue, described here: #71

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants