Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nf-core/circrna] error: Your FASTQ files do not have the appropriate extension #42

Closed
BirongZhang opened this issue Oct 27, 2021 · 19 comments
Labels
bug Something isn't working

Comments

@BirongZhang
Copy link

BirongZhang commented Oct 27, 2021

Hi all,

Thanks so much for generating this useful pipeline!
I wanted to find circrnas in a different way, and I found your work. But when I use it, I encounter the following problems:

Here is my code:

nohup nextflow run nf-core/circrna \
-r 892b3136e7432221bd81f8c7cc0400ebe541b08e \
-profile singularity \
--genome 'GRCh37' \
--input "/scratch/c.c2050857/circrna/raw_data_gz/*.fastq.gz" \
--input_type 'fastq' \
--module 'circrna_discovery' \
--tool 'ciriquant, dcc, find_circ, circexplorer2' \
--outdir Results 

My fastq.gz data:
Here I also have a question, is this pipeline only for fastq.gz data? Can I use fastq data?
Screenshot 2021-10-27 at 13 45 23

My error:
Screenshot 2021-10-27 at 13 45 43

Could you please take a look at this? Any advice would be appreciated. Thanks!

Kind regards,
Birong

@BirongZhang BirongZhang added the bug Something isn't working label Oct 27, 2021
@BarryDigby
Copy link
Collaborator

Hey there, unfortunately, the workflow has been designed to work with paired-end FASTQ files.

@BirongZhang
Copy link
Author

Hi Barry, thanks so much for letting me know that.

@BarryDigby
Copy link
Collaborator

BarryDigby commented Oct 28, 2021

I'm looking at the experiment metadata for SRR6343608

It states the data is PAIRED ? Maybe you did not download the data correctly.

Might I suggest using nf-core/fetchngs which is an excellent workflow for downloading public datasets. Simply provide the workflow with a SRA / ENA / GEO ID, so for you it would be PRJNA420975 for the first sample. You will have to go digging for the rest..

Good luck!

@BirongZhang
Copy link
Author

Hey Barry,

Thanks for your kind help!
Let me check my data. I have only downloaded a few samples of this dataset and would like to try them out first.
However, I still have much single-end fastq data...

Thanks again!

Kind regards,
Birong

@BirongZhang
Copy link
Author

BirongZhang commented Nov 1, 2021

I'm looking at the experiment metadata for SRR6343608

It states the data is PAIRED ? Maybe you did not download the data correctly.

Might I suggest using nf-core/fetchngs which is an excellent workflow for downloading public datasets. Simply provide the workflow with a SRA / ENA / GEO ID, so for you it would be PRJNA420975 for the first sample. You will have to go digging for the rest..

Good luck!

Hi Barry,

Thanks for letting me know about the useful pipeline nf-core/fetchngs! It works well!

You are right, it is paired end datasetSRR6343628. I don't know why my previous method didn't work.

nohup parallel -j 1 fastq-dump --skip-technical -F ::: $(cat SraAccList.txt)

Last week I tried the same data set: PRJNA420975 with nf-core/fetchngs pipeline, but I was a little confused. Perhaps this data (SRX3441728) is large and the pipeline splits the data into two parts. So when I want to merge them, should I merge them as shown below?
Screenshot 2021-11-01 at 12 53 39

Thanks again for your time and work!

Kind regards,
Birong

@BarryDigby
Copy link
Collaborator

Hey Birong,

Yep that didn't work because you need to include the --split-3 command in your fatsq-dump command. This will split the mate pairs into *_1.fastq and *_2.fastq files for you. But I see you got nf-core/fetchngs working :)

Your merge strategy looks correct to me. Judging by the file sizes, they might have split SRX3441728 over two lanes to increase sequencing depth, in which case merging makes sense.

However, just to be safe, run FastQC on the SRX3441728 samples to make sure one of the lanes wasn't a bad batch.

Also, if you merge the files, check to make sure that the *_1.fastq and *_2.fastq mate files have the same number of reads. (I am pretty sure I have come across this error with the aligners in this workflow).

Best,
Barry

@BirongZhang
Copy link
Author

Hi Barry,

Sorry, it is me, again!

Thanks for your reply! They are all working now. I've successfully downloaded several datasets!

But I have a new problem with circrna pipeline.

I am using the supercomputer Hawk, and paired data.
Here is my script:

module load nextflow/21.04.0
module load singularity

nextflow run nf-core/circrna \
-r 892b3136e7432221bd81f8c7cc0400ebe541b08e \
-profile singularity \
--genome 'GRCh37' \
--input "raw_data/SRR6343628_{1,2}.fastq.gz" \
--input_type 'fastq' \
--module 'circrna_discovery' \
--tool 'circexplorer2' 

Here is my error:
Screenshot 2021-11-09 at 19 06 18

Error executing process > 'STAR_1PASS (SRR6343628)'
Caused by: Process requirement exceed available CPUs -- req: 16; avail: 8

What does this mean? Does this mean the supercomputer didn't meet Pipeline's requirements? But when I run the test data, I successfully get the result(test_outdir). What should I do?

Let me know if you need any further information. Thanks so much for your time and patient!

Best regards,
Birong

@BarryDigby
Copy link
Collaborator

Hi Birong,

Don't worry about it - happy to help.

So this means that the process STAR_1PASS requested 16 CPUs, but you only have 8 CPUs available on the queue you sent the job to on Hawk. You will need to change the configuration file settings. Try the following:

  1. Make a fork of the repository.
  2. Clone the forked repository to your computer
  3. Make changes to the conf/base.config file
  4. git add . -> git commit -m "config change for hawk" -> git push
  5. Now your saved changes to the config file exist on your forked repo. When running nextflow, be sure to pull your forked repo and not my original circrna repo. i.e nextflow pull BirongZhang/circrna , nextflow run BirongZhang/circrna -r dev [...]

Here is what I mean by point 3:

/*
 * -------------------------------------------------
 *  nf-core/circrna Nextflow base config file
 * -------------------------------------------------
 * A 'blank slate' config file, appropriate for general
 * use on most high performace compute environments.
 * Assumes that all software is installed and available
 * on the PATH. Runs in `local` mode - all jobs will be
 * run on the logged in environment.
 */

process {

  cpus = { check_max( 1 * task.attempt, 'cpus' ) }
  memory = { check_max( 7.GB * task.attempt, 'memory' ) }
  time = { check_max( 12.h * task.attempt, 'time' ) }

  errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
  maxRetries = 1
  maxErrors = '-1'

  // Process-specific resource requirements
  // NOTE - Only one of the labels below are used in the fastqc process in the main script.
  //        If possible, it would be nice to keep the same label naming convention when
  //        adding in your processes.
  // TODO nf-core: Customise requirements for specific processes.
  // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors

  withLabel:process_low {
    cpus = { check_max( 2 * task.attempt, 'cpus' ) }     # This line controls CPU usage for process_low labels
    memory = { check_max( 14.GB * task.attempt, 'memory' ) }
    time = { check_max( 6.h * task.attempt, 'time' ) }
  }
  withLabel:process_medium {
    cpus = { check_max( 8 * task.attempt, 'cpus' ) }    # If your CPU max is 8, set this to 4? 
    memory = { check_max( 42.GB * task.attempt, 'memory' ) }
    time = { check_max( 8.h * task.attempt, 'time' ) }
  }
  withLabel:process_high {
    cpus = { check_max( 16 * task.attempt, 'cpus' ) }   # Alignment steps use process_high and request 16CPUs. Change this to 8 CPUS
    memory = { check_max( 84.GB * task.attempt, 'memory' ) }
    time = { check_max( 16.h * task.attempt, 'time' ) }
  }
  withLabel:process_long {
    time = { check_max( 24.h * task.attempt, 'time' ) }
  }
  withName:get_software_versions {
    cache = false
  }
  withLabel:py3{
    container = 'barryd237/py3:dev'
  }
}

You could ask your system administrator about the maximum CPU and memory capacity of Hawk so you can configure this file in such a way that it never asks for more resources than are available.

@BarryDigby
Copy link
Collaborator

@BirongZhang Going to re-open this issue because it has a lot of good troubleshooting questions in it - if that's ok?

@BarryDigby BarryDigby reopened this Nov 10, 2021
@BirongZhang
Copy link
Author

Hi Barry,

Thanks so much for your kind help!

I will try what you said before, and ask hawk team about the maximum CPU.

I will let you know what happens. Thanks again.

Best,
Birong

@BirongZhang
Copy link
Author

Hi Barry,

I am back.
I can another highmem partition, so maybe the previous problem could be solved. But this time, new problem emerged before STAR step:

N E X T F L O W  ~  version 21.04.0
Launching `nf-core/circrna` [prickly_gautier] - revision: 892b3136e7432221bd81f8c7cc0400ebe541b08e
WARNING: Could not load nf-core/config profiles: https://raw.githubusercontent.com/nf-core/configs/master/nfcore_custom.config
WARN: There's no process matching config selector: get_software_versions


------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/circrna v1.0.0
------------------------------------------------------


Input data log info:
No input sample CSV file provided, attempting to read from path instead.
Reading input data from path: raw_data/SRR6343609_{1,2}.fastq.gz


Core Nextflow options
  runName               : prickly_gautier
  containerEngine       : singularity
  container             : barryd237/circrna:dev
  launchDir             : /scratch/c.c2050857/NAFLD/GSE107650
  workDir               : /scratch/c.c2050857/NAFLD/GSE107650/work
  projectDir            : /home/c.c2050857/.nextflow/assets/nf-core/circrna
  userName              : c.c2050857
  profile               : singularity
  configFiles           : /home/c.c2050857/.nextflow/assets/nf-core/circrna/nextflow.config

Input/output options
  input                 : raw_data/SRR6343609_{1,2}.fastq.gz
  input_type            : fastq
  outdir                : GSE107650_Results

Reference genome files
  genome                : GRCh37

STAR
  chimScoreSeparation   : 10

Generic options
  max_multiqc_email_size: 25 MB

Max job request options
  max_memory            : 128 GB
  max_time              : 10d

------------------------------------------------------
 Only displaying parameters that differ from defaults.
------------------------------------------------------
WARN: Access to undefined parameter `name` -- Initialise it to a default value eg. `params.name = some_value`
WARN: Access to undefined parameter `fasta` -- Initialise it to a default value eg. `params.fasta = some_value`
WARN: Access to undefined parameter `gtf` -- Initialise it to a default value eg. `params.gtf = some_value`
WARN: Access to undefined parameter `bowtie` -- Initialise it to a default value eg. `params.bowtie = some_value`
WARN: Access to undefined parameter `bowtie2` -- Initialise it to a default value eg. `params.bowtie2 = some_value`
WARN: Access to undefined parameter `bwa` -- Initialise it to a default value eg. `params.bwa = some_value`
WARN: Access to undefined parameter `fasta_fai` -- Initialise it to a default value eg. `params.fasta_fai = some_value`
WARN: Access to undefined parameter `hisat` -- Initialise it to a default value eg. `params.hisat = some_value`
WARN: Access to undefined parameter `star` -- Initialise it to a default value eg. `params.star = some_value`
WARN: Access to undefined parameter `segemehl` -- Initialise it to a default value eg. `params.segemehl = some_value`
[-        ] process > SOFTWARE_VERSIONS -
[-        ] process > BWA_INDEX         -
[-        ] process > SAMTOOLS_INDEX    -
[-        ] process > HISAT2_INDEX      -
[-        ] process > STAR_INDEX        -
[-        ] process > BOWTIE_INDEX      -
[-        ] process > BOWTIE2_INDEX     -
[-        ] process > SEGEMEHL_INDEX    -
[-        ] process > FILTER_GTF        -
[-        ] process > CIRIQUANT_YML     -
[-        ] process > GENE_ANNOTATION   -
[-        ] process > BAM_TO_FASTQ      -
[-        ] process > FASTQC_RAW        -
[-        ] process > BBDUK             -
[-        ] process > FASTQC_BBDUK      -
[-        ] process > CIRIQUANT         -
[-        ] process > STAR_1PASS        -
[-        ] process > SJDB_FILE         -
WARN: Access to undefined parameter `circexplorer2_annotation` -- Initialise it to a default value eg. `params.circexplorer2_annotation = some_value`

executor >  local (2)
[f6/4d3ce9] process > SOFTWARE_VERSIONS       [  0%] 0 of 1
[-        ] process > BWA_INDEX               -
[-        ] process > SAMTOOLS_INDEX          -
[-        ] process > HISAT2_INDEX            -
[-        ] process > STAR_INDEX              -
[-        ] process > BOWTIE_INDEX            -
[-        ] process > BOWTIE2_INDEX           -
[-        ] process > SEGEMEHL_INDEX          -
[-        ] process > FILTER_GTF              -
[-        ] process > CIRIQUANT_YML           -
[-        ] process > GENE_ANNOTATION         -
[-        ] process > BAM_TO_FASTQ            -
[e3/10de03] process > FASTQC_RAW (SRR6343609) [  0%] 0 of 1
[-        ] process > BBDUK                   -
[-        ] process > FASTQC_BBDUK            -
[-        ] process > CIRIQUANT               -
[-        ] process > STAR_1PASS              -
[-        ] process > SJDB_FILE               -
[-        ] process > STAR_2PASS              -
[-        ] process > CIRCEXPLORER2           -
[-        ] process > CIRCRNA_FINDER          -
[-        ] process > DCC_MATE1               -
[-        ] process > DCC_MATE2               -
[-        ] process > DCC                     -
[-        ] process > FIND_ANCHORS            -
[-        ] process > FIND_CIRC               -
[-        ] process > MAPSPLICE_ALIGN         -
[-        ] process > MAPSPLICE_PARSE         -
[-        ] process > SEGEMEHL_ALIGN          -
[-        ] process > ANNOTATION              -
[-        ] process > FASTA                   -
[-        ] process > COUNT_MATRIX_SINGLE     -
[-        ] process > TARGETSCAN_DATABASE     -
[-        ] process > MIRNA_PREDICTION        -
[-        ] process > MIRNA_TARGETS           -
[-        ] process > HISAT_ALIGN             -
[-        ] process > STRINGTIE               -
[-        ] process > DEA                     -
[-        ] process > MULTIQC                 -
WARN: Access to undefined parameter `circexplorer2_annotation` -- Initialise it to a default value eg. `params.circexplorer2_annotation = some_value`

executor >  local (2)
[f6/4d3ce9] process > SOFTWARE_VERSIONS       [100%] 1 of 1 ✔
[-        ] process > BWA_INDEX               -
[-        ] process > SAMTOOLS_INDEX          -
[-        ] process > HISAT2_INDEX            -
[-        ] process > STAR_INDEX              -
[-        ] process > BOWTIE_INDEX            -
[-        ] process > BOWTIE2_INDEX           -
[-        ] process > SEGEMEHL_INDEX          -
[-        ] process > FILTER_GTF              -
[-        ] process > CIRIQUANT_YML           -
[-        ] process > GENE_ANNOTATION         -
[-        ] process > BAM_TO_FASTQ            -
[e3/10de03] process > FASTQC_RAW (SRR6343609) [  0%] 0 of 1
[-        ] process > BBDUK                   -
[-        ] process > FASTQC_BBDUK            -
[-        ] process > CIRIQUANT               -
[-        ] process > STAR_1PASS              -
[-        ] process > SJDB_FILE               -
[-        ] process > STAR_2PASS              -
[-        ] process > CIRCEXPLORER2           -
[-        ] process > CIRCRNA_FINDER          -
[-        ] process > DCC_MATE1               -
[-        ] process > DCC_MATE2               -
[-        ] process > DCC                     -
[-        ] process > FIND_ANCHORS            -
[-        ] process > FIND_CIRC               -
[-        ] process > MAPSPLICE_ALIGN         -
[-        ] process > MAPSPLICE_PARSE         -
[-        ] process > SEGEMEHL_ALIGN          -
[-        ] process > ANNOTATION              -
[-        ] process > FASTA                   -
[-        ] process > COUNT_MATRIX_SINGLE     -
[-        ] process > TARGETSCAN_DATABASE     -
[-        ] process > MIRNA_PREDICTION        -
[-        ] process > MIRNA_TARGETS           -
[-        ] process > HISAT_ALIGN             -
[-        ] process > STRINGTIE               -
[-        ] process > DEA                     -
[-        ] process > MULTIQC                 -
WARN: Access to undefined parameter `circexplorer2_annotation` -- Initialise it to a default value eg. `params.circexplorer2_annotation = some_value`

executor >  local (2)
[f6/4d3ce9] process > SOFTWARE_VERSIONS       [100%] 1 of 1 ✔
[-        ] process > BWA_INDEX               -
[-        ] process > SAMTOOLS_INDEX          -
[-        ] process > HISAT2_INDEX            -
[-        ] process > STAR_INDEX              -
[-        ] process > BOWTIE_INDEX            -
[-        ] process > BOWTIE2_INDEX           -
[-        ] process > SEGEMEHL_INDEX          -
[-        ] process > FILTER_GTF              -
[-        ] process > CIRIQUANT_YML           -
[-        ] process > GENE_ANNOTATION         -
[-        ] process > BAM_TO_FASTQ            -
[e3/10de03] process > FASTQC_RAW (SRR6343609) [  0%] 0 of 1
[-        ] process > BBDUK                   -
[-        ] process > FASTQC_BBDUK            -
[-        ] process > CIRIQUANT               -
[-        ] process > STAR_1PASS              -
[-        ] process > SJDB_FILE               -
[-        ] process > STAR_2PASS              -
[-        ] process > CIRCEXPLORER2           -
[-        ] process > CIRCRNA_FINDER          -
[-        ] process > DCC_MATE1               -
[-        ] process > DCC_MATE2               -
[-        ] process > DCC                     -
[-        ] process > FIND_ANCHORS            -
[-        ] process > FIND_CIRC               -
[-        ] process > MAPSPLICE_ALIGN         -
[-        ] process > MAPSPLICE_PARSE         -
[-        ] process > SEGEMEHL_ALIGN          -
[-        ] process > ANNOTATION              -
[-        ] process > FASTA                   -
[-        ] process > COUNT_MATRIX_SINGLE     -
[-        ] process > TARGETSCAN_DATABASE     -
[-        ] process > MIRNA_PREDICTION        -
[-        ] process > MIRNA_TARGETS           -
[-        ] process > HISAT_ALIGN             -
[-        ] process > STRINGTIE               -
[-        ] process > DEA                     -
[-        ] process > MULTIQC                 -

Error executing process > 'STAR_1PASS (null)'

Caused by:
 Connect to ngi-igenomes.s3.amazonaws.com:443 [ngi-igenomes.s3.amazonaws.com/52.218.112.154] failed: Network is unreachable (connect failed)

executor >  local (2)
[f6/4d3ce9] process > SOFTWARE_VERSIONS       [100%] 1 of 1 ✔
[-        ] process > BWA_INDEX               -
[-        ] process > SAMTOOLS_INDEX          -
[-        ] process > HISAT2_INDEX            -
[-        ] process > STAR_INDEX              -
[-        ] process > BOWTIE_INDEX            -
[-        ] process > BOWTIE2_INDEX           -
[-        ] process > SEGEMEHL_INDEX          -
[-        ] process > FILTER_GTF              -
[-        ] process > CIRIQUANT_YML           -
[-        ] process > GENE_ANNOTATION         -
[-        ] process > BAM_TO_FASTQ            -
[e3/10de03] process > FASTQC_RAW (SRR6343609) [  0%] 0 of 1
[-        ] process > BBDUK                   -
[-        ] process > FASTQC_BBDUK            -
[-        ] process > CIRIQUANT               -
[-        ] process > STAR_1PASS              -
[-        ] process > SJDB_FILE               -
[-        ] process > STAR_2PASS              -
[-        ] process > CIRCEXPLORER2           -
[-        ] process > CIRCRNA_FINDER          -
[-        ] process > DCC_MATE1               -
[-        ] process > DCC_MATE2               -
[-        ] process > DCC                     -
[-        ] process > FIND_ANCHORS            -
[-        ] process > FIND_CIRC               -
[-        ] process > MAPSPLICE_ALIGN         -
[-        ] process > MAPSPLICE_PARSE         -
[-        ] process > SEGEMEHL_ALIGN          -
[-        ] process > ANNOTATION              -
[-        ] process > FASTA                   -
[-        ] process > COUNT_MATRIX_SINGLE     -
[-        ] process > TARGETSCAN_DATABASE     -
[-        ] process > MIRNA_PREDICTION        -
[-        ] process > MIRNA_TARGETS           -
[-        ] process > HISAT_ALIGN             -
[-        ] process > STRINGTIE               -
[-        ] process > DEA                     -
[-        ] process > MULTIQC                 -
Error executing process > 'STAR_1PASS (null)'

Caused by:
  Connect to ngi-igenomes.s3.amazonaws.com:443 [ngi-igenomes.s3.amazonaws.com/52.218.112.154] failed: Network is unreachable (connect failed)

-[nf-core/circrna] Pipeline completed with errors-
WARN: Killing pending tasks (1)
WARN: To render the execution DAG in the required format it is required to install Graphviz -- See http://www.graphviz.org for more info.

Have you ever met this before? Let me know if you need more details, thanks.

Best,
Birong

@BarryDigby
Copy link
Collaborator

Hey Birong,

It looks like you do not have internet connection on the cluster. Try pinging google from the cluster, the result should look like this..

barry@YT-1300:/data$ ping www.google.com
PING www.google.com(di-in-f106.1e100.net (2a00:1450:400b:c01::6a)) 56 data bytes
64 bytes from di-in-f106.1e100.net (2a00:1450:400b:c01::6a): icmp_seq=1 ttl=110 time=55.6 ms
64 bytes from di-in-f106.1e100.net (2a00:1450:400b:c01::6a): icmp_seq=2 ttl=110 time=132 ms
64 bytes from di-in-f106.1e100.net (2a00:1450:400b:c01::6a): icmp_seq=3 ttl=110 time=43.7 ms
^C
--- www.google.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 43.667/77.093/132.042/39.157 ms

@BarryDigby
Copy link
Collaborator

If you can locate the reference genome files you need (GRCh37 FASTA, GTF files [previous runs on your laptop maybe?]) and upload them to the cluster manually, you will not need to connect to the AWS iGenomes bucket to automatically pull reference files.

Then I can look into running the pipeline 'offline' for you - I've never done it but can try to learn

@BirongZhang
Copy link
Author

Hi Barry,

I am back again!
So sorry for the delay. I had a break.

Yes, you are right. The supercomputer team also told me that sometimes I was not allowed to download some external data because of the firewall. This also reminds me that sometimes I cannot even use wget in some supercomputer partitions.

I really appreciate for your "offline" help, but I don't think I should continue to consume any more of your time and energy because of my particular case. You have done enough for me, and I really learned a lot for our conversation.

No worries, when I was trying to use your pipeline, I have run some STAR junction files, next I will try to use circular RNAs tools one by one.

Nice to meet you online! Thanks so much for you kind help all the time!

Best,
Birong

@BirongZhang
Copy link
Author

BirongZhang commented Dec 1, 2021

Hi Barry,

I am back again!
I saw you also used DCC. When I was using DCC, I got some error, could you help me to take a quick look?

Here is my scripts:

# http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=1221177343_h3fzKaDey7mY9G3uYurJpQBBXJ1S&clade=mammal&org=Human&db=hg38&hgta_group=rep&hgta_track=knownGene&hgta_table=0&hgta_regionType=genome&position=chrX%3A15%2C560%2C138-15%2C602%2C945&hgta_outputType=primaryTable&hgta_outFileName=UCSC
sed -i  '' 's/^chr//g' GRCh38_repeat_file.gtf
head -3 GRCh38_repeat_file.gtf

# Preparation of input files for circRNA detection step
# step one: obtain reference genome: Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/

# step two: repeat masker file for the genome build: GRCh38_repeatmasker.gtf.gz
http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=1221177343_h3fzKaDey7mY9G3uYurJpQBBXJ1S&clade=mammal&org=Human&db=hg38&hgta_group=rep&hgta_track=knownGene&hgta_table=0&hgta_regionType=genome&position=chrX%3A15%2C560%2C138-15%2C602%2C945&hgta_outputType=primaryTable&hgta_outFileName=UCSC
https://www.biostars.org/p/227979/

DCC samplesheet \
      -mt1 meta1 \
      -mt2 meta2 \
      -D \
      -R GRCh38_repeat_file.gtf \
      -an GRCh38_repeatmasker.gtf \
      -Pi \
      -F \
      -M \
      -Nr 5 6 \
      -fg \
      -G \
      -O DCC \
      -A /scratch/c.c2050857/reference/reference_Human/Homo_sapiens.GRCh38.dna.primary_assembly.fa

Here is my STAR output:

find -L Results/data/sample -name "*_Chimeric.out.junction" > samplesheet
find -L Results/data/sample_1 -name "*_1_Chimeric.out.junction" > meta1
find -L Results/data/sample_2 -name "*_2_Chimeric.out.junction" > meta2

head -3 samplesheet
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample/SRR9036347/SRR9036347_1_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample/SRR9036347/SRR9036347_2_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample/SRR9036334/SRR9036334_2_Chimeric.out.junction

$ head -3 meta1
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_1/SRR9036347/SRR9036347_1_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_1/SRR9036334/SRR9036334_1_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_1/SRR9036315/SRR9036315_1_Chimeric.out.junction

$ head -3 meta2
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_2/SRR9036347/SRR9036347_2_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_2/SRR9036334/SRR9036334_2_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_2/SRR9036315/SRR9036315_2_Chimeric.out.junction

Screenshot 2021-12-01 at 00 54 12

Here is my scripts:

DCC 0.5.0 started
44 CPU cores available, using 2
Please make sure that the read pairs have been mapped both, combined and on a per mate basis
Collecting chimera information from mates-separate mapping
WARNING: File meta2, line 1 does not contain all features.
WARNING: meta2 is probably corrupt.
WARNING: Offending line: /scratch/c.c2050857/NAFLD/GSE130970/Results/R2/SRR9036381_2_Chimeric.out.junction
Traceback (most recent call last):
  File "/nfshome/store03/users/c.c2050857/.venv-circtools-detect/bin/DCC", line 11, in <module>
    load_entry_point('DCC==0.5.0', 'console_scripts', 'DCC')()
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 254, in main
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 535, in fixall
  File "build/bdist.linux-x86_64/egg/DCC/fix2chimera.py", line 94, in fixchimerics
  File "build/bdist.linux-x86_64/egg/DCC/fix2chimera.py", line 65, in fixmate2
IndexError: list index out of range

Is there anything wrong with my scripts or the input? Thanks!

Best regards,
Birong

@BarryDigby
Copy link
Collaborator

Hey Birong,

So one or two things that might help, (but it is hard to tell from the output):

  • Use the same GTF file that was used to make the STAR index, and used for STAR mapping. You have provided the repeat_masker file to -an which should be the same GTF used for STAR i.e the full GRCh38 GTF file.
  • There is supposed to be an @ symbol in front of samplesheet, meta1 and meta2.
  • Try those first, and if it still complains about your *Chimeric.out.junction file then it might really be corrupted.

Here is an example of one I have on my computer:

chr1	1767875	-	chr1	9242065	+	2	1	4	simulate:21663	1767876	76S24M	9242066	24S76M1595p73M18828N27M
chr1	9243810	+	chr1	11026892	+	0	0	0	simulate:21971	9242109	100M1519p82M18S	11026893	82S16M2S
chr1	15450870	+	chr1	15447467	+	2	1	2	simulate:33533	15450824	46M54S	15447468	46S54M3206p100M
chr1	15450870	+	chr1	15447467	+	2	1	2	simulate:33535	15447538	18M3168N82M48p16M84S	15447468	16S82M2S
chr1	15450870	+	chr1	15447467	+	2	1	2	simulate:33540	15450832	38M62S	15447468	38S62M10p16M3168N84M
chr1	15450870	+	chr1	15447467	+	2	1	2	simulate:33547	15447485	71M3168N27M2S69p50M50S	15447468	50S50M
chr1	15447467	-	chr1	15450870	-	1	2	1	simulate:33549	15447468	59S39M2S	15450749	100M-38p59M41S
chr1	15450855	+	chr1	15447532	+	-1	0	0	simulate:33558	15450755	100M	15447533	23M3168N77M
chr1	15450870	+	chr1	15447467	+	2	1	2	simulate:33559	15450838	32M68S	15447468	32S68M-14p34M3168N66M
chr1	15447475	-	chr1	15450754	-	-1	0	0	simulate:33560	15447476	80M3168N20M	15447486	70M3168N30M

(14 columns).

Good luck ,

Barry

@BirongZhang
Copy link
Author

BirongZhang commented Dec 1, 2021

Hi Barry,

Thanks so much for your kind reply! It helps a lot!🥳

Do you mean this? How about -R GRCh38_repeat_file.gtf and -B bam_file.txt ? Do you have any suggestions about them?

DCC @samplesheet \
      -mt1 @meta1 \
      -mt2 @meta2 \
      -D \
      -R GRCh38_repeat_file.gtf \
      -an /scratch/c.c2050857/reference/reference_Human/Homo_sapiens.GRCh38.103.gtf \
      -Pi \
      -F \
      -M \
      -Nr 5 6 \
      -fg \
      -G \
      -O DCC \
      -A /scratch/c.c2050857/reference/reference_Human/Homo_sapiens.GRCh38.dna.primary_assembly.fa

Before that, I put all the STAR output into a big directory. Today, I tried to put SRR* STAR output file into the specific SRR* directory. So now samplesheet has 158 lines, meta1 and mate2 have 78 lines. Is that okay? I am really confused about how to make those preparations files.😣

sample => samplesheet (_1 and _2, 158 lines)
Screenshot 2021-12-01 at 18 03 17

sample_1 => meta1 (only _1, 78 lines) . (sample_2 => meta2, only _2,78 lines)
Screenshot 2021-12-01 at 18 04 57

Let me try it first, thanks again!🤗

Kind regards,
Birong

@BarryDigby
Copy link
Collaborator

The way I designed DCC in my workflow is to use the outputs from STAR using the 2nd pass mode.

  1. Map both reads to genome using STAR (1st pass).
  2. Collect all sj.out.tab files for every sample mapped in 1st pass. (these are novel junction sites)
  3. Perform STAR 2nd pass mapping, where I include the sj.out.tab files to help STAR align to novel splice sites. This is done for A: paired end reads, and B: each read individually
  4. Collect the Chimeric.out.junction files. Using SRR9036307 as an example, DCC expects SRR9036307_Chimeric.out.junction, SRR9036307_1_Chimeric.out.junction and SRR9036307_2_Chimeric.out.junction as inputs.

In the workflow, for sample SRR9036307, there are 3 inputs:

SRR9036307/SRR9036307_Chimeric.out.junction
mate1/SRR9036307_1_Chimeric.out.junction
mate2/SRR9036307_2_Chimeric.out.junction

The printf command is simply placing these $PATHS in samplesheet, mate1 and mate2 files for DCC - nothing special.

There is no -B flag ;) check their documentation here: https://github.com/dieterich-lab/DCC#runnning-dcc

Barry

@BirongZhang
Copy link
Author

Hi Barry,

Thanks for your time!

It is so clear, I will try it and let you know what happens.
Thanks again!

Best,
Birong

@nictru nictru closed this as completed Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants