Improve BWA index workflow #596

jfy133 · 2020-10-27T12:07:11Z

Currently we by default unzip the FASTA file

Lines 280 to 300 in f2a326d

    
           if("${params.fasta}".endsWith(".gz")){ 
        
               process unzip_reference{ 
        
                   tag "${zipped_fasta}" 
        
                   input: 
        
                   path zipped_fasta from file(params.fasta) // path doesn't like it if a string of an object is not prefaced with a root dir (/), so use file() to resolve string before parsing to `path`  
        
                   output: 
        
                   path "$unzip" into ch_fasta into ch_fasta_for_bwaindex,ch_fasta_for_bt2index,ch_fasta_for_faidx,ch_fasta_for_seqdict,ch_fasta_for_circulargenerator,ch_fasta_for_circularmapper,ch_fasta_for_damageprofiler,ch_fasta_for_qualimap,ch_fasta_for_pmdtools,ch_fasta_for_genotyping_ug,ch_fasta_for_genotyping_hc,ch_fasta_for_genotyping_freebayes,ch_fasta_for_genotyping_pileupcaller,ch_fasta_for_vcf2genome,ch_fasta_for_multivcfanalyzer,ch_fasta_for_genotyping_angsd 
        
                   script: 
        
                   unzip = zipped_fasta.toString() - '.gz' 
        
                   """ 
        
                   pigz -f -d -p ${task.cpus} $zipped_fasta 
        
                   """ 
        
                   } 
        
               } else { 
        
               fasta_for_indexing = Channel 
        
               .fromPath("${params.fasta}", checkIfExists: true) 
        
               .into{ ch_fasta_for_bwaindex; ch_fasta_for_bt2index; ch_fasta_for_faidx; ch_fasta_for_seqdict; ch_fasta_for_circulargenerator; ch_fasta_for_circularmapper; ch_fasta_for_damageprofiler; ch_fasta_for_qualimap; ch_fasta_for_pmdtools; ch_fasta_for_genotyping_ug; ch_fasta__for_genotyping_hc; ch_fasta_for_genotyping_hc; ch_fasta_for_genotyping_freebayes; ch_fasta_for_genotyping_pileupcaller; ch_fasta_for_vcf2genome; ch_fasta_for_multivcfanalyzer;ch_fasta_for_genotyping_angsd } 
        
           }

however, both bwa index and picard CreateSequenceDictinary does allow indexing of gzipped files. This can be confusing for people who have already BWA indexed their genomes while gzipped, and this then gets out of sync with the nf-core/eager bwa mapping command where it has unzipped the FASTA (so now doesn't have .gz) but then the actual supplied indicies do have the .gz suffix.

It would make more intuitive sense that we skip the whole unzipping thing unless we are running samtools faidx. However we need to check if this may affect downstream analysis.

In the meantime we should document this properly.

Originally reported by @marcel-keller

The text was updated successfully, but these errors were encountered:

jfy133 added the enhancement New feature or request label Oct 27, 2020

jfy133 added a commit that referenced this issue Oct 27, 2020

Address #596

aa643e3

jfy133 added pending Addressed on branch waiting for related PR partially-fixed labels Oct 27, 2020

jfy133 added this to the 2.2.1 milestone Oct 27, 2020

jfy133 mentioned this issue Oct 27, 2020

Patch 2.2.1 #594

Merged

8 tasks

jfy133 closed this as completed in #594 Oct 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve BWA index workflow #596

Improve BWA index workflow #596

jfy133 commented Oct 27, 2020

Improve BWA index workflow #596

Improve BWA index workflow #596

Comments

jfy133 commented Oct 27, 2020