docs: Document params with test data files (#64)

Several documentation fixes: * In the case 1 to 3 example commands, use the test data input files for step 1 and step 2 * Add instructions to unzip the PacBio read dataset before running the example commands * Fix misspelled params illiumina, mitocondrial, and falcon-unzip * arrow01 is now falcon_unzip * Add warning note about hyphens or dashes may not copy from browser correctly * Fix the markdown table for missing the final pipe Update conda install instructions and yaml file * bbtools breaks Nextflow java and the script we needed from it is contained in other dependencies. * update conda install documentation to use the other_dependencies yaml for conda installs --------- Co-authored-by: Andrew Severin <severin@iastate.edu>
isugifNF · Feb 2, 2023 · 88624da · 88624da
1 parent 5a54bf9
commit 88624da
Show file tree

Hide file tree

Showing 5 changed files with 69 additions and 43 deletions.
diff --git a/docs/Case_01.md b/docs/Case_01.md
@@ -24,25 +24,33 @@ We have listed some example files to test the pipeline based on Chromosome 30 Hz
 Case 1 will take primary assembly from the `FALCON/2-asm-falcon` folder.
 
 | Param | Files | Download link|
-|:--|:--|:--
-| --primary_assembly | "p_ctg.fasta" | [p_ctg.fasta](https://data.nal.usda.gov/system/files/p_ctg.fasta)|
-| --mitochondrial_assembly | "GCF_022581195.2_ilHelZeax1.1_mito.fa" | [GenBank download fasta](https://www.ncbi.nlm.nih.gov/nuccore/NC_061507.1?report=fasta)|
-| --illumina_reads |"testpolish_{R1,R2}.fq" | [testpolish_R1.fastq](https://data.nal.usda.gov/system/files/testpolish_R1.fastq), [testpolish_R2.fastq](https://data.nal.usda.gov/system/files/testpolish_R2.fastq) |
-| --pacbio_reads | "test.1.filtered.bam" | [test.1.filtered.bam_.gz](https://data.nal.usda.gov/system/files/test.1.filtered.bam_.gz)|
+|:--|:--|:--|
+| `--primary_assembly` | "p_ctg.fasta" | [p_ctg.fasta](https://data.nal.usda.gov/system/files/p_ctg.fasta)|
+| `--mitochondrial_assembly` | "GCF_022581195.2_ilHelZeax1.1_mito.fa" | [GenBank download fasta](https://www.ncbi.nlm.nih.gov/nuccore/NC_061507.1?report=fasta)|
+| `--illumina_reads` |"testpolish_{R1,R2}.fastq" | [testpolish_R1.fastq](https://data.nal.usda.gov/system/files/testpolish_R1.fastq), [testpolish_R2.fastq](https://data.nal.usda.gov/system/files/testpolish_R2.fastq) |
+| `--pacbio_reads` | "test.1.filtered.bam" | [test.1.filtered.bam_.gz](https://data.nal.usda.gov/system/files/test.1.filtered.bam_.gz)|
+
+**Note:** The PacBio Reads (`test.1.filtered.bam_.gz`) must be decompressed before running the pipeline. 
+
+```
+gunzip -dc test.1.filtered.bam_.gz > test.1.filtered.bam
+```
 
 ### Recommended parameters
 
 ```
 nextflow run isugifNF/polishCLR -r main  \
-  --primary_assembly "data/primary.fasta" \
-  --mitocondrial_assembly "data/mitochondrial.fasta" \
-  --illiumina_reads "data/illumina/*_{R1,R2}.fasta.bz" \
-  --pacbio_reads "data/pacbio/pacbio.subreads.bam" \
+  --primary_assembly "p_ctg.fasta" \
+  --mitochondrial_assembly "GCF_022581195.2_ilHelZeax1.1_mito.fa" \
+  --illumina_reads "*_{R1,R2}.fastq" \
+  --pacbio_reads "test.1.filtered.bam" \
   --step 1 \
-  --falcon-unzip false \
+  --falcon_unzip false \
   -profile slurm
 ```
 
+**Note:** On some browsers, the dashes (-) and underscores (_) can be copied incorrectly.  So if you run into an error that says `not valid in the pipeline` try manually retyping those parameters.
+
 Step 2 runs another round of Arrow polishing with the PacBio reads, then polishes with short-reads with two rounds of FreeBayes. We broke these two steps into seperate phases to allow for manual scaffolding.
 
 Provide the purged primary `primary_purged.fa` and alternate contigs `haps_purged.fa` from purge_dups, and mitochondrial genome `mitochondrial.fasta` as input to step 2. 
@@ -56,8 +64,8 @@ Regardless don't forget to include parameter flags `--step 2` and `resume` to th
   --primary_assembly "primary_purged.fa" \
   --alternate_assembly "haps_purged.fa" \
   --mitochondrial_assembly "data/mitochondrial.fasta" \
-  --illumina_reads "data/illumina/*_{R1,R2}.fasta.bz" \
-  --pacbio_reads "../RawSequelData/m*.subreads.bam" \
+  --illumina_reads "*_{R1,R2}.fastq" \
+  --pacbio_reads "test.1.filtered.bam" \
   --step 2 \
   -profile slurm \
   -resume

diff --git a/docs/Case_02.md b/docs/Case_02.md
@@ -20,28 +20,36 @@ We have listed some example files to test the pipeline based on Chromosome 30 Hz
 Case 2 will take primary assembly from the `FALCON/3-unzip` folder.
 
 | Param | Files | Download link|
-|:--|:--|:--
-| --primary_assembly | "all_p_ctg.fasta" | [all_p_ctg.fasta](https://data.nal.usda.gov/system/files/all_p_ctg.fasta)|
-| --alternate_assembly | "all_h_ctg.fasta" |[all_h_ctg.fasta](https://data.nal.usda.gov/system/files/all_h_ctg.fasta)|
-| --mitochondrial_assembly | "GCF_022581195.2_ilHelZeax1.1_mito.fa" | [GenBank download fasta](https://www.ncbi.nlm.nih.gov/nuccore/NC_061507.1?report=fasta)|
-| --illumina_reads |"testpolish_{R1,R2}.fq" | [testpolish_R1.fastq](https://data.nal.usda.gov/system/files/testpolish_R1.fastq), [testpolish_R2.fastq](https://data.nal.usda.gov/system/files/testpolish_R2.fastq) |
-| --pacbio_reads | "test.1.filtered.bam" | [test.1.filtered.bam_.gz](https://data.nal.usda.gov/system/files/test.1.filtered.bam_.gz)|
+|:--|:--|:--|
+| `--primary_assembly` | "all_p_ctg.fasta" | [all_p_ctg.fasta](https://data.nal.usda.gov/system/files/all_p_ctg.fasta)|
+| `--alternate_assembly` | "all_h_ctg.fasta" |[all_h_ctg.fasta](https://data.nal.usda.gov/system/files/all_h_ctg.fasta)|
+| `--mitochondrial_assembly` | "GCF_022581195.2_ilHelZeax1.1_mito.fa" | [GenBank download fasta](https://www.ncbi.nlm.nih.gov/nuccore/NC_061507.1?report=fasta)|
+| `--illumina_reads` |"testpolish_{R1,R2}.fastq" | [testpolish_R1.fastq](https://data.nal.usda.gov/system/files/testpolish_R1.fastq), [testpolish_R2.fastq](https://data.nal.usda.gov/system/files/testpolish_R2.fastq) |
+| `--pacbio_reads` | "test.1.filtered.bam" | [test.1.filtered.bam_.gz](https://data.nal.usda.gov/system/files/test.1.filtered.bam_.gz)|
+
+**Note:** The PacBio Reads (`test.1.filtered.bam_.gz`) must be decompressed before running the pipeline. 
+
+```
+gunzip -dc test.1.filtered.bam_.gz > test.1.filtered.bam
+```
 
 ### Recommended parameters
 
 ```
 nextflow run isugifNF/polishCLR -r main \
-  --primary_assembly "data/primary.fasta" \
-  --alternate_assembly "data/alternate.fasta" \
-  --mitocondrial_assembly "data/mitochondrial.fasta" \
-  --illumina_reads "data/illumina/*_{R1,R2}.fasta.bz" \
-  --pacbio_reads "data/pacbio/pacbio.subreads.bam" \
+  --primary_assembly "all_p_ctg.fasta" \
+  --alternate_assembly "all_h_ctg.fasta" \
+  --mitochondrial_assembly "GCF_022581195.2_ilHelZeax1.1_mito.fa" \
+  --illumina_reads "*_{R1,R2}.fastq" \
+  --pacbio_reads "test.1.filtered.bam" \
   --step 1 \
-  --arrow01 \
+  --falcon_unzip true \
   -profile slurm \
   -resume
 ```
 
+**Note:** On some browsers, the dashes (-) and underscores (_) can be copied incorrectly.  So if you run into an error that says `not valid in the pipeline` try manually retyping those parameters.
+
 Step 2 runs another round of Arrow polishing with the PacBio reads, then polishes with short-reads with two rounds of FreeBayes. We broke these two steps into seperate phases to allow for manual scaffolding.
 
 Provide the purged primary `primary_purged.fa` and alternate contigs `haps_purged.fa` from purge_dups, and mitochondrial genome `mitochondrial.fasta` as input to step 2. 
@@ -55,8 +63,8 @@ Regardless don't forget to include parameter flags `--step 2` and `resume` to th
   --primary_assembly "primary_purged.fa" \
   --alternate_assembly "haps_purged.fa" \
   --mitochondrial_assembly "data/mitochondrial.fasta" \
-  --illumina_reads "data/illumina/*_{R1,R2}.fasta.bz" \
-  --pacbio_reads "../RawSequelData/m*.subreads.bam" \
+  --illumina_reads "*_{R1,R2}.fastq" \
+  --pacbio_reads "test.1.filtered.bam" \
   --step 2 \
   -profile slurm \
   -resume

diff --git a/docs/Case_03.md b/docs/Case_03.md
@@ -20,27 +20,35 @@ We have listed some example files to test the pipeline based on Chromosome 30 Hz
 Case 3 will take primary assembly from the `FALCON/4-polish` folder.
 
 | Param | Files | Download link|
-|:--|:--|:--
-| --primary_assembly | "cns_p_ctg.fasta" | [cns_p_ctg.fasta](https://data.nal.usda.gov/system/files/cns_p_ctg.fasta) |
-| --alternate_assembly | "cns_h_ctg.fasta" | [cns_h_ctg.fasta](https://data.nal.usda.gov/system/files/cns_h_ctg.fasta)|
-| --mitochondrial_assembly | "GCF_022581195.2_ilHelZeax1.1_mito.fa" | [GenBank download fasta](https://www.ncbi.nlm.nih.gov/nuccore/NC_061507.1?report=fasta)|
-| --illumina_reads |"testpolish_{R1,R2}.fq" | [testpolish_R1.fastq](https://data.nal.usda.gov/system/files/testpolish_R1.fastq), [testpolish_R2.fastq](https://data.nal.usda.gov/system/files/testpolish_R2.fastq) |
-| --pacbio_reads | "test.1.filtered.bam" | [test.1.filtered.bam_.gz](https://data.nal.usda.gov/system/files/test.1.filtered.bam_.gz)|
+|:--|:--|:--|
+| `--primary_assembly` | "cns_p_ctg.fasta" | [cns_p_ctg.fasta](https://data.nal.usda.gov/system/files/cns_p_ctg.fasta) |
+| `--alternate_assembly` | "cns_h_ctg.fasta" | [cns_h_ctg.fasta](https://data.nal.usda.gov/system/files/cns_h_ctg.fasta)|
+| `--mitochondrial_assembly` | "GCF_022581195.2_ilHelZeax1.1_mito.fa" | [GenBank download fasta](https://www.ncbi.nlm.nih.gov/nuccore/NC_061507.1?report=fasta)|
+| `--illumina_reads` |"testpolish_{R1,R2}.fastq" | [testpolish_R1.fastq](https://data.nal.usda.gov/system/files/testpolish_R1.fastq), [testpolish_R2.fastq](https://data.nal.usda.gov/system/files/testpolish_R2.fastq) |
+| `--pacbio_reads` | "test.1.filtered.bam" | [test.1.filtered.bam_.gz](https://data.nal.usda.gov/system/files/test.1.filtered.bam_.gz)|
+
+**Note:** The PacBio Reads (`test.1.filtered.bam_.gz`) must be decompressed before running the pipeline. 
+
+```
+gunzip -dc test.1.filtered.bam_.gz > test.1.filtered.bam
+```
 
 ### Recommended parameters
 
 ```
 nextflow run isugifNF/polishCLR -r main \
-  --primary_assembly "data/primary.fasta" \
-  --alternate_assembly "data/alternate.fasta" \
-  --mitocondrial_assembly "data/mitochondrial.fasta" \
-  --illumina_reads "data/illumina/*_{R1,R2}.fasta.bz" \
-  --pacbio_reads "data/pacbio/pacbio.subreads.bam" \
+  --primary_assembly "cns_p_ctg.fasta" \
+  --alternate_assembly "cns_h_ctg.fasta" \
+  --mitochondrial_assembly "GCF_022581195.2_ilHelZeax1.1_mito.fa" \
+  --illumina_reads "*_{R1,R2}.fastq" \
+  --pacbio_reads "test.1.filtered.bam" \
   --step 1 \
   -profile slurm \
   -resume
 ```
 
+  **Note:** On some browsers, the dashes (-) and underscores (_) can be copied incorrectly.  So if you run into an error that says `not valid in the pipeline` try manually retyping those parameters.
+
 Step 2 runs another round of Arrow polishing with the PacBio reads, then polishes with short-reads with two rounds of FreeBayes. We broke these two steps into seperate phases to allow for manual scaffolding.
 
 Provide the purged primary `primary_purged.fa` and alternate contigs `haps_purged.fa` from purge_dups, and mitochondrial genome `mitochondrial.fasta` as input to step 2. 
@@ -54,9 +62,11 @@ Regardless don't forget to include parameter flags `--step 2` and `resume` to th
   --primary_assembly "primary_purged.fa" \
   --alternate_assembly "haps_purged.fa" \
   --mitochondrial_assembly "data/mitochondrial.fasta" \
-  --illumina_reads "data/illumina/*_{R1,R2}.fasta.bz" \
-  --pacbio_reads "../RawSequelData/m*.subreads.bam" \
+  --illumina_reads "*_{R1,R2}.fastq" \
+  --pacbio_reads "test.1.filtered.bam" \
   --step 2 \
   -profile slurm \
   -resume
   ```
+
+
diff --git a/docs/Install.md b/docs/Install.md
@@ -22,10 +22,10 @@ nextflow pull isugifNF/polishCLR -r main
 Install dependencies in a [miniconda](https://docs.conda.io/en/latest/miniconda.html) environment.
 
 ```
-wget https://raw.githubusercontent.com/isugifNF/polishCLR/main/environment.yml
+wget https://raw.githubusercontent.com/isugifNF/polishCLR/main/other_dependencies.yml
 
 [[ -d env ]] || mkdir env
-conda env create -f environment.yml -p ${PWD}/env/polishCLR_env
+conda env create -f other_dependencies.yml -p ${PWD}/env/polishCLR_env
 
 conda activate env/polishCLR_env
 

diff --git a/other_dependencies.yml b/other_dependencies.yml
@@ -16,11 +16,11 @@ dependencies:
   - samtools
   - bwa-mem2
   - pigz
-  - bbtools
+  #- bbtools
   - purge_dups
   - bamtools
   - merfin
   #- openjdk=11.0.15
-  #- nextflow
+  - nextflow
   - busco=5.4.2