# PhytoPipe quick test 

We can use [VIROMOCK challenge](https://gitlab.com/ilvo/VIROMOCKchallenge) datasets from the Plant Health Bioinformatics Network (PHBN) to test PhytoPipe. 
Let's choose the small size data [Dataset_8](https://gitlab.com/ilvo/VIROMOCKchallenge/-/blob/master/Datasets/Dataset8.md) to run a quick test. The data and results are under PhytoPipe test/data folder. You can download full challenge datasets from DRYAD (https://datadryad.org/stash/dataset/doi:10.5061/dryad.0zpc866z8).

Suppose you finished software installation and database setup. Now let's start the test.

### Step 1. set up config.yaml
Our data file names are Dataset_8_R1.fastq.gz and Dataset_8_R2.1.fastq.gz. In the configuration file

    seq_type: 'pe'
    strand1: 'R1'  
    strand2: 'R2'
    input_format: 'fastq.gz'

We only search viruses

    blastDbType: virus 

We use host filtered reads

    mapReadType: 'clean' 

We choose Spades assembler and BWA mapping tool

    assembler: 'Spades'
    mappingTool: bwa

Here is my config.yaml

In [None]:
!cat /my/software/PhytoPipe/config.yaml

```
#For paired-end file
seq_type: 'pe'  #'se' for single-end, 'pe' for paired-end
strand1: 'R1'  # for pair-end file name with R1 and R2
strand2: 'R2' 

number_of_threads: 16
input_format: 'fastq.gz'

#software
trimmomatic: /my/software/Trimmomatic-0.39/trimmomatic-0.39.jar
adapters: /my/software/Trimmomatic-0.39/adapters/TruSeq3-SE.fa

#databases
control: /my/databases/ncbi/phi-X174.fasta
krakenDb: /my/databases/kraken_db
kaijuDb: /my/databases/kaiju_db/kaiju_db_nr_euk.fmi
blastnDb: /my/databases/ncbi_nt/nt
blastxDb: /my/databases/ncbi/nr.dmnd
blastnViralDb: /my/databases/ncbi/refseq_viral_genomic.fa
blastnViralTaxonDb: /my/databases/ncbi/taxonomy/refseq_viral.gb_taxon.txt
blastxViralDb: /my/databases/ncbi/rvdb.dmnd 
blastxViralTaxonDb: /my/databases/ncbi/taxonomy/rvdb.gb_taxon.txt 
euk_rRNA: /my/databases/rRNA/silva-euk_combined_rRNA.fasta 
taxDb: /my/databases/ncbi/taxonomy
acronymDb: /my/databases/ncbi/ICTV_virus_acronym2019.txt
microbialTaxon: /my/databases/ncbi/microbial.tids
monitorPathogen: /my/databases/ncbi/monitorPathogen.txt
filterKeys: /my/databases/ncbi/filterKeyWords.txt

blastDbType: virus  #or 'virus' # or 'all' #for all pathogens. It could take a long time to finish blastn against

#reads used for mapping to a reference
#'trimmed' is for all reads including host reads
#'clean' is for all possible pathogen reads. host reads are cleaned
mapReadType: 'clean' #'trimmed'

#tools 
assembler: Trinity # or 'Spades'
mappingTool: bwa # or 'bowtie2'

#tool parameters
clumpify_param: "dedupe=t subs=0 passes=2 " #dupedist=40 optical=t # These parameters identify reads as duplicated only if they are an exact match (i.e., no substitution allowed).
trimmomatic_param: " LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36 "
spades_param: " --only-assembler --phred-offset 33 -k 21,51,71,91 "
bwa_param: " -B 4 "  #using default bwa setting #" -k 12 -A 1 -B 3 -O 1 -E 1 " #bwa loose mapping; 
bowtie2_param: " --very-sensitive-local -k 100 --score-min L,20,1.0 " #bowtie2 sensitive mapping; " --mp 20  --score-min L,-0.1,-0.1 " #bowtie2 very strict mapping

#gobal variables, please do not change
fastqDir: ""
flowCellDir: ""
workDir: ""
samples: ""
sequencing_key: ""

run_info:
  raw: raw                                #raw fastq file directoy
  log: logs                               #log files diretory
  qc: qc                                  #fastqc and multiqc directory
  trim: trimmed                           #trimmed directory
  clean: cleaned                          #clean reads after host, control ex. Phx174, duplicates removal
  assemble: assembly                      #assemble directory
  classify: classification                #read classification directory
  annotate: annotation                    #annotation directory 
  map: mapping                            #mapping reads directory   
  report: report                          #report directory 
  novel: novelVirus                       #novelVirus directory 
  
```

### Step 2. run PhytoPipe 

Suppose Dataset_8 two read files are in PhytoPipe test folder: /my/software/PhytoPipe/test/data, the working folder is /my/phytopipe_test, the PhytoPipe is in /my/software/PhytoPipe, 32 CPU cores are used. Then we can run a dry-test

In [None]:
!conda activate phytopipe
!snakemake  --configfile /my/software/PhytoPipe/config.yaml -s /my/software/PhytoPipe/Snakefile --config workDir=/my/phytopipe_test --fastqDir=/my/software/PhytoPipe/test/data --core 32 -n

If dry-test is ok, we can run PhytoPipe

In [None]:
!nohup snakemake  --configfile /my/software/PhytoPipe/config.yaml -s /my/software/PhytoPipe/Snakefile --config workDir=/my/phytopipe_test --fastqDir=/my/software/PhytoPipe/test/data --core 32 &

Take a break. It will take 1-2 hours to finish. 

### Step 3. check results 

The table report is in the file /my/phytopipe_test/report/report.txt.
The comprehensive report is in the file /my/phytopipe_test/report/report.txt.
The read quality and numbers in each QC step are in the file /my/phytopipe_test/report/qcReadNumber.txt.
The contigs blast results are in the folder /my/phytopipe_test/report/blastnx.

**Pelargonium flower break virus (PFBV) and Chenopodium quinoa mitovirus 1 (CqMV1) are detected.**

Here is the structure of the working directory (/my/phytopipe_test)

In [None]:
!cd /my/phytopipe_test
!tree -d

```
.
├── annotation
├── assembly
│   └── Dataset_8
├── classification
├── cleaned
├── logs
│   ├── annotate
│   ├── assemble
│   ├── checkPoint
│   ├── kaiju
│   ├── kraken
│   ├── mapping
│   ├── multiqc
│   ├── quast
│   ├── raw_fastqc
│   ├── removeControl
│   ├── removeDuplicate
│   ├── rRNA_qc
│   ├── trimmed_fastqc
│   └── trimmomatic
├── mapping
│   ├── map2Ref
│   │   └── Dataset_8
│   └── ref
│       └── Dataset_8
├── novelVirus
│   ├── finalAssembly
│   │   └── Dataset_8
│   ├── map2Contig
│   │   └── Dataset_8
│   └── pseudoContig
│       └── Dataset_8
├── qc
│   ├── multiqc
│   │   ├── quast_multiqc_data
│   │   ├── raw_multiqc_data
│   │   └── trimmed_multiqc_data
│   ├── quast
│   │   └── Dataset_8.quast
│   │       ├── basic_stats
│   │       └── icarus_viewers
│   ├── raw_fastqc
│   └── trimmed_fastqc
├── raw
├── ref
│   ├── genome
│   │   └── 1
│   └── index
│       └── 1
├── report
│   ├── blastnx
│   ├── html
│   ├── image
│   └── ncbiBlast
└── trimmed
```

Here is explanation for folders:
1. Read file folders:
    - "raw" folder contains all raw fastq files
    - "trimmed" folder contains fastq files after removing rRNAs, duplicates and low quality reads
    - "cleaned" folder contains fastq files after removing host reads 
2. Read classfication folder:
    - "classfication" folder contains kraken2 and kaiju read classification results: Dataset_8.kraken2.report.txt and Dataset_8.kaiju.table.txt
3. Read assembly folder:
    - "assembly" folder contains assembled contigs: Dataset_8/contigs.fasta
4. Contig annotation folder:
    - "annotation" folder contains contig annotions from blastx and blastn. Merged results are in Dataset_8.blast.nx.txt
5. Mapping reads to reference folder:
    - "mapping" folder contains references under mapping/ref/Dataset_8, and mapped bam file, consensus sequence under mapping/map2Ref/Dataset_8 
6. Report folder:
    - 'report' folder contains main reports from different tools
7. Log folder:
    - 'logs' folder contains log files from different tools
8. QC folder:
    - 'qc' folder contains fastqc, multiqc and quast report
9. Novel virus folder:
    - 'novelVirus' folder contains possible novel virus contigs mapping information
10. ref folder:
    - 'ref' folder is created by bbsplit.sh
    

## Configuration for other  VIROMOCK challenge datasets

Dataset_5, 6 and 10 are single read files, comment paire-end config, and use the following lines in config.yaml

In [1]:
seq_type: 'se'  #'se' for single-read, 'pe' for paired-end  
strand1: ''  #'' for single-end file name without R1, 'R1' for single-end file name end with R1 or R1_001
strand2: 'R2'  # keep it even no paired R2 file

To match "Observed closest NCBI accessions" of the datasets, it's better to use plantvirus database in config.yaml

blastnViralDb: /path/to/database/plantvirus.fa

blastnViralTaxonDb: /path/to/database/taxonomy/plantvirus.gb_taxon.txt

You cal put all paired-end read files in a folder and all single read files in another folder to run PhytoPipe for all datasets.