tprK pipeline

This pipeline was designed to take Illumina and PacBio files straight off the sequencer to a final comparison table of all the different variable regions with their relative frequencies, as well as various pretty plots along the way.

Warning

Fastqs much be gzipped for both Illumina and PacBio runs

Fastqs for Illumina should be paired and trimmed before being run through the pipeline

Installation

Install nextflow.
- Make sure you move nextflow to a directory in your PATH variable.
Install docker.
If running on the cloud setup setup nextflow tower

Warning

This pipeline was written with DSL1 and must be run with an older version of nextflow before your nexflow run command use:

NXF_VER=22.10.4

Metadata

A metadatafile is passed in to give sample names and locations.

Note

For Sample name column remove ".fastq.gz" from the end of the sample name and "Ill_" or "PB_" from the start of the filename.

For Illumina Runs add "Ill_" before the sample name for the fastq for the file and in the metadata

SampleName	Illumina	PacBio	BioSample	Replicate
Example_1	Ill_Example_1.fastq.gz
Example_2	Ill_Example_2.fastq.gz
Example_3	Ill_Example_3.fastq.gz
Example_4	Ill_Example_4.fastq.gz
Example_5	Ill_Example_5.fastq.gz

For Pacbio Runs add "PB_" before the sample name for the fastq for the file and in the metadata

Warning

For Pacbio runs a newline must be added to the end of the metadata file or you will receive an incomplete final line error for the createPacBioTree step.

SampleName	Illumina	PacBio	BioSample	Replicate
Example_1		PB_Example_1.fastq.gz
Example_2		PB_Example_2.fastq.gz
Example_3		PB_Example_3.fastq.gz
Example_4		PB_Example_4.fastq.gz
Example_5		PB_Example_5.fastq.gz

Important

If running the fastqs in a specific directory the directory must be specified in the --INPUT option and in the metadata file. It is recommended to run the pipeline in the directory with your fastqs and pass "./" for --INPUT.

SampleName	Illumina	PacBio	BioSample	Replicate
Example_1	Example_Directory/Ill_Example_1.fastq.gz
Example_2	Example_Directory/Ill_Example_2.fastq.gz
Example_3	Example_Directory/Ill_Example_3.fastq.gz
Example_4	Example_Directory/Ill_Example_4.fastq.gz
Example_5	Example_Directory/Ill_Example_5.fastq.gz

For running with technical replicates name any replicates with the same biosample name in the biosample column but differentiate with A B for what replicate the sample is

SampleName	Illumina	BioSample	Replicate
R501B1_R1_S5_merged	Ill_R501B1_R1_S5_merged.fastq.gz	KO_Time1_R501	A
R501B1_R2_S13_merged	Ill_R501B1_R2_S13_merged.fastq.gz	KO_Time1_R501	B
R501B2_R1_S21_merged	Ill_R501B2_R1_S21_merged.fastq.gz	KO_Time2_R501	A
R501B2_R2_S29_merged	Ill_R501B2_R2_S29_merged.fastq.gz	KO_Time2_R501	B

Options

Command	Description
`--INPUT`	Input folder where fastqs are located. For current directory, `./` can be used.
`--OUTDIR`	Output folder where .bams and consensus fastas will be piped into.
`--METADATA`	Path to metadata file with specific format.
`--PACBIO`	Specify that there are only PacBio files to be read.
`--ILLUMINA`	Specify that there are only Illumina files to be read.
`--LARGE`	Splitting of visualizations will be done, heatmap and variable site comparison will produce a separate file for each variable region rather than combining them. Recommended for all runs as output is cleaner.
`--REFERENCE`	Specify Illumina sample name (not file), to compare others to for dot-line plots. Can be used in tandem with --LARGE.
`--TECHNICAL_REPLICATE`	Generates additional statistics if using technical replicates.
`--RF_FILTER`	Specify relative frequency filter. Default is 0.00001.
`--COUNT_FILTER`	Specify count filter. Default is 0.
`--ILLUMINA_FILTER`	Specify whether PacBio reads should be filtered to only include files supported by Illumina reads that reach the cutoff.
`-resume`	nextflow will pick up where it left off if the previous command was interrupted for some reason.
`-with-docker ubuntu:18.04`	Runs command with Ubuntu docker.
`-with-trace`	Outputs a trace.txt that shows which processes end up in which work/ folders.
`-profile`	`standard`: For less computationally intensive systems run locally, not reccommended
	`standardMORE`: For running on the more computationally strong local systems, used for lab i9 imacs
	`laptop`: For running on very low power computational systems, very slow
	`Cloud`: For running on the cloud adds more computational power for memory intensive steps, recommended
`-c`	Add you nextflow config file to access cloud
`-with-tower`	Monitor your run with nextflow tower

This pipeline can be run with both Illumina and PacBio samples at the same time skip --ILLUMINA and --PACBIO option and fill out the metadata table with both PacBio and Illumina samples, each sample must have a PacBio and an Illumina sample

Commands

Illumina run in cloud:

NXF_VER=22.10.4 nextflow run greninger-lab/TPRK-Pipeline -r main \
    --METADATA Metadata_Example_Illumina.csv \
    --ILLUMINA \
    --INPUT ./ \
    --OUTDIR Example_Illumina/ \
    -with-docker ubuntu:18.04 \
    -profile Cloud \
    -c your_nextflow_aws.config \
    -with-tower

Pacbio run in cloud:

NXF_VER=22.10.4  nextflow run greninger-lab/TPRK-Pipeline -r main \
    --METADATA Metadata_Example_PACBIO.csv \
    --PACBIO \
    --INPUT Example_PacBio_fastq/ \
    --OUTDIR Example_PACBIO_Out/ \
    -with-docker ubuntu:18.04 \
    -profile Cloud \
    -c your_nextflow_aws.config \
    -with-tower

Illumina run local standard cpu usage:

NXF_VER=22.10.4 nextflow run greninger-lab/TPRK-Pipeline -r main \
    --METADATA Metadata_Example_Illumina.csv \
    --ILLUMINA \
    --INPUT ./ \
    --OUTDIR Example_Illumina/ \
    -with-docker ubuntu:18.04 \
    -profile standard

Pacbio run local higher cpu usage:

NXF_VER=22.10.4  nextflow run greninger-lab/TPRK-Pipeline -r main \
    --METADATA Metadata_Example_PACBIO.csv \
    --PACBIO \
    --INPUT Example_PacBio_fastq/ \
    --OUTDIR Example_PACBIO_Out/ \
    -with-docker ubuntu:18.04 \
    -profile standardMORE

WorkFlow

Default workflow for Illumina and Pacbio runs

flowchart TD;
    subgraph  
    visualizeAllData;
    end
    subgraph Illumina;
    createAllAssignments-->createFrequencyTables_Illumina;
    createFrequencyTables_Illumina-->summaryStats_Illumina;
    summaryStats_Illumina-->filterReads;
    classDef bar stroke:#0f0
    filterReads-->createFrequencyPlots_Illumina;
    createFrequencyPlots_Illumina-->createVariableRegionComparisons;
    createVariableRegionComparisons-->subsetReads;
    subsetReads-->visualizeAllData:::bar;
    end;
    subgraph PacBio;
    denoisePacBioFiles-.->createFrequencyTables_PacBio;
    createFrequencyTables_PacBio-.->summaryStats_PacBio;
    summaryStats_PacBio-.->createFrequencyPlots_PacBio;
    createFrequencyPlots_PacBio-.->createPacBioTree;
    createPacBioTree-.->visualizeAllData;
    end

Example

Install sratoolkit

On Mac you can use brew:

brew install sratoolkit

Illumina

Make an Example_Illumina_fastq folder

mkdir Example_Illumina_fastq

Enter Example folder

cd Example_Illumina_fastq

Download samples from SRA and place them in the Example_Illumina_fastq folder.

fasterq-dump --split-files SRR24317964
fasterq-dump --split-files SRR24317966
fasterq-dump --split-files SRR24317967	
fasterq-dump --split-files SRR24317968	
fasterq-dump --split-files SRR24317969

Cat R1 and R2 samples together

for i in *_1.fastq
do
base=$(basename $i _1.fastq)
cat ${base}_1.fastq ${base}_2.fastq > ${base}.fastq
done

Remove unpaired reads

rm *_1.fastq
rm *_2.fastq

gzip reads

gzip *.fastq

Add "Ill_: to the front of all read names

for file in *; do mv "$file" "Ill_$file"; done

Exit folder

cd ..

Run example workflow or use one from above

NXF_VER=22.10.4 nextflow run greninger-lab/TPRK-Pipeline -r main --METADATA Metadata_Example_Illumina.csv --ILLUMINA --INPUT Example_Illumina_fastq/ --OUTDIR Example_Illumina/ -with-docker ubuntu:18.04 -profile standard

Pacbio

Make an Example_PacBio_fastq folder

mkdir Example_PacBio_fastq

Enter Example folder

cd Example_PacBio_fastq

Download samples from SRA and place them in the Example_PacBio_fastq folder.

fasterq-dump SRR10294254		
fasterq-dump SRR10294238

gzip reads

gzip *.fastq

Add "PB_: to the front of all read names

for file in *; do mv "$file" "PB_$file"; done

Exit folder

cd ..

Run example workflow or use one from above

NXF_VER=22.10.4  nextflow run greninger-lab/TPRK-Pipeline -r main --METADATA Metadata_Example_PACBIO.csv --PACBIO --INPUT Example_PacBio_fastq/ --OUTDIR Example_PACBIO_Out/ -with-docker ubuntu:18.04 -profile standard

Output

Illumina Output

Example_Illumina_Out
├── Figures
│   ├── Relative_Frequency_Plots
│   │   ├── Ill_SRR24317964_RelativeFreqPlot.html                               # Bar chart comparing variability between variable sites for each sample 
│   │   ├── Ill_SRR24317964_filtered_RelativeFreqPlot.html                      # Bar chart comparing variability between variable sites for each sample with filtering
│   │   ├── Ill_SRR24317966_RelativeFreqPlot.html
│   │   ├── Ill_SRR24317966_filtered_RelativeFreqPlot.html
│   ├── Variable_Region_Comparisons
│   │   ├── SRR24317964_vs_SRR24317966_VariableRegions_DotLine_Filtered.pdf     # Plot comparing variability of reads between samples 
│   │   └── Variable_region_compare.RData                                       # R data to be loaded into R of variable region comparisons
│   ├── all_heatmap.html                                                        # Heatmap of all variable region variability for all samples, will be separated by --large option
│   └── all_variable_regions.html                                               # Barchart of all variable region variability for all samples, will be separated by --large option
├── Tables
│   ├── Ill_SRR24317964_final_data.csv                                          # separated table of counts as nucleotide for each sample
│   ├── Ill_SRR24317964_overcount_final_AA_data.csv                             # separated table of counts as amino acid for each sample
│   ├── Ill_SRR24317964_summary_statistics.csv                                  # Total read counts per variable region per sample
│   ├── Ill_SRR24317966_final_data.csv
│   ├── Ill_SRR24317966_overcount_final_AA_data.csv
│   ├── Ill_SRR24317966_summary_statistics.csv
│   ├── all_assignments.csv                                                     # dataframe of reads uncounted per variable site for all samples
│   ├── all_summary_stats.csv                                                   # counts for all samples for all variable regions
│   ├── allreads.csv
│   ├── allreads_abs.csv                                                        # all reads csv with only counts
│   ├── allreads_filt_abs.csv                                                   # all reads csv with only counts filtered
│   ├── allreads_filt_rel.csv                                                   # all reads csv with only frequency filtered
│   ├── allreads_rel.csv                                                        # all reads csv with only frequency
│   └── compare_illumina_df.csv
├── allreads.csv                                                                # Counts and Frequencies for each variable site 
├── allreads_filtered.csv                                                       # Filters dataframe if RF or Count filter option is applied
└── allreads_filtered_heatmap.csv                                               # CSV used to generate heatmap with less filtering (with default filtering values allreads csvs will be identical)

Pacbio Output

Example_PACBIO_OUT
├── Figures
│   ├── Relative_Frequency_Plots
│   │   ├── PB_SRR10294238_RelativeFreqPlot.html                                # Bar chart comparing variability between variable sites
│   │   ├── PB_SRR10294238_filtered_RelativeFreqPlot.html                       # Bar chart comparing variability between variable sites for each sample
│   │   ├── PB_SRR10294254_RelativeFreqPlot.html
│   │   └── PB_SRR10294254_filtered_RelativeFreqPlot.html
│   └── Tree
│       ├── Isolates_aa_filt_fullORFs.aln.aln.tree.nwk                          # PacBio newick tree file
│       ├── Isolates_aa_filt_fullORFs.fasta                                     # Fasta converted to AA filtered on RF cuttoff, default is identical to non filtered                                    
│       ├── Isolates_aa_fullORFs.fasta                                          # Fasta converted to AA filtered on RF cuttoff
│       ├── PacBio_Tree_Filtered.RData                                          # Pacbio tree R data
│       ├── PacBio_Tree_Filtered.pdf                                            # Pacbio tree
│       └── Table_allAAfilt_fullORFs.tsv                                        # Table of counts per sequence
├── Tables
│   └── Frequency_Tables
│       ├── PB_SRR10294238.noprimers.filtered.RAD.nolines.fix_final_data.csv                # Counts and relative frequency of sequences for sample as nucleic acids
│       ├── PB_SRR10294238.noprimers.filtered.RAD.nolines.fix_overcount_final_AA_data.csv   # Counts and relative frequency of sequences for sample as AA
│       ├── PB_SRR10294238.noprimers.filtered.RAD.nolines.fix_summary_statistics.csv        # Counts per variable region
│       ├── PB_SRR10294254.noprimers.filtered.RAD.nolines.fix_final_data.csv
│       ├── PB_SRR10294254.noprimers.filtered.RAD.nolines.fix_overcount_final_AA_data.csv
│       ├── PB_SRR10294254.noprimers.filtered.RAD.nolines.fix_summary_statistics.csv
│       ├── all_assignments.csv # All amino acids and nucleic acid sequences for samples
│       └── compare_pacbio_df.csv # all reads csv with counts and frequency
└── denoised_fastas
    ├── PB_SRR10294238.noprimers.filtered.RAD.nolines.fix.fasta                             # Denoised fasta for each sample
    └── PB_SRR10294254.noprimers.filtered.RAD.nolines.fix.fasta

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Example_Figures		Example_Figures
bin		bin
.gitignore		.gitignore
LICENSE		LICENSE
Metadata_Example_Illumina.csv		Metadata_Example_Illumina.csv
Metadata_Example_PACBIO.csv		Metadata_Example_PACBIO.csv
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tprK pipeline

Table of Contents

Installation

Metadata

Options

Commands

WorkFlow

Example

Illumina

Pacbio

Output

About

Releases

Packages

Languages

License

greninger-lab/TPRK-Pipeline

Folders and files

Latest commit

History

Repository files navigation

tprK pipeline

Table of Contents

Installation

Metadata

Options

Commands

WorkFlow

Example

Illumina

Pacbio

Output

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages