This is a Snakemake pipeline that downloads reads from SRA, assembles them using Unicycler, and outputs various quality metric files and plots. The steps in the pipeline are:
- Download unassembled genomes from SRA using SRA toolkit
- Quality control with FASTP
- phiX spikein removal with bbduk
- Assemble genomes with Unicycler
- Get genome assembly metrics with CheckM2 and Quast
- Edit the config file in the indicated places
- Install
snakemake
. A bare conda/mamba environment is recommended (ie., created withmamba create -c conda-forge -c bioconda -n snakemake snakemake
) - Edit
config/config.yml
.sra_list
should be the path to a newline-separated file of SRA accessions.- Enter the path to the checkm2 database on your system. If you don't have it installed, you can download it directly from here (source) and put enter the path into the config file.
- By default, the pipeline will put everything into the
output
folder - change the path if you'd like it to be put somewhere else
- Edit
slurm/config.yaml
.- In particular, you'll need to edit the
default-resources
entry with the default partition you'd like to use to submit slurm jobs to.
- In particular, you'll need to edit the
- Run the pipeline with
snakemake --use-conda -c
- The fastq files dumped from SRA are paired-end (ie, after dumping, they'll be named something like
SRRXXXXX_pass_1.fastq.gz
andSRRXXXXX_pass_2.fastq.gz
)
- Reports
- Spades runtime reports
- Metric plots
- See here for parameterizing R scripts with Snakemake: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#r-and-r-markdown
- Clean up names in checkm2 report
- Modify running time for CheckM2, overall workflow based on number of inputs
- Add a rule to download the checkm2 database if it doesn't exist
- Allow user to supply paired-end reads