GitHub - pmenzel/ont-assembly-snake-testdata

Test dataset for ont-assembly-snake and score-assemblies workflows

This repository contains sequencing data from ONT and Illumina for testing the ont-assembly-snake and score-assemblies Snakemake workflows, described in the preprint Snakemake Workflows for Long-read Bacterial Genome Assembly and Evaluation.

The datasets are subsampled read sets from the ONT and Illumina sequencing data of NCBI BioSample SAMN30015177 of Pandoraea commovens, from the publication Outbreak of Pandoraea commovens Infections among Non–Cystic Fibrosis Intensive Care Patients, Germany, 2019–2021.

Setup

Important

A working conda installation is required to run the workflows, see the installation instructions for Miniconda.

To download the dataset, clone this repository with:

git clone https://github.com/pmenzel/ont-assembly-snake-testdata.git
cd ont-assembly-snake-testdata

It will contain two folders: fastq-ont and fastq-illumina with following files:

├── fastq-illumina
│   ├── example_R1.fastq.gz
│   └── example_R2.fastq.gz
└── fastq-ont
    └── example.fastq.gz

Next, clone the ont-assembly-snake and score-assemblies repositories and create the respective conda-environments:

conda config --add channels bioconda

git clone https://github.com/pmenzel/ont-assembly-snake.git
conda env create -n ont-assembly-snake --file ont-assembly-snake/env/conda-main.yaml

git clone https://github.com/pmenzel/score-assemblies.git
conda env create -n score-assemblies --file score-assemblies/env/environment.yaml

Run pipelines

The commands below will first download a reference genome assembly for P. commovens and then run the ont-assembly-snake and score-assemblies workflows. The assemblies that are to be generated are defined in the file samples.yaml.

# download assembly GCF_902459615.1 used as a reference for polishing and comparison
# using the NCBI datasets tool
mkdir -p references references-protein
wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
chmod +x datasets
./datasets download genome accession GCF_902459615.1 --include genome,protein
unzip -p -j ncbi_dataset.zip 'ncbi_dataset/data/GCF_902459615.1/GCF_902459615.1_LMG_31010_genomic.fna' > references/GCF_902459615.1.fa
unzip -p -j ncbi_dataset.zip 'ncbi_dataset/data/GCF_902459615.1/protein.faa' > references-protein/GCF_902459615.1.faa
rm datasets ncbi_dataset.zip

# run ont-assembly-snake
conda activate ont-assembly-snake
snakemake -s ont-assembly-snake/Snakefile --use-conda --cores 10 --configfile samples.yaml --config genome_size=5.9 medaka_model=r941_min_sup_g507

# run score-assemblies
conda activate score-assemblies
snakemake -s score-assemblies/Snakefile --use-conda --cores 20

The output files of score-assemblies are in the folder score-assemblies-data and a summary HTML report is in score-assemblies-report.html.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
fastq-illumina		fastq-illumina
fastq-ont		fastq-ont
.gitignore		.gitignore
README.md		README.md
samples.yaml		samples.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Test dataset for ont-assembly-snake and score-assemblies workflows

Setup

Run pipelines

About

Releases

Packages

pmenzel/ont-assembly-snake-testdata

Folders and files

Latest commit

History

Repository files navigation

Test dataset for ont-assembly-snake and score-assemblies workflows

Setup

Run pipelines

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages