Skip to content

pmenzel/ont-assembly-snake-testdata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Test dataset for ont-assembly-snake and score-assemblies workflows

This repository contains sequencing data from ONT and Illumina for testing the ont-assembly-snake and score-assemblies Snakemake workflows, described in the preprint Snakemake Workflows for Long-read Bacterial Genome Assembly and Evaluation.

The datasets are subsampled read sets from the ONT and Illumina sequencing data of NCBI BioSample SAMN30015177 of Pandoraea commovens, from the publication Outbreak of Pandoraea commovens Infections among Non–Cystic Fibrosis Intensive Care Patients, Germany, 2019–2021.

Setup

Important

A working conda installation is required to run the workflows, see the installation instructions for Miniconda.

To download the dataset, clone this repository with:

git clone https://github.com/pmenzel/ont-assembly-snake-testdata.git
cd ont-assembly-snake-testdata

It will contain two folders: fastq-ont and fastq-illumina with following files:

├── fastq-illumina
│   ├── example_R1.fastq.gz
│   └── example_R2.fastq.gz
└── fastq-ont
    └── example.fastq.gz

Next, clone the ont-assembly-snake and score-assemblies repositories and create the respective conda-environments:

conda config --add channels bioconda

git clone https://github.com/pmenzel/ont-assembly-snake.git
conda env create -n ont-assembly-snake --file ont-assembly-snake/env/conda-main.yaml

git clone https://github.com/pmenzel/score-assemblies.git
conda env create -n score-assemblies --file score-assemblies/env/environment.yaml

Run pipelines

The commands below will first download a reference genome assembly for P. commovens and then run the ont-assembly-snake and score-assemblies workflows. The assemblies that are to be generated are defined in the file samples.yaml.

# download assembly GCF_902459615.1 used as a reference for polishing and comparison
# using the NCBI datasets tool
mkdir -p references references-protein
wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
chmod +x datasets
./datasets download genome accession GCF_902459615.1 --include genome,protein
unzip -p -j ncbi_dataset.zip 'ncbi_dataset/data/GCF_902459615.1/GCF_902459615.1_LMG_31010_genomic.fna' > references/GCF_902459615.1.fa
unzip -p -j ncbi_dataset.zip 'ncbi_dataset/data/GCF_902459615.1/protein.faa' > references-protein/GCF_902459615.1.faa
rm datasets ncbi_dataset.zip

# run ont-assembly-snake
conda activate ont-assembly-snake
snakemake -s ont-assembly-snake/Snakefile --use-conda --cores 10 --configfile samples.yaml --config genome_size=5.9 medaka_model=r941_min_sup_g507

# run score-assemblies
conda activate score-assemblies
snakemake -s score-assemblies/Snakefile --use-conda --cores 20

The output files of score-assemblies are in the folder score-assemblies-data and a summary HTML report is in score-assemblies-report.html.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages