An Inside Look into a Genome Assembler

The Small Long-read Assembler for Educational Purposes (SLAEP) is a versatile assembler that can assemble minor data sets into contigs for genome assembly and visualize the data within the PacBio MinION data set. It is written in Python 3.8 and documented, making it an excellent tool for teaching students about genome assembly and the inner workings of an assembler.

Repository Overview

Scripts

Assembly: Contains the scripts for genome assembly. The main.py script calls upon these for the assembly process.
Visualization: A script for visualizing the FastQ data set, aiding in the understanding of the data.

Separate Attempts

de_bruijn_graph.py: An attempt at implementing a De Bruijn graph for genome assembly.
OLC.py: An attempt at the Overlap-Layout-Consensus (OLC) approach.
directed_graph.py: An attempt at building and analyzing directed graphs for assembly.
repeat_graph.py: An attempt to handle repeat graphs during assembly.
Not functional examples: A directory with other attempted code that is not functional at the moment.

Test_data

artificial_data_generator.py: A script for generating test data for validation.
foo-reads.fq: A sample FASTQ file containing read data.
foo.paf: A sample PAF file containing read alignments.

Introduction

Genome assembly is a fundamental process in bioinformatics, involving the reconstruction of complete genomes from fragmented DNA sequences. The complexity of this task increases significantly when dealing with long-read data generated by technologies like PacBio MinION. The SLAEP assembler aims to simplify this process and make it accessible to students and learners interested in understanding the algorithms behind genome assembly.

Key Features

Educational Purpose: SLAEP is primarily designed for educational use, allowing students and newcomers to gain hands-on experience with genome assembly concepts.
Minor Data Set Assembly: While SLAEP may not be optimized for large-scale assembly projects, it handles minor data sets well, making it a proper tool for educational exercises.
Visualization Support: The included visualization script enables users to gain insights into the FastQ data set, aiding in understanding the assembly process and its results.
Python 3.8 Implementation: SLAEP is implemented entirely in Python 3.8, making it easy to understand and modify for educational purposes.

Installation

To set up the SLAEP assembler on your system, follow these steps:

Clone the SLAEP repository:

git clone https://github.com/jlalisan/slaep.git
cd slaep

Install the requirements

pip install -r requirements.txt

Run the assembler

Python main.py file.fastq -p file.paf -o output.fasta -v

Replace file.fastq with the path to your FASTQ file and file.paf with the desired PAF file.

Usage

To assemble your genome data using SLAEP, use the following command:

Python main.py file.fastq -p file.paf -o output.fasta -v

The FastQ file is required, other files are optional as long as Minimap2 is installed on the device the script is run on. The visualization option is optional as well

Example usage

Example 1: Usage with only a FastQ file. For this Minimap2 is required, it will create a PAF file from the FastQ

Python main.py file.fastq

Example 2: Usage with FastQ file and a premade PAF file. Visualization is ignored here. The output file will be based on the input file name-wise

Python main.py file.fastq -p file.paf

Example 3: Usage with FastQ file and visualization. Minimap2 is required.

Python main.py file.fastq -v

Contact

If you have any questions, feedback, or suggestions regarding the SLAEP assembler, feel free to contact me:

j.l.a.eisinga@st.hanze.nl

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.idea		.idea
Scripts		Scripts
Test_data		Test_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Inside Look into a Genome Assembler

Repository Overview

Scripts

Separate Attempts

Test_data

Table of Contents

Introduction

Key Features

Installation

Usage

Example usage

Contact

About

Releases

Packages

Languages

License

jlalisan/SLAEP

Folders and files

Latest commit

History

Repository files navigation

An Inside Look into a Genome Assembler

Repository Overview

Scripts

Separate Attempts

Test_data

Table of Contents

Introduction

Key Features

Installation

Usage

Example usage

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages