The Small Long-read Assembler for Educational Purposes (SLAEP) is a versatile assembler that can assemble minor data sets into contigs for genome assembly and visualize the data within the PacBio MinION data set. It is written in Python 3.8 and documented, making it an excellent tool for teaching students about genome assembly and the inner workings of an assembler.
Assembly
: Contains the scripts for genome assembly. Themain.py
script calls upon these for the assembly process.Visualization
: A script for visualizing the FastQ data set, aiding in the understanding of the data.
de_bruijn_graph.py
: An attempt at implementing a De Bruijn graph for genome assembly.OLC.py
: An attempt at the Overlap-Layout-Consensus (OLC) approach.directed_graph.py
: An attempt at building and analyzing directed graphs for assembly.repeat_graph.py
: An attempt to handle repeat graphs during assembly.Not functional examples
: A directory with other attempted code that is not functional at the moment.
artificial_data_generator.py
: A script for generating test data for validation.foo-reads.fq
: A sample FASTQ file containing read data.foo.paf
: A sample PAF file containing read alignments.
Genome assembly is a fundamental process in bioinformatics, involving the reconstruction of complete genomes from fragmented DNA sequences. The complexity of this task increases significantly when dealing with long-read data generated by technologies like PacBio MinION. The SLAEP assembler aims to simplify this process and make it accessible to students and learners interested in understanding the algorithms behind genome assembly.
- Educational Purpose: SLAEP is primarily designed for educational use, allowing students and newcomers to gain hands-on experience with genome assembly concepts.
- Minor Data Set Assembly: While SLAEP may not be optimized for large-scale assembly projects, it handles minor data sets well, making it a proper tool for educational exercises.
- Visualization Support: The included visualization script enables users to gain insights into the FastQ data set, aiding in understanding the assembly process and its results.
- Python 3.8 Implementation: SLAEP is implemented entirely in Python 3.8, making it easy to understand and modify for educational purposes.
To set up the SLAEP assembler on your system, follow these steps:
- Clone the SLAEP repository:
git clone https://github.com/jlalisan/slaep.git
cd slaep
- Install the requirements
pip install -r requirements.txt
- Run the assembler
Python main.py file.fastq -p file.paf -o output.fasta -v
Replace file.fastq with the path to your FASTQ file and file.paf with the desired PAF file.
To assemble your genome data using SLAEP, use the following command:
Python main.py file.fastq -p file.paf -o output.fasta -v
The FastQ file is required, other files are optional as long as Minimap2 is installed on the device the script is run on. The visualization option is optional as well
Example 1: Usage with only a FastQ file. For this Minimap2 is required, it will create a PAF file from the FastQ
Python main.py file.fastq
Example 2: Usage with FastQ file and a premade PAF file. Visualization is ignored here. The output file will be based on the input file name-wise
Python main.py file.fastq -p file.paf
Example 3: Usage with FastQ file and visualization. Minimap2 is required.
Python main.py file.fastq -v
If you have any questions, feedback, or suggestions regarding the SLAEP assembler, feel free to contact me: