## EXISTING FILE FORMATS FOR DNA
- Source: http://genome.ucsc.edu/FAQ/FAQformat.htm
- MAF
    - The multiple alignment format stores a series of multiple alignments in a format that is easy to parse and relatively easy to read. This format stores multiple alignments at the DNA level between entire genomes.
- 2bit
    - A .2bit file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible format. The file contains masking information as well as the DNA itself.
- nib
    - It describes a DNA sequence by packing two bases into each byte.
- FASTA/FASTQ
    - base pair sequences
- FASTA
    - amino acid sequences

## EXISTING PYTHON LIBRARIES
- Source: https://wiki.python.org/moin/PythonMed
- https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04455-3
- Biopython (https://biopython.org/)
    - includes ability to parse bioinformatics files into Python utilizable data structures, including support for the following formats: 
        - Blast output – both from standalone and WWW Blast
        - Clustalw
        - FASTA
        - GenBank
        - PubMed and Medline
        - ExPASy files like Enzyme and Prosite
        - SCOP, including ‘dom’ and ‘lin’ files
        - UniGene
        - SwissProt
- pysam (http://pysam.readthedocs.org/)
    - Python wrapper package around Samtools, a suite of programs for reading and manipulating high-throughput sequencing data.
- kPAL (http://kpal.readthedocs.org/)
    - k-mer profile analysis library
- DendroPy
    - package for phylogenetic computing. It supports a wide range of phylogenetic tree formats and can be used both as a phylogenetic library and for scripting.



NOTE: many many many fast* files exist apparently: https://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml

# DNA File Compression Analysis
Ian Switzer, Joshua Devine, Ronith Ranjan

May 2, 2022

## Overview

This project was undertaken a part of CS 4501: Computational Biology taught at the University of Virginia by Professor Dave Evans. Our goal was to examine existing DNA/RNA formats in the field of computational biology to find more efficient ways to store genetic information. Next Generation Sequencing (NGS) has allowed for massive amounts of genomics data to be produced, which has shifted the bottleneck from sequencing to computation; specifically long term storing, managing, and analyzing large amounts of data. Compression tools can reduce the necessary amount of storage. 

Our project subgoals were as follows: 
1. Develop our own compressed DNA file format as a learning tool
2. Compare our new file format with existing file formats
3. Identify the file format with the best performance metrics
4. Build a standalone Python library for that file format
5. Add support for that file format in [BioPython](https://biopython.org)

## File Format Development

## Comparison Testing

We identified four state-of-the-art FASTQ compression schemes: 
- [Leon](), DESCRIPTIONS HERE
- [Lfqc]()
- [Scalce]() 
- [Slimfastq]()

These schemes were tested for their compression ratios across a span of multi-read fastq files. The sizes of these files ranged from 1062 bytes to 5875259 bytes with varied read complexity. Included in the tests were also our newly created file formats and the standard [gzip]() compression format.

In [None]:
# code to generate compressed files here

### Results

insert table from excel here

## Standalone Library

With Leon outperforming the other formats, we placed our focus on making the Leon library (which is C++-based) available to Python users.  This led to the creation of [PyLeon](https://github.com/jdvne/pyleon), a module wrapping Python around Leon.  This allows users to compress/decompress .leon files from within a Python environment for use in data processing.

> https://github.com/jdvne/pyleon

## Pull Request