## EXISTING FILE FORMATS FOR DNA
- Source: http://genome.ucsc.edu/FAQ/FAQformat.htm
- MAF
    - The multiple alignment format stores a series of multiple alignments in a format that is easy to parse and relatively easy to read. This format stores multiple alignments at the DNA level between entire genomes.
- 2bit
    - A .2bit file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible format. The file contains masking information as well as the DNA itself.
- nib
    - It describes a DNA sequence by packing two bases into each byte.
- FASTA/FASTQ
    - base pair sequences
- FASTA
    - amino acid sequences

## EXISTING PYTHON LIBRARIES
- Source: https://wiki.python.org/moin/PythonMed
- https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04455-3
- Biopython (https://biopython.org/)
    - includes ability to parse bioinformatics files into Python utilizable data structures, including support for the following formats: 
        - Blast output – both from standalone and WWW Blast
        - Clustalw
        - FASTA
        - GenBank
        - PubMed and Medline
        - ExPASy files like Enzyme and Prosite
        - SCOP, including ‘dom’ and ‘lin’ files
        - UniGene
        - SwissProt
- pysam (http://pysam.readthedocs.org/)
    - Python wrapper package around Samtools, a suite of programs for reading and manipulating high-throughput sequencing data.
- kPAL (http://kpal.readthedocs.org/)
    - k-mer profile analysis library
- DendroPy
    - package for phylogenetic computing. It supports a wide range of phylogenetic tree formats and can be used both as a phylogenetic library and for scripting.



NOTE: many many many fast* files exist apparently: https://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml

## Header:

?   -> 000 <br>
A   -> 001 <br>
T/U -> 010 <br>
G   -> 011 <br>
C   -> 100 <br>
XX  -> 101 <br>
YY  -> 110 <br>
ZZ  -> 111 <br>


In [1]:
from Bio import SeqIO

# for seq_record in SeqIO.parse("SARS-Cov2-genome.fastq", "fasta"):
#     print(seq_record.id)
#     print(repr(seq_record.seq))
#     print(len(seq_record))

for seq_record in SeqIO.parse("covid.fasta", "fasta"):
    print(repr(seq_record.seq))

Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA')


In [2]:
from collections import Counter
inbits = bytearray()

class dna():
    bits: bool
    rna: bool
    pairs: Counter
    data: bytes