Codec for converting binary strings into DNA-like sequences over the alphabet {A, C, G, T} while guaranteeing that no symbol repeats more than k times consecutively (length k homopolymer run). The repository ships a fast C++ encoder/decoder that supports maximum homopolymer run lengths k ∈ {1, 2, 3, 4, 5}. See the writeup for how the encoding and decoding algorithms work. This code is written as part of the DNA Storage Project at Microsoft.
- About 50Mbps for encoding and 80 Mbps for decoding on a single core. Can be easily parallelized across multiple cores.
- Max run length 1 is about 3 times faster, because it is basically base 2 to base 3 conversion.
- Uses GMP multi precision library for maintaining large integers.
DNA synthesis and sequencing are sensitive to long runs of identical bases (e.g., AAA or CCCC). By bounding the length of these homopolymer runs, stored strands become more robust to physical errors. The encoder in this repository ensures that every generated strand respects the chosen bound k while still allowing you to recover the original binary payload exactly.
- Deterministic encoding of binary strings into base-4 strings subject to a homopolymer constraint.
- Efficient decoder that returns the original binary message.
- Supports five constraint strengths (
k = 1…5). - Includes benchmarking harnesses to measure encoding/decoding throughput with GMP-backed big-integer arithmetic.
The C++ sources depend on:
- A C++17-compatible compiler (tested with
g++). - GNU Multiple Precision Arithmetic Library (GMP) with C++ bindings (
libgmpxx).
# Ubuntu / Debian
sudo apt-get update
sudo apt-get install -y libgmp3-dev
# macOS using Homebrew
brew install gmpFor Windows, download the prebuilt binaries or build from source following the instructions at gmplib.org or the NYU GMP guide.
Clone this repository and compile the main driver:
git clone https://github.com/microsoft/DNABoundedHomopolymerEncoding.git
cd DNABoundedHomopolymerEncoding
g++ -O3 BoundedHomopolymerEncoding.cpp -lgmpxx -lgmp -o bounded_homopolymerThis produces an executable named bounded_homopolymer that links against GMP.
If you also need the rate benchmarking tool, build BHE_rates.cpp in the same way:
g++ -O3 BHE_rates.cpp -lgmpxx -lgmp -o bhe_ratesThe compiled bounded_homopolymer binary encodes random binary messages and checks that the decoder inverts the process.
./bounded_homopolymer <max_homopolymer_run_length> <encoding_length> <input_data_length> <number_trials>max_homopolymer_run_length(k): Choose from1to5. This bounds the length of any homopolymer run in the encoded strand. For example,k = 3guarantees that substrings such asAAAAorGGGGnever appear; the maximum run isAAA,CCC,GGG, orTTT.encoding_length: Desired length (in bases) of the encoded strand.input_data_length: Length (in bits) of the binary message to encode.number_trials: Number of random messages to encode/decode for benchmarking.
Example run:
./bounded_homopolymer 3 25 40 100The program prints the maximum encodable message size for the provided parameters and reports encoding/decoding timings. If any trial fails to round-trip, an error message is emitted.
The companion bhe_rates tool leverages the same encoder to estimate the smallest strand length that can host a target number of data bits while honoring the homopolymer constraint. It evaluates each k ∈ {1,…,5} and reports the achievable rate (bits per base).
./bhe_rates <encoding_length>encoding_length: Strand length (in bases) you are willing to allocate.- Output: For each
k ∈ {1,…,5}, the tool reports the maximum payload size (in bits) that fits withinencoding_lengthbases while respecting the homopolymer constraint, plus the corresponding rate.
Example:
./bhe_rates 96Sample output:
k max_input_bits rate(bits/base)
------------------------------------------
1 152 1.583333
2 184 1.916667
3 190 1.979167
4 191 1.989583
5 191 1.989583
If a given constraint cannot store any payload at that length, the tool reports 0 bits for that row.
Both BoundedHomopolymerEncoder and BoundedHomopolymerDecoder are self-contained in BoundedHomopolymerEncoding.cpp. You can include the file or extract the classes into your project.
#include "BoundedHomopolymerEncoding.cpp"
int main() {
const int k = 3; // Maximum homopolymer run length
const int encoded_len = 25; // Length of the output DNA strand
const int data_len = 12; // Length of the binary input
BoundedHomopolymerEncoder encoder(k, encoded_len, data_len);
std::string input = "101101001011"; // Provide your own binary string
std::string encoded = encoder.encode(input);
// Map digits {0,1,2,3} to {A,C,G,T} (or your preferred ordering) to obtain a DNA strand.
BoundedHomopolymerDecoder decoder(k, encoded_len, data_len);
std::string recovered = decoder.decode(encoded);
}Max data bits that can be encodedwarning: Reduceinput_data_lengthor increaseencoding_lengthso the message fits the selected constraint.- Linker errors about GMP: Confirm
libgmpandlibgmpxxare installed and search paths are visible to your compiler.
The author would like to thank Jon Schneider for his insightful comments which motivated the encoding and decoding algorithms.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit Contributor License Agreements.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.