Skip to content

microsoft/DNABoundedHomopolymerEncoding

Repository files navigation

Bounded Homopolymer Encoding for DNA Storage

Codec for converting binary strings into DNA-like sequences over the alphabet {A, C, G, T} while guaranteeing that no symbol repeats more than k times consecutively (length k homopolymer run). The repository ships a fast C++ encoder/decoder that supports maximum homopolymer run lengths k ∈ {1, 2, 3, 4, 5}. See the writeup for how the encoding and decoding algorithms work. This code is written as part of the DNA Storage Project at Microsoft.

Performance

  • About 50Mbps for encoding and 80 Mbps for decoding on a single core. Can be easily parallelized across multiple cores.
  • Max run length 1 is about 3 times faster, because it is basically base 2 to base 3 conversion.
  • Uses GMP multi precision library for maintaining large integers.

Why bounded homopolymers?

DNA synthesis and sequencing are sensitive to long runs of identical bases (e.g., AAA or CCCC). By bounding the length of these homopolymer runs, stored strands become more robust to physical errors. The encoder in this repository ensures that every generated strand respects the chosen bound k while still allowing you to recover the original binary payload exactly.

Features

  • Deterministic encoding of binary strings into base-4 strings subject to a homopolymer constraint.
  • Efficient decoder that returns the original binary message.
  • Supports five constraint strengths (k = 1…5).
  • Includes benchmarking harnesses to measure encoding/decoding throughput with GMP-backed big-integer arithmetic.

Prerequisites

The C++ sources depend on:

Installing GMP

# Ubuntu / Debian
sudo apt-get update
sudo apt-get install -y libgmp3-dev

# macOS using Homebrew
brew install gmp

For Windows, download the prebuilt binaries or build from source following the instructions at gmplib.org or the NYU GMP guide.

Building the encoder/decoder

Clone this repository and compile the main driver:

git clone https://github.com/microsoft/DNABoundedHomopolymerEncoding.git
cd DNABoundedHomopolymerEncoding
g++ -O3 BoundedHomopolymerEncoding.cpp -lgmpxx -lgmp -o bounded_homopolymer

This produces an executable named bounded_homopolymer that links against GMP.

If you also need the rate benchmarking tool, build BHE_rates.cpp in the same way:

g++ -O3 BHE_rates.cpp -lgmpxx -lgmp -o bhe_rates

Usage

Command line driver

The compiled bounded_homopolymer binary encodes random binary messages and checks that the decoder inverts the process.

./bounded_homopolymer <max_homopolymer_run_length> <encoding_length> <input_data_length> <number_trials>
  • max_homopolymer_run_length (k): Choose from 1 to 5. This bounds the length of any homopolymer run in the encoded strand. For example, k = 3 guarantees that substrings such as AAAA or GGGG never appear; the maximum run is AAA, CCC, GGG, or TTT.
  • encoding_length: Desired length (in bases) of the encoded strand.
  • input_data_length: Length (in bits) of the binary message to encode.
  • number_trials: Number of random messages to encode/decode for benchmarking.

Example run:

./bounded_homopolymer 3 25 40 100

The program prints the maximum encodable message size for the provided parameters and reports encoding/decoding timings. If any trial fails to round-trip, an error message is emitted.

Estimating minimal encoding lengths

The companion bhe_rates tool leverages the same encoder to estimate the smallest strand length that can host a target number of data bits while honoring the homopolymer constraint. It evaluates each k ∈ {1,…,5} and reports the achievable rate (bits per base).

./bhe_rates <encoding_length>
  • encoding_length: Strand length (in bases) you are willing to allocate.
  • Output: For each k ∈ {1,…,5}, the tool reports the maximum payload size (in bits) that fits within encoding_length bases while respecting the homopolymer constraint, plus the corresponding rate.

Example:

./bhe_rates 96

Sample output:

k     max_input_bits    rate(bits/base)
------------------------------------------
1     152              1.583333
2     184              1.916667
3     190              1.979167
4     191              1.989583
5     191              1.989583

If a given constraint cannot store any payload at that length, the tool reports 0 bits for that row.

Using the classes in your own code

Both BoundedHomopolymerEncoder and BoundedHomopolymerDecoder are self-contained in BoundedHomopolymerEncoding.cpp. You can include the file or extract the classes into your project.

#include "BoundedHomopolymerEncoding.cpp"

int main() {
	const int k = 3;              // Maximum homopolymer run length
	const int encoded_len = 25;   // Length of the output DNA strand
	const int data_len = 12;      // Length of the binary input

	BoundedHomopolymerEncoder encoder(k, encoded_len, data_len);
	std::string input = "101101001011"; // Provide your own binary string
	std::string encoded = encoder.encode(input);
	// Map digits {0,1,2,3} to {A,C,G,T} (or your preferred ordering) to obtain a DNA strand.

	BoundedHomopolymerDecoder decoder(k, encoded_len, data_len);
	std::string recovered = decoder.decode(encoded);
}

Troubleshooting

  • Max data bits that can be encoded warning: Reduce input_data_length or increase encoding_length so the message fits the selected constraint.
  • Linker errors about GMP: Confirm libgmp and libgmpxx are installed and search paths are visible to your compiler.

Acknowledgements

The author would like to thank Jon Schneider for his insightful comments which motivated the encoding and decoding algorithms.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit Contributor License Agreements.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

About

A codec for encoding and decoding binary strings into DNA strands with bounded homopolymer runs

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages