Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time
March 31, 2017 01:06
October 10, 2017 10:47
April 1, 2017 19:03
January 26, 2020 10:31
April 16, 2017 23:16
March 7, 2017 01:05
April 17, 2017 00:00
April 17, 2017 10:30
March 31, 2017 01:06
October 10, 2017 10:47
January 26, 2020 22:22


Build Status

Dna2vec is an open-source library to train distributed representations of variable-length k-mers.

For more information, please refer to the paper: dna2vec: Consistent vector representations of variable-length k-mers


Note that this implementation has only been tested on Python 3.5.3, but we welcome any contributions or bug reporting to make it more accessible.

  1. Clone the dna2vec repository: git clone
  2. Install Python dependencies: pip3 install -r requirements.txt
  3. Test the installation: python3 ./scripts/ -c configs/small_example.yml

Training dna2vec embeddings

  1. Download hg38 from This will take a while as it's 938MB.
  2. Untar with tar -zxvf hg38.chromFa.tar.gz. You should see FASTA files for chromosome 1 to 22: chr1.fa, chr2.fa, ..., chr22.fa.
  3. Move the 22 FASTA files to folder inputs/hg38/
  4. Start the training with: python3 ./scripts/ -c configs/hg38-20161219-0153.yml
  5. Wait for a couple of days ...
  6. Once the training is done, there should be a dna2vec-<ID>.w2v and a corresponding dna2vec-<ID>.txt file in your results/ directory.

Reading pretrained dna2vec

You can read pretrained dna2vec vectors pretrained/dna2vec-*.w2v using the class MultiKModel in dna2vec/ For example:

from dna2vec.multi_k_model import MultiKModel

filepath = 'pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v'
mk_model = MultiKModel(filepath)

You can fetch the vector representation of AAA with:

>>> mk_model.vector('AAA')
array([ 0.023137  ,  0.156295 , ...

Compute the cosine distance between two k-mers via dna2vec:

>>> mk_model.cosine_distance('AAA', 'GCT')
>>> mk_model.cosine_distance('AAA', 'AAAA')


Does the pre-trained dna2vec data (w2v file) cover all k-mers?

The pre-trained data should cover all k-mers for 3 ≤ k ≤ 8

>>> [len(mk_model.model(k).vocab) for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]
>>> [4**k for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]


I would love for you to fork and send me pull request for this project. Please contribute.


This software is licensed under the MIT license