Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Kmer Counting with Spark

This repository provides a distributed implementation of an exact Kmer Counter based on Spark

It is mainly coded in Scala, yet it entails and supports libraries written in java (see for example the distances under the java/multiseq package)

Dependencies (pom.xml):

  • scala-sdk-2.11.7
  • spark-core_2.11
  • spark-sql_2.11
  • FASTdoop-1.0.jar (to be installed in local mvn repository, see next section)
  • fastutil-7.2.1
  • spark-assembly-2.0.0-hadoop2.7.0-SNAPSHOT.jar (optional, for use on a MS Azure cluster equipped with Azure HDInsight, see next section)

Installing Fastdoop in local maven repository

FASTdoop is not yet available on maven repositories. To import the relevant jar into your local maven repository, run the following command:

mvn install:install-file -Dfile=/path/to/FASTdoop-1.0.jar -DgroupId=org.fastdoop -DartifactId=fastdoop -Dversion=1.0 -Dpackaging=jar

IntelliJ Azure plugin

This project is ready to be run on top of a Microsoft Azure HDInsight cluster. Optionally, the cluster job can be submitted, run, and debugged directly from your machine using IntelliJ IDEA or Eclipse.

Instructions on how to configure the IntelliJ Azure plugin to interoperate with HDinsight can be found at:

Make sure to use the HDInsight spark-assembly-2.0.0-hadoop2.7.0-SNAPSHOT.jar which is not available on official Maven/3rd party repositories. It can also be downloaded directly from an HDInsight node.

Project source overview

  • java | java libraries
    • multiseq multisequence distance function specifications (experimental)
  • scala
    • skc | main project libraries
      • multisequence | multisequence support (experimental)
      • test | runnable classes: TestKmerCounter, and LocalTestKmerCounter
      • package.scala
        • util | e.g., Kmer, RIndex classes and functions

How to


mvn clean
mvn compile

Package (only for cluster mode)

mvn package

Running locally using spark local mode

java -cp <CLASSPATH, including scala-library.jar> skc.test.LocalTestKmerCounter k m x useHT B sequenceType inputPath outputPath prefix write enableKryo useCustomPartitioner numPartitionTasks

Parameter description (both for local and cluster mode):

Name Meaning
k kmers length
m signature length
x (k,x) mers compression factor
useHT 1 for HT based implementation, or 0
B number of bins
sequenceType 0 for short sequences, 1 for long
inputPath dataset input path (HDFS or local)
outputPath counts output path (HDFS or local)
prefix custom output directory prefix
write enable output
enableKryo 1 to enable Kryo compression, or 0
useCustomPartitioner 1 for partition balancing, or 0
numPartitionTasks if partitioning, specifies number of partitions

Running in cluster mode (YARN) using spark-submit (example)

Example for dataset ggallus.fasta located on HDFS at hdfs://mycluster/tests/input/ggallus.fasta

spark-submit --master yarn --deploy-mode cluster  --num-executors <executors> --executor-cores <cores> --driver-memory 1g --executor-memory <Xg> --jars /path/to/fastutil-7.2.1.jar,/path/to/FASTdoop-1.0.jar --class skc.test.TestKmerCounter SKC-1.0-SNAPSHOT.jar  28 10 3 2048 0 0 hdfs://mycluster/tests/input/ggallus.fasta /mycluster/tests/output/ gallus 1 0 0 0

remember to put all the external jars on the node where spark-submit is invoked (/path/to/<jar>.jar), so that they can be deployed on all worker nodes by Spark.

Handling multiple sequences (experimental, under development)

The package multisequence contains a prototypical implementation of multiple sequence distance computation based on exact k-mers. The main runnable class is TestMultisequenceKmerCounter, and assumes as inputPath a file containing reads from a set of sequences to be compared. Each sequence is assumed to be pre-tagged by a read descriptor, as highlighted in the following example:


SRR197985.1 HWUSI-EAS687_61DAJ:3:1:1046:16470 length=200
SRR956987.1 HWI-ST571:185:D111MACXX:2:1101:1213:2216 length=101
SRR956987.2 HWI-ST571:185:D111MACXX:2:1101:1213:2216 length=101
SRR197985.2 HWUSI-EAS687_61DAJ:3:1:1046:1308 length=200
SRR956987.3 HWI-ST571:185:D111MACXX:2:1101:1019:2221 length=101

Under java/multiseq package can be found many distance implementations. By default this prototype assumes Squared euclidean distance which can be computed autonomously and incrementally for each bin.

Currently, only partial distances are calculated inside bins but not yet aggregated and saved to disk.


Distributed implementation of an exact Kmer Counter based on Spark







No releases published


No packages published