Skip to content

mbhall88/skc

Repository files navigation

skc

skc is a simple tool for finding shared k-mer content between two genomes.

Installation

Prebuilt binary

curl -sSL skc.mbh.sh | sh
# or with wget
wget -nv -O - skc.mbh.sh | sh

You can also pass options to the script like so

$ curl -sSL skc.mbh.sh | sh -s -- --help
install.sh [option]

Fetch and install the latest version of skc, if skc is already
installed it will be updated to the latest version.

Options
        -V, --verbose
                Enable verbose output for the installer

        -f, -y, --force, --yes
                Skip the confirmation prompt during installation

        -p, --platform
                Override the platform identified by the installer

        -b, --bin-dir
                Override the bin installation directory [default: /usr/local/bin]

        -a, --arch
                Override the architecture identified by the installer [default: x86_64]

        -B, --base-url
                Override the base URL used for downloading releases [default: https://github.com/mbhall88/skc/releases]

        -h, --help
                Display this help message

Cargo

cargo install skc

Conda

conda install skc

Local

cargo build --release
./target/release/skc --help

Usage

Check for shared 16-mers between the HIV-1 genome and the Mycobacterium tuberculosis genome.

$ skc -k 16 NC_001802.1.fa NC_000962.3.fa
[2023-06-20T01:46:36Z INFO ] 9079 unique k-mers in target
[2023-06-20T01:46:38Z INFO ] 2 shared k-mers between target and query
>4233642782 tcount=1 qcount=1 tpos=NC_001802.1:739 qpos=NC_000962.3:4008106
TGCAGAACATCCAGGG
>4237062597 tcount=1 qcount=1 tpos=NC_001802.1:8415 qpos=NC_000962.3:629482
CCAGCAGCAGATAGGG

So we can see there are two shared 16-mers between the genomes. By default, the shared k-mers are written to stdout - use the -o option to write them to file.

Fasta description

Example: >4233642782 tcount=1 qcount=1 tpos=NC_001802.1:739 qpos=NC_000962.3:4008106

The ID (4233642782) is the 64-bit integer representation of the k-mer's value in bit-space ( see Daniel Liu's brilliant cute-nucleotides repository for more information). tcount and qcount are the number of times the k-mer is present in the target and query genomes, respectively. tpos and qpos are the (1-based) k-mer starting position(s) within the target and query contigs - these will be comma-seperated if the k-mer occurs multiple times.

Usage help

$ skc --help
Shared k-mer content between two genomes

Usage: skc [OPTIONS] <TARGET> <QUERY>

Arguments:
  <TARGET>
          Target sequence

          Can be compressed with gzip, bzip2, xz, or zstd

  <QUERY>
          Query sequence

          Can be compressed with gzip, bzip2, xz, or zstd

Options:
  -k, --kmer <KMER>
          Size of k-mers (max. 32)

          [default: 21]

  -o, --output <OUTPUT>
          Output filepath(s); stdout if not present

  -O, --output-type <u|b|g|l|z>
          u: uncompressed; b: Bzip2; g: Gzip; l: Lzma; z: Zstd

          Output compression format is automatically guessed from the filename extension. This option is used to override that

          [default: u]

  -l, --compress-level <INT>
          Compression level to use if compressing output

          [default: 6]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Caveats

  • Make the first genome passed (<TARGET>) the smallest genome. This is to reduce memory usage as all unique k-mers ( well their u64 value) for this genome will be held in memory.
  • We do not use canonical k-mers
  • 32 is the largest k-mer size that can be used. This is basically a (lazy) implementation decision, but also helps to keep the memory footprint as low as possible. If you want larger k-mer values, I would suggest checking out some of the similar tools.

Alternate tools

skc does not claim to be the fastest or most memory-efficient tool to find shared k-mer content. I basically wrote it as I either struggled to install some alternate tools, they were clunky/verbose, or it was laborious to get shared k-mers out of the results (e.g. can only search one k-mer at a time or have to run many different subcommands). Here is a (non-exhaustive) list of other tools that can be used to get shared k-mer content

  • unikmer - this was brought to my attention after I wrote skc. Had I known about it beforehand, I probably wouldn't have written skc. So I would recommend unikmer for almost all use cases - Wei Shen writes awesome tools
  • Jellyfish
  • REINDEER
  • kmer-db
  • GGCAT
  • KAT

Acknowledgements

Daniel Liu's brilliant cute-nucleotides repository is used to (rapidly) convert k-mers into 64-bit integers.