Skip to content

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing


Notifications You must be signed in to change notification settings


Repository files navigation


github version pypi version python versions status codecov pypi downloads docs license

kb-python is a python package for processing single-cell RNA-sequencing. It wraps the kallisto | bustools single-cell RNA-seq command line tools in order to unify multiple processing workflows.

kb-python was developed by Kyung Hoi (Joseph) Min and A. Sina Booeshaghi while in Lior Pachter's lab at Caltech. If you use kb-python in a publication please cite*:

Melsted, P., Booeshaghi, A.S., et al. 
Modular, efficient and constant-memory single-cell RNA-seq preprocessing. 
Nat Biotechnol  39, 813–818 (2021).


The latest release can be installed with

pip install kb-python

The development version can be installed with

pip install git+

There are no prerequisite packages to install. The kallisto and bustools binaries are included with the package.


kb consists of four subcommands

$ kb
usage: kb [-h] [--list] <CMD> ...
positional arguments:
    info      Display package and citation information
    compile   Compile `kallisto` and `bustools` binaries from source
    ref       Build a kallisto index and transcript-to-gene mapping
    count     Generate count matrices from a set of single-cell FASTQ files

kb ref: generate a pseudoalignment index

The kb ref command takes in a species annotation file (GTF) and associated genome (FASTA) and builds a species-specific index for pseudoalignment of reads. This must be run before kb count. Internally, kb ref extracts the coding regions from the GTF and builds a transcriptome FASTA that is then indexed with kallisto index.

kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa <GENOME> <GENOME_ANNOTATION>
  • <GENOME> refers to a genome file (FASTA).
    • For example, the zebrafish genome is hosted by ensembl and can be downloaded here
  • <GENOME_ANNOTATION> refers to a genome annotation file (GTF)
    • For example, the zebrafish genome annotation file is hosted by ensembl and can be downloaded here
  • Note: The latest genome annotation and genome file for every species on ensembl can be found with the gget command-line tool.

Prebuilt indices are available at


# Index the transcriptome from genome FASTA (genome.fa.gz) and GTF (annotation.gtf.gz)
$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa genome.fa.gz annotation.gtf.gz
# An example for downloading a prebuilt reference for mouse
$ kb ref -d mouse -i index.idx -g t2g.txt

kb count: pseudoalign and count reads

The kb count command takes in the pseudoalignment index (built with kb ref) and sequencing reads generated by a sequencing machine to generate a count matrix. Internally, kb count runs numerous kallisto and bustools commands comprising a single-cell workflow for the specified technology that generated the sequencing reads.

kb  count -i index.idx -g t2g.txt -o out/ -x <TECHNOLOGY> <FASTQ FILE[s]>
  • <TECHNOLOGY> refers to the assay that generated the sequencing reads.
    • For a list of supported assays run kb --list
  • <FASTQ FILE[s]> refers to the a list of FASTQ files generated
    • Different assays will have a different number of FASTQ files
    • Different assays will place the different features in different FASTQ files
      • For example, sequencing a 10xv3 library on a NextSeq Illumina sequencer usually results in two FASTQ files.
      • The R1.fastq.gz file (colloquially called "read 1") contains a 16 basepair cell barcode and a 12 basepair unique molecular identifier (UMI).
      • The R2.fastq.gz file (colloquially called "read 2") contains the cDNA associated with the cell barcode-UMI pair in read 1.


# Quantify 10xv3 reads read1.fastq.gz and read2.fastq.gz
$ kb count -i index.idx -g t2g.txt -o out/ -x 10xv3 read1.fastq.gz read2.fastq.gz

kb info: display package and citation information

The kb info command prints out package information including the version of kb-python, kallisto, and bustools along with their installation location.

$ kb info
kb_python 0.28.0 ...
kallisto: 0.50.1 ...
bustools: 0.43.1 ...

kb compile: compile kallisto and bustools binaries from source

The kb compile command grabs the latest kallisto and bustools source and compiles the binaries. Note: this is not required to run kb-python.

Use cases

kb-python facilitates fast and uniform pre-processing of single-cell sequencing data to answer relevant research questions.

$ pip install kb-python gget ffq

# Goal: quantify publicly available scRNAseq data
$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa $(gget ref --ftp -w dna,gtf homo_sapiens)
$ kb count -i index.idx -g t2g.txt -x 10xv3 -o out $(ffq --ftp SRR10668798 | jq -r '.[] | .url' | tr '\n' ' ')
# -> count matrix in out/ folder

# Goal: quantify 10xv2 feature barcode data, feature_barcodes.txt is a tab-delimited file
# containing barcode_sequence<tab>barcode_name
$ kb ref -i index.idx -g f2g.txt -f1 features.fa --workflow kite feature_barcodes.txt
$ kb count -i index.idx -g f2b.txt -x 10xv2 -o out/ --workflow kite --h5ad R1.fastq.gz R2.fastq.gz
# -> count matrix in out/ folder

Submitted by @sbooeshaghi.

Do you have a cool use case for kb-python? Submit a PR (including the goal, code snippet, and your username) so that we can feature it here.


For a list of tutorials that use kb-python please see


Developer documentation is hosted on Read the Docs.


Thank you for wanting to improve kb-python! If you have believe you've found a bug, please submit an issue.

If you have a new feature you'd like to add to kb-python please create a pull request. Pull requests should contain a message detailing the exact changes made, the reasons for the change, and tests that check for the correctness of those changes.


If you use kb-python in a publication, please cite the following papers:

kb-python & kallisto and/or bustools

  title={kallisto, bustools, and kb-python for quantifying bulk, single-cell, and single-nucleus RNA-seq},
  author={Sullivan, Delaney K and Min, Kyung Hoi and Hj{\"o}rleifsson, Kristj{\'a}n Eldj{\'a}rn and Luebbert, Laura and Holley, Guillaume and Moses, Lambda and Gustafsson, Johan and Bray, Nicolas L and Pimentel, Harold and Booeshaghi, A Sina and others},
  publisher={Cold Spring Harbor Laboratory}


  title={\href{}{Modular, efficient and constant-memory single-cell RNA-seq preprocessing}},
  author={Melsted, P{\'a}ll and Booeshaghi, A. Sina and Liu, Lauren and Gao, Fan and Lu, Lambda and Min, Kyung Hoi Joseph and da Veiga Beltrame, Eduardo and Hj{\"o}rleifsson, Kristj{\'a}n Eldj{\'a}rn and Gehring, Jase and Pachter, Lior},
  journal={Nature biotechnology},


  title={Near-optimal probabilistic RNA-seq quantification},
  author={Bray, Nicolas L and Pimentel, Harold and Melsted, P{\'a}ll and Pachter, Lior},
  journal={Nature biotechnology},
  publisher={Nature Publishing Group}


  title={Quantifying orthogonal barcodes for sequence census assays},
  author={Booeshaghi, A Sina and Min, Kyung Hoi and Gehring, Jase and Pachter, Lior},
  journal={Bioinformatics Advances},
  publisher={Oxford University Press}

BUS format

  title={The barcode, UMI, set format and BUStools},
  author={Melsted, P{\'a}ll and Ntranos, Vasilis and Pachter, Lior},
  publisher={Oxford University Press}

kb-python was inspired by Sten Linnarsson’s loompy fromfq command (