# Fast and efficient preprocessing of scRNA-seq with kallisto | bustools | kb-python


This notebook provides a complete workflow to quantify single-cell RNA-seq data using using **kb-python** (kallisto|bustools).


[Md. Jubayer Hossain](https://mdjubayerhossain.com/)

Founder & CEO, [DeepBio Ltd.](https://deepbioltd.com/)

## Step 1: Install Required Packages

Install kb-python and ffq (for downloading SRA data).

In [5]:
# Install kb-python and ffq
%%time
!pip install kb-python ffq -q

CPU times: user 1.3 s, sys: 167 ms, total: 1.47 s
Wall time: 5.08 s


In [6]:
# Mount Google Drive (only mounts if not mounted already)
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
# Set up main working directory
import os
working_dir = "/content/drive/MyDrive/01-Projects/Papers/AD-Sex-Differences"
os.makedirs(working_dir, exist_ok=True)
os.chdir(working_dir)

In [8]:
# Define folder structure
folders = [
    "reference",
    "raw_data",
    "processed_data",
    "results",
    "results/figures",
    "results/tables"
]

# Create folders
for folder in folders:
    path = os.path.join(working_dir, folder)
    os.makedirs(path, exist_ok=True)

In [9]:
# Set Save paths for easy use later
reference_dir = os.path.join(working_dir, "reference")
raw_data_dir = os.path.join(working_dir, "raw_data")
processed_data_dir = os.path.join(working_dir, "processed_data")
results_dir = os.path.join(working_dir, "results")
figures_dir = os.path.join(working_dir, "results/figures")
tables_dir = os.path.join(working_dir, "results/tables")

## Step 2: Build Reference Index

We'll use kb-python's built-in reference download for mouse. This will download the mouse transcriptome and create an index.

Before we can quantify gene expression from raw FASTQ files, we need to create a reference index.
This index is essential because tools like kb-python (Kallisto | Bustools) must know:

- Which transcripts exist in the organism?
- Where each transcript starts and ends?
- Which transcripts map to which genes?

This allows the reads to be pseudoaligned quickly and accurately.

In [10]:
# Check directory path
!pwd

/content/drive/MyDrive/01-Projects/Papers/AD-Sex-Differences


In [11]:
# List directories
!ls

notebooks  processed_data  raw_data  reference	results


In [12]:
# Check results
!ls results

figures  tables


In [13]:
# Download and build mouse reference (this may take 10-15 minutes)
# For human: -d human
# For Mouse: -d mouse
%%time
!kb ref -d human -i reference/index.idx -g reference/t2g.txt

[2026-01-20 21:02:53,798]    INFO [download] Downloading files for human (standard workflow) from https://github.com/pachterlab/kallisto-transcriptome-indices/releases/download/v1/human_index_standard.tar.xz to tmp/human_index_standard.tar.xz
100% 138M/138M [00:01<00:00, 73.7MB/s]
[2026-01-20 21:02:55,769]    INFO [download] Extracting files from tmp/human_index_standard.tar.xz
CPU times: user 98.1 ms, sys: 18.4 ms, total: 116 ms
Wall time: 44.5 s


This command downloads and builds the human transcriptome reference needed for `pseudoalignment` and count matrix generation in single-cell RNA-seq.

- `kb ref` prepares everything needed for kb count, including:
    - downloading the transcriptome
    - generating the kallisto index
    - building the transcript-to-gene (t2g) mapping file

- `-d human`
    - Tells `kb-python` to automatically download a predefined HUMAN reference dataset.
    - No need to manually supply FASTA or GTF files.

- `-i reference/index.idx`
    - Specifies where to save the kallisto index, which is used for fast pseudoalignment.
- `-g reference/t2g.txt`
    - Creates a transcript-to-gene mapping file.
    - Bustools uses this file to convert transcript counts into gene counts.

### Alternative: Manual Reference Building

If you prefer to build the reference manually from specific genome and GTF files:

In [14]:
# OPTIONAL: Manual reference building (uncomment to use)
# Download genome and GTF from Ensembl
# !wget ftp://ftp.ensembl.org/pub/release-109/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
# !wget ftp://ftp.ensembl.org/pub/release-109/gtf/mus_musculus/Mus_musculus.GRCm39.109.gtf.gz

# Build reference
# !kb ref \
#   -i reference/index.idx \
#   -g reference/t2g.txt \
#   -f1 reference/transcriptome.fa \
#   Mus_musculus.GRCm39.dna.primary_assembly.fa.gz \
#   Mus_musculus.GRCm39.109.gtf.gz

In [15]:
# Get sample information
import os
import json
import subprocess

# Sample IDs from the SRA project
samples = [
    "SRR10278808", "SRR10278809", "SRR10278810", "SRR10278811", "SRR10278812",
    "SRR10278813", "SRR10278814", "SRR10278815", "SRR10278816", "SRR10278817",
    "SRR10278818", "SRR10278819", "SRR10278820", "SRR10278821", "SRR10278822",
    "SRR10278823", "SRR10278824", "SRR10278825", "SRR10278826", "SRR10278827",
    "SRR10278828", "SRR10278829", "SRR10278830", "SRR10278831", "SRR10278832",
    "SRR10278833", "SRR10278834", "SRR10278835", "SRR10278836", "SRR10278838",
    "SRR10278839"
]

# Total samples
print(f"Total samples: {len(samples)}")

print("Samples to download:")
for sample in samples:
    print(f"  - {sample}")

Total samples: 31
Samples to download:
  - SRR10278808
  - SRR10278809
  - SRR10278810
  - SRR10278811
  - SRR10278812
  - SRR10278813
  - SRR10278814
  - SRR10278815
  - SRR10278816
  - SRR10278817
  - SRR10278818
  - SRR10278819
  - SRR10278820
  - SRR10278821
  - SRR10278822
  - SRR10278823
  - SRR10278824
  - SRR10278825
  - SRR10278826
  - SRR10278827
  - SRR10278828
  - SRR10278829
  - SRR10278830
  - SRR10278831
  - SRR10278832
  - SRR10278833
  - SRR10278834
  - SRR10278835
  - SRR10278836
  - SRR10278838
  - SRR10278839


In [16]:
# Download FASTQ files using ffq
# Note: This will download large files (20-40 GB total)
# Make sure you have sufficient disk space and time
%%time
for sample in samples:
    print(f"\nDownloading {sample}...")

    # Get FTP URLs using ffq
    result = subprocess.run(
        ["ffq", "--ftp", sample],
        capture_output=True,
        text=True
    )

    # Parse the JSON output
    data = json.loads(result.stdout)

    # Download FASTQ files
    for entry in data:
        url = entry['url']
        filename = os.path.basename(url)
        output_path = f"raw_data/{filename}"

        if not os.path.exists(output_path):
            print(f"Downloading: {filename}")
            !wget -q --show-progress -O {output_path} {url}
        else:
            print(f"File already exists: {filename}")


Downloading SRR10278808...
Downloading: SRR10278808.fastq.gz

Downloading SRR10278809...
Downloading: SRR10278809.fastq.gz

Downloading SRR10278810...
Downloading: SRR10278810.fastq.gz

Downloading SRR10278811...
Downloading: SRR10278811.fastq.gz

Downloading SRR10278812...
Downloading: SRR10278812.fastq.gz

Downloading SRR10278813...
Downloading: SRR10278813.fastq.gz

Downloading SRR10278814...
Downloading: SRR10278814.fastq.gz

Downloading SRR10278815...
Downloading: SRR10278815.fastq.gz

Downloading SRR10278816...
Downloading: SRR10278816.fastq.gz

Downloading SRR10278817...


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

### Alternative: Use SRA Toolkit (if ffq has issues)

In [17]:
# ALTERNATIVE: Download using SRA toolkit
# !pip install sra-tools -q

# for sample in samples:
#     print(f"Downloading {sample}...")
#     !fastq-dump --split-files --gzip --outdir raw_data {sample}

## Step 4: Quantify with kb-python (Pseudoalignment and UMI counting)

Now we'll use kb count to process the FASTQ files. Since this is 10x Chromium data, we'll use the `10xv2/v3` technology specification.

In [18]:
# Process each sample
# This step runs `kb` to pseudoalign the reads, and then generate the cells x gene matrix in h5ad format.
%%time
for sample in samples:
    print(f"\n{'='*60}")
    print(f"Processing sample: {sample}")
    print(f"{'='*60}\n")

    # Define input and output paths
    r1 = f"raw_data/{sample}_1.fastq.gz"
    r2 = f"raw_data/{sample}_2.fastq.gz"
    output_dir = f"output/{sample}"

    # Run kb count
    # Note: Adjust -x parameter if needed (try 10xv3 if 10xv2 doesn't work well)
    !kb count \
        -i reference/index.idx \
        -g reference/t2g.txt \
        -x 10xv3 \
        -o {output_dir} \
        --h5ad \
        -t 4 \
        {r1} {r2}

    print(f"\nCompleted processing: {sample}")
    print(f"Output directory: {output_dir}")


Processing sample: SRR10278808

[2026-01-20 21:11:26,392]    INFO [count] Using index reference/index.idx to generate BUS file to output/SRR10278808 from
[2026-01-20 21:11:26,392]    INFO [count]         raw_data/SRR10278808_1.fastq.gz
[2026-01-20 21:11:26,392]    INFO [count]         raw_data/SRR10278808_2.fastq.gz
[2026-01-20 21:11:27,495]   ERROR [count] 
Error: file not found raw_data/SRR10278808_1.fastq.gz
Error: file not found raw_data/SRR10278808_2.fastq.gz
kallisto 0.51.1
Generates BUS files for single-cell sequencing

Usage: kallisto bus [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
pseudoalignment
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-x, --technology=STRING       Single-cell technology used
-l, --list                    List all single-cell technologies supported
-B, --batch=FILE              Process files listed in FILE
-t, --threads=INT             Number 

**Understanding kb count parameters:**

- `-i`: Index file created in Step 2
- `-g`: Transcript-to-gene mapping file
- `-x 10xv3`: Technology specification (10x Chromium v3)
- `-o`: Output directory
- `--h5ad`: Generate AnnData h5ad file for easy loading in Python
- `-t 4`: Use 4 threads (adjust based on available resources)
- Last arguments: Read 1 and Read 2 FASTQ files