<a href="https://colab.research.google.com/github/pachterlab/seqspec/blob/devel/docs/UNIFORM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
title: Using seqspec
date: 2024-09-07
authors:
  - name: A. Sina Booeshaghi
---

`seqspec` enables uniform preprocessing of sequencing reads.

# Single-cell preprocessing
Single-cell data preprocessing is the procedure where

1. Sequencing reads are aligned to a reference
2. Barcodes errors are corrected
3. UMIs/reads are counted

The goal is to produce a count matrix, where rows are cells or samples and columns are biological features such as genes, proteins, or genomic regions.

There are many tools that perform single-cell RNA-sequencing preprocessing. For this tutorial we will use `kb-python` (which uses `kallisto` and `bustools`), `STARsolo`, `simpleaf` with `seqspec` to perform alignment and quantification. `kb_python` uses `kallisto` to perform read alignment and `bustools` to perform and barcode correction and UMI counting. `STARsolo` performs performs whole genome alignment and barcode error correction. Like `kb-python`, `simpleaf` uses two separate tools under the hood: `salmon` to perform read alignment and `alevin-fry` to perform barcode error correction and UMI counting.

Throughout this tutorial we will use the `dogmaseq-dig` dataset which is a multimodal assay (RNA/ATAC/PROTEIN/TAG). The `seqspec` for this dataset can be found here


## Install tools

To understand how each tool works, please review their code and manuscript:

| Tool | Code link | Manuscript link | Purpose |
|------|-----------|-----------------|---------|
| seqspec | [GitHub](https://github.com/pachterlab/seqspec) | [doi](https://doi.org/10.1093/bioinformatics/btae168) | Identify and extract elements in reads |
| kb-python | [GitHub](https://github.com/pachterlab/kb_python) | [doi](https://doi.org/10.1101/2023.11.21.568164) | Perform read alignment, error correction, and counting |
| gget | [GitHub](https://github.com/pachterlab/gget) | [doi](https://doi.org/10.1093/bioinformatics/btac836) | Fetch species-specific references |
| kallisto | [GitHub](https://github.com/pachterlab/kallisto) | [doi](https://doi.org/10.1038/nbt.3519) | Perform read alignment (used in kb-python) |
| bustools | [GitHub](https://github.com/BUStools/bustools) | [doi](https://doi.org/10.1038/s41587-021-00870-2) | Perform barcode error correction and UMI counting (used in kb-python) |
| BUS file | [GitHub](https://github.com/BUStools/BUS-format) | [doi](https://doi.org/10.1093/bioinformatics/btz279) | Store barocdes, umis, and read alignments (used in kb-python) |
| STARsolo | [GitHub](https://github.com/alexdobin/STAR/) | [doi](https://doi.org/10.1101/2021.05.05.442755) | Perform read alignment, error correction, and counting |
| simpleaf | [GitHub](https://github.com/COMBINE-lab/simpleaf) | [doi](https://doi.org/10.1093/bioinformatics/btad614) | Perform read alignment, error correction, and counting |

In [1]:
# Install kb-python, seqspec, gget
! pip install --quiet kb-python gget > /dev/null 2>&1  # installing kb-python autoinstalls kallisto and bustools
! pip install --quiet git+https://github.com/pachterlab/seqspec@devel > /dev/null 2>&1

# Verify installations
! seqspec --version
! kb --version
! gget --version

# Install STARsolo and verify installation
! wget --quiet --show-progress https://github.com/alexdobin/STAR/archive/2.7.11b.tar.gz
! tar -xzf 2.7.11b.tar.gz > /dev/null 2>&1
! mv /content/STAR-2.7.11b/bin/Linux_x86_64/STAR /usr/bin
! STAR --version

# Install alevin-fry, simpleaf and verify installation
! curl -S --proto '=https' --tlsv1.2 -LsSf https://github.com/COMBINE-lab/alevin-fry/releases/download/v0.10.0/alevin-fry-installer.sh | sh > /dev/null 2>&1
! $HOME/.cargo/bin/alevin-fry --version

! curl -S --proto '=https' --tlsv1.2 -LsSf https://github.com/COMBINE-lab/simpleaf/releases/download/v0.17.2/simpleaf-installer.sh | sh > /dev/null 2>&1
%env ALEVIN_FRY_HOME="$HOME/.cargo/bin/alevin-fry"
! $HOME/.cargo/bin/simpleaf --version

seqspec 0.2.0
usage: kb [-h] [--list] <CMD> ...

kb_python 0.28.2

positional arguments:
  <CMD>
    info      Display package and citation information
    compile   Compile `kallisto` and `bustools` binaries from source
    ref       Build a kallisto index and transcript-to-gene mapping
    count     Generate count matrices from a set of single-cell FASTQ files

options:
  -h, --help  Show this help message and exit
  --list      Display list of supported single-cell technologies
gget version: 0.28.6
2.7.11b.tar.gz          [        <=>         ]  11.89M  7.43MB/s    in 1.6s    
2.7.11b
alevin-fry 0.10.0
env: ALEVIN_FRY_HOME="$HOME/.cargo/bin/alevin-fry"
simpleaf 0.17.2


## Download `seqspec` for the `dogmaseq-dig` data

In [2]:
! wget --quiet --show-progress https://raw.githubusercontent.com/pachterlab/seqspec/devel/examples/specs/dogmaseq-dig/spec.yaml



In [3]:
!seqspec print spec.yaml

                                                                  ┌─'ghost_protein_truseq_read1:0'
                                                                  ├─'protein_truseq_read1:33'
                                                                  ├─'protein_cell_bc:16'
                                 ┌─protein────────────────────────┤
                                 │                                ├─'protein_umi:12'
                                 │                                ├─'protein_seq:15'
                                 │                                └─'protein_truseq_read2:34'
                                 │                                ┌─'tag_truseq_read1:33'
                                 │                                ├─'tag_cell_bc:16'
                                 ├─tag────────────────────────────┼─'tag_umi:12'
                                 │                                ├─'tag_seq:15'
─────────────────────────────────┤               

## Single-cell/nuclei RNAseq quantification

### `kb-python (kallisto bustools)`

In [13]:
!seqspec file

usage: seqspec file [-h] [-o OUT] [-i IDs] -m MODALITY [-s SELECTOR] [-f FORMAT] [-k KEY] yaml

List files present in seqspec file.

Examples:
seqspec file -m rna spec.yaml                          # List paired read files
seqspec file -m rna -f interleaved spec.yaml           # List interleaved read files
seqspec file -m rna -f list -k url spec.yaml           # List urls of all read files
seqspec file -m rna -f list -s onlist -k all spec.yaml # List onlist files
---

positional arguments:
  yaml         Sequencing specification yaml file

options:
  -h, --help   show this help message and exit
  -o OUT       Path to output file
  -i IDs       Ids to list
  -s SELECTOR  Selector for ID, [read, region, file, onlist] (default: read)
  -f FORMAT    Format, [paired, interleaved, index, list], default: paired
  -k KEY       Key, [file_id, filename, filetype, filesize, url, urltype, md5, all], default: file_id

required arguments:
  -m MODALITY  Modality


In [18]:
!seqspec file -m tag -f list -s onlist -k all spec.yaml

tag_cell_bc	RNA-737K-arc-v1.txt	RNA-737K-arc-v1.txt	txt	2142553	https://github.com/pachterlab/qcbc/raw/main/tests/10xMOME/RNA-737K-arc-v1.txt.gz	https	a88cd21e801ae6f9a7d9a48b67ccf693
tag_seq	tag_0419_feature_barcodes.txt	tag_0419_feature_barcodes.txt	txt	0	https://raw.githubusercontent.com/pachterlab/seqspec/devel/examples/specs/dogmaseq-dig/tag_0419_feature_barcodes.txt	https	de44ad6d5c4b9f381a352283a6831112


In [16]:
!seqspec file -m rna -f list -s onlist -k all spec.yaml | cut -f 6 | curl | zcat > onlist_rna.txt
!seqspec file -m rna -f list -s onlist -k all spec.yaml | cut -f 6 | curl | zcat > onlist_rna.txt
!seqspec file -m rna -f list -s onlist -k all spec.yaml | cut -f 6 | curl | zcat > onlist_rna.txt
!seqspec file -m rna -f list -s onlist -k all spec.yaml | cut -f 6 | curl | zcat > onlist_rna.txt

https://github.com/pachterlab/qcbc/raw/main/tests/10xMOME/RNA-737K-arc-v1.txt.gz


In [6]:
! # seqspec commands to get onlist, technology string, and files
! w=$(seqspec onlist -m rna -o onlist.txt -s region-type -i barcode spec.yaml) && echo "Onlist: " $w

! x=$(seqspec index -m rna -t kb -s file spec.yaml) && echo "Technology string: " $x

! f=$(seqspec file -m rna -s read -f paired -k url spec.yaml  | tr "\t\n" "  ") && echo "Files: " $f

Onlist:  /content/RNA-737K-arc-v1.txt
Technology string:  0,0,16:0,16,28:1,0,102
Files:  https://github.com/pachterlab/seqspec/raw/devel/examples/specs/dogmaseq-dig/fastqs/rna_R1_SRR18677638.fastq.gz https://github.com/pachterlab/seqspec/raw/devel/examples/specs/dogmaseq-dig/fastqs/rna_R2_SRR18677638.fastq.gz


In [None]:
# standard reference
! kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa $(gget ref --ftp -w dna,gtf homo_sapiens)

! # seqspec commands to get onlist, technology string, and files
! w=$(seqspec onlist -m rna -o onlist.txt -s region-type -i barcode spec.yaml)
! echo "Onlist: " $w

! x=$(seqspec index -m rna -t kb -s file spec.yaml)
! echo "Technology string: " $x

! f=$(seqspec file -m rna -s read -f paired -k url spec.yaml  | tr "\t\n" "  ")
! echo "Files: " $f

! # standard quantification
! kb count --h5ad -t 16 -m 32G -i index.idx -g t2g.txt -o kb_out -x "$x" -w "$w" "$f"