In [None]:
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)

# Building Count Matrices with `cellatlas`
A major challenge in uniformly preprocessing large amounts of single-cell genomics data from a variety of different assays is identifying and handling sequenced elements in a coherent and consistent fashion. Cell barcodes in reads from RNAseq data from 10x Multiome, for example, must be extracted and error corrected in the manner as cell barcodes in reads from ATACseq data from 10x Multiome so that barcode-barcode registration can occur. Uniform processing in this way minimzes computational variability and enables cross-assay comparisons.

In this notebook we demonstrate how single-cell genomics data can be preprocessed to generate a cell by feature count matrix. This requires:

1. FASTQ files
2. `seqspec` specification for the FASTQ files
3. Genome Sequence FASTA
4. Genome Annotation GTF
5. (optional) Feature barcode list

# Install Packages
The vignette makes use of two non-standard command line tools, [`jq`](https://jqlang.github.io/jq/) and [`tree`](https://mama.indstate.edu/users/ice/tree/). The code cell below installs these tools on a Linux operating system and should be updated for Mac and Windows users.

In [None]:
# Install `tree` to view files
# system("curl --quiet --show-progress ftp://mama.indstate.edu/linux/tree/tree-2.1.0.tgz")

# Install `jq`, a command-line tool for extracting key value pairs from JSON files 
# system("wget --quiet --show-progress https://github.com/stedolan/jq/releases/download/jq-1.6/jq-osx-amd64")
# system("chmod +x jq-linux64 && mv jq-linux64 /usr/local/bin/jq")

We will continue with other dependencies 

In [None]:
# Clone the cellatlas repo and install the package
system("git clone https://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@github.com/cellatlas/cellatlas.git")
system("cd cellatlas && pip install --quiet .")

# Install dependencies
system("yes | pip uninstall --quiet seqspec")
system("pip install --quiet git+https://github.com/IGVF/seqspec.git")
system("pip install --quiet gget kb-python")

# Preprocessing Workflows
Below, we have generated technology specific pre-processing workflows that utilize `seqspec` to format genomic library sequence and structure and `cellatlas` to generate pipelines that process raw data into a count matrix. Currently, example workflows exist for data generated using [SPLiT-seq](#rna_10xsplitseq), [ClickTags](#clicktag), [Chromium V3](#rna_10xv3), [Chromium Single Cell ATAC-seq](#atac_10x), [Chromium Single Cell ATAC Multiome ATAC + Gene Expression](#atac_10xmulti), [Chromium Single Cell CRISPR Screening](#crispr_10x), and [Chromium Nuclei Isolation](#nuclei).

Currently, [Visium](#rna_10xvisium) is the only spatial technology with an available `seqspec`.  

More information on how to generate a `seqspec` for technologies not listed here is available on [GitHub](https://github.com/IGVF/seqspec). 
## <a id="rna_10xvisium"></a>Preprocessing for Visium

### Examine the spec
We first use `seqspec print` to check that the read structure matches what we expect. This command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files. Note that the names of the nodes in the `seqspec` must match the names of the FASTQ files.

### Fetch the references
This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment. 

### Build the pipeline

## <a id="rna_10xsplitseq"></a>Preprocessing for SPLiT-seq
### Examine the spec
We first use `seqspec print` to check that the read structure matches what we expect. This command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files. Note that the names of the nodes in the `seqspec` must match the names of the FASTQ files.

### Fetch the references
This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment. 

### Build the pipeline

## <a id="clicktag"></a>Preprocessing for ClickTags
### Examine the spec
We first use `seqspec print` to check that the read structure matches what we expect. This command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files. Note that the names of the nodes in the `seqspec` must match the names of the FASTQ files.

### Fetch the references
This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment. 

### Build the pipeline

## <a id="rna_10xv3"></a>Preprocessing for Chromium V3 Chemistry
### Examine the spec
We first use `seqspec print` to check that the read structure matches what we expect. This command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files. Note that the names of the nodes in the `seqspec` must match the names of the FASTQ files.

### Fetch the references
This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment. 

### Build the pipeline

## <a id="atac_10x"></a>Preprocessing for Chromium Single Cell ATAC-seq
### Examine the spec
We first use `seqspec print` to check that the read structure matches what we expect. This command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files. Note that the names of the nodes in the `seqspec` must match the names of the FASTQ files.

### Fetch the references
This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment. 

### Build the pipeline

## <a id="atac_10xmulti"></a>Preprocessing for Chromium Single Cell ATAC Multiome ATAC
### Examine the spec
We first use `seqspec print` to check that the read structure matches what we expect. This command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files. Note that the names of the nodes in the `seqspec` must match the names of the FASTQ files.

### Fetch the references
This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment. 

### Build the pipeline

## <a id="crispr_10x"></a>Preprocessing for Chromium Single Cell CRISPR Screening
### Examine the spec
We first use `seqspec print` to check that the read structure matches what we expect. This command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files. Note that the names of the nodes in the `seqspec` must match the names of the FASTQ files.

### Fetch the references
This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment. 

### Build the pipeline

## <a id="#nuclei"></a>Preprocessing for Chromium Nuclei Isolation
### Examine the spec
We first use `seqspec print` to check that the read structure matches what we expect. This command prints out an ordered tree representation of the sequenced elements contained in the FASTQ files. Note that the names of the nodes in the `seqspec` must match the names of the FASTQ files.

### Fetch the references
This step is only necessary if the modality that we are processing uses a transcriptome reference-based alignment. 

### Build the pipeline