In [2]:
suppressPackageStartupMessages({
    library(tidyverse)
    library(data.table)
    library(Matrix)
    library(Seurat)
    library(R.utils)
})

“package ‘tidyr’ was built under R version 4.1.2”
“package ‘readr’ was built under R version 4.1.2”
“package ‘Matrix’ was built under R version 4.1.2”


This notebook provides example code for running *CAT on an example seurat counts matrix. Similar methodology can be used for other R object types, such as SingleCellExperiment. Generally, the counts matrix should be converted into a .mtx or .h5ad file (as shown below) for use in starCAT.py.

The default reference is TCAT.V1, a reference of programs curated from multiple T cell datasets.

## Download example data

Download seurat object for an [example small dataset](https://zenodo.org/records/13368041) to local directory:

In [9]:
data_dir = './Example_Data/'
dir.create(data_dir, recursive = TRUE)

In [8]:
library(curl)
curl_download('https://zenodo.org/records/13368041/files/COMBAT-CITESeq-DATA.Raw.T.ADTfixed20230831FiltForcNMF.Downsampled.rds?download=1',
              file.path(data_dir, 'example_data.rds'), 
              handle = new_handle(timeout = 300))

Using libcurl 7.82.0 with OpenSSL/3.0.3


Attaching package: ‘curl’


The following object is masked from ‘package:readr’:

    parse_date




## Output example data to MTX format

In [12]:
seu_path = paste0(data_dir, 'example_data.rds')
seu_path

In [13]:
seu_object = readRDS(seu_path)

In [14]:
seu_object

An object of class Seurat 
20957 features across 13800 samples within 1 assay 
Active assay: RNA (20957 features, 0 variable features)

In [15]:
seu_object@meta.data %>% colnames

In [16]:
counts = seu_object@assays$RNA@counts

In [17]:
counts[1:5, 1:5] 

5 x 5 sparse Matrix of class "dgCMatrix"
           L1_AAACCCACATGGATCT L1_AAACGAAAGATAACAC L1_AAACGCTTCTTGGTCC
AL627309.1                   .                   .                   .
AL669831.5                   .                   .                   .
LINC00115                    .                   .                   .
FAM41C                       .                   .                   .
NOC2L                        .                   .                   .
           L1_AAAGAACCAAGGAGTC L1_AAAGAACCACCTCTAC
AL627309.1                   .                   .
AL669831.5                   .                   .
LINC00115                    .                   .
FAM41C                       .                   .
NOC2L                        .                   .

In [18]:
# Output counts matrix
writeMM(counts, paste0(data_dir, 'matrix.mtx'))
gzip(paste0(data_dir, 'matrix.mtx'))

# Output cell barcodes
barcodes <- colnames(counts)
write_delim(as.data.frame(barcodes), paste0(data_dir, 'barcodes.tsv'),
           col_names = FALSE)
gzip(paste0(data_dir, 'barcodes.tsv'))

# Output feature names
gene_names <- rownames(counts)
features <- data.frame("gene_id" = gene_names,"gene_name" = gene_names,type = "Gene Expression")
write_delim(as.data.frame(features),delim = "\t", paste0(data_dir, 'features.tsv'),
           col_names = FALSE)
gzip(paste0(data_dir, 'features.tsv'))

NULL

## Submit starCAT.py

The format of the bash command should look like:
```starcat --reference "TCAT.V1" --counts "counts_fn" --output-dir "output_dir" --name "outuput_name"```

In [19]:
output_name = 'example_data'
counts_fn = paste0(data_dir, 'matrix.mtx.gz')

In [20]:
cmd = paste0('starcat', 
             ' --reference ', '"TCAT.V1"',
             ' --counts ', '"', counts_fn, '"', 
             ' --output-dir ', '"', data_dir, '"', 
             ' --name ', '"', output_name, '"' 
           )
cmd

In [5]:
# Submit starCAT command
system(cmd)

## Load results into R

In [71]:
usage = read.table(paste0(data_dir, output_name, '.rf_usage_normalized.txt'))
scores = read.table(paste0(data_dir, output_name, '.scores.txt'))

In [72]:
usage %>% head(2)
scores %>% head()

Unnamed: 0_level_0,CellCycle.G2M,Translation,HLA,ISG,Mito,Doublet.RBC,gdT,CellCycle.S,Cytotoxic,Doublet.Platelet,⋯,Tfh.2,OX40.EBI3,CD172a.MERTK,IEG3,Doublet.Fibroblast,SOX4.TOX2,CD40LG.TXNIP,Tph,Exhaustion,Tfh.1
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
L1_AAACCCACATGGATCT,0.0004812625,0.1984751,0.05281601,0.001352083,0.02677662,0.0007320393,0.005636345,0.001266398,0.005563221,0.001743415,⋯,0.009453272,0.009752536,0.12944813,0.05215479,0.007152538,0.008011147,0.01265178,0.00567412,0.005282844,0.009074134
L1_AAACGAAAGATAACAC,0.0007382836,0.1261008,0.08261252,0.002197595,0.03655813,0.0006230267,0.003504475,0.003304409,0.007702728,0.001602188,⋯,0.001686609,0.0008899445,0.05706992,0.01008088,0.005119129,0.03362395,0.02996031,0.0002871153,0.000148319,0.024658555


Unnamed: 0_level_0,ASA,Proliferation,ASA_binary,Proliferation_binary,Multinomial_Label
Unnamed: 0_level_1,<dbl>,<dbl>,<chr>,<chr>,<chr>
L1_AAACCCACATGGATCT,0.02440342,0.002129125,False,False,CD4_EM
L1_AAACGAAAGATAACAC,0.01139059,0.005833321,False,False,CD8_Naive
L1_AAACGCTTCTTGGTCC,0.01524094,0.011939801,False,False,CD4_Naive
L1_AAAGAACCAAGGAGTC,0.01869528,0.015882069,False,False,CD4_Naive
L1_AAAGAACCACCTCTAC,0.03353451,0.010161335,False,False,Treg
L1_AAAGGATAGTTGTCAC,0.05383545,0.009643,False,False,CD4_EM


## End

In [3]:
sessionInfo()

R version 4.1.1 (2021-08-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)

Matrix products: default
BLAS/LAPACK: /PHShome/mc1070/anaconda3/envs/R4.1.1Py3.9.7/lib/libopenblasp-r0.3.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] R.utils_2.11.0     R.oo_1.24.0        R.methodsS3_1.8.1  SeuratObject_4.0.4
 [5] Seurat_4.1.0       Matrix_1.4-1       data.table_1.14.2  forcats_0.5.1     
 [9] stringr_1.4.0      dplyr_1.1.0        purrr_0.3.4        readr_2.1.2       
[13] tidyr_1.2.0