## Motivation {.unnumbered}
Introduction to CellRanger

In [2]:
## setup environment
suppressMessages({
library(tidyverse)
library(knitr)})

## function to print contents
cat_content <- function(dir, n = -1L){
    content <- readLines(dir, n = n, warn = FALSE)
    cat(content, sep = "\n")}

## Retrieve Raw Sequencing Files

In this section, we will download the relevant sequencing samples (CV10, CV12) highlighted in [@sec-Introduction] with the respective BioProject ID : 
[PRJNA1040901]() & [PRJNA1040899](). Since the raw sequencing files are deposited in Sequencing Read Archive (SRA), we can use the SRA Toolkit to directly transfer data from cloud server to local/remote host. 

### Sequencing Run Identifiers

Before we can proceed with the download, SRA Toolkit needs the sequencing run identifiers for each sample, which typically has a prefix starting with "SRR". With multi-omic data, one sequencing run identifier will be generated for each sequencing library (GEX, ADT, BCR, TCR etc.). To retrieve the relevant identifiers, we downloaded the [[SraRunTable.txt]](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA1040888&o=acc_s%3Aa) file  for this study from NIH SRA Run Selector.

In [11]:
fs::dir_tree(path = "/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/", recurse = 0)

[38;5;33m/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/[0m
├── [38;5;33mcellranger[0m
├── [38;5;33mlogs[0m
└── [38;5;33mraw_fastq[0m


In [17]:
fs::dir_tree(path = "/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/input")

[38;5;33m/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/input[0m
├── [38;5;40mGSE247910_ADT_HTO_details_README.txt[0m
├── [38;5;40mGSE247912_ADT_HTO_details_README.txt[0m
├── SRR_Acc_List.txt
└── [38;5;40mSraRunTable.txt[0m


As you can see from the SraRunTable.txt file below, sequencing run identifiers and BioProject IDs are stored in the \<Run\> and \<BioProject\> columns. Next we will get the relevant sequencing run identifiers from the SraRunTable.txt file below and store these as a .txt file.

In [7]:
## see structure of SraRunTable.txt
sra <- read.csv("/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/input/SraRunTable.txt", sep = ",")
str(sra)

'data.frame':	96 obs. of  34 variables:
 $ Run                           : chr  "SRR26852594" "SRR26852595" "SRR26852596" "SRR26852597" ...
 $ Assay.Type                    : chr  "OTHER" "OTHER" "OTHER" "OTHER" ...
 $ AvgSpotLen                    : int  160 160 152 160 160 152 161 161 128 160 ...
 $ Bases                         : num  3.98e+08 6.37e+08 4.72e+08 3.78e+09 5.25e+08 ...
 $ BioProject                    : chr  "PRJNA1040889" "PRJNA1040889" "PRJNA1040889" "PRJNA1040889" ...
 $ BioSample                     : chr  "SAMN38270931" "SAMN38270932" "SAMN38270933" "SAMN38270934" ...
 $ Bytes                         : num  1.62e+08 2.57e+08 1.94e+08 1.54e+09 2.13e+08 ...
 $ cell_type                     : chr  "PBMCs" "PBMCs" "PBMCs" "PBMCs" ...
 $ Center.Name                   : chr  "KORALOV, PATHOLOGY, NYU LANGONE" "KORALOV, PATHOLOGY, NYU LANGONE" "KORALOV, PATHOLOGY, NYU LANGONE" "KORALOV, PATHOLOGY, NYU LANGONE" ...
 $ Collection_Date               : chr  "missing" "missing

In [8]:
# select CV10, CV12
select <- c("PRJNA1040901", "PRJNA1040899")
sra <- sra %>% filter(BioProject %in% select) %>% select(BioProject, Run, library_type, treatment)
sra

BioProject,Run,library_type,treatment
<chr>,<chr>,<chr>,<chr>
PRJNA1040901,SRR26844707,TCRgd,COVID-19
PRJNA1040901,SRR26844708,TCRab,COVID-19
PRJNA1040901,SRR26844709,GEX,COVID-19
PRJNA1040901,SRR26844710,BCR,COVID-19
PRJNA1040901,SRR26844711,ADT,COVID-19
PRJNA1040899,SRR26844884,TCRgd,SARS-CoV-2 vaccine
PRJNA1040899,SRR26844885,TCRab,SARS-CoV-2 vaccine
PRJNA1040899,SRR26844886,GEX,SARS-CoV-2 vaccine
PRJNA1040899,SRR26844887,BCR,SARS-CoV-2 vaccine
PRJNA1040899,SRR26844888,ADT,SARS-CoV-2 vaccine


In [15]:
## store identifiers per line in a txt file
write.table(sra$Run, "/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/input/SRR_Acc_List.txt", col.names = F, row.names = F, quote = F)
cat_content("/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/input/SRR_Acc_List.txt")

SRR26844707
SRR26844708
SRR26844709
SRR26844710
SRR26844711
SRR26844884
SRR26844885
SRR26844886
SRR26844887
SRR26844888


### Download with SRA Toolkit
To download each sequencing library with run identifier, we used the fasterq-dump command from SRA Toolkit with parallelization to speed up the process of download. Adding the --split-files argument is essential for majority 10X sequencing, as downstream pipeline requires forward and reverse strand reads to be kept in separate fastq files. Below is a bash script to perform the download.

In [16]:
## show contents of sra toolkit script
cat_content("/camp/home/hungm/nemo-pipelines/datarepo/sratools/sratools.sh")

#!/bin/bash
#SBATCH --job-name=sratools
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=1
#SBATCH --time=7-00:00:0
#SBATCH --mem=200G
#SBATCH --partition=ncpu
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=matthew.hung@crick.ac.uk

############## edit the following ##################
export accession=matthew/MH_GSE247917
export PRJ=/camp/home/hungm/scratch/hungm/${accession}
#####################################################

source /camp/home/hungm/nemo-pipelines/piplog.sh

which fasterq-dump
mkdir -p ${PRJ}/raw_fastq/
cd ${PRJ}/raw_fastq/
for i in $(cat ${PRJ}/input/SRR_Acc_List.txt);
	do ~/.conda/envs/sratools/bin/fasterq-dump $i -e 32 --include-technical --split-files;
	for j in i ; 
		do gzip ${i}*.fastq ; 
	done; 
done


In [39]:
## submit bash script to download raw fastq files
system("cd /camp/home/hungm/nemo-pipelines/datarepo/sratools/; sbatch sratools.sh")

The fastq files are now downloaded to the following directory.

In [4]:
fs::dir_tree(path = "/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq")

[38;5;33m/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq[0m
├── [38;5;9mSRR26844707_1.fastq.gz[0m
├── [38;5;9mSRR26844707_2.fastq.gz[0m
├── [38;5;9mSRR26844707_3.fastq.gz[0m
├── [38;5;9mSRR26844708_1.fastq.gz[0m
├── [38;5;9mSRR26844708_2.fastq.gz[0m
├── [38;5;9mSRR26844708_3.fastq.gz[0m
├── [38;5;9mSRR26844709_1.fastq.gz[0m
├── [38;5;9mSRR26844709_2.fastq.gz[0m
├── [38;5;9mSRR26844709_3.fastq.gz[0m
├── [38;5;9mSRR26844710_1.fastq.gz[0m
├── [38;5;9mSRR26844710_2.fastq.gz[0m
├── [38;5;9mSRR26844710_3.fastq.gz[0m
├── [38;5;9mSRR26844711_1.fastq.gz[0m
├── [38;5;9mSRR26844711_2.fastq.gz[0m
├── [38;5;9mSRR26844884_1.fastq.gz[0m
├── [38;5;9mSRR26844884_2.fastq.gz[0m
├── [38;5;9mSRR26844884_3.fastq.gz[0m
├── [38;5;9mSRR26844885_1.fastq.gz[0m
├── [38;5;9mSRR26844885_2.fastq.gz[0m
├── [38;5;9mSRR26844885_3.fastq.gz[0m
├── [38;5;9mSRR26844886_1.fastq.gz[0m
├── [38;5;9mSRR26844886_2.fastq.gz[0m
├── [38;5;9mSRR26844886_3.fastq.gz[0m
├── 

## Multi-Pipeline Configurations
We have setup a reference pipeline/system [see [GitHub]()] to run cellranger-multi (V7.0.1). Below are the purpose of each file :
  
> * batch.sh - a bash script to setup config.csv and multi.sh for each sequencing sample
> * batch_id.txt - a txt file containing names of each sequencing sample
> * config.csv - configuration file for cellranger-multi [see [cellranger-multi]()]
> * fastqformat.sh - rename fastq files for cellranger [see [cellranger fastq names]()]
> * features_reference.csv - a reference configuration file for feature barcoding [see [feature barcoding]())]
> * library.csv - a reference configuration file for config.csv [library] section
> * multi.sh - a bash script to run cellranger-multi command
> * tcrgd_primers.txt - default TCR-GD primer (5P v1.1) library [see [TCR-GD primers](https://kb.10xgenomics.com/hc/en-us/articles/360015793931-Can-I-detect-T-cells-with-gamma-delta-chains-in-my-V-D-J-data)]

In [20]:
fs::dir_tree(path = "/camp/home/hungm/nemo-pipelines/scrnaseq/cellranger/")

[38;5;33m/camp/home/hungm/nemo-pipelines/scrnaseq/cellranger/[0m
├── batch.sh
├── batch_id.txt
├── [38;5;40mconfig.csv[0m
├── fastqformat.sh
├── [38;5;40mfeature_reference.csv[0m
├── library.csv
├── [38;5;40mmulti.sh[0m
├── readme.txt
└── tcrgd_primers.txt


Outline below are the steps to run cellranger-multi for multiple sequencing samples with the system :  

> 1. Rename fastq files to appropriate names with "fastqformat.sh"
> 2. Specify sequencing sample names in "batch_id.txt" file
> 3. Set up "config.csv" and specify paths for reference genomes
> 4. Set correct oligo tag/sequence for "feature_reference.csv" if necesary
> 5. Set correct library paths for "library.csv" if necesary
> 6. Modify script and run "source batch.sh" to setup scripts per sample
> 7. Final check and run "sbatch */multi.sh" to submit jobs

### <font color='grey'>Step1 -</font> fastqformat.sh {.unnumbered}
For cellranger to read the fastq files properly, the names of the fastq files need modification. To do so the fastqformat.sh script does the renaming automatically for 10X fastqs downloaded from SRA.

In [21]:
## show fastqformat.sh
cat_content('/camp/home/hungm/nemo-pipelines/scrnaseq/cellranger/fastqformat.sh')

for file in *_1.fastq.gz; do
  newname=$(echo "$file" | sed 's/_1.fastq.gz/_S1_R1_001.fastq.gz/')
  mv "$file" "$newname"
done

for file in *_2.fastq.gz; do
  newname=$(echo "$file" | sed 's/_2.fastq.gz/_S1_R2_001.fastq.gz/')
  mv "$file" "$newname"
done

for file in *_3.fastq.gz; do
  newname=$(echo "$file" | sed 's/_3.fastq.gz/_S1_I1_001.fastq.gz/')
  mv "$file" "$newname"
done

for file in *_4.fastq.gz; do
  newname=$(echo "$file" | sed 's/_4.fastq.gz/_S1_I2_001.fastq.gz/')
  mv "$file" "$newname"
done


In [22]:
## separate ADT libraries
system("
    cd /camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/ \n 
    mkdir ADT \n 
    mv SRR26844888* ADT \n
    mv SRR26844711* ADT \n")

In [24]:
## View ADT fastq R1
system("
zcat /camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/ADT/SRR26844711_1.fastq.gz | head -n 3;
zcat /camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/ADT/SRR26844711_2.fastq.gz | head -n 3")

@SRR26844711.1 A01581:168:HCWFHDRX2:1:2101:1108:1000 length=28  
GNCCTCAAGCTTTGGTTCGTTAGCGTCT  
+SRR26844711.1 A01581:168:HCWFHDRX2:1:2101:1108:1000 length=28  
F#FFFFFFFFF:FFFFFFFFFFFFFF,,  
@SRR26844711.2 A01581:168:HCWFHDRX2:1:2101:1127:1000 length=28  

@SRR26844711.1 A01581:168:HCWFHDRX2:1:2101:1108:1000 length=90  
NAGCTCCGTCCTCCGAATCATGTTGGTAAACACGCCCATATAAGAAAACGCTAACGAACCAAAGCTTGAGGACAGATCGGAAGAGAGTCG  
+SRR26844711.1 A01581:168:HCWFHDRX2:1:2101:1108:1000 length=90  
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:,:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF,FFFFFFFFFF,  
@SRR26844711.2 A01581:168:HCWFHDRX2:1:2101:1127:1000 length=90  

In [24]:
## execute script
system("
    cd /camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/ADT \n 
    source /camp/home/hungm/nemo-pipelines/scrnaseq/cellranger/fastqformat.sh \n 
    mv * .. \n
    cd /camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/ \n 
    rm -r ADT
    ")

In [None]:
zcat /camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/SRR26844709_1.fastq.gz | head -n 5
zcat /camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/SRR268447_2.fastq.gz | head -n 5
zcat /camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/SRR26844710_3.fastq.gz | head -n 5

@SRR26844710.1 A01581:167:HCG2TDMXY:1:1101:1262:1016 length=10  
NCAGTAACTA  
+SRR26844710.1 A01581:167:HCG2TDMXY:1:1101:1262:1016 length=10  
#F:FFFFFFF  
@SRR26844710.2 A01581:167:HCG2TDMXY:1:1101:1606:1016 length=10  
  
@SRR26844710.1 A01581:167:HCG2TDMXY:1:1101:1262:1016 length=28  
ANGGCCATCGTTACAGAGCCTCAATCTT  
+SRR26844710.1 A01581:167:HCG2TDMXY:1:1101:1262:1016 length=28  
F#FFFFFFFFFFFFFFFFFFFFFFFFFF  
@SRR26844710.2 A01581:167:HCG2TDMXY:1:1101:1606:1016 length=28  
  
@SRR26844710.1 A01581:167:HCG2TDMXY:1:1101:1262:1016 length=90  
NTTTGAACACTCTAATTTTTTCAAAGTAAACGCTTCGGGCCCCGCGGGACACTCAGCTAAGAGCATCGAGGGGGCGCCGAGAGGCAAGGG  
+SRR26844710.1 A01581:167:HCG2TDMXY:1:1101:1262:1016 length=90  
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF  
@SRR26844710.2 A01581:167:HCG2TDMXY:1:1101:1606:1016 length=90  

In [None]:
## modify format fastq for GEX/VDJ
for file in *_2.fastq.gz; do
  newname=$(echo "$file" | sed 's/_2.fastq.gz/_S1_R1_001.fastq.gz/')
  mv "$file" "$newname"
done

for file in *_3.fastq.gz; do
  newname=$(echo "$file" | sed 's/_3.fastq.gz/_S1_R2_001.fastq.gz/')
  mv "$file" "$newname"
done

for file in *_1.fastq.gz; do
  newname=$(echo "$file" | sed 's/_1.fastq.gz/_S1_I1_001.fastq.gz/')
  mv "$file" "$newname"
done

In [7]:
## execute script
system("cd /camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/ \n source /camp/home/hungm/nemo-pipelines/scrnaseq/cellranger/fastqformat.sh")

In [25]:
fs::dir_tree(path = "/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq")

[38;5;33m/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq[0m
├── [38;5;9mSRR26844707_S1_I1_001.fastq.gz[0m
├── [38;5;9mSRR26844707_S1_R1_001.fastq.gz[0m
├── [38;5;9mSRR26844707_S1_R2_001.fastq.gz[0m
├── [38;5;9mSRR26844708_S1_I1_001.fastq.gz[0m
├── [38;5;9mSRR26844708_S1_R1_001.fastq.gz[0m
├── [38;5;9mSRR26844708_S1_R2_001.fastq.gz[0m
├── [38;5;9mSRR26844709_S1_I1_001.fastq.gz[0m
├── [38;5;9mSRR26844709_S1_R1_001.fastq.gz[0m
├── [38;5;9mSRR26844709_S1_R2_001.fastq.gz[0m
├── [38;5;9mSRR26844710_S1_I1_001.fastq.gz[0m
├── [38;5;9mSRR26844710_S1_R1_001.fastq.gz[0m
├── [38;5;9mSRR26844710_S1_R2_001.fastq.gz[0m
├── [38;5;9mSRR26844711_S1_R1_001.fastq.gz[0m
├── [38;5;9mSRR26844711_S1_R2_001.fastq.gz[0m
├── [38;5;9mSRR26844884_S1_I1_001.fastq.gz[0m
├── [38;5;9mSRR26844884_S1_R1_001.fastq.gz[0m
├── [38;5;9mSRR26844884_S1_R2_001.fastq.gz[0m
├── [38;5;9mSRR26844885_S1_I1_001.fastq.gz[0m
├── [38;5;9mSRR26844885_S1_R1_001.fastq.gz[0m
├── [38;5;9

### <font color='grey'>Step2 -</font> batch_id.txt {.unnumbered}
Specify sequencing sample names (CV10, CV12) in "batch_id.txt" file.

In [14]:
## show batch_id.txt
write.table(c("CV10", "CV12"), "/camp/home/hungm/nemo-pipelines/scrnaseq/cellranger/batch_id.txt", col.names = F, row.names = F, quote = F)
cat_content("/camp/home/hungm/nemo-pipelines/scrnaseq/cellranger/batch_id.txt")

CV10
CV12


### <font color='grey'>Step3 -</font> configure.csv {.unnumbered}
Set paths for human reference genomes for GEX and VDJ libraries for "config.csv" file. This "config.csv" template file was retrieved from [CellRanger config.csv]() and has been modified to run our pipeline. The reference genome files are required for the cellranger-multi pipeline and can be pre-installed following the steps in the [build 10X reference]() link.

:::{.callout-warning}
Please do not change the number of lines in the script as this is critical for Step 5 to run properly.
:::

In [3]:
cat_content("/camp/home/hungm/nemo-pipelines/scrnaseq/cellranger/config.csv")

# This template shows the possible cellranger multi config CSV options for analyzing Single Cell Gene Expression with Feature Barcode Technology (Antibody Capture, CRISPR Guide Capture, Cell Multiplexing, Antigen Capture), Fixed RNA Profiling, or Single Cell Immune Profiling data. 
# These options cannot be used all together - see section descriptions for detail.
# Use 'cellranger multi-template --parameters' to see descriptions of all parameters.

[gene-expression]
reference,/camp/svc/reference/Genomics/10x/10x_transcriptomes/refdata-gex-GRCh38-2020-A
create-bam,true
# probe-set,/path/to/probe/set, # Required, Fixed RNA Profiling only. 
# filter-probes,<true|false>, # Optional, Fixed RNA Profiling only. 
# r1-length,<int>
# r2-length,<int>
chemistry,SC5P-R2
# expect-cells,<auto>
# force-cells,<auto>
# no-secondary,<true|false>
# no-bam,<true|false>
# check-library-compatibility,<true|false>
# target-panel,/path/to/target/panel, # Required, Targeted GEX only.
# no-target-umi-filter,<tr

We will leave the references in [feature barcode] and [libraries] as empty in the "config.csv", as we will define these paths with "library.csv" and "feature_reference.csv" below.

### <font color='grey'>Step4 -</font> feature_reference.csv {.unnumbered}
Next we will set the correct oligo tag/sequence for "feature_reference.csv" for cellranger-multi to process and quantify the oligo reads properly. Below is an example of feature_reference.csv.

In [4]:
## show example feature_reference.csv
cat_content("/camp/home/hungm/nemo-pipelines/scrnaseq/cellranger/feature_reference.csv")

id,name,read,pattern,sequence,feature_type
C0301,C0301,R2,5PNNNNNNNNNN(BC),ACCCACCAGTAAGAC,Antibody Capture
C0302,C0302,R2,5PNNNNNNNNNN(BC),GGTCGAGAGCATTCA,Antibody Capture
C0303,C0303,R2,5PNNNNNNNNNN(BC),CTTGCCGCATGTCAT,Antibody Capture
C0304,C0304,R2,5PNNNNNNNNNN(BC),AAAGCATTCTTCACG,Antibody Capture
C0305,C0305,R2,5PNNNNNNNNNN(BC),CTTTGTCTTTGTGAG,Antibody Capture
C0306,C0306,R2,5PNNNNNNNNNN(BC),TATGCTGCCACGGTA,Antibody Capture
C0307,C0307,R2,5PNNNNNNNNNN(BC),GAGTCTGCCAGTATC,Antibody Capture
C0308,C0308,R2,5PNNNNNNNNNN(BC),TATAGAACGCCAGGC,Antibody Capture
C0309,C0309,R2,5PNNNNNNNNNN(BC),TGCCTATGAAACAAG,Antibody Capture
C0310,C0310,R2,5PNNNNNNNNNN(BC),CCGATTGTAACAGAC,Antibody Capture
C0311,C0311,R2,5PNNNNNNNNNN(BC),GCTTACCGAATTAAC,Antibody Capture
C0312,C0312,R2,5PNNNNNNNNNN(BC),CTGCAAATATAACGG,Antibody Capture


Since CITEseq & cell hashing library was prepared for the sequencing samples, information of the antibody used and their oligo sequence was retrieved from the SRA accession of this study. We will make one feature_reference.csv file for each sequencing sample to avoid mixing up, as the same hashtag oligo was used for different donors in different sequencing runs.

In [44]:
fs::dir_tree(path = "/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/input")

[38;5;33m/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/input[0m
├── [38;5;40mGSE247910_ADT_HTO_details_README.txt[0m
├── [38;5;40mGSE247912_ADT_HTO_details_README.txt[0m
├── SRR_Acc_List.txt
└── [38;5;40mSraRunTable.txt[0m


In [5]:
## read the downloaded feature reference files
cv10_features <- read.csv("/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/input/GSE247910_ADT_HTO_details_README.txt", sep = "\t")
cv12_features <- read.csv("/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/input/GSE247912_ADT_HTO_details_README.txt", sep = "\t")
colnames(cv12_features)[4] <- "HTO" 
head(cv10_features)
head(cv12_features)

Unnamed: 0_level_0,ADT,barcode.sequence,X,HTO,barcode.sequence.1,hashtagged.sample
Unnamed: 0_level_1,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>
1,TotalSeq?-C0145 anti-human CD103,GACCTCATTGTGAAT,,anti-human Hashtag 1,GTCAACTCTTTAGCG,CV-001 d0 booster
2,TotalSeq?-C0155 anti-human CD107a,CAGCCCACTGCAATA,,anti-human Hashtag 2,TGATGGCCTATTGGG,CV-001 d7 booster
3,TotalSeq?-C0061 anti-human CD117,AGACTAATAGCTGAC,,anti-human Hashtag 3,TTCCGCCTCTCTTTG,CV-001 d28 booster
4,TotalSeq?-C0161 anti-human CD11b,GACAAGTGATCTGCA,,anti-human Hashtag 4,AGTAAGTTCAGCGTA,CV-011 d120 booster
5,TotalSeq?-C0053 anti-human CD11c,TACGCCTATAACTTG,,,,
6,TotalSeq?-C0064 anti-human CD123,CTTCACTCTGTCAGG,,,,


Unnamed: 0_level_0,ADT,barcode.sequence,X,HTO,barcode.sequence.1,hashtagged.sample
Unnamed: 0_level_1,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>
1,TotalSeq?-C0145 anti-human CD103,GACCTCATTGTGAAT,,TotalSeq?-C0251 anti-human Hashtag 1,GTCAACTCTTTAGCG,CV-053 d0 vax
2,TotalSeq?-C0155 anti-human CD107a,CAGCCCACTGCAATA,,TotalSeq?-C0253 anti-human Hashtag 3,TTCCGCCTCTCTTTG,CV-053 d7 vax
3,TotalSeq?-C0061 anti-human CD117,AGACTAATAGCTGAC,,TotalSeq?-C0254 anti-human Hashtag 4,AGTAAGTTCAGCGTA,CV-053 d21 vax
4,TotalSeq?-C0161 anti-human CD11b,GACAAGTGATCTGCA,,TotalSeq?-C0255 anti-human Hashtag 5,AAGTATCGTTTCGCA,CV-053 d28 vax
5,TotalSeq?-C0053 anti-human CD11c,TACGCCTATAACTTG,,,,
6,TotalSeq?-C0064 anti-human CD123,CTTCACTCTGTCAGG,,,,


In [6]:
## modify each feature reference
for(x in c("cv10_features", "cv12_features")){
    
    features <- get(x)
    colnames(features) <- NULL
    colnames(features) <- rep(c("id", "sequence", "name"),2)
    features <- bind_rows(features[c(1:3)], features[c(4:6)]) %>%
        filter(id != "") %>%
        arrange(name)

    features <- features %>%
        mutate(
            id = gsub(" $", "", id), # remove random " " at the end of id
            name = gsub(" ", "_", name),
            name = ifelse(is.na(name), id, name)) %>% # add citeseq id to name
        mutate(
            name = gsub(".* ", "", name), 
            id = gsub(".*anti-human Hashtag ", "Hashtag", id),
            id = gsub("TotalSeq\\?\\-C", "C", id),
            id = gsub(" .*", "", id)) %>%
        mutate(
            name = gsub("^isoIg", "Ig", name),
            read = "R2",
            pattern = "5PNNNNNNNNNN(BC)",
            feature_type = "Antibody Capture") %>%
        filter(id != "") %>%
        filter(sequence != "") %>%
        select(id, name, read, pattern, sequence, feature_type) %>%
        arrange(desc(id)) %>%
        distinct(.)
    
    print(any(is.na(features)))
    assign(x, features)}

[1] FALSE
[1] FALSE


In [7]:
head(cv10_features)
head(cv12_features)

Unnamed: 0_level_0,id,name,read,pattern,sequence,feature_type
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,Hashtag4,CV-011_d120_booster,R2,5PNNNNNNNNNN(BC),AGTAAGTTCAGCGTA,Antibody Capture
2,Hashtag3,CV-001_d28_booster,R2,5PNNNNNNNNNN(BC),TTCCGCCTCTCTTTG,Antibody Capture
3,Hashtag2,CV-001_d7_booster,R2,5PNNNNNNNNNN(BC),TGATGGCCTATTGGG,Antibody Capture
4,Hashtag1,CV-001_d0_booster,R2,5PNNNNNNNNNN(BC),GTCAACTCTTTAGCG,Antibody Capture
5,C0831,CD138,R2,5PNNNNNNNNNN(BC),GTATAGACCAAAGCC,Antibody Capture
6,C0804,CD186,R2,5PNNNNNNNNNN(BC),GACAGTCGATGCAAC,Antibody Capture


Unnamed: 0_level_0,id,name,read,pattern,sequence,feature_type
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,Hashtag5,CV-053_d28_vax,R2,5PNNNNNNNNNN(BC),AAGTATCGTTTCGCA,Antibody Capture
2,Hashtag4,CV-053_d21_vax,R2,5PNNNNNNNNNN(BC),AGTAAGTTCAGCGTA,Antibody Capture
3,Hashtag3,CV-053_d7_vax,R2,5PNNNNNNNNNN(BC),TTCCGCCTCTCTTTG,Antibody Capture
4,Hashtag1,CV-053_d0_vax,R2,5PNNNNNNNNNN(BC),GTCAACTCTTTAGCG,Antibody Capture
5,C0831,CD138,R2,5PNNNNNNNNNN(BC),GTATAGACCAAAGCC,Antibody Capture
6,C0804,CD186,R2,5PNNNNNNNNNN(BC),GACAGTCGATGCAAC,Antibody Capture


In [5]:
## now output the dataframes as feature_reference.csv
for(x in c("cv10_features", "cv12_features")){
    sample <- gsub("_.*", "", x)
    write.csv(get(x), paste0("/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/input/", toupper(sample), "_feature_reference.csv"), row.names = F, quote = F)}

In [6]:
fs::dir_tree(path = "/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/input")

[38;5;33m/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/input[0m
├── CV10_feature_reference.csv
├── CV12_feature_reference.csv
├── [38;5;40mGSE247910_ADT_HTO_details_README.txt[0m
├── [38;5;40mGSE247912_ADT_HTO_details_README.txt[0m
├── SRR_Acc_List.txt
└── [38;5;40mSraRunTable.txt[0m


### <font color='grey'>Step5 -</font> library.csv {.unnumbered}
Now we will create library.csv files to configure the "config.csv" file for each sequencing sample. Below are the specific columns required for "library.csv"

> [libraries]  
> fastq_id,fastqs,lanes,feature_types

In [None]:
# example of library.csv
cat_content('/camp/home/hungm/nemo-pipelines/scrnaseq/cellranger/library.csv')

SRR22473100,/camp/home/hungm/scratch/hungm/anqi/AX_GSE219098/raw_fastq/,any,Gene Expression
SRR22473101,/camp/home/hungm/scratch/hungm/anqi/AX_GSE219098/raw_fastq/,any,Gene Expression
SRR22473102,/camp/home/hungm/scratch/hungm/anqi/AX_GSE219098/raw_fastq/,any,Gene Expression
SRR22473103,/camp/home/hungm/scratch/hungm/anqi/AX_GSE219098/raw_fastq/,any,Gene Expression
SRR22473104,/camp/home/hungm/scratch/hungm/anqi/AX_GSE219098/raw_fastq/,any,Antibody Capture
SRR22473105,/camp/home/hungm/scratch/hungm/anqi/AX_GSE219098/raw_fastq/,any,Antibody Capture
SRR22473106,/camp/home/hungm/scratch/hungm/anqi/AX_GSE219098/raw_fastq/,any,Antibody Capture
SRR22473107,/camp/home/hungm/scratch/hungm/anqi/AX_GSE219098/raw_fastq/,any,Antibody Capture
SRR22473108,/camp/home/hungm/scratch/hungm/anqi/AX_GSE219098/raw_fastq/,any,VDJ-B
SRR22473109,/camp/home/hungm/scratch/hungm/anqi/AX_GSE219098/raw_fastq/,any,VDJ-B
SRR22473110,/camp/home/hungm/scratch/hungm/anqi/AX_GSE219098/raw_fastq/,any,VDJ-B
SRR22473111,/c

View SRA metadata again from [@sec-sra] and modify dataframe to library.csv format.

In [None]:
# view SRA metadata again
head(sra)

Unnamed: 0_level_0,BioProject,Run,library_type,treatment
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,PRJNA1040901,SRR26844707,TCRgd,COVID-19
2,PRJNA1040901,SRR26844708,TCRab,COVID-19
3,PRJNA1040901,SRR26844709,GEX,COVID-19
4,PRJNA1040901,SRR26844710,BCR,COVID-19
5,PRJNA1040901,SRR26844711,ADT,COVID-19
6,PRJNA1040899,SRR26844884,TCRgd,SARS-CoV-2 vaccine


In [9]:
## modify sra metadata
sra <- sra %>%
    mutate(feature_types = case_when(
        library_type == "GEX" ~ "Gene Expression", 
        library_type == "ADT" ~ "Antibody Capture", 
        library_type == "BCR" ~ "VDJ-B", 
        library_type == "TCRab" ~ "VDJ-T", 
        library_type == "TCRgd" ~ "VDJ-T-GD")) %>%
    mutate(
        id = ifelse(BioProject == "PRJNA1040901", "CV10", "CV12"),
        fastq_id = Run,
        fastqs = "/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/",
        lanes = "any") %>%
    select(id, fastq_id, fastqs, lanes, feature_types)
sra

id,fastq_id,fastqs,lanes,feature_types
<chr>,<chr>,<chr>,<chr>,<chr>
CV10,SRR26844707,/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/,any,VDJ-T-GD
CV10,SRR26844708,/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/,any,VDJ-T
CV10,SRR26844709,/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/,any,Gene Expression
CV10,SRR26844710,/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/,any,VDJ-B
CV10,SRR26844711,/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/,any,Antibody Capture
CV12,SRR26844884,/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/,any,VDJ-T-GD
CV12,SRR26844885,/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/,any,VDJ-T
CV12,SRR26844886,/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/,any,Gene Expression
CV12,SRR26844887,/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/,any,VDJ-B
CV12,SRR26844888,/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/raw_fastq/,any,Antibody Capture


In [10]:
## save dataframe as library.csv
outputdir <- "/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/input/"
for(x in unique(sra$id)){
    sra %>%
        filter(id == x) %>%
        select(!id) %>%
        write.table(., paste0(outputdir, x, "_library.csv"), col.names = F, row.names = F, quote = F, sep = ",")}

In [11]:
fs::dir_tree(path = "/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/input/")

[38;5;33m/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/input/[0m
├── CV10_feature_reference.csv
├── CV10_library.csv
├── CV12_feature_reference.csv
├── CV12_library.csv
├── [38;5;40mGSE247910_ADT_HTO_details_README.txt[0m
├── [38;5;40mGSE247912_ADT_HTO_details_README.txt[0m
├── SRR_Acc_List.txt
└── [38;5;40mSraRunTable.txt[0m


### <font color='grey'>Step6 -</font> batch.sh {.unnumbered}
Next we will need to create and modify each individual multi.sh and config.csv file for each sequencing sample, which can be done by running the "batch.sh" script. The script will perform the following :  
> 1. make a subdirectory for each sequencing sample in a pre-defined cellranger log directory.
> 2. copy "multi.sh" script into each sequencing sample subdirectory and modify cellranger output directory when multi.sh is ran
> 3. copy "config.csv" file into  each sequencing sample subdirectory and add "library.csv" and "feature_reference.csv" contents to "config.csv"

In [16]:
cat_content('/camp/home/hungm/nemo-pipelines/scrnaseq/cellranger/batch.sh')

export accession=/matthew/MH_GSE247917/

# setup all scripts in this cellranger log directory
mkdir -p /camp/home/hungm/scratch/hungm/${accession}/logs/cellranger

# for each sample name in batch_id.txt
while IFS= read -r id;
do mkdir -p /camp/home/hungm/scratch/hungm/${accession}/logs/cellranger/${id}

   # copy essential files
   cp -r /camp/home/hungm/nemo-pipelines/scrnaseq/cellranger/multi.sh /camp/home/hungm/scratch/hungm/${accession}/logs/cellranger/${id}/
   cp -r /camp/home/hungm/nemo-pipelines/scrnaseq/cellranger/config.csv /camp/home/hungm/scratch/hungm/${accession}/logs/cellranger/${id}/
   cp -r /camp/home/hungm/nemo-pipelines/scrnaseq/cellranger/batch* /camp/home/hungm/scratch/hungm/${accession}/logs/cellranger/

   # set up cellranger output directories
   sed -i "11s|.*|export accession=$accession|" /camp/home/hungm/scratch/hungm/${accession}/logs/cellranger/${id}/multi.sh
   sed -i "12s|.*|export sample=$id|" /camp/home/hungm/scratch/hungm/${accession}/logs/cellranger/

In [25]:
## execute batch.sh script
system("source /camp/home/hungm/nemo-pipelines/scrnaseq/cellranger/batch.sh")
fs::dir_tree(path = "/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/cellranger")

[38;5;33m/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/cellranger[0m
├── [38;5;33mCV10[0m
│   ├── [38;5;40mconfig.csv[0m
│   └── [38;5;40mmulti.sh[0m
├── [38;5;33mCV12[0m
│   ├── [38;5;40mconfig.csv[0m
│   └── [38;5;40mmulti.sh[0m
├── [38;5;40mbatch.sh[0m
└── [38;5;40mbatch_id.txt[0m


In [18]:
cat_content("/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/cellranger/CV12/config.csv")

# This template shows the possible cellranger multi config CSV options for analyzing Single Cell Gene Expression with Feature Barcode Technology (Antibody Capture, CRISPR Guide Capture, Cell Multiplexing, Antigen Capture), Fixed RNA Profiling, or Single Cell Immune Profiling data. 
# These options cannot be used all together - see section descriptions for detail.
# Use 'cellranger multi-template --parameters' to see descriptions of all parameters.

[gene-expression]
reference,/camp/svc/reference/Genomics/10x/10x_transcriptomes/refdata-gex-mm10-2020-A
create-bam,true
# probe-set,/path/to/probe/set, # Required, Fixed RNA Profiling only. 
# filter-probes,<true|false>, # Optional, Fixed RNA Profiling only. 
# r1-length,<int>
# r2-length,<int>
# chemistry,SC5P-R2
# expect-cells,<auto>
# force-cells,<auto>
# no-secondary,<true|false>
# no-bam,<true|false>
# check-library-compatibility,<true|false>
# target-panel,/path/to/target/panel, # Required, Targeted GEX only.
# no-target-umi-filter,<tr

### <font color='grey'>Step7 -</font> multi.sh {.unnumbered}
After confirming that all the files are properly set up, we can run the multi.sh scripts for each sequencing sample.

In [19]:
cat_content("/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/cellranger/CV12/multi.sh")

#!/bin/bash
#SBATCH --job-name=cellranger-multi
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=1
#SBATCH --time=7-00:00:0
#SBATCH --mem=250G
#SBATCH --partition=ncpu
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=matthew.hung@crick.ac.uk

export accession=/matthew/MH_GSE247917/
export sample=CV12
mkdir -p /camp/home/hungm/scratch/hungm/${accession}/cellranger
exec > /camp/home/hungm/scratch/hungm/${accession}/logs/cellranger/${sample}/multi.log 2>&1
rm -r /camp/home/hungm/scratch/hungm/${accession}/logs/cellranger/slurm*

cd /camp/home/hungm/scratch/hungm/${accession}/cellranger
#module load CellRanger/7.1.0
module load CellRanger/8.0.0
cellranger multi --id=${sample} \
		 --csv=/camp/home/hungm/scratch/hungm/${accession}/logs/cellranger/${sample}/config.csv \
                 --localmem=200 \
                 --localcores=30



In [27]:
system("cd /camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/cellranger/; sbatch CV10/multi.sh")
system("cd /camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/cellranger/; sbatch CV12/multi.sh")

In [35]:
cat_content("/camp/home/hungm/scratch/hungm/matthew/MH_GSE247917/logs/cellranger/CV12/multi.log", n=20)

Martian Runtime - v4.0.10
Serving UI at http://cn013:38301?auth=iqpNquUSp17wjSrHixYSvDF7N_9DdLUSl-8oqgz5S04

Running preflight checks (please wait)...
2024-09-02 20:15:48 [runtime] (ready)           ID.CV12.SC_MULTI_CS.PARSE_MULTI_CONFIG
2024-09-02 20:15:48 [runtime] (run:local)       ID.CV12.SC_MULTI_CS.PARSE_MULTI_CONFIG.fork0.chnk0.main
2024-09-02 20:16:05 [runtime] (chunks_complete) ID.CV12.SC_MULTI_CS.PARSE_MULTI_CONFIG
2024-09-02 20:16:05 [runtime] (ready)           ID.CV12.SC_MULTI_CS.FULL_COUNT_INPUTS.WRITE_GENE_INDEX
2024-09-02 20:16:05 [runtime] (run:local)       ID.CV12.SC_MULTI_CS.FULL_COUNT_INPUTS.WRITE_GENE_INDEX.fork0.chnk0.main
2024-09-02 20:16:05 [runtime] (ready)           ID.CV12.SC_MULTI_CS.SC_MULTI_CORE.MULTI_CHEMISTRY_DETECTOR._GEM_WELL_CHEMISTRY_DETECTOR.DETECT_COUNT_CHEMISTRY
2024-09-02 20:16:05 [runtime] (run:local)       ID.CV12.SC_MULTI_CS.SC_MULTI_CORE.MULTI_CHEMISTRY_DETECTOR._GEM_WELL_CHEMISTRY_DETECTOR.DETECT_COUNT_CHEMISTRY.fork0.chnk0.main
2024-09-02 20

## Session Info {.unnumbered}

In [252]:
sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Rocky Linux 8.7 (Green Obsidian)

Matrix products: default
BLAS/LAPACK: /nemo/lab/caladod/working/Matthew/.conda/envs/seurat5/lib/libopenblasp-r0.3.23.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.4    
 [5] purrr_1.0.2     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
 [9] ggplot2_3.5.1   tidyverse_2.0.0 knitr_1.45     

