# Doublet detection

This notebook contains details on how to set up doublet detection pipeline and pull together results before deciding how to handle doublet removal.

## Prerequisite on doublet detection pipeline

The doublet detection steps rely heavily on the Nextflow pipeline developed here https://github.com/ParkerLab/Multiome-Doublet-Detection-NextFlow. One can start using the pipeline by downloading the repository using the command `git clone https://github.com/ParkerLab/Multiome-Doublet-Detection-NextFlow`.

At this stage, we assume that all preprocessing steps have been run, and lists of barcodes that pass preprocessing screens are obtained

## Doublet detection (round 1)

### Step 1: create library.config file (to run the Nextflow pipeline)

The pipeline requires different chemistry to be run separately. The following code is an example of how to set up a config file for V2 chemistry samples.

```
library(stringr)

files <- list.files("/nfs/turbo/umms-scjp-pank/1_HPAP/results/rna/gencode_v39/emptyDrops/results/pctMTusingBelowEndCliff_pctMtless30_FDR0.005/cellbender_default/", "_passQC_barcodes.csv")

s <- read.table("/nfs/turbo/umms-scjp-pank/1_HPAP/scripts/doubletfinder_v2/sampleList.txt", header = F)
files <- files[grep(paste0(s$V1, collapse = "|"), files)]

df <- data.frame("library" = gsub("_passQC_barcodes.csv", "", files),
                 "rna_bam" = NA, "rna_pass_qc_barcodes" = NA,
                 "rna_cellbender" = NA, "doubletfinder_pcs" = NA,
                 "doubletfinder_resolution" = NA, "doubletfinder_sctransform" = NA)

df$rna_bam <- paste0("/nfs/turbo/umms-scjp-pank/1_HPAP/results/rna/gencode_v39/prune/", df$library, "-hg38.before-dedup.bam")
df$rna_pass_qc_barcodes <- paste0("/nfs/turbo/umms-scjp-pank/1_HPAP/results/rna/gencode_v39/emptyDrops/results/pctMTusingBelowEndCliff_pctMtless30_FDR0.005/cellbender_default/", files)
df$rna_cellbender <- paste0("/nfs/turbo/umms-scjp-pank/1_HPAP/results/rna/gencode_v39_private/cellbender/cellbender_optimized/", df$library, "-hg38.cellbender_FPR_0.05_filtered.h5")
df$doubletfinder_pcs <- 25
df$doubletfinder_resolution <- 0.2
df$doubletfinder_sctransform <- "false"

write.table(df, "/nfs/turbo/umms-scjp-pank/1_HPAP/scripts/doubletfinder_v2/library_info.tsv", sep = "\t", quote = F, row.names = F)
```

### Step 2: run the pipeline

Example command
```
sbatch --job-name=dblfinderV2 --mem=500M --time=72:00:00 --account=scjp99 --mail-user=vthihong@umich.edu --mail-type=END,FAIL --signal=B:TERM@60 --wrap="exec ~/tools/nextflow run -resume --library_info library_info.tsv --rna_barcodes /nfs/turbo/umms-scjp-pank/1_HPAP/scripts/snRNAseq-NextFlow_v2/737K-august-2016.txt --results /nfs/turbo/umms-scjp-pank/1_HPAP/results/rna/gencode_v39/doubletfinder_v2/doubletfinder_round1/ -entry rna /nfs/turbo/umms-scjp-pank/1_HPAP/scripts/doubletfinder_v2/main.nf"
```

## Doublet detection (round 2)

### Step 1: create library.config file (to run the Nextflow pipeline)

Before running this step, one can merge all barcodes marked as doublets in DoubletFinder round 1 into a file called `indivDblts.txt`. As a result, we can exclude these doublets before running DoubletFinder the second time.

In [1]:
dblt <- read.table("/nfs/turbo/umms-scjp-pank/4_integration/results/202503_freeze/doubletfinder_round1/nonDup_proteinCoding/indivDblts.txt", header = F)
head(dblt)
dblt$sample <- sub("(.*)-[^-]*$", "\\1", dblt$V1)
dblt$barcode <- sub(".*-(.*)$", "\\1", dblt$V1)
head(dblt)

Unnamed: 0_level_0,V1
Unnamed: 0_level_1,<chr>
1,SRR12831418-ACATCAGTCTACTCAT
2,SRR12831418-AAACCTGTCATCATTC
3,SRR12831418-CAGAGAGTCCATGAAC
4,SRR12831418-ACTATCTCAAGGTTCT
5,SRR12831418-CCACCTAAGAGTGAGA
6,SRR12831418-TACCTATAGCACGCCT


Unnamed: 0_level_0,V1,sample,barcode
Unnamed: 0_level_1,<chr>,<chr>,<chr>
1,SRR12831418-ACATCAGTCTACTCAT,SRR12831418,ACATCAGTCTACTCAT
2,SRR12831418-AAACCTGTCATCATTC,SRR12831418,AAACCTGTCATCATTC
3,SRR12831418-CAGAGAGTCCATGAAC,SRR12831418,CAGAGAGTCCATGAAC
4,SRR12831418-ACTATCTCAAGGTTCT,SRR12831418,ACTATCTCAAGGTTCT
5,SRR12831418-CCACCTAAGAGTGAGA,SRR12831418,CCACCTAAGAGTGAGA
6,SRR12831418-TACCTATAGCACGCCT,SRR12831418,TACCTATAGCACGCCT


We then create a files of remaing barcodes:

```
for (s in unique(dblt$sample)) {
    if (file.exists(paste0("/nfs/turbo/umms-scjp-pank/1_HPAP/results/rna/gencode_v39/emptyDrops/results/pctMTusingBelowEndCliff_pctMtless30_FDR0.005/cellbender_default/", s, "_passQC_barcodes.csv"))) {
        bc <- read.table(paste0("/nfs/turbo/umms-scjp-pank/1_HPAP/results/rna/gencode_v39/emptyDrops/results/pctMTusingBelowEndCliff_pctMtless30_FDR0.005/cellbender_default/", s, "_passQC_barcodes.csv"), header = F)
        to_exclude <- dblt[dblt$sample == s, "barcode"]
        bc <- bc[!(bc$V1 %in% to_exclude),]
        write.table(bc, paste0("/nfs/turbo/umms-scjp-pank/4_integration/results/202503_freeze/doubletfinder_round1/nonDup_proteinCoding/selected-cells/", s, "_selectedBC.txt"), col.names = F, quote = F, row.names = F)
    }
}
```

Next, we create a new `library.config` file:

```
lib_config <- read.table("/nfs/turbo/umms-scjp-pank/1_HPAP/scripts/doubletfinder_v2/library_info.tsv", header = T)
lib_config$rna_pass_qc_barcodes <- paste0("/nfs/turbo/umms-scjp-pank/4_integration/results/202503_freeze/doubletfinder_round1/nonDup_proteinCoding/selected-cells/", lib_config$library, "_selectedBC.txt")
write.table(lib_config, "/nfs/turbo/umms-scjp-pank/1_HPAP/scripts/doubletfinder_v2/library_info_round2.tsv", sep = "\t", quote = F, row.names = F)
```

### Step 2: run the pipeline

Example command
```
sbatch --job-name=dblfinderV2 --mem=500M --time=72:00:00 --account=scjp99 --mail-user=vthihong@umich.edu --mail-type=END,FAIL --signal=B:TERM@60 --wrap="exec ~/tools/nextflow run -resume --library_info library_info_round2.tsv --rna_barcodes /nfs/turbo/umms-scjp-pank/1_HPAP/scripts/snRNAseq-NextFlow_v2/737K-august-2016.txt --results /nfs/turbo/umms-scjp-pank/1_HPAP/results/rna/gencode_v39/doubletfinder_v2/doubletfinder_round2/ -entry rna /nfs/turbo/umms-scjp-pank/1_HPAP/scripts/doubletfinder_v2/main.nf"
```