# Make Arrow Files

Process raw data to make Arrow files.

In [1]:
library(ArchR)

Loading required package: ggplot2

Loading required package: SummarizedExperiment

Loading required package: GenomicRanges

Loading required package: stats4

Loading required package: BiocGenerics

Loading required package: parallel


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    unio

In [2]:
set.seed(1)
addArchRThreads(threads = 48) 

Setting default number of Parallel threads to 48.



In [3]:
addArchRGenome("hg38")

Setting default genome to Hg38.



In [4]:
FRAG_BASE = "/srv/scratch/surag/scMultiome-reprog/chromap/outputs/"
frag.files =  list.files(FRAG_BASE, pattern="*gz$")
sample.names = lapply(strsplit(frag.files, "\\."), "[[", 1)
frag.files = paste(FRAG_BASE, frag.files, sep='')
names(frag.files) = sample.names

frag.files

In [5]:
ArrowFiles <- createArrowFiles(
  inputFiles = frag.files,
  sampleNames = names(frag.files),
  filterTSS = 4, #Dont set this too high because you can always increase later
  filterFrags = 1000, 
  addTileMat = TRUE,
  addGeneScoreMat = TRUE
)

filterFrags is no longer a valid input. Please use minFrags! Setting filterFrags value to minFrags!

filterTSS is no longer a valid input. Please use minTSS! Setting filterTSS value to minTSS!

Using GeneAnnotation set by addArchRGenome(Hg38)!

Using GeneAnnotation set by addArchRGenome(Hg38)!

ArchR logging to : ArchRLogs/ArchR-createArrows-612e72b7148e2-Date-2022-06-04_Time-00-52-28.log
If there is an issue, please report to github with logFile!

Cleaning Temporary Files

2022-06-04 00:52:29 : Batch Execution w/ safelapply!, 0 mins elapsed.

ArchR logging successful to : ArchRLogs/ArchR-createArrows-612e72b7148e2-Date-2022-06-04_Time-00-52-28.log



In [6]:
ArrowFiles

In [7]:
# rds to tsv
for (x in c("D1M",
"D2M")) {
    r = readRDS(sprintf("./QualityControl/%s/%s-Pre-Filter-Metadata.rds", x, x))
    write.table(r, sprintf("./QualityControl/%s/%s-Pre-Filter-Metadata.tsv", x, x), sep='\t', row.names=F, quote=F)
}

In [8]:
doubScores <- addDoubletScores(
  input = ArrowFiles,
  k = 10, #Refers to how many cells near a "pseudo-doublet" to count.
  knnMethod = "UMAP", #Refers to the embedding to use for nearest neighbor search.
  LSIMethod = 1
)

ArchR logging to : ArchRLogs/ArchR-addDoubletScores-612e73c235ab3-Date-2022-06-04_Time-01-04-41.log
If there is an issue, please report to github with logFile!

2022-06-04 01:04:42 : Batch Execution w/ safelapply!, 0 mins elapsed.

2022-06-04 01:04:42 : D2M (1 of 2) :  Computing Doublet Statistics, 0.001 mins elapsed.

D2M (1 of 2) : UMAP Projection R^2 = 0.91704

D2M (1 of 2) : UMAP Projection R^2 = 0.91704

2022-06-04 01:09:29 : D1M (2 of 2) :  Computing Doublet Statistics, 4.789 mins elapsed.

Filtering 1 dims correlated > 0.75 to log10(depth + 1)

D1M (2 of 2) : UMAP Projection R^2 = 0.98878

D1M (2 of 2) : UMAP Projection R^2 = 0.98878

ArchR logging successful to : ArchRLogs/ArchR-addDoubletScores-612e73c235ab3-Date-2022-06-04_Time-01-04-41.log



In [9]:
# moved files manually (krishna)
paste("/srv/scratch/surag/scMultiome/arrow/", ArrowFiles, sep='')

---

In [10]:
sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS/LAPACK: /users/surag/anaconda3/envs/r36_cran/lib/libopenblasp-r0.3.9.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] gridExtra_2.3                     nabor_0.5.0                      
 [3] Seurat_3.1.5                      BSgenome.Hsapiens.UCSC.hg38_1.4.1
 [5] BSgenome_1.54.0                   rtracklayer_1.46.0               
 [7] Biostrings_2.54.0                 XVect