<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/figure_generation/GenFig5C-H_S24-S25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Generates figure 5C-H, S24, and S25**

This notebook generates figures showing differences in amplication within genes across clusters.

Steps:
1. Download the code and processed data
2. Setup the R environment
3. Generate the figures

The data for this figure is produced by the following notebooks:

Processing of FASTQ files with kallisto and bustools:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessEVALPBMC.ipynb

Preprocessing of BUG files:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_EVALPBMC.ipynb




**1. Download the code and processed data**

In [1]:
#download the R code
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


Cloning into 'GRNP_2020'...
remote: Enumerating objects: 339, done.[K
remote: Counting objects: 100% (339/339), done.[K
remote: Compressing objects: 100% (279/279), done.[K
remote: Total 2057 (delta 249), reused 87 (delta 60), pack-reused 1718[K
Receiving objects: 100% (2057/2057), 10.90 MiB | 14.43 MiB/s, done.
Resolving deltas: 100% (1434/1434), done.


In [2]:
#download processed data from Zenodo
![ -d "figureData" ] && rm -r figureData
!mkdir figureData

!cd figureData && wget https://zenodo.org/record/4661263/files/EVALPBMC.zip?download=1 && unzip 'EVALPBMC.zip?download=1' && rm 'EVALPBMC.zip?download=1'


--2021-04-04 19:25:26--  https://zenodo.org/record/4661263/files/EVALPBMC.zip?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 467646609 (446M) [application/octet-stream]
Saving to: ‘EVALPBMC.zip?download=1’


2021-04-04 19:26:17 (8.95 MB/s) - ‘EVALPBMC.zip?download=1’ saved [467646609/467646609]

Archive:  EVALPBMC.zip?download=1
   creating: EVALPBMC/
  inflating: EVALPBMC/Bug_10.RData   
  inflating: EVALPBMC/Bug_100.RData  
  inflating: EVALPBMC/Bug_20.RData   
  inflating: EVALPBMC/Bug_40.RData   
  inflating: EVALPBMC/Bug_5.RData    
  inflating: EVALPBMC/Bug_60.RData   
  inflating: EVALPBMC/Bug_80.RData   
  inflating: EVALPBMC/ds_summary.txt  
  inflating: EVALPBMC/pooledHist.RData  
  inflating: EVALPBMC/pooledHistDS.RData  
  inflating: EVALPBMC/PredEvalData.RDS  
  inflating: EVALPBMC/Stats.RData    


In [3]:
#Check that download worked
!cd figureData && ls -l && cd EVALPBMC && ls -l

total 4
drwxr-xr-x 2 root root 4096 Jul  1  2020 EVALPBMC
total 486724
-rw-r--r-- 1 root root 87322865 Jun 30  2020 Bug_100.RData
-rw-r--r-- 1 root root 53475778 Jun 30  2020 Bug_10.RData
-rw-r--r-- 1 root root 65711410 Jun 30  2020 Bug_20.RData
-rw-r--r-- 1 root root 75161084 Jun 30  2020 Bug_40.RData
-rw-r--r-- 1 root root 37818341 Jun 30  2020 Bug_5.RData
-rw-r--r-- 1 root root 80649419 Jun 30  2020 Bug_60.RData
-rw-r--r-- 1 root root 84316810 Jun 30  2020 Bug_80.RData
-rw-r--r-- 1 root root      992 Jul  1  2020 ds_summary.txt
-rw-r--r-- 1 root root   316188 Jul  1  2020 pooledHistDS.RData
-rw-r--r-- 1 root root   720120 Jul  1  2020 pooledHist.RData
-rw-r--r-- 1 root root 11259902 Jul  1  2020 PredEvalData.RDS
-rw-r--r-- 1 root root  1633732 Jun 30  2020 Stats.RData


**2. Prepare the R environment**

In [4]:
#switch to R mode
%reload_ext rpy2.ipython


In [5]:
#install the R packages and setup paths
%%R
install.packages("dplyr")
install.packages("textTinyR")
install.packages("DescTools")
install.packages("qdapTools")
install.packages("ggplot2")
install.packages("ggpubr")
install.packages("matrixStats")
install.packages("Matrix.utils")


R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/dplyr_1.0.5.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 949019 bytes (926 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write 

In [6]:
%%R
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("BUSpaRse", update=FALSE)

R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/BiocManager_1.30.12.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 261321 bytes (255 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =


In [None]:
%%R
install.packages("Seurat")

R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: also installing the dependencies ‘gtools’, ‘caTools’, ‘sass’, ‘jquerylib’, ‘sitmo’, ‘globals’, ‘listenv’, ‘parallelly’, ‘plyr’, ‘zoo’, ‘gplots’, ‘reshape2’, ‘httpuv’, ‘xtable’, ‘sourcetools’, ‘bslib’, ‘spatstat.data’, ‘spatstat.utils’, ‘spatstat.sparse’, ‘tensor’, ‘goftest’, ‘deldir’, ‘polyclip’, ‘FNN’, ‘RSpectra’, ‘dqrng’, ‘fitdistrplus’, ‘future’, ‘future.apply’, ‘ggridges’, ‘ica’, ‘igraph’, ‘irlba’, ‘leiden’, ‘lmtest’, ‘miniUI’, ‘patchwork’, ‘pbapply’, ‘plotly’, ‘png’, ‘RANN’, ‘RcppAnnoy’, ‘reticulate’, ‘ROCR’, ‘Rtsne’, ‘scattermore’, ‘sctransform’, ‘SeuratObject’, ‘shiny’, ‘spatstat.core’, ‘spatstat.geom’, ‘uwot’


R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/gtools_3.8.2.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 197529 bytes (192 KB)

R[write to console]: =
R[write to console]: =
R[w

In [None]:
%%R
install.packages("preseqR")

**3. Generate the figures**


In [None]:
#First set some path variables
%%R
source("GRNP_2020/RCode/pathsGoogleColab.R")


In [None]:
#Import helpers (available in other notebooks)
%%R
source(paste0(sourcePath,"ButterflyHelpers.R"))
source(paste0(sourcePath,"CCCHelpers.R"))
source(paste0(sourcePath,"preseqHelpers.R"))
source(paste0(sourcePath,"ggplotHelpers.R"))





In [None]:
#create figure directory
![ -d "figures" ] && rm -r figures
!mkdir figures

In [None]:
#Load libraries
%%R
library(ggplot2)
library("ggpubr")
library(tidyverse)
library(BUSpaRse)
library(textTinyR)
library(qdapTools)
library(Seurat)
library("matrixStats")
library(Matrix.utils)

In [None]:
%%R
########################
# Load data and run through Seurat
########################

gn10x = read_count_output(dir=paste0(dataPath, "EVALPBMC/bus_output/genecounts"), "output", tcc=FALSE)
countsPerCell = sparse_Sums(gn10x)
gn10x = gn10x[,countsPerCell > 200]
gn10x = gn10x[sparse_Sums(gn10x, rowSums = TRUE) != 0,]
#convert the genes to gene symbols
tr2g = read.table(paste0(dataPath, "EVALPBMC/bus_output/transcripts_to_genes.txt"), stringsAsFactors = F)
lookupTable = tr2g[,2:3]
lookupTable= unique(lookupTable)

outGenes = lookup(row.names(gn10x), lookupTable)

row.names(gn10x) = outGenes
gn10x = aggregate.Matrix(gn10x, outGenes, fun='sum')



set.seed(1)


#run through Seurat
d = CreateSeuratObject(counts = gn10x, project = "10x", min.cells = 0, min.features = 0)

#create variable for mitochondrial count fraction
d[["percent.mt"]] <- PercentageFeatureSet(d, pattern = "^MT-")
VlnPlot(d, features = c("percent.mt"))

cellFilter = d$percent.mt < 10

#easisest to recreate the Seurat object

gn10x = gn10x[,cellFilter]
gn10xRawCounts = gn10xRawCounts[,cellFilter]
d = CreateSeuratObject(counts = gn10x, project = "10x", min.cells = 0, min.features = 0)


#library size normalization + change to log scale
d <- NormalizeData(d, normalization.method = "LogNormalize", scale.factor = 10000)

#Finds the most variable genes
d <- FindVariableFeatures(d, selection.method = "vst", nfeatures = 2000)

#Make mean and variance the same for all genes:
d <- ScaleData(d, features = rownames(d))

#Principal component analysis
d <- RunPCA(d, features = VariableFeatures(object = d))

#do clustering
d <- FindNeighbors(d, dims = 1:10)
d <- FindClusters(d, resolution = 0.5)

#Generate UMAP map
d <- RunUMAP(d, dims = 1:10)

clust = d$seurat_clusters


d = RenameIdents(object = d, `0` = "T1", `1` = "T2", `2` = "T3", `3` = "M1", `4` = "B", `5` = "M2", `6` = "T4", `7` = "T5", `8` = "U1", `9` = "U2", `10` = "U3",  `11` = "U4")

#Test 1 - make sure we identified the clusters correctly
d2 = subset(d, seurat_clusters == 0 | seurat_clusters == 1 | seurat_clusters == 2 | seurat_clusters == 6  | seurat_clusters == 7 ) #T/NK cells
DimPlot(d2, reduction = "umap")
d2 = subset(d, seurat_clusters == 3 | seurat_clusters == 5 ) #Monocytes
DimPlot(d2, reduction = "umap")
d2 = subset(d, seurat_clusters == 4) #B cells
DimPlot(d2, reduction = "umap")
#Test 2 - Identify the cell types in the clusters
FeaturePlot(d, c("CD3D", "CD19", "LYZ"))
            

#show clustering
figS24 = DimPlot(d, reduction = "umap")
print(figS24)


ggsave(
  paste0(figure_path, "FigS24.png"),
  plot = figS24, device = "png",
  width = 4, height = 4, dpi = 300)


In [None]:
%%R
######################
# Compare CU across clusters
######################

cuPerCluster = matrix(, nrow = nrow(gn10x), ncol = 12) #number of clusters is 12
rownames(cuPerCluster) = rownames(gn10x)
UMIsPerCluster = matrix(, nrow = nrow(gn10x), ncol = 12) #number of clusters is 12
rownames(UMIsPerCluster) = rownames(gn10x)
cellsPerCluster = rep(NA,12)
clusterNames = c("T1","T2","T3","M1","B","M2","T4","T5","U1","U2","U3","U4") #compared manually with id setting above
colnames(cuPerCluster) = clusterNames
colnames(UMIsPerCluster) = clusterNames

#load the bug
loadBug("EVALPBMC")
bug = getBug("EVALPBMC")

for (i in 1:12) {
  sel = d$seurat_clusters == i-1
  cellsPerCluster[i] = sum(sel)
  UMIsPerCluster[,i] = sparse_Sums(gn10x[,sel], rowSums = TRUE)

  #to figure out cu per cluster, we need to use the bug
  barcodes = colnames(gn10x)[sel]
  subBug = bug[bug$barcode %in% barcodes,]
  
  res = subBug %>% group_by(gene) %>% summarize(cu = mean(count))
  colnames(res) = c("x","y")
  cuPerCluster[,i] = lookup(rownames(cuPerCluster), res)
}

cuPerClusterFilt = cuPerCluster[,cellsPerCluster > 300]
UMIsPerClusterFilt = UMIsPerCluster[,cellsPerCluster > 300]
cuPerClusterFilt[UMIsPerClusterFilt < 20] = NA

#only measure the variance for somewhat highly expressed genes
cuPerClusterFilt2 = cuPerClusterFilt[(rowMeans(UMIsPerClusterFilt) > 60) & (rowSums(!is.na(cuPerClusterFilt)) > 2),]
dim(cuPerClusterFilt2)
cuPerClusterFilt2
variances = rowVars(cuPerClusterFilt2, na.rm=TRUE)

srt = sort(variances, index.return=TRUE, decreasing = TRUE)
sel = seq_len(length(srt$x)) %in% srt$ix[1:50] #get the 50 genes with most variance

#sel = variances > 10
sum(sel)

TPMs = UMIsPerCluster
for (i in 1:12) {
  TPMs[,i] = TPMs[,i]*10^6/sum(TPMs[,i])
}
colSums(TPMs)#ok


cuForVariableGenes = cuPerClusterFilt2[sel,]
UMIsForVariableGenes = UMIsPerCluster[rownames(UMIsPerCluster) %in% rownames(cuForVariableGenes),]
TPMsForVariableGenes = TPMs[rownames(TPMs) %in% rownames(cuForVariableGenes),cellsPerCluster > 300]
logtrans = log2(TPMsForVariableGenes + 1)

max(variances, na.rm=TRUE)

geneSel = rownames(cuForVariableGenes) == "ALDH2" #ALDH2
cu = cuForVariableGenes[geneSel,]
logExpr = logtrans[geneSel,]
texts = names(cu)

dsPlot = tibble(x=cu, y=logExpr, col=c(colors[1],colors[1],colors[1],colors[2],"#000000",colors[2],colors[1],colors[1]))
dsText = tibble(x=cu + 0.2, y=logExpr)
fit = lm(logExpr~cu)
summary(fit) #p = 0.0040, F-Test
#fix the texts a bit so they are visible and don't overlap
dsText$x[6] = dsText$x[6] - 1.5
dsText$y[8] = dsText$y[8] + 0.05
dsText$y[3] = dsText$y[3] - 0.05

pE = ggplot2::ggplot(dsPlot,ggplot2::aes(x=x,y=y)) +
  ggplot2::geom_point(ggplot2::aes(x=x,y=y), color=dsPlot$col, shape=1, size=1) +
  geom_abline(slope = fit$coefficients[2], intercept = fit$coefficients[1], colour="#008800", size=1.3) +
  geom_text(data=dsText, label=texts, size=4, hjust = 0, color=dsPlot$col, parse=FALSE) +
  ggplot2::labs(y=expression(Log[2]*"(CPM + 1)"), x="Copies per UMI", title="ALDH2 acr. clusters, uncorr.") +
  theme(panel.background = element_rect("white", "white", 0, 0, "white"),
        legend.position= "bottom", legend.direction = "horizontal",#, legend.title = element_blank())
        strip.text.x = element_text(size = 12, face = "bold"),
        #legend.position= "none",
        plot.title = element_text(face = "bold"),
        strip.background = element_blank())
print(pE)


In [None]:
%%R
###########################
#look at the histograms for T cells and Monocytes
###########################

#first extract the barcodes
sel = d$seurat_clusters == 0 | d$seurat_clusters == 1 | d$seurat_clusters == 2 | d$seurat_clusters == 6 | d$seurat_clusters == 7
tcellBarcodes = colnames(gn10x)[sel]
sel = d$seurat_clusters == 3 | d$seurat_clusters == 5
monoBarcodes = colnames(gn10x)[sel]

#load the bug
tCounts = bug[(bug$barcode %in% tcellBarcodes) & (bug$gene == "ALDH2"),]$count
mCounts = bug[(bug$barcode %in% monoBarcodes) & (bug$gene == "ALDH2"),]$count
mCountsFilt = mCounts[mCounts < 30]
hend = max(c(mCountsFilt,tCounts))
numTTotMolecules = nrow(bug[bug$barcode %in% tcellBarcodes,])
numMTotMolecules = nrow(bug[bug$barcode %in% monoBarcodes,])
numTTotCounts = sum(bug[bug$barcode %in% tcellBarcodes,]$count)
numMTotCounts = sum(bug[bug$barcode %in% monoBarcodes,]$count)


ht = hist(tCounts, breaks=seq(0.5, hend+0.5, by=1), plot = FALSE)$counts
hm = hist(mCountsFilt, breaks=seq(0.5, hend+0.5, by=1), plot = FALSE)$counts

htScaled = ht/numTTotMolecules * 10^6
hmScaled = hm/numMTotMolecules * 10^6


dsPlot = data.frame(x = 1:hend, y = htScaled)
pC = ggplot(dsPlot,aes(x=x,y=y)) +
  geom_bar(stat="identity", fill = colors[1]) +
  labs(y="CPM", x="Counts per UMI", title="ALDH2, T cells") +
  ylim(0,50) +
  theme(panel.background = element_rect("white", "white", 0, 
                                        0, "white"),
        plot.title = element_text(face = "bold")
  )
print(pC)

dsPlot = data.frame(x = 1:hend, y = hmScaled)
pD = ggplot(dsPlot,aes(x=x,y=y)) +
  geom_bar(stat="identity", fill = colors[2]) +
  labs(y="CPM", x="Counts per UMI", title="ALDH2, Monocytes") +
  ylim(0,50) +
  theme(panel.background = element_rect("white", "white", 0, 
                                        0, "white"),
        plot.title = element_text(face = "bold")
  )
print(pD)




In [None]:
%%R
##############################
#Plot all variable genes
##############################

#first scale all genes to the same scale (sum of 1000)
scaled = TPMsForVariableGenes*1000/rowSums(TPMsForVariableGenes)
rowSums(scaled)
logScaled = log2(scaled+1)
cu = NULL
scLogExpr = NULL

for (i in 1:nrow(cuForVariableGenes)) {
  sel = !is.na(cuForVariableGenes[i,])
  cu = c(cu, as.numeric(cuForVariableGenes[i,sel]))  
  scLogExpr = as.numeric(c(scLogExpr, logScaled[i,sel]))
}

fit = lm(scLogExpr~cu)
summary(fit)# p-value: < 2.2e-16, F-Test

dsPlot = tibble(x=cu, y=scLogExpr)
dsText = tibble(x=1, y=10)


pG = ggplot2::ggplot(dsPlot,ggplot2::aes(x=x,y=y)) +
  ggplot2::geom_point(ggplot2::aes(x=x,y=y), color="black", shape=1, size=1) +
  geom_abline(slope = fit$coefficients[2], intercept = fit$coefficients[1], colour="#008800", size=1.3) +
  geom_text(data=dsText, label=paste0("R = ",format(cor(cu,scLogExpr), digits=2)), size=5, hjust = 0, parse=FALSE) +
  ggplot2::labs(y=expression(Log[2]*"(Norm. Expr + pc)"), x="Copies per UMI", title="Ampl. acr.clusters, uncorr.") +
  theme(panel.background = element_rect("white", "white", 0, 0, "white"),
        legend.position= "bottom", legend.direction = "horizontal",#, legend.title = element_blank())
        strip.text.x = element_text(size = 12, face = "bold"),
        plot.title = element_text(face = "bold"),
        strip.background = element_blank())
print(pG)


In [None]:
%%R
######################
# Now predict and see if the correlation goes down
######################

dsid = "EVALPBMC"
bug = getBug(dsid)

#first filter on the genes in the plot to make the bug smaller
genes = rownames(cuForVariableGenes)
bugFilt1 = bug[bug$gene %in% genes,]

predDS = UMIsForVariableGenes #just allocate the right size
predDS[,] = NA
predZTNB = UMIsForVariableGenes #just allocate the right size
predZTNB[,] = NA


for (i in 1:12) {
  print(i)
  sel = clust == i-1
  cellIds = names(clust)[sel]
  clustBug = bugFilt1[bugFilt1$barcode %in% cellIds,] #filter a bit more
  if (cellsPerCluster[i] > 300) {
    for (g in 1:length(genes)) {
      if (UMIsForVariableGenes[g,i] >= 20) {
        #Get the histogram
        counts = clustBug[clustBug$gene == genes[g],]$count
        h = hist(counts, breaks=seq(0.5, max(counts)+0.5, by=1), plot = FALSE)
        freq = h$mids
        counts = h$counts
        added = 0
        #preseq cannot handle if we have only ones, so modify the histogram slightly
        if ((length(freq)==1) & (freq[1] == 1)) {
          added = 2
          freq = c(1,2)
          counts = c(counts[1]+1,1)#room for improvement here
        }
        dd = as.matrix(data.frame(freq,counts));
        rSACZTNB = mod.ztnb.rSAC(dd, incTol = 1e-5, iterIncTol = 200);
        rSACDS = ds.rSAC(dd,mt=2)
        newCountsZTNB = rSACZTNB(10^20)
        newCountsDS = rSACDS(10^20)
        newCountsZTNB[newCountsZTNB < 0] = 0
        newCountsDS[newCountsDS < 0] = 0
        predZTNB[g,i] = newCountsZTNB - added;
        predDS[g,i] = newCountsDS - added;
      }
    }
  }
}


#remove all columns with just NA
predFiltZTNB = predZTNB[,colSums(!is.na(predZTNB)) != 0]
predLogZTNB = predFiltZTNB;
predFiltDS = predDS[,colSums(!is.na(predDS)) != 0]
predLogDS = predFiltDS;
#scale to the same number of average counts per cluster as before
colSumsData = TPMsForVariableGenes
colSumsData[is.na(predFiltZTNB)] = NA
scaleSizeZTNB = mean(colSums(colSumsData,na.rm=TRUE))
colSumsData = TPMsForVariableGenes
colSumsData[is.na(predFiltDS)] = NA
scaleSizeDS = mean(colSums(colSumsData,na.rm=TRUE))


for (i in 1:ncol(predFiltZTNB)) {
  TPMish = predFiltZTNB[,i]*scaleSizeZTNB/sum(predFiltZTNB, na.rm = TRUE)
  predLogZTNB[,i] = log2(TPMish + 1)
  TPMish = predFiltDS[,i]*scaleSizeDS/sum(predFiltDS, na.rm = TRUE)
  predLogDS[,i] = log2(TPMish + 1)
}

#use DS in main figure

#Get the average TPM per cluster before prediction
sumTPMBefore = sum(TPMsForVariableGenes[rownames(TPMsForVariableGenes) == "ALDH2",])

geneSel = rownames(predFiltDS) == "ALDH2" #ALDH2
cu = cuForVariableGenes[geneSel,]
pseudoTPM = predFiltDS[geneSel,] * sumTPMBefore/sum(predFiltDS[geneSel,])
logExpr = log2(pseudoTPM + 1)
fit = lm(logExpr~cu)
summary(fit) # p value: 0.5355, F-Test

dsPlot = tibble(x=cu, y=logExpr, col=c(colors[1],colors[1],colors[1],colors[2],"#000000",colors[2],colors[1],colors[1]))
dsText = tibble(x=cu + 0.2, y=logExpr)
#fix the texts a bit so they are visible and don't overlap
dsText$x[6] = dsText$x[6] - 1.5
dsText$y[1] = dsText$y[1] - 0.1

pF = ggplot2::ggplot(dsPlot,ggplot2::aes(x=x,y=y)) +
  ggplot2::geom_point(ggplot2::aes(x=x,y=y), color=dsPlot$col, shape=1, size=1) +
  geom_abline(slope = fit$coefficients[2], intercept = fit$coefficients[1], colour="#008800", size=1.3) +
  geom_text(data=dsText, label=texts, size=4, hjust = 0, color=dsPlot$col, parse=FALSE) +
  ggplot2::labs(y=expression(Log[2]*"(pseudoTPM + pc)"), x="Copies per UMI", title="ALDH2 acr. clusters, corr.") +
  theme(panel.background = element_rect("white", "white", 0, 0, "white"),
        legend.position= "bottom", legend.direction = "horizontal",#, legend.title = element_blank())
        strip.text.x = element_text(size = 12, face = "bold"),
        plot.title = element_text(face = "bold"),
        strip.background = element_blank())
print(pF)




##############################
#Plot all variable genes, now corrected
##############################

#first scale all genes to the same scale (sum of 1000)
scaled = predFiltDS*1000/rowSums(predFiltDS, na.rm=TRUE)
rowSums(scaled, na.rm=TRUE)
logScaled = log2(scaled+1)
cu = NULL
scLogExpr = NULL

for (i in 1:nrow(cuForVariableGenes)) {
  sel = !is.na(cuForVariableGenes[i,])
  cu = c(cu, as.numeric(cuForVariableGenes[i,sel]))  
  scLogExpr = as.numeric(c(scLogExpr, logScaled[i,sel]))
}

fit = lm(scLogExpr~cu)
summary(fit)#p = 5.111e-09, F-Test
dsPlot = tibble(x=cu, y=scLogExpr)
dsText = tibble(x=1, y=10)


pH = ggplot2::ggplot(dsPlot,ggplot2::aes(x=x,y=y)) +
  ggplot2::geom_point(ggplot2::aes(x=x,y=y), color="black", shape=1, size=1) +
  geom_abline(slope = fit$coefficients[2], intercept = fit$coefficients[1], colour="#008800", size=1.3) +
  geom_text(data=dsText, label=paste0("R = ",format(cor(cu,scLogExpr), digits=2)), size=5, hjust = 0, parse=FALSE) +
  ggplot2::labs(y=expression(Log[2]*"(Norm. Expr + pc)"), x="Copies per UMI", title="Ampl. acr. clusters, corr.") +
  theme(panel.background = element_rect("white", "white", 0, 0, "white"),
        legend.position= "bottom", legend.direction = "horizontal",#, legend.title = element_blank())
        strip.text.x = element_text(size = 12, face = "bold"),
        plot.title = element_text(face = "bold"),
        #legend.position= "none",
        strip.background = element_blank())
print(pH)

In [None]:
%%R
fig5CH = ggarrange(pC, pD, pE, pF, pG, pH, nrow=2, ncol=3, labels=c("C","D","E","F","G","H"))
print(fig5CH)
ggsave(
  paste0(figure_path, "Fig5C-H.png"),
  plot = fig5CH, device = "png",
  width = 9, height = 6, dpi = 300)

In [None]:
%%R
#Create supplementary figure with ZTNB
################

scaled = predFiltZTNB*1000/rowSums(predFiltZTNB, na.rm=TRUE)
rowSums(scaled, na.rm=TRUE)
logScaled = log2(scaled+1)
cu = NULL
scLogExpr = NULL

for (i in 1:nrow(cuForVariableGenes)) {
  sel = !is.na(cuForVariableGenes[i,])
  cu = c(cu, as.numeric(cuForVariableGenes[i,sel]))  
  scLogExpr = as.numeric(c(scLogExpr, logScaled[i,sel]))
}

fit = lm(scLogExpr~cu)
summary(fit)#p=0.4397, F-Test
dsPlot = tibble(x=cu, y=scLogExpr)
dsText = tibble(x=1, y=10)


figS25 = ggplot2::ggplot(dsPlot,ggplot2::aes(x=x,y=y)) +
  ggplot2::geom_point(ggplot2::aes(x=x,y=y), color="black", shape=1, size=1) +
  geom_abline(slope = fit$coefficients[2], intercept = fit$coefficients[1], colour="#008800", size=1.3) +
  geom_text(data=dsText, label=paste0("R = ",format(cor(cu,scLogExpr), digits=2)), size=5, hjust = 0, parse=FALSE) +
  ggplot2::labs(y=expression(Log[2]*"(Norm. Expr + pc)"), x="Copies per UMI") +
  theme(panel.background = element_rect("white", "white", 0, 0, "white"),
        legend.position= "bottom", legend.direction = "horizontal",#, legend.title = element_blank())
        strip.text.x = element_text(size = 12, face = "bold"),
        #legend.position= "none",
        strip.background = element_blank())
print(figS25)

ggsave(
  paste0(figure_path, "FigS25.png"),
  plot = figS25, device = "png",
  width = 3, height = 3, dpi = 300)

