<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/figure_generation/GenFigS7-S21Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Precalculates data for the supplementary figures S7-S21**

This notebook precalculates the data for the supplementary figures S7-S21, since there are some heavy calculation steps involved for generating the figures. The most demanding task is prediction of unseen molecules for each gene using the ZTNB method. This notebook may take several hours to run.

Steps:
1. Download the code and processed data
2. Setup the R environment
3. Define a general function to precalculate figure data for a dataset and save it to disk
4. Call the precalculation function for all datasets

The data for these figures is produced by the following notebooks:

Processing of FASTQ files with kallisto and bustools:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessEVAL.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessEVALPBMC.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessEVALPBMC_DS.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessEVALPBMC_SW.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessLC.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessMRET.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessMRET2.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessPBMC_NG.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessPBMC_NG_2.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessPBMC_V2.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessPBMC_V3.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessPBMC_V3_2.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessPBMC_V3_3.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessMARSSEQ.ipynb

Preprocessing of BUG files:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_EVAL.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_EVALPBMC.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_EVALPBMC_DS.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_EVALPBMC_SW.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_LC.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_MRET.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_MRET2.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_PBMC_NG.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_PBMC_NG_2.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_PBMC_V2.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_PBMC_V3.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_PBMC_V3_2.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_PBMC_V3_3.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_MARSSEQ.ipynb


**1. Download the code and processed data**

In [1]:
#download the R code
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


Cloning into 'GRNP_2020'...
remote: Enumerating objects: 179, done.[K
remote: Counting objects: 100% (179/179), done.[K
remote: Compressing objects: 100% (141/141), done.[K
remote: Total 1897 (delta 121), reused 59 (delta 38), pack-reused 1718[K
Receiving objects: 100% (1897/1897), 9.82 MiB | 18.48 MiB/s, done.
Resolving deltas: 100% (1306/1306), done.


In [2]:
#download processed data from Zenodo for all datasets
![ -d "data" ] && rm -r data
!mkdir figureData
!cd figureData && wget https://zenodo.org/record/4661263/files/EVAL.zip?download=1 && unzip 'EVAL.zip?download=1' && rm 'EVAL.zip?download=1'
!cd figureData && wget https://zenodo.org/record/4661263/files/EVALPBMC.zip?download=1 && unzip 'EVALPBMC.zip?download=1' && rm 'EVALPBMC.zip?download=1'
!cd figureData && wget https://zenodo.org/record/4661263/files/EVALPBMC_DS.zip?download=1 && unzip 'EVALPBMC_DS.zip?download=1' && rm 'EVALPBMC_DS.zip?download=1'
!cd figureData && wget https://zenodo.org/record/4661263/files/EVALPBMC_SW.zip?download=1 && unzip 'EVALPBMC_SW.zip?download=1' && rm 'EVALPBMC_SW.zip?download=1'
!cd figureData && wget https://zenodo.org/record/4661263/files/MRET.zip?download=1 && unzip 'MRET.zip?download=1' && rm 'MRET.zip?download=1'
!cd figureData && wget https://zenodo.org/record/4661263/files/MRET2.zip?download=1 && unzip 'MRET2.zip?download=1' && rm 'MRET2.zip?download=1'
!cd figureData && wget https://zenodo.org/record/4661263/files/LC.zip?download=1 && unzip 'LC.zip?download=1' && rm 'LC.zip?download=1'
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_V2.zip?download=1 && unzip 'PBMC_V2.zip?download=1' && rm 'PBMC_V2.zip?download=1'
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_V3.zip?download=1 && unzip 'PBMC_V3.zip?download=1' && rm 'PBMC_V3.zip?download=1'
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_V3_2.zip?download=1 && unzip 'PBMC_V3_2.zip?download=1' && rm 'PBMC_V3_2.zip?download=1'
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_V3_3.zip?download=1 && unzip 'PBMC_V3_3.zip?download=1' && rm 'PBMC_V3_3.zip?download=1'
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_NG.zip?download=1 && unzip 'PBMC_NG.zip?download=1' && rm 'PBMC_NG.zip?download=1'
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_NG_2.zip?download=1 && unzip 'PBMC_NG_2.zip?download=1' && rm 'PBMC_NG_2.zip?download=1'
!cd figureData && wget https://zenodo.org/record/4661263/files/MARSSEQ.zip?download=1 && unzip 'MARSSEQ.zip?download=1' && rm 'MARSSEQ.zip?download=1'


--2021-04-03 21:34:43--  https://zenodo.org/record/4661263/files/EVAL.zip?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 206479312 (197M) [application/octet-stream]
Saving to: ‘EVAL.zip?download=1’


2021-04-03 21:35:04 (10.2 MB/s) - ‘EVAL.zip?download=1’ saved [206479312/206479312]

Archive:  EVAL.zip?download=1
   creating: EVAL/
  inflating: EVAL/Bug_10.RData       
  inflating: EVAL/Bug_100.RData      
  inflating: EVAL/Bug_20.RData       
  inflating: EVAL/Bug_25.RData       
  inflating: EVAL/Bug_40.RData       
  inflating: EVAL/Bug_5.RData        
  inflating: EVAL/Bug_60.RData       
  inflating: EVAL/Bug_80.RData       
  inflating: EVAL/ds_summary.txt     
  inflating: EVAL/PredEvalData.RDS   
  inflating: EVAL/Stats.RData        
--2021-04-03 21:35:06--  https://zenodo.org/record/4661263/files/EVALPBMC.zip?download=1
Resolving zenodo.o

In [3]:
#Check that download worked
!cd figureData && ls -l && cd EVAL && ls -l

total 56
drwxr-xr-x 2 root root 4096 Jul  1  2020 EVAL
drwxr-xr-x 2 root root 4096 Jul  1  2020 EVALPBMC
drwxr-xr-x 2 root root 4096 Jul  1  2020 EVALPBMC_DS
drwxr-xr-x 2 root root 4096 Jul  1  2020 EVALPBMC_SW
drwxr-xr-x 2 root root 4096 Jul  1  2020 LC
drwxr-xr-x 2 root root 4096 Feb  4 19:07 MARSSEQ
drwxr-xr-x 2 root root 4096 Jul  1  2020 MRET
drwxr-xr-x 2 root root 4096 Jul  1  2020 MRET2
drwxr-xr-x 2 root root 4096 Jul  1  2020 PBMC_NG
drwxr-xr-x 2 root root 4096 Jul  1  2020 PBMC_NG_2
drwxr-xr-x 2 root root 4096 Jul  1  2020 PBMC_V2
drwxr-xr-x 2 root root 4096 Jul  1  2020 PBMC_V3
drwxr-xr-x 2 root root 4096 Jul  1  2020 PBMC_V3_2
drwxr-xr-x 2 root root 4096 Jul  1  2020 PBMC_V3_3
total 212788
-rw-r--r-- 1 root root 37523336 Jun 30  2020 Bug_100.RData
-rw-r--r-- 1 root root 17301493 Jun 30  2020 Bug_10.RData
-rw-r--r-- 1 root root 23443334 Jun 30  2020 Bug_20.RData
-rw-r--r-- 1 root root 25288320 Jun 30  2020 Bug_25.RData
-rw-r--r-- 1 root root 29057075 Jun 30  2020 Bug_40.RData

**2. Prepare the R environment**

In [4]:
#switch to R mode
%reload_ext rpy2.ipython


In [5]:
#install the R packages and setup paths
%%R
install.packages("dplyr")
install.packages("preseqR")


R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/dplyr_1.0.5.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 949019 bytes (926 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write 

**3. Define a general function to precalculate data for a dataset**

This function calculates all prediction data needed for the figures. For the A figures, all UMIs from all genes are joined into a single pool, and prediction is done from there. For the other figures (B-F), the predictions are done per gene. We predict using four methods:

1. Preseq DS (Rational functions approximation), trunkating CU histograms at 2
2. Preseq DS, trunkating CU histograms at 20
3. Zero-trunkated negative binomial (ZTNB)
4. "Best practice", which selects Preseq DS (here with histograms trunkated at 2) if the number of copies per molecule CV > 1, otherwise ZTNB.

In [6]:
#First set some path variables
%%R
source("GRNP_2020/RCode/pathsGoogleColab.R")


In [7]:
#Import the code for prediction (available in other notebooks)
%%R
source(paste0(sourcePath,"ButterflyHelpers.R"))
source(paste0(sourcePath,"preseqHelpers.R"))





R[write to console]: 
Attaching package: ‘dplyr’


R[write to console]: The following objects are masked from ‘package:stats’:

    filter, lag


R[write to console]: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [8]:
#Define a function that precalculates the figure data for a dataset
%%R
GenAlgEvaluationData <- function(dsid) {
  print(paste0("Processing ",dsid, ":"))
  loadStats(dsid)
  stats = getStats(dsid)
  
  
  ##################################
  #first, prediction of all UMIs
  ##################################
  
  #Get UMI counts at all different ds stages
  ###########################################
  
  dss = c("5","10","20","40","60","80","100")
  
  #generate the columns to extract from stats
  colList = dss #allocation of right size
  for (i in 1:length(dss)) {
    colList[i] = paste0("UMIs_", dsid, "_d_", dss[i])
  }
  extrFromStats = stats[,which(colnames(stats) %in% colList)]
  dsCounts = as.numeric(colSums(extrFromStats))

  
  #get histogram from 0.05:
  loadBug(dsid, 0.05)
  h = totalCPUHistogram(getBug(dsid, 0.05))
  rmBug(dsid, 0.05)

  x = c(0.05,0.1,0.2,0.4,0.6,0.8,1)
  t = c(1,2,4,8,12,16,20)


  #predict with Good-Toulmin
  predGT005 = rep(0,length(t))
  predGT005[1] = dsCounts[1];
  for (i in 2:length(t)) {
    predGT005[i] = goodToulmin(h,t[i])
  }
  
  #predict with Preseq DS, mt=20
  predPSDS005_20 = rep(0,length(t))
  predPSDS005_20[1] = dsCounts[1];
  for (i in 2:length(t)) {
    predPSDS005_20[i] = predPreSeqDS(h,t[i],20)
  }
  
  #predict with Preseq DS, mt=2
  predPSDS005_2 = rep(0,length(t))
  predPSDS005_2[1] = dsCounts[1];
  for (i in 2:length(t)) {
    predPSDS005_2[i] = predPreSeqDS(h,t[i],2)
  }
  
  #predict with Preseq ZTNB
  predPSZTNB005 = rep(0,length(t))
  predPSZTNB005[1] = dsCounts[1];
  for (i in 2:length(t)) {
    predPSZTNB005[i] = predPreSeqZTNB(h,t[i]) #just ignore the warnings, deprecated...
  }
  
  ##################################
  #now, prediction per gene for different methods
  ##################################
  
  loadBug(dsid, 0.1)
  ds10Bug = getBug(dsid, 0.1)
  rmBug(dsid, 0.1)

  collapsed10 = aggregate(count~gene, ds10Bug, FUN=c) #if you get an error here, you probably defined a variable called "c"...
  totUMIs10 = sapply(collapsed10$count, FUN=length)
  rm(ds10Bug)
  
  loadBug(dsid, 1)
  ds100Bug = getBug(dsid, 1)
  rmBug(dsid, 1)
  
  collapsedFull = aggregate(count~gene, ds100Bug, FUN=c) #if you get an error here, you probably defined a variable called "c"...
  rm(ds100Bug)
  merged2 = inner_join(collapsed10, collapsedFull, by="gene")
  rm(collapsed10, collapsedFull)
  colnames(merged2) = c("gene", "DS10", "Full")
  
  totUMIsFull = sapply(merged2$Full, FUN=length)
  #sort the genes on number of UMIs
  srt = sort(totUMIs10, index.return=T)
  umis = srt$x
  merged2srt = merged2[srt$ix,]
  
  #now predict using both ztnb and ds:
  numgenes = dim(merged2srt)[1]
  predds_20 = rep(0,numgenes)
  predds_2 = rep(0,numgenes)
  predztnb = rep(0,numgenes)
  predbp = rep(0,numgenes)
  predscaled = rep(0,numgenes)
  fracOnes = rep(0,numgenes)
  fullUMIs = rep(0,numgenes)
  
  globScale = sum(totUMIsFull)/sum(totUMIs10)

  #All quick ones.
  print(paste0("DS etc.:", numgenes))
  for (i in 1:numgenes) {
    if (i %% 1000 == 0) {
      print(i)
    }
    h = hist(merged2srt$DS10[[i]], breaks=seq(0.5, max(merged2srt$DS10[[i]])+0.5, by=1), plot = F)
    predds_20[[i]] = predPreSeqDS(h, 10, 20)
    predds_2[[i]] = predPreSeqDS(h, 10, 2)
    predscaled[[i]] = length(merged2srt$DS10[[i]])*globScale #this resembles CPM
    
    fullUMIs[[i]] = length(merged2srt$Full[[i]])
    
    fracOnes[[i]] = h$density[1]
  }
  
  #ztnb
  print(paste0("ZTNB:", numgenes))
  for (i in 1:numgenes) {
    if (i %% 1000 == 0) {
      print(i)
    }
    h = hist(merged2srt$DS10[[i]], breaks=seq(0.5, max(merged2srt$DS10[[i]])+0.5, by=1), plot = F)
    predztnb[[i]] = predPreSeqZTNB(h, 10)
  }
  
  #best practice
  print(paste0("Best practice:", numgenes))
  for (i in 1:numgenes) {
    if (i %% 1000 == 0) {
      print(i)
    }
    h = hist(merged2srt$DS10[[i]], breaks=seq(0.5, max(merged2srt$DS10[[i]])+0.5, by=1), plot = F)
    predbp[[i]] = predPreSeq(h, 10, mt=2)
  }
  
  
  toSave = list(x, dsCounts, predGT005, predPSDS005_2, predPSDS005_20, predPSZTNB005, 
                predds_20, predds_2, predztnb, predbp, predscaled, fracOnes, fullUMIs, umis,
                merged2srt)
  
  
  filename = paste0(figure_data_path, dsid, "/PredEvalData.RDS")
  saveRDS(toSave, filename)
  
  
}


**4. Call the precalculation function for all datasets**


In [9]:
%%R
GenAlgEvaluationData("EVAL")
GenAlgEvaluationData("EVALPBMC")
GenAlgEvaluationData("EVALPBMC_DS")
GenAlgEvaluationData("EVALPBMC_SW")
GenAlgEvaluationData("PBMC_V3")
GenAlgEvaluationData("PBMC_V3_2")
GenAlgEvaluationData("PBMC_V3_3")
GenAlgEvaluationData("PBMC_NG")
GenAlgEvaluationData("PBMC_NG_2")
GenAlgEvaluationData("PBMC_V2")
GenAlgEvaluationData("LC")
GenAlgEvaluationData("MRET")
GenAlgEvaluationData("MRET2")
GenAlgEvaluationData("MARSSEQ")


[1] "Processing EVAL:"
[1] "DS etc.:17546"
[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
[1] 11000
[1] 12000
[1] 13000
[1] 14000
[1] 15000
[1] 16000
[1] 17000
[1] "ZTNB:17546"
[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
[1] 11000
[1] 12000
[1] 13000
[1] 14000
[1] 15000
[1] 16000
[1] 17000
[1] "Best practice:17546"
[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
[1] 11000
[1] 12000
[1] 13000
[1] 14000
[1] 15000
[1] 16000
[1] 17000
[1] "Processing EVALPBMC:"
[1] "DS etc.:17297"
[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
[1] 11000
[1] 12000
[1] 13000
[1] 14000
[1] 15000
[1] 16000
[1] 17000
[1] "ZTNB:17297"
[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
[1] 11000
[1] 12000
[1] 13000
[1] 14000
[1] 15000
[1] 16000
[1] 17000
[1] "Best practice:17297"
[1] 1000
[1] 2000
[

In [10]:
!cd figureData && ls -l && cd EVAL && ls -l

total 56
drwxr-xr-x 2 root root 4096 Jul  1  2020 EVAL
drwxr-xr-x 2 root root 4096 Jul  1  2020 EVALPBMC
drwxr-xr-x 2 root root 4096 Jul  1  2020 EVALPBMC_DS
drwxr-xr-x 2 root root 4096 Jul  1  2020 EVALPBMC_SW
drwxr-xr-x 2 root root 4096 Jul  1  2020 LC
drwxr-xr-x 2 root root 4096 Feb  4 19:07 MARSSEQ
drwxr-xr-x 2 root root 4096 Jul  1  2020 MRET
drwxr-xr-x 2 root root 4096 Jul  1  2020 MRET2
drwxr-xr-x 2 root root 4096 Jul  1  2020 PBMC_NG
drwxr-xr-x 2 root root 4096 Jul  1  2020 PBMC_NG_2
drwxr-xr-x 2 root root 4096 Jul  1  2020 PBMC_V2
drwxr-xr-x 2 root root 4096 Jul  1  2020 PBMC_V3
drwxr-xr-x 2 root root 4096 Jul  1  2020 PBMC_V3_2
drwxr-xr-x 2 root root 4096 Jul  1  2020 PBMC_V3_3
total 212788
-rw-r--r-- 1 root root 37523336 Jun 30  2020 Bug_100.RData
-rw-r--r-- 1 root root 17301493 Jun 30  2020 Bug_10.RData
-rw-r--r-- 1 root root 23443334 Jun 30  2020 Bug_20.RData
-rw-r--r-- 1 root root 25288320 Jun 30  2020 Bug_25.RData
-rw-r--r-- 1 root root 29057075 Jun 30  2020 Bug_40.RData