<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/figure_generation/GenFig4AC_S23Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Precalculates data for figure 4A-C and supplementary figure 23**

This notebook precalculates the data for the figures since there are some heavy calculation steps involved for generating the figures. The most demanding task is prediction of unseen molecules for each gene using the ZTNB method. This notebook may take 15-30 minutes to run.

Steps:
1. Download the code and processed data
2. Setup the R environment
3. Generate the data

The data used in these calculations is produced by the following notebooks:

Processing of FASTQ files with kallisto and bustools:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessPBMC_V3_3.ipynb

Preprocessing of BUG files:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_PBMC_V3_3.ipynb



**1. Download the code and processed data**

In [1]:
#download the R code
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


Cloning into 'GRNP_2020'...
remote: Enumerating objects: 269, done.[K
remote: Counting objects: 100% (269/269), done.[K
remote: Compressing objects: 100% (212/212), done.[K
remote: Total 1987 (delta 194), reused 84 (delta 57), pack-reused 1718[K
Receiving objects: 100% (1987/1987), 10.80 MiB | 21.74 MiB/s, done.
Resolving deltas: 100% (1379/1379), done.


In [2]:
#download processed data from Zenodo for all datasets
![ -d "figureData" ] && rm -r figureData
!mkdir figureData
!cd figureData && wget https://zenodo.org/record/4661263/files/EVALPBMC.zip?download=1 && unzip 'EVALPBMC.zip?download=1' && rm 'EVALPBMC.zip?download=1'

--2021-04-04 16:59:58--  https://zenodo.org/record/4661263/files/EVALPBMC.zip?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 467646609 (446M) [application/octet-stream]
Saving to: ‘EVALPBMC.zip?download=1’


2021-04-04 17:01:02 (7.15 MB/s) - ‘EVALPBMC.zip?download=1’ saved [467646609/467646609]

Archive:  EVALPBMC.zip?download=1
   creating: EVALPBMC/
  inflating: EVALPBMC/Bug_10.RData   
  inflating: EVALPBMC/Bug_100.RData  
  inflating: EVALPBMC/Bug_20.RData   
  inflating: EVALPBMC/Bug_40.RData   
  inflating: EVALPBMC/Bug_5.RData    
  inflating: EVALPBMC/Bug_60.RData   
  inflating: EVALPBMC/Bug_80.RData   
  inflating: EVALPBMC/ds_summary.txt  
  inflating: EVALPBMC/pooledHist.RData  
  inflating: EVALPBMC/pooledHistDS.RData  
  inflating: EVALPBMC/PredEvalData.RDS  
  inflating: EVALPBMC/Stats.RData    


In [3]:
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_V2.zip?download=1 && unzip 'PBMC_V2.zip?download=1' && rm 'PBMC_V2.zip?download=1'

--2021-04-04 17:01:06--  https://zenodo.org/record/4661263/files/PBMC_V2.zip?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 816179326 (778M) [application/octet-stream]
Saving to: ‘PBMC_V2.zip?download=1’


2021-04-04 17:02:25 (10.0 MB/s) - ‘PBMC_V2.zip?download=1’ saved [816179326/816179326]

Archive:  PBMC_V2.zip?download=1
   creating: PBMC_V2/
  inflating: PBMC_V2/Bug_10.RData    
  inflating: PBMC_V2/Bug_100.RData   
  inflating: PBMC_V2/Bug_20.RData    
  inflating: PBMC_V2/Bug_40.RData    
  inflating: PBMC_V2/Bug_5.RData     
  inflating: PBMC_V2/Bug_60.RData    
  inflating: PBMC_V2/Bug_80.RData    
  inflating: PBMC_V2/ds_summary.txt  
  inflating: PBMC_V2/pooledHist.RData  
  inflating: PBMC_V2/pooledHistDS.RData  
  inflating: PBMC_V2/PredEvalData.RDS  
  inflating: PBMC_V2/Stats.RData     


In [4]:
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_V3.zip?download=1 && unzip 'PBMC_V3.zip?download=1' && rm 'PBMC_V3.zip?download=1'

--2021-04-04 17:02:33--  https://zenodo.org/record/4661263/files/PBMC_V3.zip?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1127576758 (1.0G) [application/octet-stream]
Saving to: ‘PBMC_V3.zip?download=1’


2021-04-04 17:04:27 (9.59 MB/s) - ‘PBMC_V3.zip?download=1’ saved [1127576758/1127576758]

Archive:  PBMC_V3.zip?download=1
   creating: PBMC_V3/
  inflating: PBMC_V3/Bug_10.RData    
  inflating: PBMC_V3/Bug_100.RData   
  inflating: PBMC_V3/Bug_20.RData    
  inflating: PBMC_V3/Bug_40.RData    
  inflating: PBMC_V3/Bug_5.RData     
  inflating: PBMC_V3/Bug_60.RData    
  inflating: PBMC_V3/Bug_80.RData    
  inflating: PBMC_V3/ds_summary.txt  
  inflating: PBMC_V3/pooledHist.RData  
  inflating: PBMC_V3/pooledHistDS.RData  
  inflating: PBMC_V3/PredEvalData.RDS  
  inflating: PBMC_V3/Stats.RData     


In [5]:
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_V3_2.zip?download=1 && unzip 'PBMC_V3_2.zip?download=1' && rm 'PBMC_V3_2.zip?download=1'

--2021-04-04 17:04:38--  https://zenodo.org/record/4661263/files/PBMC_V3_2.zip?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1558568885 (1.5G) [application/octet-stream]
Saving to: ‘PBMC_V3_2.zip?download=1’


2021-04-04 17:07:01 (10.4 MB/s) - ‘PBMC_V3_2.zip?download=1’ saved [1558568885/1558568885]

Archive:  PBMC_V3_2.zip?download=1
   creating: PBMC_V3_2/
  inflating: PBMC_V3_2/Bug_10.RData  
  inflating: PBMC_V3_2/Bug_100.RData  
  inflating: PBMC_V3_2/Bug_20.RData  
  inflating: PBMC_V3_2/Bug_40.RData  
  inflating: PBMC_V3_2/Bug_5.RData   
  inflating: PBMC_V3_2/Bug_60.RData  
  inflating: PBMC_V3_2/Bug_80.RData  
  inflating: PBMC_V3_2/ds_summary.txt  
  inflating: PBMC_V3_2/pooledHist.RData  
  inflating: PBMC_V3_2/pooledHistDS.RData  
  inflating: PBMC_V3_2/PredEvalData.RDS  
  inflating: PBMC_V3_2/Stats.RData   


In [6]:
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_V3_3.zip?download=1 && unzip 'PBMC_V3_3.zip?download=1' && rm 'PBMC_V3_3.zip?download=1'

--2021-04-04 17:07:17--  https://zenodo.org/record/4661263/files/PBMC_V3_3.zip?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1819306744 (1.7G) [application/octet-stream]
Saving to: ‘PBMC_V3_3.zip?download=1’


2021-04-04 17:10:13 (9.92 MB/s) - ‘PBMC_V3_3.zip?download=1’ saved [1819306744/1819306744]

Archive:  PBMC_V3_3.zip?download=1
   creating: PBMC_V3_3/
  inflating: PBMC_V3_3/Bug_10.RData  
  inflating: PBMC_V3_3/Bug_100.RData  
  inflating: PBMC_V3_3/Bug_20.RData  
  inflating: PBMC_V3_3/Bug_40.RData  
  inflating: PBMC_V3_3/Bug_5.RData   
  inflating: PBMC_V3_3/Bug_60.RData  
  inflating: PBMC_V3_3/Bug_80.RData  
  inflating: PBMC_V3_3/ds_summary.txt  
  inflating: PBMC_V3_3/pooledHist.RData  
  inflating: PBMC_V3_3/pooledHistDS.RData  
  inflating: PBMC_V3_3/PredEvalData.RDS  
  inflating: PBMC_V3_3/Stats.RData   


In [7]:
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_NG.zip?download=1 && unzip 'PBMC_NG.zip?download=1' && rm 'PBMC_NG.zip?download=1'

--2021-04-04 17:10:33--  https://zenodo.org/record/4661263/files/PBMC_NG.zip?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1183970256 (1.1G) [application/octet-stream]
Saving to: ‘PBMC_NG.zip?download=1’


2021-04-04 17:12:29 (9.89 MB/s) - ‘PBMC_NG.zip?download=1’ saved [1183970256/1183970256]

Archive:  PBMC_NG.zip?download=1
   creating: PBMC_NG/
  inflating: PBMC_NG/Bug_10.RData    
  inflating: PBMC_NG/Bug_100.RData   
  inflating: PBMC_NG/Bug_20.RData    
  inflating: PBMC_NG/Bug_40.RData    
  inflating: PBMC_NG/Bug_5.RData     
  inflating: PBMC_NG/Bug_60.RData    
  inflating: PBMC_NG/Bug_80.RData    
  inflating: PBMC_NG/ds_summary.txt  
  inflating: PBMC_NG/pooledHist.RData  
  inflating: PBMC_NG/pooledHistDS.RData  
  inflating: PBMC_NG/PredEvalData.RDS  
  inflating: PBMC_NG/Stats.RData     


In [8]:
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_NG_2.zip?download=1 && unzip 'PBMC_NG_2.zip?download=1' && rm 'PBMC_NG_2.zip?download=1'

--2021-04-04 17:12:42--  https://zenodo.org/record/4661263/files/PBMC_NG_2.zip?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1753743936 (1.6G) [application/octet-stream]
Saving to: ‘PBMC_NG_2.zip?download=1’


2021-04-04 17:15:27 (10.2 MB/s) - ‘PBMC_NG_2.zip?download=1’ saved [1753743936/1753743936]

Archive:  PBMC_NG_2.zip?download=1
   creating: PBMC_NG_2/
  inflating: PBMC_NG_2/Bug_10.RData  
  inflating: PBMC_NG_2/Bug_100.RData  
  inflating: PBMC_NG_2/Bug_20.RData  
  inflating: PBMC_NG_2/Bug_40.RData  
  inflating: PBMC_NG_2/Bug_5.RData   
  inflating: PBMC_NG_2/Bug_60.RData  
  inflating: PBMC_NG_2/Bug_80.RData  
  inflating: PBMC_NG_2/ds_summary.txt  
  inflating: PBMC_NG_2/pooledHist.RData  
  inflating: PBMC_NG_2/pooledHistDS.RData  
  inflating: PBMC_NG_2/PredEvalData.RDS  
  inflating: PBMC_NG_2/Stats.RData   


In [9]:
!cd figureData && wget https://zenodo.org/record/4661263/files/FigureData.zip?download=1 && unzip 'FigureData.zip?download=1' && rm 'FigureData.zip?download=1'

--2021-04-04 17:15:46--  https://zenodo.org/record/4661263/files/FigureData.zip?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7490400 (7.1M) [application/octet-stream]
Saving to: ‘FigureData.zip?download=1’


2021-04-04 17:15:49 (7.13 MB/s) - ‘FigureData.zip?download=1’ saved [7490400/7490400]

Archive:  FigureData.zip?download=1
 extracting: Fig3C_r1.RDS            
 extracting: Fig3C_r2.RDS            
 extracting: Fig3_h1.RDS             
 extracting: Fig3_h2.RDS             
  inflating: Fig4AC_ldata.RDS        
  inflating: Fig4AC_ldata2.RDS       
  inflating: Fig4_DE.RDS             
  inflating: gc.RDS                  
  inflating: simDepthData.RDS        
  inflating: simFcData.RDS           
  inflating: simGexData.RDS          
 extracting: simMuData.RDS           


In [10]:
#Check that download worked
!cd figureData && ls -l && cd EVALPBMC && ls -l

total 7968
drwxr-xr-x 2 root root    4096 Jul  1  2020 EVALPBMC
-rw-r--r-- 1 root root     239 Jun 30  2020 Fig3C_r1.RDS
-rw-r--r-- 1 root root     233 Jun 30  2020 Fig3C_r2.RDS
-rw-r--r-- 1 root root     683 Jun 30  2020 Fig3_h1.RDS
-rw-r--r-- 1 root root     830 Jun 30  2020 Fig3_h2.RDS
-rw-r--r-- 1 root root  372141 Apr  3 16:59 Fig4AC_ldata2.RDS
-rw-r--r-- 1 root root  480126 Apr  3 16:59 Fig4AC_ldata.RDS
-rw-r--r-- 1 root root 3949960 Feb 25 18:38 Fig4_DE.RDS
-rw-r--r-- 1 root root  304548 Feb 26 13:30 gc.RDS
drwxr-xr-x 2 root root    4096 Jul  1  2020 PBMC_NG
drwxr-xr-x 2 root root    4096 Jul  1  2020 PBMC_NG_2
drwxr-xr-x 2 root root    4096 Jul  1  2020 PBMC_V2
drwxr-xr-x 2 root root    4096 Jul  1  2020 PBMC_V3
drwxr-xr-x 2 root root    4096 Jul  1  2020 PBMC_V3_2
drwxr-xr-x 2 root root    4096 Jul  1  2020 PBMC_V3_3
-rw-r--r-- 1 root root 1128027 Feb 15 11:34 simDepthData.RDS
-rw-r--r-- 1 root root  621646 Feb 15 13:22 simFcData.RDS
-rw-r--r-- 1 root root 1237109 Feb 15 15:46

**2. Prepare the R environment**

In [11]:
#switch to R mode
%reload_ext rpy2.ipython


In [17]:
#install the R packages and setup paths
%%R
install.packages("qdapTools")
install.packages("dplyr")
install.packages("preseqR")
#install.packages("stringdist")


R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: also installing the dependencies ‘bitops’, ‘chron’, ‘data.table’, ‘RCurl’, ‘XML’


R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/bitops_1.0-6.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 8734 bytes

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[writ

**3. Generate and save the data**

There are a few calculations to do

1. Prediction of downsampled data using ZTNB
2. Pooled prediction
3. Calculation of sampling noise, defined as the difference between a downsampled dataset and the average of the same dataset downsampled 20 times.


In [18]:
#First set some path variables
%%R
source("GRNP_2020/RCode/pathsGoogleColab.R")


In [19]:
#Import the code for prediction (available in other notebooks)
%%R
source(paste0(sourcePath,"ButterflyHelpers.R"))
source(paste0(sourcePath,"preseqHelpers.R"))
source(paste0(sourcePath,"BinomialDownsampling.R"))




R[write to console]: 
Attaching package: ‘qdapTools’


R[write to console]: The following object is masked from ‘package:dplyr’:

    id




In [20]:
#Generate data and save it
%%R
dsid = "PBMC_V3_3"
otherIds = c("PBMC_V3", "PBMC_V3_2", "PBMC_NG", "PBMC_NG_2", "PBMC_V2", "EVALPBMC")
loadBug(dsid, 0.1)

dsBug = getBug(dsid, 0.1)


loadPooledHistogramDS("PBMC_V3_3")
loadPooledHistogramDS("PBMC_V3_2")
loadPooledHistogramDS("PBMC_V3")
loadPooledHistogramDS("PBMC_NG")
loadPooledHistogramDS("PBMC_NG_2")
loadPooledHistogramDS("PBMC_V2")
loadPooledHistogramDS("EVALPBMC")



#Collect the data

poolHistList = poolHistograms(dsid, dsBug, otherIds)

loadStats(dsid)

#create data for supplementary plot
loadBug(dsid)
bug = getBug(dsid)
binDs = binomialDownsampling(bug, 0.1)




#no prediction
fromStats = tibble(gene = statsPBMC_V3_3$gene, 
                   trueval = statsPBMC_V3_3$CPM_PBMC_V3_3_d_100,
                   x = statsPBMC_V3_3$CPM_PBMC_V3_3_d_10, #so, we use the 
                   nopred = statsPBMC_V3_3$CPM_PBMC_V3_3_d_10)

#prediction
pred100From10 = upSampleAndGetMeanExprPreSeqZTNB(dsBug, t=10)

colnames(pred100From10) = c("gene", "pred")

#prediction with pooling
predPool = poolPrediction(dsBug, 10, poolHistList, 500000)
colnames(predPool) = c("gene", "poolpred")



#sampling noise
colnames(binDs) = c("gene", "sampling")



#merge all
m1 = inner_join(fromStats, pred100From10, by="gene")
m2 = inner_join(m1, predPool, by="gene")

#move sampling noise to a supporting figure

ldata = m2

m3 = inner_join(fromStats, binDs, by="gene")

ldata2 = m3


#cpm and log transform
for (i in 2:6) {
  ldata[, i] = log2(ldata[, i]*10^6/sum(ldata[, i]) + 1)
}

for (i in 2:5) {
  ldata2[, i] = log2(ldata2[, i]*10^6/sum(ldata2[, i]) + 1)
}


saveRDS(ldata, paste0(figure_data_path, "Fig4AC_ldata.RDS"))
saveRDS(ldata2, paste0(figure_data_path, "Fig4AC_ldata2.RDS"))




[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
[1] 11000
[1] 12000
[1] 13000
[1] 14000
[1] 15000
[1] 16000
[1] 17000
[1] 18000
[1] 19000
[1] 20000
[1] 21000
[1] 22000
[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
[1] 11000
[1] 12000
[1] 13000
[1] 14000
[1] 15000
[1] 16000
[1] 17000
[1] 18000
[1] 19000
[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
[1] 11000
[1] 12000
[1] 13000
[1] 14000
[1] 15000
[1] 16000
[1] 17000
[1] 18000
[1] 19000


In [16]:
!cd figureData && ls -l

total 7968
drwxr-xr-x 2 root root    4096 Jul  1  2020 EVALPBMC
-rw-r--r-- 1 root root     239 Jun 30  2020 Fig3C_r1.RDS
-rw-r--r-- 1 root root     233 Jun 30  2020 Fig3C_r2.RDS
-rw-r--r-- 1 root root     683 Jun 30  2020 Fig3_h1.RDS
-rw-r--r-- 1 root root     830 Jun 30  2020 Fig3_h2.RDS
-rw-r--r-- 1 root root  372141 Apr  3 16:59 Fig4AC_ldata2.RDS
-rw-r--r-- 1 root root  480126 Apr  3 16:59 Fig4AC_ldata.RDS
-rw-r--r-- 1 root root 3949960 Feb 25 18:38 Fig4_DE.RDS
-rw-r--r-- 1 root root  304548 Feb 26 13:30 gc.RDS
drwxr-xr-x 2 root root    4096 Jul  1  2020 PBMC_NG
drwxr-xr-x 2 root root    4096 Jul  1  2020 PBMC_NG_2
drwxr-xr-x 2 root root    4096 Jul  1  2020 PBMC_V2
drwxr-xr-x 2 root root    4096 Jul  1  2020 PBMC_V3
drwxr-xr-x 2 root root    4096 Jul  1  2020 PBMC_V3_2
drwxr-xr-x 2 root root    4096 Jul  1  2020 PBMC_V3_3
-rw-r--r-- 1 root root 1128027 Feb 15 11:34 simDepthData.RDS
-rw-r--r-- 1 root root  621646 Feb 15 13:22 simFcData.RDS
-rw-r--r-- 1 root root 1237109 Feb 15 15:46

In [21]:
!ls

figureData  GRNP_2020  sample_data
