<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/figure_generation/GenFigS6Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Precalculates data for figure 4 D-E**

This notebook precalculates the data for the figure 4 D-E, since there are some heavy calculation steps involved for generating the figure. The most demanding task is prediction of unseen molecules for each gene using the ZTNB method. This notebook may take 15-30 minutes to run.

Steps:
1. Download the code and processed data
2. Setup the R environment
3. Define a general function to precalculate figure data for a dataset and save it to disk
4. Call the precalculation function for all datasets

The data used in these calculations is produced by the following notebooks:

Processing of FASTQ files with kallisto and bustools:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessEVALPBMC.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessEVALPBMC_DS.ipynb

Preprocessing of BUG files:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_EVALPBMC.ipynb
https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_EVALPBMC_DS.ipynb



**1. Download the code and processed data**

In [None]:
#download the R code
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


Cloning into 'GRNP_2020'...
remote: Enumerating objects: 122, done.[K
remote: Counting objects: 100% (122/122), done.[K
remote: Compressing objects: 100% (88/88), done.[K
remote: Total 1005 (delta 78), reused 64 (delta 34), pack-reused 883[K
Receiving objects: 100% (1005/1005), 7.38 MiB | 15.31 MiB/s, done.
Resolving deltas: 100% (632/632), done.


In [None]:
#download processed data from Zenodo for all datasets
![ -d "figureData" ] && rm -r figureData
!mkdir figureData
!cd figureData && wget https://zenodo.org/record/4661263/files/EVALPBMC.zip?download=1 && unzip 'EVALPBMC.zip?download=1' && rm 'EVALPBMC.zip?download=1'



--2020-07-02 18:59:46--  https://zenodo.org/record/4661263/files/EVALPBMC.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 467646609 (446M) [application/octet-stream]
Saving to: ‘EVALPBMC.zip?download=1’


2020-07-02 19:00:12 (17.7 MB/s) - ‘EVALPBMC.zip?download=1’ saved [467646609/467646609]

Archive:  EVALPBMC.zip?download=1
   creating: EVALPBMC/
  inflating: EVALPBMC/Bug_10.RData   
  inflating: EVALPBMC/Bug_100.RData  
  inflating: EVALPBMC/Bug_20.RData   
  inflating: EVALPBMC/Bug_40.RData   
  inflating: EVALPBMC/Bug_5.RData    
  inflating: EVALPBMC/Bug_60.RData   
  inflating: EVALPBMC/Bug_80.RData   
  inflating: EVALPBMC/ds_summary.txt  
  inflating: EVALPBMC/pooledHist.RData  
  inflating: EVALPBMC/pooledHistDS.RData  
  inflating: EVALPBMC/PredEvalData.RDS  
  inflating: EVALPBMC/Stats.RData    


In [None]:
!cd figureData && wget https://zenodo.org/record/4661263/files/EVALPBMC_DS.zip?download=1 && unzip 'EVALPBMC_DS.zip?download=1' && rm 'EVALPBMC_DS.zip?download=1'

--2020-07-02 19:00:17--  https://zenodo.org/record/4661263/files/EVALPBMC_DS.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 207886298 (198M) [application/octet-stream]
Saving to: ‘EVALPBMC_DS.zip?download=1’


2020-07-02 19:00:27 (23.2 MB/s) - ‘EVALPBMC_DS.zip?download=1’ saved [207886298/207886298]

Archive:  EVALPBMC_DS.zip?download=1
   creating: EVALPBMC_DS/
  inflating: EVALPBMC_DS/Bug_10.RData  
  inflating: EVALPBMC_DS/Bug_100.RData  
  inflating: EVALPBMC_DS/Bug_20.RData  
  inflating: EVALPBMC_DS/Bug_40.RData  
  inflating: EVALPBMC_DS/Bug_5.RData  
  inflating: EVALPBMC_DS/Bug_60.RData  
  inflating: EVALPBMC_DS/Bug_80.RData  
  inflating: EVALPBMC_DS/ds_summary.txt  
  inflating: EVALPBMC_DS/PredEvalData.RDS  
  inflating: EVALPBMC_DS/Stats.RData  


In [None]:
#Check that download worked
!cd figureData && ls -l && cd EVALPBMC && ls -l

total 8
drwxr-xr-x 2 root root 4096 Jul  1 20:29 EVALPBMC
drwxr-xr-x 2 root root 4096 Jul  1 21:25 EVALPBMC_DS
total 486728
-rw-r--r-- 1 root root 87322865 Jun 30 12:01 Bug_100.RData
-rw-r--r-- 1 root root 53475778 Jun 30 11:52 Bug_10.RData
-rw-r--r-- 1 root root 65711410 Jun 30 11:53 Bug_20.RData
-rw-r--r-- 1 root root 75161084 Jun 30 11:56 Bug_40.RData
-rw-r--r-- 1 root root 37818341 Jun 30 11:52 Bug_5.RData
-rw-r--r-- 1 root root 80649419 Jun 30 11:58 Bug_60.RData
-rw-r--r-- 1 root root 84316810 Jun 30 12:00 Bug_80.RData
-rw-r--r-- 1 root root      992 Jul  1 02:30 ds_summary.txt
-rw-r--r-- 1 root root   316188 Jul  1 15:25 pooledHistDS.RData
-rw-r--r-- 1 root root   720120 Jul  1 15:25 pooledHist.RData
-rw-r--r-- 1 root root 11259902 Jul  1 20:29 PredEvalData.RDS
-rw-r--r-- 1 root root  1633732 Jun 30 12:01 Stats.RData


**2. Prepare the R environment**

In [None]:
#switch to R mode
%reload_ext rpy2.ipython


  from pandas.core.index import Index as PandasIndex


In [None]:
#install the R packages and setup paths
%%R
#install.packages("qdapTools")
install.packages("dplyr")
install.packages("preseqR")
install.packages("DescTools")
#install.packages("stringdist")


R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/dplyr_1.0.0.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 836651 bytes (817 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write 

**3. Define a general function to precalculate data for a pair of datasets**

The plot is about showing the correlation (Lin's CCC) between two datasets at different number of reads.

We created a function for this to enable comparisons of more datasets.


In [None]:
#First set some path variables
%%R
source("GRNP_2020/RCode/pathsGoogleColab.R")


In [None]:
#Import the code for prediction (available in other notebooks)
%%R
source(paste0(sourcePath,"ButterflyHelpers.R"))
source(paste0(sourcePath,"preseqHelpers.R"))
source(paste0(sourcePath,"CCCHelpers.R"))





R[write to console]: 
Attaching package: ‘dplyr’


R[write to console]: The following objects are masked from ‘package:stats’:

    filter, lag


R[write to console]: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [None]:
#Run the calculations and save the results to disk
%%R

#prediction is slow, so make it possible to do this once and save it
genFig4Data = function(dsid1, dsid2, predVals) {
  bug1 = getBug(dsid1)
  bug2 = getBug(dsid2)
  #predict and add results to merged
  pred1 = upSampleAndGetMeanExprPreSeqZTNB(bug1, t=predVals)
  pred2 = upSampleAndGetMeanExprPreSeqZTNB(bug2, t=predVals)
  #pred1 = upSampleAndGetMeanExprPreSeq(bug1, t=predVals, mt=2)
  #pred2 = upSampleAndGetMeanExprPreSeq(bug2, t=predVals, mt=2)
  return(list(pred1,pred2))
}


predVals_1 = c(1, 1.5, 2, 3, 4, 6, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096)

loadBug("EVALPBMC_DS")
loadBug("EVALPBMC")

d1 = genFig4Data("EVALPBMC_DS", "EVALPBMC", predVals_1)


saveRDS(d1, paste0(figure_data_path, "Fig4_DE.RDS"))



[1] "Genes: 18289"
[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
[1] 11000
[1] 12000
[1] 13000
[1] 14000
[1] 15000
[1] 16000
[1] 17000
[1] 18000
[1] "Genes: 19468"
[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
[1] 11000
[1] 12000
[1] 13000
[1] 14000
[1] 15000
[1] 16000
[1] 17000
[1] 18000
[1] 19000


In [None]:
!cd figureData && ls -l

total 3868
drwxr-xr-x 2 root root    4096 Jul  1 20:29 EVALPBMC
drwxr-xr-x 2 root root    4096 Jul  1 21:25 EVALPBMC_DS
-rw-r--r-- 1 root root 3949984 Jul  2 19:34 Fig4_d1.RDS
