<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/figure_generation/GenFig1_3Data%20.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Precalculates data for figure 1**

This notebook precalculates the data for the figure 1, since there are some heavy calculation steps involved for generating the figure. The most demanding task is prediction of unseen molecules for each gene using the ZTNB method (figure 1B, III). This notebook may take 30-60 minutes to run.

Steps:
1. Download the code and processed data
2. Setup the R environment
3. Generate the data

The data used in these calculations is produced by the following notebooks:

Processing of FASTQ files with kallisto and bustools:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessEVAL.ipynb

Preprocessing of BUG files:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_EVAL.ipynb


**1. Download the code and processed data**

In [None]:
#download the R code
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


Cloning into 'GRNP_2020'...
remote: Enumerating objects: 226, done.[K
remote: Counting objects: 100% (226/226), done.[K
remote: Compressing objects: 100% (182/182), done.[K
remote: Total 1109 (delta 153), reused 78 (delta 44), pack-reused 883[K
Receiving objects: 100% (1109/1109), 7.42 MiB | 1.31 MiB/s, done.
Resolving deltas: 100% (707/707), done.


In [None]:
#download processed data from Zenodo for all datasets
![ -d "figureData" ] && rm -r figureData
!mkdir figureData
!cd figureData && wget https://zenodo.org/record/3909758/files/EVAL.zip?download=1 && unzip 'EVAL.zip?download=1' && rm 'EVAL.zip?download=1'



--2020-07-02 21:03:42--  https://zenodo.org/record/3909758/files/EVAL.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 206479312 (197M) [application/octet-stream]
Saving to: ‘EVAL.zip?download=1’


2020-07-02 21:03:45 (97.3 MB/s) - ‘EVAL.zip?download=1’ saved [206479312/206479312]

Archive:  EVAL.zip?download=1
   creating: EVAL/
  inflating: EVAL/Bug_10.RData       
  inflating: EVAL/Bug_100.RData      
  inflating: EVAL/Bug_20.RData       
  inflating: EVAL/Bug_25.RData       
  inflating: EVAL/Bug_40.RData       
  inflating: EVAL/Bug_5.RData        
  inflating: EVAL/Bug_60.RData       
  inflating: EVAL/Bug_80.RData       
  inflating: EVAL/ds_summary.txt     
  inflating: EVAL/PredEvalData.RDS   
  inflating: EVAL/Stats.RData        


In [None]:
#Check that download worked
!cd figureData && ls -l && cd EVAL && ls -l

total 4
drwxr-xr-x 2 root root 4096 Jul  1 19:49 EVAL
total 212788
-rw-r--r-- 1 root root 37523336 Jun 30 13:45 Bug_100.RData
-rw-r--r-- 1 root root 17301493 Jun 30 13:42 Bug_10.RData
-rw-r--r-- 1 root root 23443334 Jun 30 13:42 Bug_20.RData
-rw-r--r-- 1 root root 25288320 Jun 30 13:42 Bug_25.RData
-rw-r--r-- 1 root root 29057075 Jun 30 13:43 Bug_40.RData
-rw-r--r-- 1 root root 11226736 Jun 30 13:41 Bug_5.RData
-rw-r--r-- 1 root root 32629892 Jun 30 13:44 Bug_60.RData
-rw-r--r-- 1 root root 35477251 Jun 30 13:44 Bug_80.RData
-rw-r--r-- 1 root root     1025 Jul  1 01:33 ds_summary.txt
-rw-r--r-- 1 root root  4167784 Jul  1 20:11 PredEvalData.RDS
-rw-r--r-- 1 root root  1761192 Jun 30 13:45 Stats.RData


**2. Prepare the R environment**

In [None]:
#switch to R mode
%reload_ext rpy2.ipython


  from pandas.core.index import Index as PandasIndex


In [None]:
#install the R packages and setup paths
%%R
install.packages("dplyr")
install.packages("preseqR")


R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/dplyr_1.0.0.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 836651 bytes (817 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write 

**3. Generate the data**

The most demanding step here is to predict up to the full number of reads from each point in fig 1B III. Although we only look at two genes, we still need to predict all to be able to CPM-normalize the expression.


In [None]:
#First set some path variables
%%R
source("GRNP_2020/RCode/pathsGoogleColab.R")


In [None]:
#Import the code for prediction (available in other notebooks)
%%R
source(paste0(sourcePath,"ButterflyHelpers.R"))
source(paste0(sourcePath,"preseqHelpers.R"))






R[write to console]: 
Attaching package: ‘dplyr’


R[write to console]: The following objects are masked from ‘package:stats’:

    filter, lag


R[write to console]: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [None]:
#Run the calculations and save the results to disk
%%R

loadStats("EVAL")
#so, use the histograms from downsampled data at 0.25, which somewhat matches the A figure
loadBug("EVAL", 0.25)

#Fig 3A - histograms per gene 


collapsedNonFilt = bug_EVAL_25 %>% group_by(gene) %>% do(countslist=c(.$count))

h1 = hist(collapsedNonFilt$countslist[collapsedNonFilt$gene == "Vmn1r13"][[1]], breaks=seq(0.5, 100.5, by=1), plot=F)
h2 = hist(collapsedNonFilt$countslist[collapsedNonFilt$gene == "Ubb"][[1]], breaks=seq(0.5, 100.5, by=1), plot=F)

saveRDS(h1, paste0(figure_data_path, "Fig3_h1.RDS"))
saveRDS(h2, paste0(figure_data_path, "Fig3_h2.RDS"))


#now, fig 3C


#create prediction data

xes = c(1,2,4,5,8,12,16,20)
predVals = 20/xes
downSamp = c(0.05, 0.1, 0.2, 0.25, 0.4, 0.6, 0.8, 1)

#build it backwards
cpms = tibble(gene=statsEVAL$gene, n=statsEVAL$CPM_EVAL_d_100)
for (i in (length(xes)-1):1) {
  loadBug("EVAL", downSamp[i])
  pred = upSampleAndGetMeanExprPreSeqZTNB(getBug("EVAL", downSamp[i]), t=predVals[[i]])
  rmBug("EVAL", downSamp[i])
  cpm = pred
  cpm[[2]] = cpm[[2]]*10^6/sum(cpm[[2]])
  cpms = inner_join(cpm, cpms, by="gene")
}

r1 = data.frame(cpms)[cpms$gene == "Vmn1r13",]
r2 = data.frame(cpms)[cpms$gene == "Ubb",]

saveRDS(r1, paste0(figure_data_path, "Fig3C_r1.RDS"))
saveRDS(r2, paste0(figure_data_path, "Fig3C_r2.RDS"))


[1] "Genes: 19023"
[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
[1] 11000
[1] 12000
[1] 13000
[1] 14000
[1] 15000
[1] 16000
[1] 17000
[1] 18000
[1] 19000
[1] "Genes: 18864"
[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
[1] 11000
[1] 12000
[1] 13000
[1] 14000
[1] 15000
[1] 16000
[1] 17000
[1] 18000
[1] "Genes: 18626"
[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
[1] 11000
[1] 12000
[1] 13000
[1] 14000
[1] 15000
[1] 16000
[1] 17000
[1] 18000
[1] "Genes: 18311"
[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
[1] 11000
[1] 12000
[1] 13000
[1] 14000
[1] 15000
[1] 16000
[1] 17000
[1] 18000
[1] "Genes: 18151"
[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
[1] 11000
[1] 12000
[1] 13000
[1] 14000
[1] 15000
[1] 16000
[1] 17000
[1] 18000
[1] "Genes: 17546"
[1] 1000
[1] 2000
[1]

In [None]:
!cd figureData && ls -l

total 20
drwxr-xr-x 2 root root 4096 Jul  1 19:49 EVAL
-rw-r--r-- 1 root root  682 Jul  2 21:05 Fig1_h1.RDS
-rw-r--r-- 1 root root  828 Jul  2 21:05 Fig1_h2.RDS
-rw-r--r-- 1 root root  237 Jul  2 22:11 Fig1_r1_III.RDS
-rw-r--r-- 1 root root  231 Jul  2 22:11 Fig1_r2_III.RDS
