<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/figure_generation/GenFigS3_S20Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Precalculates data for figure 3 and supplementary figure 20**

This notebook precalculates the data for the figures since there are some heavy calculation steps involved for generating the figures. The most demanding task is prediction of unseen molecules for each gene using the ZTNB method. This notebook may take 15-30 minutes to run.

Steps:
1. Download the code and processed data
2. Setup the R environment
3. Generate the data

**1. Download the code and processed data**

In [1]:
#download the R code
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


Cloning into 'GRNP_2020'...
remote: Enumerating objects: 153, done.[K
remote: Counting objects: 100% (153/153), done.[K
remote: Compressing objects: 100% (118/118), done.[K
remote: Total 1036 (delta 98), reused 64 (delta 35), pack-reused 883[K
Receiving objects: 100% (1036/1036), 7.40 MiB | 5.04 MiB/s, done.
Resolving deltas: 100% (652/652), done.


In [2]:
#download processed data from Zenodo for all datasets
![ -d "figureData" ] && rm -r figureData
!mkdir figureData
!cd figureData && wget https://zenodo.org/record/3909758/files/EVALPBMC.zip?download=1 && unzip 'EVALPBMC.zip?download=1' && rm 'EVALPBMC.zip?download=1'

--2020-07-02 20:01:02--  https://zenodo.org/record/3909758/files/EVALPBMC.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 467646609 (446M) [application/octet-stream]
Saving to: ‘EVALPBMC.zip?download=1’


2020-07-02 20:02:28 (6.25 MB/s) - ‘EVALPBMC.zip?download=1’ saved [467646609/467646609]

Archive:  EVALPBMC.zip?download=1
   creating: EVALPBMC/
  inflating: EVALPBMC/Bug_10.RData   
  inflating: EVALPBMC/Bug_100.RData  
  inflating: EVALPBMC/Bug_20.RData   
  inflating: EVALPBMC/Bug_40.RData   
  inflating: EVALPBMC/Bug_5.RData    
  inflating: EVALPBMC/Bug_60.RData   
  inflating: EVALPBMC/Bug_80.RData   
  inflating: EVALPBMC/ds_summary.txt  
  inflating: EVALPBMC/pooledHist.RData  
  inflating: EVALPBMC/pooledHistDS.RData  
  inflating: EVALPBMC/PredEvalData.RDS  
  inflating: EVALPBMC/Stats.RData    


In [None]:
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_V2.zip?download=1 && unzip 'PBMC_V2.zip?download=1' && rm 'PBMC_V2.zip?download=1'

In [None]:
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_V3.zip?download=1 && unzip 'PBMC_V3.zip?download=1' && rm 'PBMC_V3.zip?download=1'

--2020-07-02 20:02:35--  https://zenodo.org/record/3909758/files/PBMC_V3.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1127576758 (1.0G) [application/octet-stream]
Saving to: ‘PBMC_V3.zip?download=1’


In [None]:
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_V3_2.zip?download=1 && unzip 'PBMC_V3_2.zip?download=1' && rm 'PBMC_V3_2.zip?download=1'

In [None]:
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_V3_3.zip?download=1 && unzip 'PBMC_V3_3.zip?download=1' && rm 'PBMC_V3_3.zip?download=1'

In [None]:
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_NG.zip?download=1 && unzip 'PBMC_NG.zip?download=1' && rm 'PBMC_NG.zip?download=1'

In [None]:
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_NG_2.zip?download=1 && unzip 'PBMC_NG_2.zip?download=1' && rm 'PBMC_NG_2.zip?download=1'

In [None]:
#Check that download worked
!cd figureData && ls -l && cd EVALPBMC && ls -l

**2. Prepare the R environment**

In [None]:
#switch to R mode
%reload_ext rpy2.ipython


In [None]:
#install the R packages and setup paths
%%R
#install.packages("qdapTools")
install.packages("dplyr")
install.packages("preseqR")
#install.packages("stringdist")


**3. Generate and save the data**

There are a few calculations to do

1. Prediction of downsampled data using ZTNB
2. Pooled prediction
3. Calculation of sampling noise, defined as the difference between a downsampled dataset and the average of the same dataset downsampled 20 times.


In [None]:
#First set some path variables
%%R
source("GRNP_2020/RCode/pathsGoogleColab.R")


In [None]:
#Import the code for prediction (available in other notebooks)
%%R
source(paste0(sourcePath,"ButterflyHelpers.R"))
source(paste0(sourcePath,"preseqHelpers.R"))





In [None]:
#Generate data and save it
%%R
dsid = "PBMC_V3_3"
otherIds = c("PBMC_V3", "PBMC_V3_2", "PBMC_NG", "PBMC_NG_2", "PBMC_V2", "EVALPBMC")
loadBug(dsid, 0.1)

dsBug = getBug(dsid, 0.1)


loadPooledHistogramDS("PBMC_V3_3")
loadPooledHistogramDS("PBMC_V3_2")
loadPooledHistogramDS("PBMC_V3")
loadPooledHistogramDS("PBMC_NG")
loadPooledHistogramDS("PBMC_NG_2")
loadPooledHistogramDS("PBMC_V2")
loadPooledHistogramDS("EVALPBMC")



#Collect the data

poolHistList = poolHistograms(dsid, dsBug, otherIds)

loadStats(dsid)
load(file=paste0(figure_data_path, "PBMC_V3_3_ds10_20Times.RData")) #gets the data for comparison with sampling noise

#no prediction
fromStats = tibble(gene = statsPBMC_V3_3$gene, 
                   trueval = statsPBMC_V3_3$CPM_PBMC_V3_3_d_100,
                   x = statsPBMC_V3_3$CPM_PBMC_V3_3_d_10, #so, we use the 
                   nopred = statsPBMC_V3_3$CPM_PBMC_V3_3_d_10)

#prediction
#pred100From10 = upSampleAndGetMeanExprPreSeq(dsBug, t=10, mt=2)
pred100From10 = upSampleAndGetMeanExprPreSeqZTNB(dsBug, t=10)

colnames(pred100From10) = c("gene", "pred")

#prediction with pooling
predPool = poolPrediction(dsBug, 10, poolHistList, 500000)
colnames(predPool) = c("gene", "poolpred")



#sampling noise
colnames(PBMC_V3_3_ds10_20Times) = c("gene", "sampling")



#merge all
m1 = inner_join(fromStats, pred100From10, by="gene")
m2 = inner_join(m1, predPool, by="gene")

#move sampling noise to a supporting figure
#m3 = inner_join(m2, PBMC_V3_3_ds10_20Times, by="gene")

ldata = m2

m3 = inner_join(fromStats, PBMC_V3_3_ds10_20Times, by="gene")

ldata2 = m3


#cpm and log transform
#for (i in 2:7) {
for (i in 2:6) {
  ldata[, i] = log2(ldata[, i]*10^6/sum(ldata[, i]) + 1)
}

for (i in 2:5) {
  ldata2[, i] = log2(ldata2[, i]*10^6/sum(ldata2[, i]) + 1)
}


saveRDS(ldata, paste0(figure_data_path, "Fig3_ldata.RDS"))
saveRDS(ldata2, paste0(figure_data_path, "Fig3_ldata2.RDS"))



**4. Call the precalculation function for all datasets**


In [None]:
%%R
GenAlgEvaluationData("EVAL")
GenAlgEvaluationData("EVALPBMC")
GenAlgEvaluationData("EVALPBMC_DS")
GenAlgEvaluationData("EVALPBMC_SW")
GenAlgEvaluationData("PBMC_V3")
GenAlgEvaluationData("PBMC_V3_2")
GenAlgEvaluationData("PBMC_V3_3")
GenAlgEvaluationData("PBMC_NG")
GenAlgEvaluationData("PBMC_NG_2")
GenAlgEvaluationData("PBMC_V2")
GenAlgEvaluationData("LC")
GenAlgEvaluationData("MRET")
GenAlgEvaluationData("MRET2")


In [None]:
!cd figureData && ls -l && cd EVAL && ls -l