<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/figure_generation/GenFig4AC_S23Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Precalculates data for figure 4A-C and supplementary figure 23**

This notebook precalculates the data for the figures since there are some heavy calculation steps involved for generating the figures. The most demanding task is prediction of unseen molecules for each gene using the ZTNB method. This notebook may take 15-30 minutes to run.

Steps:
1. Download the code and processed data
2. Setup the R environment
3. Generate the data

The data used in these calculations is produced by the following notebooks:

Processing of FASTQ files with kallisto and bustools:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessPBMC_V3_3.ipynb

Preprocessing of BUG files:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_PBMC_V3_3.ipynb



**1. Download the code and processed data**

In [None]:
#download the R code
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


In [None]:
#download processed data from Zenodo for all datasets
![ -d "figureData" ] && rm -r figureData
!mkdir figureData
!cd figureData && wget https://zenodo.org/record/4661263/files/EVALPBMC.zip?download=1 && unzip 'EVALPBMC.zip?download=1' && rm 'EVALPBMC.zip?download=1'

In [None]:
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_V2.zip?download=1 && unzip 'PBMC_V2.zip?download=1' && rm 'PBMC_V2.zip?download=1'

In [None]:
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_V3.zip?download=1 && unzip 'PBMC_V3.zip?download=1' && rm 'PBMC_V3.zip?download=1'

In [None]:
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_V3_2.zip?download=1 && unzip 'PBMC_V3_2.zip?download=1' && rm 'PBMC_V3_2.zip?download=1'

In [None]:
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_V3_3.zip?download=1 && unzip 'PBMC_V3_3.zip?download=1' && rm 'PBMC_V3_3.zip?download=1'

In [None]:
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_NG.zip?download=1 && unzip 'PBMC_NG.zip?download=1' && rm 'PBMC_NG.zip?download=1'

In [None]:
!cd figureData && wget https://zenodo.org/record/4661263/files/PBMC_NG_2.zip?download=1 && unzip 'PBMC_NG_2.zip?download=1' && rm 'PBMC_NG_2.zip?download=1'

In [None]:
!cd figureData && wget https://zenodo.org/record/4661263/files/FigureData.zip?download=1 && unzip 'FigureData.zip?download=1' && rm 'FigureData.zip?download=1'

In [None]:
#Check that download worked
!cd figureData && ls -l && cd EVALPBMC && ls -l

**2. Prepare the R environment**

In [None]:
#switch to R mode
%reload_ext rpy2.ipython


In [None]:
#install the R packages and setup paths
%%R
#install.packages("qdapTools")
install.packages("dplyr")
install.packages("preseqR")
#install.packages("stringdist")


**3. Generate and save the data**

There are a few calculations to do

1. Prediction of downsampled data using ZTNB
2. Pooled prediction
3. Calculation of sampling noise, defined as the difference between a downsampled dataset and the average of the same dataset downsampled 20 times.


In [None]:
#First set some path variables
%%R
source("GRNP_2020/RCode/pathsGoogleColab.R")


In [None]:
#Import the code for prediction (available in other notebooks)
%%R
source(paste0(sourcePath,"ButterflyHelpers.R"))
source(paste0(sourcePath,"preseqHelpers.R"))
source(paste0(sourcePath,"BinomialDownsampling.R"))




In [None]:
#Generate data and save it
%%R
dsid = "PBMC_V3_3"
otherIds = c("PBMC_V3", "PBMC_V3_2", "PBMC_NG", "PBMC_NG_2", "PBMC_V2", "EVALPBMC")
loadBug(dsid, 0.1)

dsBug = getBug(dsid, 0.1)


loadPooledHistogramDS("PBMC_V3_3")
loadPooledHistogramDS("PBMC_V3_2")
loadPooledHistogramDS("PBMC_V3")
loadPooledHistogramDS("PBMC_NG")
loadPooledHistogramDS("PBMC_NG_2")
loadPooledHistogramDS("PBMC_V2")
loadPooledHistogramDS("EVALPBMC")



#Collect the data

poolHistList = poolHistograms(dsid, dsBug, otherIds)

loadStats(dsid)

#create data for supplementary plot
loadBug(dsid)
bug = getBug(dsid)
binDs = binomialDownsampling(bug, 0.1)




#no prediction
fromStats = tibble(gene = statsPBMC_V3_3$gene, 
                   trueval = statsPBMC_V3_3$CPM_PBMC_V3_3_d_100,
                   x = statsPBMC_V3_3$CPM_PBMC_V3_3_d_10, #so, we use the 
                   nopred = statsPBMC_V3_3$CPM_PBMC_V3_3_d_10)

#prediction
pred100From10 = upSampleAndGetMeanExprPreSeqZTNB(dsBug, t=10)

colnames(pred100From10) = c("gene", "pred")

#prediction with pooling
predPool = poolPrediction(dsBug, 10, poolHistList, 500000)
colnames(predPool) = c("gene", "poolpred")



#sampling noise
colnames(binDs) = c("gene", "sampling")



#merge all
m1 = inner_join(fromStats, pred100From10, by="gene")
m2 = inner_join(m1, predPool, by="gene")

#move sampling noise to a supporting figure

ldata = m2

m3 = inner_join(fromStats, binDs, by="gene")

ldata2 = m3


#cpm and log transform
for (i in 2:6) {
  ldata[, i] = log2(ldata[, i]*10^6/sum(ldata[, i]) + 1)
}

for (i in 2:5) {
  ldata2[, i] = log2(ldata2[, i]*10^6/sum(ldata2[, i]) + 1)
}


saveRDS(ldata, paste0(figure_data_path, "Fig4AC_ldata.RDS"))
saveRDS(ldata2, paste0(figure_data_path, "Fig4AC_ldata2.RDS"))




In [None]:
!cd figureData && ls -l