<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_Pooled.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Creates pooled data from several datasets to be used for prediction**

This notebook assembles CU histograms from multiple datasets for use in pooled prediction, a method in which lowly expressed genes can get a more stable CU histogram by "borrowing" information from similar datasets. We create two variants here - pooled data from full datasets and pooled data from downsampled dataset (downsampled 10 times). The reason for the latter is for use with down-sampling experiments, such as figure 3. We generate pooled data from 6 PBMC 10X datasets. For the downsampled variant, we downsample each dataset 10 times and pool a total of 60 datasets, since so much information is lost in the downsampling.

We also perform downsampling of a dataset to be able to compare prediction to sampling noise (fig S20)

Steps:
1. Clone the code repo and download data to process
2. Prepare the R environment
3. Process and save the data


**1. Clone the code repo and download data to process**

In [1]:
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


Cloning into 'GRNP_2020'...
remote: Enumerating objects: 89, done.[K
remote: Counting objects:   1% (1/89)[Kremote: Counting objects:   2% (2/89)[Kremote: Counting objects:   3% (3/89)[Kremote: Counting objects:   4% (4/89)[Kremote: Counting objects:   5% (5/89)[Kremote: Counting objects:   6% (6/89)[Kremote: Counting objects:   7% (7/89)[Kremote: Counting objects:   8% (8/89)[Kremote: Counting objects:  10% (9/89)[Kremote: Counting objects:  11% (10/89)[Kremote: Counting objects:  12% (11/89)[Kremote: Counting objects:  13% (12/89)[Kremote: Counting objects:  14% (13/89)[Kremote: Counting objects:  15% (14/89)[Kremote: Counting objects:  16% (15/89)[Kremote: Counting objects:  17% (16/89)[Kremote: Counting objects:  19% (17/89)[Kremote: Counting objects:  20% (18/89)[Kremote: Counting objects:  21% (19/89)[Kremote: Counting objects:  22% (20/89)[Kremote: Counting objects:  23% (21/89)[Kremote: Counting objects:  24% (22/89)[Kremote: Countin

In [None]:
#download BUG data from Zenodo
![ -d "figureData" ] && rm -r figureData
!mkdir figureData
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_V3_2.zip?download=1 && unzip 'PBMC_V3_2.zip?download=1' && rm 'PBMC_V3_2.zip?download=1'
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_V3.zip?download=1 && unzip 'PBMC_V3.zip?download=1' && rm 'PBMC_V3.zip?download=1'
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_NG.zip?download=1 && unzip 'PBMC_NG.zip?download=1' && rm 'PBMC_NG.zip?download=1'
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_NG_2.zip?download=1 && unzip 'PBMC_NG_2.zip?download=1' && rm 'PBMC_NG_2.zip?download=1'
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_V2.zip?download=1 && unzip 'PBMC_V2.zip?download=1' && rm 'PBMC_V2.zip?download=1'
!cd figureData && wget https://zenodo.org/record/3909758/files/EVALPBMC.zip?download=1 && unzip 'EVALPBMC.zip?download=1' && rm 'EVALPBMC.zip?download=1'


--2020-07-02 17:45:38--  https://zenodo.org/record/3909758/files/PBMC_V3_2.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1558568885 (1.5G) [application/octet-stream]
Saving to: ‘PBMC_V3_2.zip?download=1’


In [None]:
#Check that download worked
!cd figureData && ls -l && cd PBMC_V3_2 && ls -l

**2. Prepare the R environment**

In [None]:
#switch to R mode
%reload_ext rpy2.ipython


In [None]:
#install the R packages
%%R
install.packages("qdapTools")
install.packages("dplyr")
install.packages("stringdist")


**3. Process and save the data**

Here we generate the pooled CU histograms, both from full datasets and downsampled ones. In addition, we calculate the average expression when downsampling 20 times, which will to a large extent remove the sampling noise. We then estimate the sampling noise as the difference between this mean and data that is downsampled only once.

In [None]:
#First set some path variables
%%R
source("GRNP_2020/RCode/pathsGoogleColab.R")


In [None]:
#Process and filter the BUG file
%%R
source(paste0(sourcePath, "ButterflyHelpers.R"))

#generate pooled histograms (for fig 3C)

loadBug("PBMC_V3")
generatePooledHistogramDS("PBMC_V3")
generatePooledHistogram("PBMC_V3") #these are not really used, but we generate them anyway, it is little extra effort
rmBug("PBMC_V3")

loadBug("PBMC_V3_2")
generatePooledHistogramDS("PBMC_V3_2")
generatePooledHistogram("PBMC_V3_2")
rmBug("PBMC_V3_2")

loadBug("PBMC_V3_3")
generatePooledHistogramDS("PBMC_V3_3")
generatePooledHistogram("PBMC_V3_3")
#also downsample 20 times of PBMC_V3_3 - to enable comparison with sampling noise for fig S20
PBMC_V3_3_ds10_20Times = downSampleBUGNTimes(getBug("PBMC_V3_3", 1), 0.1, 20)
save(PBMC_V3_3_ds10_20Times, file=paste0(figure_data_path, "PBMC_V3_3_ds10_20Times.RData"))
rmBug("PBMC_V3_3")

loadBug("PBMC_NG")
generatePooledHistogramDS("PBMC_NG")
generatePooledHistogram("PBMC_NG")
rmBug("PBMC_NG")

loadBug("PBMC_NG_2")
generatePooledHistogramDS("PBMC_NG_2")
generatePooledHistogram("PBMC_NG_2")
rmBug("PBMC_NG_2")

loadBug("PBMC_V2")
generatePooledHistogramDS("PBMC_V2")
generatePooledHistogram("PBMC_V2")
rmBug("PBMC_V2")

loadBug("EVALPBMC")
generatePooledHistogramDS("EVALPBMC")
generatePooledHistogram("EVALPBMC")
rmBug("EVALPBMC")

loadBug("PBMC_V3_3", 1)
rmBug("PBMC_V3_3")



In [None]:
!cd figureData && ls -l 