<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_Pooled.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Creates pooled data from several datasets to be used for prediction**

This notebook assembles CU histograms from multiple datasets for use in pooled prediction, a method in which lowly expressed genes can get a more stable CU histogram by "borrowing" information from similar datasets. We create two variants here - pooled data from full datasets and pooled data from downsampled dataset (downsampled 10 times). The reason for the latter is for use with down-sampling experiments, such as figure 3. We generate pooled data from 6 PBMC 10X datasets. For the downsampled variant, we downsample each dataset 10 times and pool a total of 60 datasets, since so much information is lost in the downsampling.

We also perform downsampling of a dataset to be able to compare prediction to sampling noise (fig S20)

Steps:
1. Clone the code repo and download data to process
2. Prepare the R environment
3. Process and save the data


**1. Clone the code repo and download data to process**

In [1]:
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


Cloning into 'GRNP_2020'...
remote: Enumerating objects: 41, done.[K
remote: Counting objects: 100% (41/41), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 924 (delta 21), reused 25 (delta 11), pack-reused 883[K
Receiving objects: 100% (924/924), 7.36 MiB | 21.00 MiB/s, done.
Resolving deltas: 100% (575/575), done.


In [3]:
#download BUG data from Zenodo
![ -d "figureData" ] && rm -r figureData
!mkdir figureData
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_V3_2.zip?download=1 && unzip 'PBMC_V3_2.zip?download=1' && rm 'PBMC_V3_2.zip?download=1'
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_V3.zip?download=1 && unzip 'PBMC_V3.zip?download=1' && rm 'PBMC_V3.zip?download=1'
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_NG.zip?download=1 && unzip 'PBMC_NG.zip?download=1' && rm 'PBMC_NG.zip?download=1'
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_NG_2.zip?download=1 && unzip 'PBMC_NG_2.zip?download=1' && rm 'PBMC_NG_2.zip?download=1'
!cd figureData && wget https://zenodo.org/record/3909758/files/PBMC_V2.zip?download=1 && unzip 'PBMC_V2.zip?download=1' && rm 'PBMC_V2.zip?download=1'
!cd figureData && wget https://zenodo.org/record/3909758/files/EVALPBMC.zip?download=1 && unzip 'EVALPBMC.zip?download=1' && rm 'EVALPBMC.zip?download=1'


--2020-07-02 16:06:46--  https://zenodo.org/record/3909758/files/PBMC_V3_2.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1558568885 (1.5G) [application/octet-stream]
Saving to: ‘PBMC_V3_2.zip?download=1’


2020-07-02 16:08:54 (11.6 MB/s) - ‘PBMC_V3_2.zip?download=1’ saved [1558568885/1558568885]

Archive:  PBMC_V3_2.zip?download=1
   creating: PBMC_V3_2/
  inflating: PBMC_V3_2/Bug_10.RData  
  inflating: PBMC_V3_2/Bug_100.RData  
  inflating: PBMC_V3_2/Bug_20.RData  
  inflating: PBMC_V3_2/Bug_40.RData  
  inflating: PBMC_V3_2/Bug_5.RData   
  inflating: PBMC_V3_2/Bug_60.RData  
  inflating: PBMC_V3_2/Bug_80.RData  
  inflating: PBMC_V3_2/ds_summary.txt  
  inflating: PBMC_V3_2/pooledHist.RData  
  inflating: PBMC_V3_2/pooledHistDS.RData  
  inflating: PBMC_V3_2/PredEvalData.RDS  
  inflating: PBMC_V3_2/Stats.RData   
--2020-07-02 16:09:1

In [6]:
#Check that download worked
!cd figureData && ls -l && cd PBMC_V3_2 && ls -l

total 24
drwxr-xr-x 2 root root 4096 Jul  1 20:29 EVALPBMC
drwxr-xr-x 2 root root 4096 Jul  1 22:18 PBMC_NG
drwxr-xr-x 2 root root 4096 Jul  1 22:37 PBMC_NG_2
drwxr-xr-x 2 root root 4096 Jul  1 22:52 PBMC_V2
drwxr-xr-x 2 root root 4096 Jul  1 21:27 PBMC_V3
drwxr-xr-x 2 root root 4096 Jul  1 21:43 PBMC_V3_2
total 1628004
-rw-r--r-- 1 root root 352633388 Jun 30 23:44 Bug_100.RData
-rw-r--r-- 1 root root 117036831 Jun 30 23:19 Bug_10.RData
-rw-r--r-- 1 root root 187888424 Jun 30 23:22 Bug_20.RData
-rw-r--r-- 1 root root 267712541 Jun 30 23:27 Bug_40.RData
-rw-r--r-- 1 root root  65984760 Jun 30 23:17 Bug_5.RData
-rw-r--r-- 1 root root 311852745 Jun 30 23:33 Bug_60.RData
-rw-r--r-- 1 root root 335390002 Jun 30 23:40 Bug_80.RData
-rw-r--r-- 1 root root      1065 Jul  1 21:24 ds_summary.txt
-rw-r--r-- 1 root root    256079 Jul  1 14:34 pooledHistDS.RData
-rw-r--r-- 1 root root    518870 Jul  1 14:35 pooledHist.RData
-rw-r--r-- 1 root root  25673618 Jul  1 21:43 PredEvalData.RDS
-rw-r--r-- 1 

**2. Prepare the R environment**

In [7]:
#switch to R mode
%reload_ext rpy2.ipython


  from pandas.core.index import Index as PandasIndex


In [8]:
#install the R packages
%%R
install.packages("qdapTools")
install.packages("dplyr")
install.packages("stringdist")


R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: also installing the dependencies ‘bitops’, ‘chron’, ‘data.table’, ‘RCurl’, ‘XML’


R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/bitops_1.0-6.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 8734 bytes

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[writ

**3. Process and save the data**

Here we generate the pooled CU histograms, both from full datasets and downsampled ones.

In [10]:
#First set some path variables
%%R
source("GRNP_2020/RCode/pathsGoogleColab.R")


In [12]:
#Process and filter the BUG file
%%R
source(paste0(sourcePath, "ButterflyHelpers.R"))

#generate pooled histograms (for fig 3C)

loadBug("PBMC_V3")
generatePooledHistogramDS("PBMC_V3")
generatePooledHistogram("PBMC_V3") #these are not really used, but we generate them anyway, it is little extra effort
rmBug("PBMC_V3")

loadBug("PBMC_V3_2")
generatePooledHistogramDS("PBMC_V3_2")
generatePooledHistogram("PBMC_V3_2")
rmBug("PBMC_V3_2")

loadBug("PBMC_V3_3")
generatePooledHistogramDS("PBMC_V3_3")
generatePooledHistogram("PBMC_V3_3")
#also downsample 20 times of PBMC_V3_3 - to enable comparison with sampling noise for fig S20
PBMC_V3_3_ds10_20Times = downSampleBUGNTimes(getBug("PBMC_V3_3", 1), 0.1, 20)
save(PBMC_V3_3_ds10_20Times, file=paste0(figure_data_path, "PBMC_V3_3_ds10_20Times.RData"))
rmBug("PBMC_V3_3")

loadBug("PBMC_NG")
generatePooledHistogramDS("PBMC_NG")
generatePooledHistogram("PBMC_NG")
rmBug("PBMC_NG")

loadBug("PBMC_NG_2")
generatePooledHistogramDS("PBMC_NG_2")
generatePooledHistogram("PBMC_NG_2")
rmBug("PBMC_NG_2")

loadBug("PBMC_V2")
generatePooledHistogramDS("PBMC_V2")
generatePooledHistogram("PBMC_V2")
rmBug("PBMC_V2")

loadBug("EVALPBMC")
generatePooledHistogramDS("EVALPBMC")
generatePooledHistogram("EVALPBMC")
rmBug("EVALPBMC")

loadBug("PBMC_V3_3", 1)
rmBug("PBMC_V3_3")



In [None]:
!cd figureData && ls -l 

total 208752
-rw-r--r-- 1 root root 37523302 Jul  1 20:41 Bug_100.RData
-rw-r--r-- 1 root root 17301806 Jul  1 20:39 Bug_10.RData
-rw-r--r-- 1 root root 23452646 Jul  1 20:39 Bug_20.RData
-rw-r--r-- 1 root root 25293705 Jul  1 20:40 Bug_25.RData
-rw-r--r-- 1 root root 29076462 Jul  1 20:40 Bug_40.RData
-rw-r--r-- 1 root root 11231802 Jul  1 20:39 Bug_5.RData
-rw-r--r-- 1 root root 32628714 Jul  1 20:41 Bug_60.RData
-rw-r--r-- 1 root root 35471380 Jul  1 20:41 Bug_80.RData
-rw-r--r-- 1 root root     1013 Jul  1 21:09 ds_summary.txt
-rw-r--r-- 1 root root  1760913 Jul  1 20:42 Stats.RData
Dataset: EVAL

totUMIs: 3604133
totCells: 1555
totCounts: 29913038
countsPerUMI: 8.29964876434915
UMIsPerCell: 2317.77041800643
countsPerCell: 19236.6803858521
totFracOnes: 0.276914864129598
FracMolWithUMIDistToNeighborH: 134, 739, 345, 11, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborL: 242, 1158, 584, 16, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborHFrac: 0.109031733116355, 0.601301871440195, 0.28071