<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_EVALPBMC_DS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Processes the BUG files into files prepared for use in R**
1. Clone the code repo and download data to process
2. Prepare the R environment
3. Process the data
4. Generate statistics for the dataset

**1. Clone the code repo and download data to process**

In [1]:
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


Cloning into 'GRNP_2020'...
remote: Enumerating objects: 237, done.[K
remote: Counting objects: 100% (237/237), done.[K
remote: Compressing objects: 100% (173/173), done.[K
remote: Total 237 (delta 107), reused 134 (delta 45), pack-reused 0[K
Receiving objects: 100% (237/237), 7.13 MiB | 7.95 MiB/s, done.
Resolving deltas: 100% (107/107), done.


In [2]:
#download BUG data from Zenodo
!mkdir data
!cd data && wget https://zenodo.org/record/3911637/files/EVALPBMC_DS.zip?download=1 && unzip 'EVALPBMC_DS.zip?download=1' && rm 'EVALPBMC_DS.zip?download=1'

--2020-06-28 19:44:36--  https://zenodo.org/record/3911637/files/EVALPBMC_DS.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 123937246 (118M) [application/octet-stream]
Saving to: ‘EVALPBMC_DS.zip?download=1’


2020-06-28 19:44:38 (67.8 MB/s) - ‘EVALPBMC_DS.zip?download=1’ saved [123937246/123937246]

Archive:  EVALPBMC_DS.zip?download=1
   creating: EVALPBMC_DS/
   creating: EVALPBMC_DS/bus_output/
  inflating: EVALPBMC_DS/bus_output/bug.txt  
  inflating: EVALPBMC_DS/bus_output/coll.genes.txt  
  inflating: EVALPBMC_DS/bus_output/transcripts_to_genes.txt  


In [3]:
#Check that download worked
!cd data && ls -l && cd EVALPBMC_DS/bus_output && ls -l

total 4
drwxr-xr-x 3 root root 4096 Jun 28 12:48 EVALPBMC_DS
total 498464
-rw-r--r-- 1 root root 501340444 Apr 12 19:23 bug.txt
-rw-r--r-- 1 root root    738211 Apr 12 19:23 coll.genes.txt
-rw-r--r-- 1 root root   8335642 Apr 12 18:30 transcripts_to_genes.txt


**2. Prepare the R environment**

In [4]:
#switch to R mode
%reload_ext rpy2.ipython


  from pandas.core.index import Index as PandasIndex


In [5]:
#install the R packages
%%R
sourcePath = "GRNP_2020/NotebookAdaptedRCode/"
install.packages("qdapTools")
install.packages("dplyr")
install.packages("stringdist")


R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: also installing the dependencies ‘bitops’, ‘chron’, ‘data.table’, ‘RCurl’, ‘XML’


R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/bitops_1.0-6.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 8734 bytes

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[writ

**3. Process the data**

Here we discard multimapped UMIs and all UMIs belonging to cells with fewer than 200 UMIs. We also precalculate gene expression, fraction of single-copy molecules etc. and save as stats (statistics). These can later be used when generating figures. We also generate down-sampled BUGs.

In [6]:
#create output directory
!mkdir figureData

In [7]:
#Process and filter the BUG file
%%R
source(paste0(sourcePath, "BUGProcessingHelpers.R"))
createStandardBugsData(paste0(dataPath,"EVALPBMC_DS/"), "EVALPBMC_DS", c(0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1))



R[write to console]: 
Attaching package: ‘dplyr’


R[write to console]: The following objects are masked from ‘package:stats’:

    filter, lag


R[write to console]: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


R[write to console]: 
Attaching package: ‘qdapTools’


R[write to console]: The following object is masked from ‘package:dplyr’:

    id




[1] "Generating data for EVALPBMC_DS"
[1] "Reading BUG from data/EVALPBMC_DS/ ..."
[1] "Filtering multi-mapped reads..."
[1] "Fraction multi-mapped reads: 0.209591228591221"
[1] "Converting genes..."
[1] "Done"
[1] "Down-sampling in total 7 bugs:"
[1] "1: Down-sampling to 0.05"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "2: Down-sampling to 0.1"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "3: Down-sampling to 0.2"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "4: Down-sampling to 0.4"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "5: Down-sampling to 0.6"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "6: Down-sampling to 0.8"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "7: Down-sampling to 1"
[1] "Done"


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "Saving BUG..."
[1] "Saving BUGs..."
[1] "Saving stats..."


**4. Generate statistics for the dataset**

Here we create a file with various statistics for the dataset. This processing is rather time-consuming.

In [10]:
%%R
source(paste0(sourcePath, "GenBugSummary.R"))
genBugSummary("EVALPBMC_DS", "FGF23", "RPS10", 10)

[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 10

In [11]:
!cd figureData/EVALPBMC_DS && ls -l && more ds_summary.txt

total 249040
-rw-r--r-- 1 root root  39106204 Jun 28 19:51 Bug.RData
-rw-r--r-- 1 root root 214485573 Jun 28 19:53 DsBugs.RData
-rw-r--r-- 1 root root       952 Jun 28 20:18 ds_summary.txt
-rw-r--r-- 1 root root   1413974 Jun 28 19:53 Stats.RData
Dataset: EVALPBMC_DS

totUMIs: 4015392
totCells: 6862
totCounts: 87378182
countsPerUMI: 21.7608099034914
UMIsPerCell: 585.163509180997
countsPerCell: 12733.6318857476
totFracOnes: 0.199118043767582
FracMolWithUMIDistToNeighborH: 910, 1007, 83, 0, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborL: 799, 1061, 139, 1, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborHFrac: 0.455, 0.5035, 0.0415, 0, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborLFrac: 0.3995, 0.5305, 0.0695, 5e-04, 0, 0, 0, 0, 0,
 0
FracMolWithUMIDistToNeighbor1cpy: 993, 882, 122, 3, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighbor2cpy: 918, 965, 116, 1, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighbor>=3cpy: 911, 992, 97, 0, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighbor1cpyFrac: 0.4965, 0.441,