<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_EVAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**EVAL dataset: Processes the BUG files into files prepared for use in R**

This notebook processes the output from the fastq file processing for this dataset. The data produced here is pre-generated and downloaded by the figure generation code. The purpose of this processing step is to prepare the data for figure generation, by filtering the data and producing downsampled datasets in addition to the original one. 

Steps:
1. Clone the code repo and download data to process
2. Prepare the R environment
3. Process the data
4. Generate statistics for the dataset

**1. Clone the code repo and download data to process**

In [12]:
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


Cloning into 'GRNP_2020'...
remote: Enumerating objects: 133, done.[K
remote: Counting objects: 100% (133/133), done.[K
remote: Compressing objects: 100% (98/98), done.[K
remote: Total 1016 (delta 85), reused 66 (delta 35), pack-reused 883[K
Receiving objects: 100% (1016/1016), 7.39 MiB | 1.27 MiB/s, done.
Resolving deltas: 100% (639/639), done.


In [2]:
#download BUG data from Zenodo
!mkdir data
!cd data && wget https://zenodo.org/record/3924675/files/EVAL.zip?download=1 && unzip 'EVAL.zip?download=1' && rm 'EVAL.zip?download=1'

--2020-07-02 19:11:15--  https://zenodo.org/record/3924675/files/EVAL.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 71831345 (69M) [application/octet-stream]
Saving to: ‘EVAL.zip?download=1’


2020-07-02 19:11:16 (100 MB/s) - ‘EVAL.zip?download=1’ saved [71831345/71831345]

Archive:  EVAL.zip?download=1
   creating: EVAL/
   creating: EVAL/bus_output/
  inflating: EVAL/bus_output/bug.txt  
  inflating: EVAL/bus_output/coll.genes.txt  
  inflating: EVAL/bus_output/transcripts_to_genes.txt  


In [3]:
#Check that download worked
!cd data && ls -l && cd EVAL/bus_output && ls -l

total 4
drwxr-xr-x 3 root root 4096 Jul  1 00:00 EVAL
total 300376
-rw-r--r-- 1 root root 300813629 Jun 29 14:01 bug.txt
-rw-r--r-- 1 root root    779656 Jun 29 13:58 coll.genes.txt
-rw-r--r-- 1 root root   5983758 Jun 29 13:02 transcripts_to_genes.txt


**2. Prepare the R environment**

In [5]:
#switch to R mode
%reload_ext rpy2.ipython


  from pandas.core.index import Index as PandasIndex


In [9]:
#install the R packages
%%R
install.packages("qdapTools")
install.packages("dplyr")
install.packages("stringdist")
install.packages("stringr")


R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: also installing the dependency ‘RCurl’


R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/RCurl_1.98-1.2.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 699583 bytes (683 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to c

**3. Process the data**

Here we discard multimapped UMIs and all UMIs belonging to cells with fewer than 200 UMIs. We also precalculate gene expression, fraction of single-copy molecules etc. and save as stats (statistics). These can later be used when generating figures. We also generate down-sampled BUGs.

In [15]:
#create output directory
!mkdir figureData

In [13]:
#First set some path variables
%%R
source("GRNP_2020/RCode/pathsGoogleColab.R")


In [16]:
#Process and filter the BUG file
%%R
source(paste0(sourcePath, "BUGProcessingHelpers.R"))
createStandardBugsData(paste0(dataPath,"EVAL/"), "EVAL", c(0.05, 0.1, 0.2, 0.25, 0.4, 0.6, 0.8, 1))



[1] "Generating data for EVAL"
[1] "Reading BUG from data/EVAL/ ..."
[1] "Filtering multi-mapped reads..."
[1] "Fraction multi-mapped reads: 0.177481439356937"
[1] "Converting genes..."
[1] "Done"
[1] "Down-sampling in total 8 bugs:"
[1] "8: Down-sampling to 1"
[1] "saving BUG..."
[1] "creating stats..."


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "cpm normalizing..."
[1] "7: Down-sampling to 0.8"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "saving BUG..."
[1] "creating stats..."


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "cpm normalizing..."
[1] "6: Down-sampling to 0.6"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "saving BUG..."
[1] "creating stats..."


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "cpm normalizing..."
[1] "5: Down-sampling to 0.4"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "saving BUG..."
[1] "creating stats..."


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "cpm normalizing..."
[1] "4: Down-sampling to 0.25"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "saving BUG..."
[1] "creating stats..."


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "cpm normalizing..."
[1] "3: Down-sampling to 0.2"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "saving BUG..."
[1] "creating stats..."


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "cpm normalizing..."
[1] "2: Down-sampling to 0.1"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "saving BUG..."
[1] "creating stats..."


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "cpm normalizing..."
[1] "1: Down-sampling to 0.05"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "saving BUG..."
[1] "creating stats..."


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "cpm normalizing..."
[1] "Done"
[1] "Saving stats..."


**4. Generate statistics for the dataset**

Here we create a file with various statistics for the dataset, which is used for generating table S2. It also contains some additional information about the dataset. Generation of this file may take several hours.

In [None]:
%%R
source(paste0(sourcePath, "GenBugSummary.R"))
genBugSummary("EVAL", "Vmn1r13", "Ubb", 10)

[1] "Will process 1229 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1]

In [None]:
!cd figureData/EVAL && ls -l && more ds_summary.txt

total 208752
-rw-r--r-- 1 root root 37523302 Jul  1 20:41 Bug_100.RData
-rw-r--r-- 1 root root 17301806 Jul  1 20:39 Bug_10.RData
-rw-r--r-- 1 root root 23452646 Jul  1 20:39 Bug_20.RData
-rw-r--r-- 1 root root 25293705 Jul  1 20:40 Bug_25.RData
-rw-r--r-- 1 root root 29076462 Jul  1 20:40 Bug_40.RData
-rw-r--r-- 1 root root 11231802 Jul  1 20:39 Bug_5.RData
-rw-r--r-- 1 root root 32628714 Jul  1 20:41 Bug_60.RData
-rw-r--r-- 1 root root 35471380 Jul  1 20:41 Bug_80.RData
-rw-r--r-- 1 root root     1013 Jul  1 21:09 ds_summary.txt
-rw-r--r-- 1 root root  1760913 Jul  1 20:42 Stats.RData
Dataset: EVAL

totUMIs: 3604133
totCells: 1555
totCounts: 29913038
countsPerUMI: 8.29964876434915
UMIsPerCell: 2317.77041800643
countsPerCell: 19236.6803858521
totFracOnes: 0.276914864129598
FracMolWithUMIDistToNeighborH: 134, 739, 345, 11, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborL: 242, 1158, 584, 16, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborHFrac: 0.109031733116355, 0.601301871440195, 0.28071