**Processes the BUG files into files prepared for use in R**
1. Clone the code repo and download data to process
2. Prepare the R environment
3. Process the data
4. Generate statistics for the dataset

**1. Clone the code repo and download data to process**

In [11]:
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


Cloning into 'GRNP_2020'...
remote: Enumerating objects: 172, done.[K
remote: Counting objects: 100% (172/172), done.[K
remote: Compressing objects: 100% (120/120), done.[K
remote: Total 172 (delta 68), reused 115 (delta 34), pack-reused 0[K
Receiving objects: 100% (172/172), 7.10 MiB | 5.51 MiB/s, done.
Resolving deltas: 100% (68/68), done.


In [2]:
#download BUG data from Zenodo
!mkdir data
!cd data && wget https://zenodo.org/record/3911637/files/MRET2.zip?download=1 && unzip 'MRET2.zip?download=1' && rm 'MRET2.zip?download=1'

--2020-06-28 16:40:29--  https://zenodo.org/record/3911637/files/MRET2.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 71831345 (69M) [application/octet-stream]
Saving to: ‘MRET2.zip?download=1’


2020-06-28 16:41:17 (1.50 MB/s) - ‘MRET2.zip?download=1’ saved [71831345/71831345]

Archive:  MRET2.zip?download=1
   creating: MRET2/
   creating: MRET2/bus_output/
  inflating: MRET2/bus_output/bug.txt  
  inflating: MRET2/bus_output/coll.genes.txt  
  inflating: MRET2/bus_output/transcripts_to_genes.txt  


In [3]:
#Check that download worked
!cd data && ls -l && cd MRET2/bus_output && ls -l

total 4
drwxr-xr-x 3 root root 4096 Jun 28 12:42 MRET2
total 300376
-rw-r--r-- 1 root root 300813629 Mar 16 12:47 bug.txt
-rw-r--r-- 1 root root    779656 Mar 16 11:26 coll.genes.txt
-rw-r--r-- 1 root root   5983758 Mar 16 09:57 transcripts_to_genes.txt


**2. Prepare the R environment**

In [4]:
#switch to R mode
%reload_ext rpy2.ipython


  from pandas.core.index import Index as PandasIndex


In [14]:
#install the R packages
%%R
sourcePath = "GRNP_2020/NotebookAdaptedRCode/"
install.packages("qdapTools")
install.packages("dplyr")
install.packages("stringdist")


R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/qdapTools_1.3.5.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 36880 bytes (36 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[writ

**3. Process the data**

Here we discard multimapped UMIs and all UMIs belonging to cells with fewer than 200 UMIs. We also precalculate gene expression, fraction of single-copy molecules etc. and save as stats (statistics). These can later be used when generating figures. We also generate down-sampled BUGs.

In [7]:
#create output directory
!mkdir figureData

In [12]:
#Process and filter the BUG file
%%R
source(paste0(sourcePath, "BUGProcessingHelpers.R"))
createStandardBugsData(paste0(dataPath,"MRET2/"), "MRET2", c(0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1))



[1] "Generating data for MRET2"
[1] "Reading BUG from data/MRET2/ ..."
[1] "Filtering multi-mapped reads..."
[1] "Fraction multi-mapped reads: 0.177481439356937"
[1] "Converting genes..."
[1] "Done"
[1] "Down-sampling in total 7 bugs:"
[1] "1: Down-sampling to 0.05"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "2: Down-sampling to 0.1"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "3: Down-sampling to 0.2"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "4: Down-sampling to 0.4"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "5: Down-sampling to 0.6"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "6: Down-sampling to 0.8"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "7: Down-sampling to 1"
[1] "Done"


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "Saving BUG..."
[1] "Saving BUGs..."
[1] "Saving stats..."


**4. Generate statistics for the dataset**

Here we create a file with various statistics for the dataset. This processing is rather time-consuming.

In [15]:
%%R
source(paste0(sourcePath, "GenBugSummary.R"))
genBugSummary("MRET2", "Vmn1r13", "Ubb", 10)

[1] "Will process 1229 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1]

In [17]:
!cd figureData/MRET2 && ls -l && more ds_summary.txt

total 220500
-rw-r--r-- 1 root root  37523364 Jun 28 17:00 Bug.RData
-rw-r--r-- 1 root root 186708411 Jun 28 17:01 DsBugs.RData
-rw-r--r-- 1 root root      1005 Jun 28 17:36 ds_summary.txt
-rw-r--r-- 1 root root   1549191 Jun 28 17:01 Stats.RData
Dataset: MRET2

totUMIs: 3604133
totCells: 1555
totCounts: 29913038
countsPerUMI: 8.29964876434915
UMIsPerCell: 2317.77041800643
countsPerCell: 19236.6803858521
totFracOnes: 0.276914864129598
FracMolWithUMIDistToNeighborH: 134, 739, 345, 11, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborL: 247, 1175, 560, 18, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborHFrac: 0.109031733116355, 0.601301871440195, 0.28071
6029292107, 0.00895036615134255, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborLFrac: 0.1235, 0.5875, 0.28, 0.009, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighbor1cpy: 238, 1233, 520, 9, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighbor2cpy: 240, 1170, 577, 13, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighbor>=3cpy: 212, 1253, 529, 6, 0, 0, 0, 0, 0, 0
F