<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_PBMC_V3_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Processes the BUG files into files prepared for use in R**
1. Clone the code repo and download data to process
2. Prepare the R environment
3. Process the data
4. Generate statistics for the dataset

**1. Clone the code repo and download data to process**

In [1]:
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


Cloning into 'GRNP_2020'...
remote: Enumerating objects: 285, done.[K
remote: Counting objects: 100% (285/285), done.[K
remote: Compressing objects: 100% (218/218), done.[K
remote: Total 285 (delta 141), reused 138 (delta 48), pack-reused 0[K
Receiving objects: 100% (285/285), 7.15 MiB | 5.71 MiB/s, done.
Resolving deltas: 100% (141/141), done.


In [2]:
#download BUG data from Zenodo
!mkdir data
!cd data && wget https://zenodo.org/record/3911637/files/PBMC_V3_3.zip?download=1 && unzip 'PBMC_V3_3.zip?download=1' && rm 'PBMC_V3_3.zip?download=1'

--2020-06-28 21:15:23--  https://zenodo.org/record/3911637/files/PBMC_V3_3.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 258556329 (247M) [application/octet-stream]
Saving to: ‘PBMC_V3_3.zip?download=1’


2020-06-28 21:16:02 (6.70 MB/s) - ‘PBMC_V3_3.zip?download=1’ saved [258556329/258556329]

Archive:  PBMC_V3_3.zip?download=1
   creating: PBMC_V3_3/
   creating: PBMC_V3_3/bus_output/
  inflating: PBMC_V3_3/bus_output/bug.txt  
  inflating: PBMC_V3_3/bus_output/coll.genes.txt  
  inflating: PBMC_V3_3/bus_output/transcripts_to_genes.txt  


In [3]:
#Check that download worked
!cd data && ls -l && cd PBMC_V3_3/bus_output && ls -l

total 4
drwxr-xr-x 3 root root 4096 Jun 28 12:48 PBMC_V3_3
total 1232920
-rw-r--r-- 1 root root 1255736268 May 27 09:35 bug.txt
-rw-r--r-- 1 root root     779656 May 27 09:36 coll.genes.txt
-rw-r--r-- 1 root root    5983758 May 27 08:33 transcripts_to_genes.txt


**2. Prepare the R environment**

In [4]:
#switch to R mode
%reload_ext rpy2.ipython


  from pandas.core.index import Index as PandasIndex


In [5]:
#install the R packages
%%R
sourcePath = "GRNP_2020/NotebookAdaptedRCode/"
install.packages("qdapTools")
install.packages("dplyr")
install.packages("stringdist")


R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: also installing the dependencies ‘bitops’, ‘chron’, ‘data.table’, ‘RCurl’, ‘XML’


R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/bitops_1.0-6.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 8734 bytes

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[writ

**3. Process the data**

Here we discard multimapped UMIs and all UMIs belonging to cells with fewer than 200 UMIs. We also precalculate gene expression, fraction of single-copy molecules etc. and save as stats (statistics). These can later be used when generating figures. We also generate down-sampled BUGs.

In [6]:
#create output directory
!mkdir figureData

In [7]:
#Process and filter the BUG file
%%R
source(paste0(sourcePath, "BUGProcessingHelpers.R"))
createStandardBugsData(paste0(dataPath,"PBMC_V3_3/"), "PBMC_V3_3", c(0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1))



R[write to console]: 
Attaching package: ‘dplyr’


R[write to console]: The following objects are masked from ‘package:stats’:

    filter, lag


R[write to console]: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


R[write to console]: 
Attaching package: ‘qdapTools’


R[write to console]: The following object is masked from ‘package:dplyr’:

    id




[1] "Generating data for PBMC_V3_3"
[1] "Reading BUG from data/PBMC_V3_3/ ..."
[1] "Filtering multi-mapped reads..."
[1] "Fraction multi-mapped reads: 0.205924086085551"
[1] "Converting genes..."
[1] "Done"
[1] "Down-sampling in total 7 bugs:"
[1] "1: Down-sampling to 0.05"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "2: Down-sampling to 0.1"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "3: Down-sampling to 0.2"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "4: Down-sampling to 0.4"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "5: Down-sampling to 0.6"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "6: Down-sampling to 0.8"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "7: Down-sampling to 1"
[1] "Done"


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)

R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "Saving BUG..."
[1] "Saving BUGs..."
[1] "Saving stats..."


**4. Generate statistics for the dataset**

Here we create a file with various statistics for the dataset. This processing is rather time-consuming.

In [8]:
%%R
source(paste0(sourcePath, "GenBugSummary.R"))
genBugSummary("PBMC_V3_3", "Vmn1r13", "Ubb", 10)

[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 10

In [11]:
!cd figureData/PBMC_V3_3 && ls -l && cat ds_summary.txt

total 1154744
-rw-r--r-- 1 root root 239129923 Jun 28 21:25 Bug.RData
-rw-r--r-- 1 root root 941503384 Jun 28 21:32 DsBugs.RData
-rw-r--r-- 1 root root       970 Jun 28 22:28 ds_summary.txt
-rw-r--r-- 1 root root   1806761 Jun 28 21:32 Stats.RData
Dataset: PBMC_V3_3

totUMIs: 24224784
totCells: 12521
totCounts: 50724058
countsPerUMI: 2.09389103324925
UMIsPerCell: 1934.73236961904
countsPerCell: 4051.11876048239
totFracOnes: 0.439630256352337
FracMolWithUMIDistToNeighborH: 201, 1132, 638, 29, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborL: 199, 1076, 707, 17, 1, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborHFrac: 0.1005, 0.566, 0.319, 0.0145, 0, 0, 0, 0, 0, 
0
FracMolWithUMIDistToNeighborLFrac: 0.0995, 0.538, 0.3535, 0.0085, 5e-04, 0, 0, 0
, 0, 0
FracMolWithUMIDistToNeighbor1cpy: 214, 1164, 602, 20, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighbor2cpy: 189, 1107, 689, 15, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighbor>=3cpy: 193, 1144, 650, 13, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighbor1cpy