<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_PBMC_V3_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**PBMC_V3_3 dataset: Processes the BUG files into files prepared for use in R**

This notebook processes the output from the fastq file processing for this dataset. The data produced here is pre-generated and downloaded by the figure generation code. The purpose of this processing step is to prepare the data for figure generation, by filtering the data and producing downsampled datasets in addition to the original one. 

Steps:
1. Clone the code repo and download data to process
2. Prepare the R environment
3. Process the data
4. Generate statistics for the dataset

The data used in this processing step is produced by the following notebook:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessPBMC_V3_3.ipynb


**1. Clone the code repo and download data to process**

In [None]:
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


Cloning into 'GRNP_2020'...
remote: Enumerating objects: 263, done.[K
remote: Counting objects: 100% (263/263), done.[K
remote: Compressing objects: 100% (216/216), done.[K
remote: Total 1146 (delta 180), reused 83 (delta 47), pack-reused 883[K
Receiving objects: 100% (1146/1146), 7.53 MiB | 20.73 MiB/s, done.
Resolving deltas: 100% (734/734), done.


In [None]:
#download BUG data from Zenodo
!mkdir data
!cd data && wget https://zenodo.org/record/3924675/files/PBMC_V3_3.zip?download=1 && unzip 'PBMC_V3_3.zip?download=1' && rm 'PBMC_V3_3.zip?download=1'

--2020-07-02 22:13:50--  https://zenodo.org/record/3924675/files/PBMC_V3_3.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 476814352 (455M) [application/octet-stream]
Saving to: ‘PBMC_V3_3.zip?download=1’


2020-07-02 22:14:25 (13.3 MB/s) - ‘PBMC_V3_3.zip?download=1’ saved [476814352/476814352]

Archive:  PBMC_V3_3.zip?download=1
   creating: PBMC_V3_3/
   creating: PBMC_V3_3/bus_output/
  inflating: PBMC_V3_3/bus_output/bug.txt  
  inflating: PBMC_V3_3/bus_output/coll.genes.txt  
  inflating: PBMC_V3_3/bus_output/transcripts_to_genes.txt  


In [None]:
#Check that download worked
!cd data && ls -l && cd PBMC_V3_3/bus_output && ls -l

total 4
drwxr-xr-x 3 root root 4096 Jul  1 00:00 PBMC_V3_3
total 2214104
-rw-r--r-- 1 root root 2258153556 Jun 30 16:57 bug.txt
-rw-r--r-- 1 root root     738211 Jun 30 15:10 coll.genes.txt
-rw-r--r-- 1 root root    8335642 Jun 30 14:59 transcripts_to_genes.txt


**2. Prepare the R environment**

In [None]:
#switch to R mode
%reload_ext rpy2.ipython


  from pandas.core.index import Index as PandasIndex


In [None]:
#install the R packages
%%R
install.packages("qdapTools")
install.packages("dplyr")
install.packages("stringdist")
install.packages("stringr")


R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: also installing the dependencies ‘bitops’, ‘chron’, ‘data.table’, ‘RCurl’, ‘XML’


R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/bitops_1.0-6.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 8734 bytes

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[writ

**3. Process the data**

Here we discard multimapped UMIs and all UMIs belonging to cells with fewer than 200 UMIs. We also precalculate gene expression, fraction of single-copy molecules etc. and save as stats (statistics). These can later be used when generating figures. We also generate down-sampled BUGs.

In [None]:
#create output directory
!mkdir figureData

In [None]:
#First set some path variables
%%R
source("GRNP_2020/RCode/pathsGoogleColab.R")


In [None]:
#Process and filter the BUG file
%%R
source(paste0(sourcePath, "BUGProcessingHelpers.R"))
createStandardBugsData(paste0(dataPath,"PBMC_V3_3/"), "PBMC_V3_3", c(0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1))



R[write to console]: 
Attaching package: ‘dplyr’


R[write to console]: The following objects are masked from ‘package:stats’:

    filter, lag


R[write to console]: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


R[write to console]: 
Attaching package: ‘qdapTools’


R[write to console]: The following object is masked from ‘package:dplyr’:

    id




[1] "Generating data for PBMC_V3_3"
[1] "Reading BUG from data/PBMC_V3_3/ ..."
[1] "Filtering multi-mapped reads..."
[1] "Fraction multi-mapped reads: 0.162711442687754"
[1] "Converting genes..."
[1] "Done"
[1] "Down-sampling in total 7 bugs:"
[1] "7: Down-sampling to 1"
[1] "saving BUG..."
[1] "creating stats..."


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "cpm normalizing..."
[1] "6: Down-sampling to 0.8"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "saving BUG..."
[1] "creating stats..."


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "cpm normalizing..."
[1] "5: Down-sampling to 0.6"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "saving BUG..."
[1] "creating stats..."


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "cpm normalizing..."
[1] "4: Down-sampling to 0.4"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "saving BUG..."
[1] "creating stats..."


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "cpm normalizing..."
[1] "3: Down-sampling to 0.2"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "saving BUG..."
[1] "creating stats..."


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "cpm normalizing..."
[1] "2: Down-sampling to 0.1"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "saving BUG..."
[1] "creating stats..."


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "cpm normalizing..."
[1] "1: Down-sampling to 0.05"
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "saving BUG..."
[1] "creating stats..."


R[write to console]: `summarise()` ungrouping output (override with `.groups` argument)



[1] "cpm normalizing..."
[1] "Done"
[1] "Saving stats..."


**4. Generate statistics for the dataset**

Here we create a file with various statistics for the dataset, which is used for generating table S2. It also contains some additional information about the dataset. Generation of this file may take several hours.

In [None]:
%%R
source(paste0(sourcePath, "GenBugSummary.R"))
genBugSummary("PBMC_V3_3", "FGF23", "RPS10", 10)

[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 10

In [None]:
!cd figureData/PBMC_V3_3 && ls -l && cat ds_summary.txt

total 1869192
-rw-r--r-- 1 root root 414359244 Jul  2 00:27 Bug_100.RData
-rw-r--r-- 1 root root 136417318 Jul  2 00:04 Bug_10.RData
-rw-r--r-- 1 root root 218074220 Jul  2 00:07 Bug_20.RData
-rw-r--r-- 1 root root 310531469 Jul  2 00:11 Bug_40.RData
-rw-r--r-- 1 root root  77502057 Jul  2 00:02 Bug_5.RData
-rw-r--r-- 1 root root 362533949 Jul  2 00:16 Bug_60.RData
-rw-r--r-- 1 root root 392395794 Jul  2 00:23 Bug_80.RData
-rw-r--r-- 1 root root       976 Jul  2 03:59 ds_summary.txt
-rw-r--r-- 1 root root   2218420 Jul  2 00:27 Stats.RData
Dataset: PBMC_V3_3

totUMIs: 41540915
totCells: 5228
totCounts: 172653580
countsPerUMI: 4.15622958714318
UMIsPerCell: 7945.85214231063
countsPerCell: 33024.7857689365
totFracOnes: 0.224678392375324
FracMolWithUMIDistToNeighborH: 83, 772, 1058, 86, 1, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborL: 84, 720, 1078, 114, 4, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborHFrac: 0.0415, 0.386, 0.529, 0.043, 5e-04, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborLFrac: 0