<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_LC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**LC dataset: Processes the BUG files into files prepared for use in R**

This notebook processes the output from the fastq file processing for this dataset. The data produced here is pre-generated and downloaded by the figure generation code. The purpose of this processing step is to prepare the data for figure generation, by filtering the data and producing downsampled datasets in addition to the original one. 

Steps:
1. Clone the code repo and download data to process
2. Prepare the R environment
3. Process the data
4. Generate statistics for the dataset

**1. Clone the code repo and download data to process**

In [None]:
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


In [None]:
#download BUG data from Zenodo
!mkdir data
!cd data && wget https://zenodo.org/record/3924675/files/LC.zip?download=1 && unzip 'LC.zip?download=1' && rm 'LC.zip?download=1'

In [None]:
#Check that download worked
!cd data && ls -l && cd LC/bus_output && ls -l

**2. Prepare the R environment**

In [None]:
#switch to R mode
%reload_ext rpy2.ipython


In [None]:
#install the R packages
%%R
install.packages("qdapTools")
install.packages("dplyr")
install.packages("stringdist")
install.packages("stringr")


**3. Process the data**

Here we discard multimapped UMIs and all UMIs belonging to cells with fewer than 200 UMIs. We also precalculate gene expression, fraction of single-copy molecules etc. and save as stats (statistics). These can later be used when generating figures. We also generate down-sampled BUGs.

In [None]:
#create output directory
!mkdir figureData

In [None]:
#First set some path variables
%%R
source("GRNP_2020/RCode/pathsGoogleColab.R")


In [None]:
#Process and filter the BUG file
%%R
source(paste0(sourcePath, "BUGProcessingHelpers.R"))
createStandardBugsData(paste0(dataPath,"LC/"), "LC", c(0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1), UmisPerCellLimit = 1000)



**4. Generate statistics for the dataset**

Here we create a file with various statistics for the dataset, which is used for generating table S2. It also contains some additional information about the dataset. Generation of this file may take several hours.

In [9]:
%%R
source(paste0(sourcePath, "GenBugSummary.R"))
genBugSummary("LC", "FGF23", "RPS10", 10)

[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 100
[1] 200
[1] 300
[1] 400
[1] 500
[1] 600
[1] 700
[1] 800
[1] 900
[1] 1000
[1] 1100
[1] 1200
[1] 1300
[1] 1400
[1] 1500
[1] 1600
[1] 1700
[1] 1800
[1] 1900
[1] 2000
[1] "Down-sampling to 2000 UMIs"
[1] "Will process 2000 UMIs"
[1] 10

In [10]:
!cd figureData/LC && ls -l && cat ds_summary.txt

total 2753212
-rw-r--r-- 1 root root 616490756 Jul  2 14:58 Bug_100.RData
-rw-r--r-- 1 root root 202747451 Jul  2 15:34 Bug_10.RData
-rw-r--r-- 1 root root 319878147 Jul  2 15:31 Bug_20.RData
-rw-r--r-- 1 root root 452490442 Jul  2 15:26 Bug_40.RData
-rw-r--r-- 1 root root 117384304 Jul  2 15:35 Bug_5.RData
-rw-r--r-- 1 root root 528842371 Jul  2 15:18 Bug_60.RData
-rw-r--r-- 1 root root 579122831 Jul  2 15:08 Bug_80.RData
-rw-r--r-- 1 root root       957 Jul  2 18:54 ds_summary.txt
-rw-r--r-- 1 root root   2297676 Jul  2 15:36 Stats.RData
Dataset: LC

totUMIs: 70004066
totCells: 21137
totCounts: 305427674
countsPerUMI: 4.36299905779759
UMIsPerCell: 3311.92061314283
countsPerCell: 14449.9065146426
totFracOnes: 0.267279174898212
FracMolWithUMIDistToNeighborH: 859, 947, 192, 2, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborL: 635, 1038, 326, 1, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborHFrac: 0.4295, 0.4735, 0.096, 0.001, 0, 0, 0, 0, 0, 0
FracMolWithUMIDistToNeighborLFrac: 0.3175, 0.51