<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/figure_generation/GenFigS6Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Precalculates data for figure S6**

This notebook precalculates the data for supplementary figure 6, since there are some steps involved that downloads data from the internet (genome info). This notebook may take 15-30 minutes to run.

Steps:
1. Download the code and processed data
2. Setup the R environment
3.Generate the data

The data used in these calculations is produced by the following notebooks:

Processing of FASTQ files with kallisto and bustools:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessEVAL.ipynb

Preprocessing of BUG files:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_EVAL.ipynb



**1. Download the code and processed data**

In [1]:
#download the R code
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


Cloning into 'GRNP_2020'...
remote: Enumerating objects: 384, done.[K
remote: Counting objects: 100% (384/384), done.[K
remote: Compressing objects: 100% (324/324), done.[K
remote: Total 2102 (delta 285), reused 87 (delta 60), pack-reused 1718[K
Receiving objects: 100% (2102/2102), 11.18 MiB | 12.68 MiB/s, done.
Resolving deltas: 100% (1470/1470), done.


In [2]:
#download processed data from Zenodo for all datasets
![ -d "figureData" ] && rm -r figureData
!mkdir figureData
!cd figureData && wget https://zenodo.org/record/4661263/files/EVAL.zip?download=1 && unzip 'EVAL.zip?download=1' && rm 'EVAL.zip?download=1'


--2021-04-05 15:05:34--  https://zenodo.org/record/4661263/files/EVAL.zip?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 206479312 (197M) [application/octet-stream]
Saving to: ‘EVAL.zip?download=1’


2021-04-05 15:06:28 (3.80 MB/s) - ‘EVAL.zip?download=1’ saved [206479312/206479312]

Archive:  EVAL.zip?download=1
   creating: EVAL/
  inflating: EVAL/Bug_10.RData       
  inflating: EVAL/Bug_100.RData      
  inflating: EVAL/Bug_20.RData       
  inflating: EVAL/Bug_25.RData       
  inflating: EVAL/Bug_40.RData       
  inflating: EVAL/Bug_5.RData        
  inflating: EVAL/Bug_60.RData       
  inflating: EVAL/Bug_80.RData       
  inflating: EVAL/ds_summary.txt     
  inflating: EVAL/PredEvalData.RDS   
  inflating: EVAL/Stats.RData        


In [3]:
#Check that download worked
!cd figureData && ls -l && cd EVAL && ls -l

total 4
drwxr-xr-x 2 root root 4096 Jul  1  2020 EVAL
total 212788
-rw-r--r-- 1 root root 37523336 Jun 30  2020 Bug_100.RData
-rw-r--r-- 1 root root 17301493 Jun 30  2020 Bug_10.RData
-rw-r--r-- 1 root root 23443334 Jun 30  2020 Bug_20.RData
-rw-r--r-- 1 root root 25288320 Jun 30  2020 Bug_25.RData
-rw-r--r-- 1 root root 29057075 Jun 30  2020 Bug_40.RData
-rw-r--r-- 1 root root 11226736 Jun 30  2020 Bug_5.RData
-rw-r--r-- 1 root root 32629892 Jun 30  2020 Bug_60.RData
-rw-r--r-- 1 root root 35477251 Jun 30  2020 Bug_80.RData
-rw-r--r-- 1 root root     1025 Jul  1  2020 ds_summary.txt
-rw-r--r-- 1 root root  4167784 Jul  1  2020 PredEvalData.RDS
-rw-r--r-- 1 root root  1761192 Jun 30  2020 Stats.RData


**2. Prepare the R environment**

In [4]:
#switch to R mode
%reload_ext rpy2.ipython


In [12]:
#install the R packages and setup paths
%%R
install.packages("qdapTools")
install.packages("tidyverse")
install.packages("DescTools")
#install.packages("stringdist")
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/qdapTools_1.3.5.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 36880 bytes (36 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[writ


Error: Bioconductor version '3.10' requires R version '3.6'; use
  `BiocManager::install(version = '3.12')` with R version 4.0; see
  https://bioconductor.org/install


In [16]:
%%R
BiocManager::install("GenomicFeatures", update=FALSE)
BiocManager::install("BSgenome", update=FALSE)
BiocManager::install("BSgenome.Mmusculus.UCSC.mm10", update=FALSE)
BiocManager::install("TxDb.Mmusculus.UCSC.mm10.knownGene", update=FALSE)

R[write to console]: 'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.rstudio.com


R[write to console]: Bioconductor version 3.12 (BiocManager 1.30.12), R 4.0.4 (2021-02-15)

R[write to console]: Installing package(s) 'GenomicFeatures'

R[write to console]: also installing the dependencies ‘formatR’, ‘lambda.r’, ‘futile.options’, ‘matrixStats’, ‘futile.logger’, ‘snow’, ‘MatrixGenerics’, ‘DelayedArray’, ‘BiocParallel’, ‘Rhtslib’, ‘SummarizedExperiment’, ‘GenomeInfoDbData’, ‘zlibbioc’, ‘Rsamtools’, ‘GenomicAlignments’, ‘GenomeInfoDb’, ‘GenomicRanges’, ‘XVector’, ‘Biostrings’, ‘rtracklayer’


R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/formatR_1.8.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 109533 bytes (106 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to co

In [17]:
%%R
BiocManager::install("biomaRt", update=FALSE)

R[write to console]: 'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.rstudio.com


R[write to console]: Bioconductor version 3.12 (BiocManager 1.30.12), R 4.0.4 (2021-02-15)

R[write to console]: Installing package(s) 'biomaRt'

R[write to console]: trying URL 'https://bioconductor.org/packages/3.12/bioc/src/contrib/biomaRt_2.46.3.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 671575 bytes (655 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to con

**3. Define a general function to precalculate data for a pair of datasets**

The plot is about showing the correlation (Lin's CCC) between two datasets at different number of reads.

We created a function for this to enable comparisons of more datasets.


In [18]:
#First set some path variables
%%R
source("GRNP_2020/RCode/pathsGoogleColab.R")


In [19]:
#Import some helper code (available in other notebooks)
%%R
source(paste0(sourcePath,"ButterflyHelpers.R"))

library(tidyverse)
library(biomaRt)
library(GenomicFeatures)
library(BSgenome.Mmusculus.UCSC.mm10)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(qdapTools)




R[write to console]: Loading required package: BiocGenerics

R[write to console]: Loading required package: parallel

R[write to console]: 
Attaching package: ‘BiocGenerics’


R[write to console]: The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB


R[write to console]: The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union


R[write to console]: The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


R[write to console]: The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Posit

In [20]:
#############################################
#Get average transcript length and GC content for each gene
#############################################
%%R

###############
# Gene length
##############

txdbM = makeTxDbFromBiomart(biomart="ENSEMBL_MART_ENSEMBL", dataset="mmusculus_gene_ensembl", transcript_ids=NULL, circ_seqs=DEFAULT_CIRC_SEQS, filter=NULL, id_prefix="ensembl_", host="www.ensembl.org", port=80, taxonomyId=NA, miRBaseBuild=NA)

tlM = transcriptLengths(txdbM, with.cds_len=FALSE)
colnames(tlM)[c(2,3)] = c("tx","gene") #rename to match with the other datasets



#get versions
listMarts()#version 103
ensembl <- useMart("ENSEMBL_MART_ENSEMBL")
listDatasets(ensembl)

##############
#GC Content
##############

#load genome for mus musculus
#BiocManager::install("BSgenome")
#BiocManager::install("BSgenome.Mmusculus.UCSC.mm10")
library(BSgenome.Mmusculus.UCSC.mm10)
genomeM <- BSgenome.Mmusculus.UCSC.mm10
#BiocManager::install("TxDb.Mmusculus.UCSC.mm10.knownGene")
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
txdbM <- TxDb.Mmusculus.UCSC.mm10.knownGene
transcriptsM <- exonsBy(txdbM, by="tx", use.names=TRUE)


seqsM = extractTranscriptSeqs(genomeM, transcriptsM)

#calculate gc content in the strings

gcContent <- function(x) {
  return (letterFrequency(x, c("GC"), OR="|", as.prob=TRUE))
} 


#test 1 - gcContent
b = BString("GGCCGA")
gcContent(b)#should be 5/6 = 0.8333333, ok!


gcFullLength = gcContent(seqsM)
txs = seqsM@ranges@NAMES
#need to remove the version from the transcript name
txs = substr(txs, 1, 18)

#these have transcript ids only, not gene ids. So, merge with the length to get gene id
gcs = tibble(tx = txs, gc = gcFullLength[,1])

#convert the genes
tr2g = read.table(paste0(dataPath,"EVAL/bus_output/transcripts_to_genes.txt"), stringsAsFactors = F)
lookupTable = tr2g[,2:3]
lookupTable= unique(lookupTable)
lookupTable[[1]] = substr(lookupTable[[1]], 1, str_length("ENSMUSG00000087582"))


outGenes = lookup(tlM$gene, lookupTable)
#length(outGenes)#140725
#dim(tlM) #140725      5, ok
tlM$gene = outGenes

gcsMMerged = inner_join(gcs, tlM, by="tx")
#remove NAs (genes that don't have a gene name)
gcsMMergedFilt = gcsMMerged[!is.na(gcsMMerged$gene),]

#take mean of all transcripts for each gene
gctl = gcsMMergedFilt %>% group_by(gene) %>% summarize(gc=mean(gc), txlen = mean(tx_len))




R[write to console]: Download and preprocess the 'transcripts' data frame ... 
R[write to console]: OK

R[write to console]: Download and preprocess the 'chrominfo' data frame ... 
R[write to console]: Error in GenomeInfoDb:::make_circ_flags_from_circ_seqs(chromlengths$name,  : 
  'circ_seqs' contains unrecognized chromosome names: chrM, MtDNA, mit,
  Mito, mitochondrion, dmel_mitochondrion_genome, Pltd, ChrC, Pt,
  chloroplast, Chloro, 2micron, 2-micron, 2uM




Error in GenomeInfoDb:::make_circ_flags_from_circ_seqs(chromlengths$name,  : 
  'circ_seqs' contains unrecognized chromosome names: chrM, MtDNA, mit,
  Mito, mitochondrion, dmel_mitochondrion_genome, Pltd, ChrC, Pt,
  chloroplast, Chloro, 2micron, 2-micron, 2uM


In [10]:
%%R
##########################
#Merge the data with the stats from the EVAL dataset
##########################
loadStats("EVAL")
stats = getStats("EVAL")
d = stats[,c(1,which((colnames(stats) == "FracOnes_EVAL_d_100") | (colnames(stats) == "CountsPerUMI_EVAL_d_100") | (colnames(stats) == "UMIs_EVAL_d_100")))]
colnames(d)[2:4] = c("UMIs","FSCM", "CU")
d = d[d$UMIs >= 30,]

#merge everything together
gctlcu = inner_join(gctl, d, by="gene")
dim(gctlcu)#11826

saveRDS(gctlcu, paste0(figure_data_path, "gc.RDS"))


R[write to console]: Error in inner_join(gctl, d, by = "gene") : object 'gctl' not found




Error in inner_join(gctl, d, by = "gene") : object 'gctl' not found


In [11]:
!cd figureData && ls -l

total 4
drwxr-xr-x 2 root root 4096 Jul  1  2020 EVAL
