<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/figure_generation/GenFig4AC_S23.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Generates figure 4 A-C and supplementary figure 23**

This notebook generates figures for showing the improvement of BUTTERFLY correction on scRNA-Seq data, by comparing a downsampled and full dataset. Furthermore, we show the effect of "borrowing" CU histogram information from similar datasets, and determine the sampling noise, which sets a theoretical maximum performance for the prediction in downsampling scenarios such as this.

Steps:
1. Download the code and processed data
2. Setup the R environment
3. Generate the figures

The data for these figures is produced by the following notebooks:

Processing of FASTQ files with kallisto and bustools:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessPBMC_V3_3.ipynb

Preprocessing of BUG files:

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/R_processing/ProcessR_PBMC_V3_3.ipynb

Precalculation of figure data: 

https://github.com/pachterlab/GRNP_2020/blob/master/notebooks/figure_generation/GenFig4AC_S23Data.ipynb


**1. Download the code and processed data**

In [None]:
#download the R code
![ -d "GRNP_2020" ] && rm -r GRNP_2020

!git clone https://github.com/pachterlab/GRNP_2020.git


In [None]:
#download processed data from Zenodo for all datasets
![ -d "figureData" ] && rm -r figureData
!mkdir figureData
!cd figureData && wget https://zenodo.org/record/4661263/files/FigureData.zip?download=1 && unzip 'FigureData.zip?download=1' && rm 'FigureData.zip?download=1'


In [None]:
#Check that download worked
!cd figureData && ls -l && cd PBMC_V3_3 && ls -l

**2. Prepare the R environment**

In [None]:
#switch to R mode
%reload_ext rpy2.ipython


In [None]:
#install the R packages and setup paths
%%R
install.packages("dplyr")
install.packages("ggplot2")
install.packages("DescTools")
install.packages("ggpubr")
install.packages("hexbin")
install.packages("reshape2")
install.packages("farver")




**3. Generate the figures**


In [None]:
#First set some path variables
%%R
source("GRNP_2020/RCode/pathsGoogleColab.R")


In [None]:
#Import the code for prediction (available in other notebooks)
%%R
source(paste0(sourcePath,"ButterflyHelpers.R"))
#source(paste0(sourcePath,"preseqHelpers.R"))
source(paste0(sourcePath,"CCCHelpers.R"))
source(paste0(sourcePath,"ggplotHelpers.R"))





In [None]:
#create figure directory
![ -d "figures" ] && rm -r figures
!mkdir figures

In [None]:
#Create and save the figures
%%R
library(ggplot2)
library(ggpubr)
library(hexbin)
library(dplyr)


ldata = readRDS(paste0(figure_data_path, "Fig4AC_ldata.RDS"))
ldata2 = readRDS(paste0(figure_data_path, "Fig4AC_ldata2.RDS"))

#generate plot data
plotdata = tibble(gene=ldata$gene, 
                  x=ldata$x, 
                  nopred=ldata$nopred - ldata$trueval,
                  pred=ldata$pred - ldata$trueval,
                  poolpred=ldata$poolpred - ldata$trueval)

#melt
plotdata.m = reshape2::melt(plotdata, id.vars=c("gene","x"), measure.vars = c("nopred", "pred", "poolpred"))

labl = labeller(variable = 
                      c("nopred" = "No Correction",
                        "pred" = "Correction",
                        "poolpred" = "Correction using Pooling"))

dfline = data.frame(x=c(0,16), y=c(0,0))

dummyData = data.frame(x=c(0,0), y=c(1.1, -1.5)) #used in a trick to set y axis range below

fig4AC = ggplot(plotdata.m) +
  stat_binhex(bins=60,na.rm = TRUE, mapping=aes(x = x, y=value, fill = log(..count..))) + # opts(aspect.ratio = 1) +
  facet_wrap(facets = ~  variable, scales = "free_x", labeller = labl, ncol=3) +
  geom_line(data=dfline, mapping = aes(x=x, y=y), color="black", size=1) + 
  geom_blank(data = dummyData, mapping = aes(x=x, y=y)) + #trick to set y axis range
  labs(y=expression(Log[2]*" fold change (CPM)"), x=expression(Log[2]*"(CPM + 1)")) +
  theme(panel.background = element_rect("white", "white", 0, 0, "white"),
        legend.position= "bottom", legend.direction = "horizontal",#, legend.title = element_blank())
        strip.text.x = element_text(size = 12, face = "bold"),
        #legend.position= "none",
        strip.background = element_blank())

print(fig4AC)

ggsave(
  paste0(figure_path, "Fig4AC.png"),
  plot = fig4AC, device = "png",
  width = 7, height = 4, dpi = 300)



In [None]:
%%R
#########################
# Fig S23 (Sampling noise)
#########################
#cpm and log transform

plotdata2 = tibble(gene=ldata2$gene, 
                  x=ldata2$x, 
                  y=ldata2$sampling - ldata$nopred)


dfline = data.frame(x=c(0,16), y=c(0,0))
dummyData = data.frame(x=c(0,0), y=c(1.1, -1.5))

figS23 = ggplot(plotdata2) +
  stat_binhex(bins=60,na.rm = TRUE, mapping=aes(x = x, y=y, fill = log(..count..))) + # opts(aspect.ratio = 1) +
  #facet_wrap(facets = ~  variable, scales = "free_x", labeller = labl, ncol=3) +
  geom_line(data=dfline, mapping = aes(x=x, y=y), color="black", size=1) + 
  geom_blank(data = dummyData, mapping = aes(x=x, y=y)) + #trick to set y axis range
  labs(y=expression(Log[2]*" fold change (CPM)"), x=expression(Log[2]*"(CPM + 1)")) +
  theme(panel.background = element_rect("white", "white", 0, 0, "white"),
        legend.position= "bottom", legend.direction = "horizontal",#, legend.title = element_blank())
        strip.text.x = element_text(size = 12, face = "bold"),
        #legend.position= "none",
        strip.background = element_blank())

print(figS23)


ggsave(
  paste0(figure_path, "FigS23.png"),
  plot = figS23, device = "png",
  width = 3, height = 4, dpi = 300)



In [None]:
#The data to present over the plots
%%R
print(paste0("CCC, no pred: ", getCCC(ldata$nopred, ldata$trueval))) #0.981275291888894
print(paste0("CCC, pred no pooling: ", getCCC(ldata$pred, ldata$trueval))) #0.993829998877551
print(paste0("CCC, pred with pooling: ", getCCC(ldata$poolpred, ldata$trueval))) #0.997028015743896
print(paste0("CCC, no pred, bin ds vs ds: ", getCCC(ldata2$nopred, ldata2$sampling))) #
