# 4.1 Generating footprints and V-plots with the ATACseqQC package
### Make sure you are using the R-kernel to run this notebook 

In [None]:
#change to your working directory 
username="annashch"
setwd(paste("/scratch/",username,sep=""))

The [ATACseqQC](https://www.bioconductor.org/packages/release/bioc/vignettes/ATACseqQC/inst/doc/ATACseqQC.html) provides convenient wrappers for a number of ATAC QC workflows. 
We will use this toolkit to generate footprints and V-plots for some of the transcription factors found to be enriched across conditions in the HOMER analysis (see notebook 3.6)

In [None]:
## load the needed R libraries
library(ATACseqQC)
library(GenomicRanges)
library(BSgenome.Scerevisiae.UCSC.sacCer3)
genome <- Scerevisiae
seqlev=seqlevels(genome)

In [None]:
## indicate the Transcription Factor of interest
tf="REB1"
## indicate the path to the BAM file we will use. 
experiment="0min_SKN7"

## Generating Footprints from BAM files 

In [None]:
#use the paste command in R to provide the path to the duplicate-filtered replicate-merged bam file.  
bamfile=paste("/outputs/croo_pilot/",experiment,"/align/",experiment,".merged.nodup.bam",sep="")

In [None]:
#first, load the duplicate-filtered bam file for 0min_SKN7, indicate that the file is paired-end by 
# setting asMates=TRUE 
bam_data=readBamFile(bamfile,asMates=TRUE)

In [None]:
#visualize the loaded bam file 
bam_data

In [None]:
## shift the coordinates of 5'ends of alignments in the bam file
shiftedBamFile=paste(experiment,".merged.nodup.shifted.bam",sep='')
shifted_bam_data <- shiftGAlignmentsList(bam_data, outbam=shiftedBamFile)

## REB1

In [None]:
## foot prints
library(MotifDb)

#we can subset the motif database to just S. cerevisiae motifs: 
query (MotifDb, 'cerevisiae')

In [None]:
#We will generate footprints & V-plots for the  S. cerevisiase transcription factor REB1.
# Let's verify that there is a motif for this TF  in the database 
query(query (MotifDb, 'cerevisiae'),tf)


In [None]:
#Get the Position Frequency Matrix for REB1
pfm=query(query (MotifDb, 'cerevisiae'),tf)
pfm=as.list(pfm)
print(pfm[[1]], digits=2)


In [None]:
#load the peak regions that overlap with REB1 (see notebook 3.6 for how these are generated) and store them as 
# a GenomicRanges object 
motif_hits=read.table("REB1.in.0min_SKN7.bed",header=FALSE,sep='\t')
colnames(motif_hits)=c("chr","start","end","id","score","strand")


motif_hits=makeGRangesFromDataFrame(motif_hits,
                                    seqinfo=seqinfo(genome),
                                    seqnames.field="chr",
                                    start.field="start",
                                    end.field="end",
                                    keep.extra.columns=TRUE)
motif_hits

In [None]:
options(repr.plot.width=12, repr.plot.height=10)
sigs <- factorFootprints(shiftedBamFile, 
                         pfm=pfm[[1]], 
                         genome=genome,
                         bindingSites=motif_hits,
                         seqlev=paste0(seqlevels(genome)),
                         min.score="95%",
                         upstream=50,
                         downstream=50)


## Generating V-plots

In [None]:
vp <- vPlot(shiftedBamFile, 
            pfm=pfm[[1]], 
            genome=genome, 
            min.score="95%",
            bindingSites=motif_hits,
            seqlev=paste0(seqlevels(genome)),
            upstream=500, 
            downstream=500, 
            ylim=c(0, 500), 
            bandwidth=c(2, 1))

In [None]:
distanceDyad(vp, pch=20, cex=.5)


## De novo HOMER hit 

We can also generate a footprint/V-plot for the top-hit de novo motif from homer: 

![top_hits](images/top_hits_homer_SKN7.png)

Note: you can click on the "motif file matrix" link in the right-most column of the homerResults.html results file to get the input motif file for scanning: 


```
>GGGCGGCACAAG	1-GGGCGGCACAAG,BestGuess:POL011.1_XCPE1/Jaspar(0.681)	10.848594	-40.855667	0	T:9.0(5.70%),B:1.0(0.03%),P:1e-17
0.001	0.001	0.997	0.001
0.125	0.250	0.624	0.001
0.001	0.001	0.997	0.001
0.001	0.997	0.001	0.001
0.125	0.125	0.749	0.001
0.001	0.001	0.874	0.124
0.001	0.749	0.249	0.001
0.749	0.001	0.125	0.125
0.124	0.874	0.001	0.001
0.874	0.001	0.124	0.001
0.997	0.001	0.001	0.001
0.125	0.125	0.749	0.001
```
This motif is located in the output folder: 
```
/scratch/[YOUR USERNAME]/homer_SKN7_0min_vs_45min_negative/homerResults/motif1.motif
```

In [None]:
denovo1_pfm=read.table("homer_SKN7_0min_vs_45min_negative/homerResults/motif1.motif",skip = 1,header=FALSE,sep='\t')
head(denovo1_pfm)


In [None]:
#let's transpose the matrix and generate proper row names 
denovo1_pfm=t(denovo1_pfm)
rownames(denovo1_pfm)=c("A","C","G","T")
head(denovo1_pfm)

In [None]:
#load the peak regions that overlap with REB1 (see notebook 3.6 for how these are generated) and store them as 
# a GenomicRanges object 
denovo1_motif_hits=read.table("denovo1.in.0min_SKN7.bed",header=FALSE,sep='\t')
colnames(denovo1_motif_hits)=c("chr","start","end","id","score","strand")


denovo1_motif_hits=makeGRangesFromDataFrame(denovo1_motif_hits,
                                    seqinfo=seqinfo(genome),
                                    seqnames.field="chr",
                                    start.field="start",
                                    end.field="end",
                                    keep.extra.columns=TRUE)
denovo1_motif_hits

In [None]:
sigs <- factorFootprints(shiftedBamFile, 
                         pfm=denovo1_pfm, 
                         genome=genome,
                         bindingSites=denovo1_motif_hits,
                         seqlev=paste0(seqlevels(genome)),
                         min.score="95%",
                         upstream=50,
                         downstream=50)


In [None]:
vp <- vPlot(shiftedBamFile, 
            pfm=denovo1_pfm, 
            genome=genome, 
            min.score="95%",
            bindingSites=denovo1_motif_hits,
            seqlev=paste0(seqlevels(genome)),
            upstream=500, 
            downstream=500, 
            ylim=c(0, 500), 
            bandwidth=c(2, 1))

In [None]:
distanceDyad(vp, pch=20, cex=.5)


## Functions in R 

If we wanted to run the workflow above on a different experiment or TF, it would be convenient to have a small number of commands we could execute to do that. We can wrap the commands above into two R functions to achieve this: 

In [None]:
#This function reads and shifts a bam file for a given experiment 
read_and_shift_bam <- function(experiment){
    bamfile=paste("/outputs/croo_pilot/",experiment,"/align/",experiment,".merged.nodup.bam",sep="")
    bam_data=readBamFile(bamfile,asMates=TRUE)
    shiftedBamFile=paste(experiment,".merged.nodup.shifted.bam",sep='')
    shifted_bam_data <- shiftGAlignmentsList(bam_data, outbam=shiftedBamFile)
    return;
}

#This function generates a PFM matrix by querying the S. cerevisiae transcription factor database 
get_pfm_from_db <-function(tf)
    {
    pfm=query(query (MotifDb, 'cerevisiae'),tf)
    pfm=as.list(pfm)
    if(length(pfm)==0)
    {
    print(paste("tf",tf," not found in JASPAR"))
    return
    }
    return(pfm[[1]])
    
}

#This function uses the filename for a shifted bam to generate a footprint plot and a V-plot for a TF. 
make_footprint_and_vplot <- function(experiment,motifs_in_peaks_bed,pfm) {
    shiftedBamFile=paste(experiment,".merged.nodup.shifted.bam",sep='')
   
    #specify that yeast genome is used 
    genome <- Scerevisiae
    seqlev=seqlevels(genome)
    
    #generate GRanges object with motifs in peaks 
    motif_hits=read.table(motifs_in_peaks_bed,header=FALSE,sep='\t')
    colnames(motif_hits)=c("chr","start","end","id","score","strand")
    motif_hits=makeGRangesFromDataFrame(motif_hits,
                                    seqinfo=seqinfo(genome),
                                    seqnames.field="chr",
                                    start.field="start",
                                    end.field="end",
                                    keep.extra.columns=TRUE)

    #set plot size
    options(repr.plot.width=12, repr.plot.height=10)

    #make fooprint plot
    sigs <- factorFootprints(shiftedBamFile, 
                         pfm=pfm, 
                         genome=genome,
                         bindingSites=motif_hits,
                         seqlev=paste0(seqlevels(genome)),
                         min.score="95%",
                         upstream=50,
                         downstream=50)
    #make V-plot              
    vp <- vPlot(shiftedBamFile, 
            pfm=pfm, 
            genome=genome, 
            bindingSites=motif_hits,
            min.score="95%",
            seqlev=paste0(seqlevels(genome)),
            upstream=500, 
            downstream=500, 
            ylim=c(0, 500), 
            bandwidth=c(2, 1)) 
    
    #make Dyad plot 
    distanceDyad(vp, 
                 pch=20, 
                 cex=.5)
    return
}

Let's see some examles of our helper functions in action 

Now, let's repeat our analysis for **45min_SKN7**

In [None]:
#read & shift the bam file 
read_and_shift_bam("45min_SKN7")

In [None]:
#generate footprint & V-plot 
reb1_pfm=get_pfm_from_db("REB1")
make_footprint_and_vplot("45min_SKN7","REB1.in.45min_SKN7.bed",reb1_pfm)

We can also generate a footprint/V-plot for the strongest de novo motif hit from homer 

In [None]:
make_footprint_and_vplot("45min_SKN7","denovo1.in.45min_SKN7.bed",denovo1_pfm)