Skip to content
eiwenyang089 edited this page Aug 17, 2019 · 47 revisions

BEAPR

Allele-specific protein-RNA binding is an essential aspect that may reveal functional genetic variants (GVs) mediating post-transcriptional regulation. Recently, genome-wide detection of in vivo binding of RNA binding proteins is greatly facilitated by the enhanced crosslinking and immunoprecipitation (eCLIP) method. We developed a new computational approach, called BEAPR, to identify allele-specific binding (ASB) events in eCLIP-Seq data. BEAPR takes into account crosslinking-induced sequence propensity and variations between replicated experiments. Using simulated and actual data, we show that BEAPR largely outperforms often- used count analysis methods. Importantly, BEAPR overcomes the inherent over-dispersion problem of these methods. Complemented by experimental validations, we demonstrate that the application of BEAPR to ENCODE eCLIP-Seq data of 154 proteins helps to predict functional GVs that alter splicing or mRNA abundance. Moreover, many GVs with ASB patterns have known disease relevance. Overall, BEAPR is an effective method that helps to address the outstanding challenge of functional interpretation of GVs.

Downloading the software

You can download the latest version of the software here.

Installation

This software package was developed on Linux and it has been tested with g++ 4.4.5 and R 3.2.3. To install it, perform the following steps:

Compile the source codes (optional)

The binary file BEAPR in bin/ should be able to run on a x86 Linux OS. If not, you can compile the executable files of BEAPR by yourself.

cd /usr/local/BEAPR/src/bias
make bias
make install
cd /usr/local/BEAPR/src/postproc
make postproc
make install

Then, the two executable files, bias and postproc, of BEAPR should be located in /usr/local/BEAPR/bin/.

Calling ASB SNVs from eCLIP data.

Preparation for eCLIP data

To predict the ASB events of a RBP, BEAPR required a list of heterozygous SNVs and a eCLIP data set of the RBP as its input data. The eCLIP data set included multiple eCLIP replicates and one SMInput samples. Each eCLIP replicate or SMInput sample consisted of a BAM file for aligned eCLIP reads and a BED file that defined the genomic regions where the aligned reads were clustered, i.e., peaks. A peak was called a binding region of the RBP if the number of eCLIP reads per million within the peak was significantly higher the number within the identical genomic region in the SMInput sample. Those heterozygous SNPs located outside the binding regions were ignored in the prediction of ASB events of the RBP. If the list of heterozygous SNPs was not available, BIAS was able to predicted heterozygous SNPs from the input eCLIP data.

Here we use the ENCODE K562 SRSF1 eCLIP data set as an example to demonstrate how to prepare eCLIP data for BEAPR.

  1. From the ENCODE consortium website, download the .bam files for aligned eCLIP reads in the two replicates and the mock input sample of SRSF1, change the filenames as r1.bam, r1, and input.bam, respectively, into your data directory /usr/local/BEAPR/data/. We use clipOverlap to remove overlapping regions of read pairs.

  2. From the ENCODE consortium website, download the .bed files of eCLIP peaks in the two replicates of SRSF1. Normalize the eCLIP peaks by following the ENCODE pipeline described in here. Change the filenames of normalized peaks in the r1 and r2 replicates as r1.bed and r2.bed, respectively. These eCLIP peaks define the genomic regions where BEAPR is going to look for ASB SNVs

  3. run CLIPper on input.bam to call peaks in the SMInput file. specify the output .bed files of CLIPper as input.bed.

After the three steps, you should have the following six data files in your data directory /usr/local/BEAPR/data/.

r1.bam r1.bed r2.bam r2.bed input.bam input.bed
  1. BEAPR requires a genotype file in the VCF file format. (e.g., K562.vcf)

  2. BEAPR measures the variance of allelic read counts at all SNPs sites (including both homozygous and heterozygous SNPs). Our analysis is based on a SNP annotation from multiple SNP databases here. You can download and unzipped the text files in your own directory.

Preparation for the parameter file CMD.txt to run BEAPR

  1. To run BEAPR, you need to provide a text file specifying the values to the following parameters.

    5.1 The filenames of the .bed and .bam files for the SMInput sample

    INPUTBEDFILE=/usr/local/BEAPR/data/input.bed INPUTBAMFILE=/usr/local/BEAPR/data/input.bam

    5.2 The filenames of the .bed and .bam files for the eCLIP replicates (separated by ",")

    REPBEDFILES=/usr/local/BEAPR/data/r1.bed,/usr/local/BEAPR/data/r2.bed REPBAMFILES=/usr/local/BEAPR/data/r1.bam,/usr/local/BEAPR/data/r2.bam

    5.3 Specify the replicate ids as r1 and r2

    REPIDS=r1,r2

    5.4 Provide the output directory and an temporary directory for data processing

    TMPDIR=/usr/local/BEAPR/data/tmp OUTDIR=/usr/local/BEAPR/data/out

    5.5 Specify the filename of the genotype file. VCFNAME=/usr/local/BEAPR/data/K562.vcf

    5.6 Specify the directory of the SNPs annotation. SNPDIR=/usr/local/BEAPR/all_SNV/

    An example of CMD.txt is included in BEAPR/bin.

Calling allelic bias by BEAPR

Now you can start an ASB analysis using this command:

cd /usr/local/BEAPR/bin/
./bias CMD.txt

(Samtools 1.2 is required to run bias.) bias used an empirical Gaussian distribution to model the normalized read counts. If the difference of the two mean values was significantly greater than the expected variance of the normalized read counts, the null hypothesis was rejected. Those SNVs where the null hypothesis was rejected were considered as the candidates of ASB SNVs. If bias is successfully executed, you should have the following files in your output directory (e.g., /usr/local/BEAPR/out/).

asb_candidates.txt

Post-proccessing

The set of the candidates was subject to several filters that increased the accuracy of our ASB prediction. In brief, we removed the candidates in genomic regions that were homo-polymeric, consisted of repetitive elements or highly similar to other parts of the genome. All candidates in genes where there was allele-specific gene expression were discarded. BIAS also accounts for the reference mapping bias, which is estimated from the SMInput samples. All candidates whose observed reference allele frequency was similar to the average one estimated from the SMInput samples were also deleted. The remaining candidates were predicted as ASB SNVs.

To conduct the post processing in BEAPR, run

cd /usr/local/BEAPR/bin/
./postproc /usr/local/BEAPR/data/post_cmd.txt

, where post_cmd.txt is an input command file which consists of the following lines. (an example of the command file is in bin/)

ASARP= (the output of ASARP file)
PREDFILE= asb_candidates.txt (the output of bias)
TABFILE= hg19_regions.tab (a file to define the genomic region of genes, could be found in bin/)
OUTFILE= out.txt (the name of your output file)
REPEATMASK= hg19_repeatmsk (the repeatmask file from UCSC Genome Browser)
FASTADIR=/home/yyang/data/fasta/hg19 (a directory to keep each chr*.fa and hg19.fa (all chromosomes) )
TMPDIR= (a temporary dir for post processing)
FDR=0.1 (FDR cutoff)
MINAF=0.6 (optional, minimal allelic frequency of the major alleles)
HTZONE=K562_zone.txt (optional, any genomic regions you want to block out)

Then, the list of filtered ASB SNVs, out.txt, are available in the output directory you specified in this parameter file .

Simulating allelic read counts

To simulate allele-specific read counts that mimic those in actual eCLIP data sets, we used the eCLIP data of SRSF1 in the K562 cell line generated by ENCODE. eCLIP peaks were identified as described above. For heterozygous SNPs (dbSNP 144) located in the peaks, we obtained their total read coverage in each replicate. The empirical total read coverage distributions were used to generate independent sets of simulated read counts. For each simulated SNP, its total read coverage was sampled from the above distribution. Its allelic ratio was set to be 0.5, unless it is a simulated ASB SNP (with allelic ratio being 0.7, 0.8 or 0.9). The allelic read counts for each SNP were determined using a zero-truncated negative binomial distribution with the expected variance set to be equivalent to the observed variance between the two replicates of SRSF1 eCLIP as a function of total read coverage.

To run our script to generate simulated allelic read counts,

cd /usr/local/BEAPR/Simulation/
sh run_beapr.sh

An output file mean.out of 5000 simulated SNVs will be created in data/. The last two columns are the sum of read counts for major alleles (M_sum) and minor alleles (m_sum).