Skip to content
A simple FDR peak caller
Perl
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
DmR6.genes.gff.gz
README.md
find_peaks
peaks2genes

README.md

find_peaks

A simple FDR peak caller. The script is designed for processing DamID-seq datasets such as those generated by the damidseq_pipeline software, but will work on any DNA binding track in bedgraph or GFF format (for example, background-subtracted ChIP-seq data).

The output is a GFF file of all peaks with an FDR less than an assigned value (by default, FDR < 0.01). The mean binding intensity of the peak is represented by the score column of the GFF (column 6) and the FDR is present in the attributes column (column 9). The peak file can be loaded into genome browser software such as IGV for viewing, and associated genes can be called with the included peaks2genes script.

Installation

find_peaks requires Perl v5.010 or greater. To install, copy the find_peaks executable into your path.

Usage

Run the script via

	find_peaks [options] [list of files in bedgraph or GFF file format]

To see all usage options, run with

	find_peaks --help

The output of the script is a list of all peaks with an FDR less than the cutoff value.

Calling genes from the peaks file

The peaks file can be used as the input to the included peaks2genes script in order to call associated genes (i.e. genes in close proximity). Run

	peaks2genes --help

to see all usage options. Note that the path to a genome annotation file in GFF format needs to be set with the --genes_file option. An annotation for release 6 (DmR6.genes.gff.gz) of the Drosophila melanogaster genome is provided in the repository.

Please note that the close association of a peak with a gene does not necessarily imply direct regulation of that gene, and associations must be used as a basic guide only. There are many examples in the literature of enhancers present in unrelated gene bodies or located a very long distance away from their target genes.

Quantiles and stepping

Depending on the expected genome coverage of the protein of interest, the --min_quant and --step options may need to be adjusted. Reducing the value of --step will increase the resolution of peak detection at the expense of speed. Lowering the --min_quant value will enable the detection of smaller peak heights (but may increase the number of non-biologically relevant regions of enrichment).

Algorithm

find_peaks is roughly based on an algorithm developed by Tony Southall (doi:10.1016/j.neuron.2012.06.015). Briefly, a list of binding intensity thresholds are identified from the dataset, the dataset is randomly shuffled, and the frequency of consecutive regions (i.e. GATC fragments or bins) with a score greater than the threshold calculated. The relationship between the number of consecutive fragments and the frequency of observation in a random dataset is logarithmic, and can thus be effectively modelled for any number of fragments through linear regression. The FDR is very simply the observed / expected for a number of consecutive fragments above a given threshold.

You can’t perform that action at this time.