A simple FDR peak caller. The script is designed for processing DamID-seq datasets such as those generated by the damidseq_pipeline software, but will work on any DNA binding track in bedgraph or GFF format (for example, background-subtracted ChIP-seq data).
The output is a GFF file of all peaks with an FDR less than an assigned value (by default, FDR < 0.01). The mean binding intensity of the peak is represented by the score column of the GFF (column 6) and the FDR is present in the attributes column (column 9). The peak file can be loaded into genome browser software such as IGV for viewing, and associated genes can be called with the included peaks2genes script.
find_peaks requires Perl v5.010 or greater. To install, copy the find_peaks executable into your path.
Run the script via
find_peaks [options] [list of files in bedgraph or GFF file format]
To see all usage options, run with
find_peaks --help
The output of the script is a list of all peaks with an FDR less than the cutoff value.
The peaks file can be used as the input to the included peaks2genes script in order to call associated genes (i.e. genes in close proximity). Run
peaks2genes --help
to see all usage options. Note that the path to a genome annotation file in GFF format needs to be set with the --genes_file option. An annotation for release 6 (DmR6.genes.gff.gz) of the Drosophila melanogaster genome is provided in the repository.
Please note that the close association of a peak with a gene does not necessarily imply direct regulation of that gene, and associations must be used as a basic guide only. There are many examples in the literature of enhancers present in unrelated gene bodies or located a very long distance away from their target genes.
Depending on the expected genome coverage of the protein of interest, the --min_quant and --step options may need to be adjusted. Reducing the value of --step will increase the resolution of peak detection at the expense of speed. Lowering the --min_quant value will enable the detection of smaller peak heights (but may increase the number of non-biologically relevant regions of enrichment).
find_peaks is roughly based on an algorithm developed by Tony Southall (doi:10.1016/j.neuron.2012.06.015). Briefly, a list of binding intensity thresholds are identified from the dataset, the dataset is randomly shuffled, and the frequency of consecutive regions (i.e. GATC fragments or bins) with a score greater than the threshold calculated. The relationship between the number of consecutive fragments and the frequency of observation in a random dataset is logarithmic, and can thus be effectively modelled for any number of fragments through linear regression. The FDR is very simply the observed / expected for a number of consecutive fragments above a given threshold.