Skip to content

We present an in silico mutagenesis based approach to transcription factor binding sites evolution, based on a machine learning model of binding.

Notifications You must be signed in to change notification settings

ljljolinq1010/A-robust-method-for-detecting-positive-selection-on-regulatory-sequences

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A-robust-method-for-detecting-positive-selection-on-regulatory-sequences-

We developed a method to detect positive selection of transcription factor binding sites (TFBSs) evolution based on binding affinity changes. This is achieved by comparing the observed binding affinity changes in evolution to a null distribution. The effects of substitutions on binding affinity change can be accurately predicted by deltaSVM (Lee et al. 2015), a machine leaning based method to predict the effects of regulatory variations de novo from sequence.

  1. The procedures of detecting positive selection

1). Training of the gapped k-mer support vector machine (gkm-SVM)

Firstly, we defined a positive training set and its corresponding negative training set. The positive training set is ChIP-seq narrow peaks of transcription factors. The negative training set is an equal number of sequences which randomly sampled from the genome with matched the length, GC content and repeat fraction of the positive training set. This negative training set was generated by using “genNullSeqs”, a function of gkm-SVM R package (Ghandi et al. 2016). Then, we trained a gkm-SVM with default parameters except -l=10 (meaning we use 10-mer as feature to distinguish positive and negative training sets). The classification performance of the trained gkm-SVM was measured by using receiver operating characteristic (ROC) curves with fivefold cross-validation. The gkm-SVM training and cross-validation were achieved by using the “gkmtrain” function of “LS-GKM: a new gkm-SVM software for large-scale datasets” (Lee 2016). For details, please check https://github.com/Dongwon-Lee/lsgkm.

2). Generate SVM weights of all possible 10-mers based on the trained gkm-SVM

The SVM weights of all possible 10-mers were generated by using the “gkmpredict” function of “LS-GKM”.

3). Infer ancestor sequence

The ancestor sequence was inferred from sequence alignment with a sister species and an outgroup.

4). Infer positive selection

After we got the SVM weights of all possible 10-mers, and both the ancestor and focal sequences, we infered signal of positive selection by using "testPosSelec.pl". This script was saved in "scripts" folder, and was modified from "deltasvm.pl", a script that calculates deltaSVM scores, which contributed by Lee et al. (2015).

  1. The scripts were used to generate all figures in the paper

Please check "selection_analysis.R" in the "scripts" folder

  1. The data was used to generate all figures in the paper

Please check the "data" folder

  1. Reference

Ghandi M, Mohammad-Noori M, Ghareghani N, Lee D, Garraway L, Beer MA. 2016. gkmSVM: an R package for gapped-kmer SVM. Bioinformatics 32:2205–2207.

Lee D. 2016. LS-GKM: a new gkm-SVM for large-scale datasets. Bioinformatics 32:2196–2198.

Lee D, Gorkin DU, Baker M, Strober BJ, Asoni AL, McCallion AS, Beer MA. 2015. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47:955–961.

About

We present an in silico mutagenesis based approach to transcription factor binding sites evolution, based on a machine learning model of binding.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages