Skip to content

hartonen/rePWM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#######
#rePWM#
#######

rePWM is a simple script to build a PFM/PWM model from a dataset using a PWM/PFM seed. The PWM/PFM model is built by masking each position of the seed PWM in turn with a flat prior distribution to create an unbiased, 'multinomial' estimate of the PWM from the given data. See Figure 1b in "Jolma, Arttu, et al. "DNA-binding specificities of human transcription factors." Cell 152.1-2 (2013): 327-339." for a more detailed explanation.

rePWM uses MOODS to calculate motif matches, so MOODS should be installed and the script moods_dna.py added to the system path. Other than that, only a standard python installation and standard Linux utilities grep, cat, sort and head are needed. Instructions on how to install MOODS can be found from https://github.com/jhkorhonen/MOODS .

TEST
----

This test command run using the data in folder test/ (sequences from LoVo CTCF ChIP-nexus and the corresponding CTCF PFM obtained with MEME) should reproduce the matrix in folder test_results/

rePWM.py test/motif1.pfm test/top1000_peaks.fasta test_results/multinomial_CTCF.pfm --Bforce 200 --t 4

It will print the following output to stdout:

Creating masked matrices... done!
running MOODS:
 moods_dna.py -o ./moods.out -m ./rePWM_m0.pfm ./rePWM_m1.pfm ./rePWM_m2.pfm ./rePWM_m3.pfm ./rePWM_m4.pfm ./rePWM_m5.pfm ./rePWM_m6.pfm ./rePWM_m7.pfm ./rePWM_m8.pfm ./rePWM_m9.pfm ./rePWM_m10.pfm ./rePWM_m11.pfm ./rePWM_m12.pfm ./rePWM_m13.pfm ./rePWM_m14.pfm -s test/top1000_peaks.fasta -t 4.0 --ps 0.1 --lo-bg 0.25 0.25 0.25 0.25  ...
done!
Re-calculating the matrix...
grep "rePWM_m0.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_0tmp.out
head -200 ./moods_sorted_0tmp.out > ./moods_sorted_0.out
grep "rePWM_m1.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_1tmp.out
head -200 ./moods_sorted_1tmp.out > ./moods_sorted_1.out
grep "rePWM_m2.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_2tmp.out
head -200 ./moods_sorted_2tmp.out > ./moods_sorted_2.out
grep "rePWM_m3.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_3tmp.out
head -200 ./moods_sorted_3tmp.out > ./moods_sorted_3.out
grep "rePWM_m4.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_4tmp.out
head -200 ./moods_sorted_4tmp.out > ./moods_sorted_4.out
grep "rePWM_m5.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_5tmp.out
head -200 ./moods_sorted_5tmp.out > ./moods_sorted_5.out
grep "rePWM_m6.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_6tmp.out
head -200 ./moods_sorted_6tmp.out > ./moods_sorted_6.out
grep "rePWM_m7.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_7tmp.out
head -200 ./moods_sorted_7tmp.out > ./moods_sorted_7.out
grep "rePWM_m8.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_8tmp.out
head -200 ./moods_sorted_8tmp.out > ./moods_sorted_8.out
grep "rePWM_m9.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_9tmp.out
head -200 ./moods_sorted_9tmp.out > ./moods_sorted_9.out
grep "rePWM_m10.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_10tmp.out
head -200 ./moods_sorted_10tmp.out > ./moods_sorted_10.out
grep "rePWM_m11.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_11tmp.out
head -200 ./moods_sorted_11tmp.out > ./moods_sorted_11.out
grep "rePWM_m12.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_12tmp.out
head -200 ./moods_sorted_12tmp.out > ./moods_sorted_12.out
grep "rePWM_m13.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_13tmp.out
head -200 ./moods_sorted_13tmp.out > ./moods_sorted_13.out
grep "rePWM_m14.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_14tmp.out
head -200 ./moods_sorted_14tmp.out > ./moods_sorted_14.out
done!

Matches to the output motif using TOMTOM: http://meme-suite.org/opal-jobs/appTOMTOM_5.0.0_1529496412046-1109278779/tomtom.html
Matches to the input motif using TOMTOM: http://meme-suite.org/opal-jobs/appTOMTOM_5.0.0_1529496588319-1195667469/tomtom.html


HELP
----

Help on running rePWM can be found by typing

rePWM.py -h

assuming that the script is in your system path. This should print the following documentation:

usage: rePWM.py [-h] [--mtype {pwm,pfm}] [--p P] [--t T] [--B B]
                [--Bforce BFORCE] [--no-rc {yes,no}] [--batch {yes,no}]
                [--bg BG BG BG BG] [--ps PS] [--lo-bg LO_BG LO_BG LO_BG LO_BG]
                [--tmpdir TMPDIR] [--keeptmp {yes,no}] [--v {yes,no}]
                matrix seq [seq ...] outfile

positional arguments:
  matrix                The input position frequency matrix file (tab-
                        separeted).
  seq                   Full path to a Fasta-file containing sequences used to
                        recalculate the PFM.
  outfile               Full path to the output file (the recalculated PFM).

optional arguments:
  -h, --help            show this help message and exit
  --Bforce BFORCE       Return EXACTLY the specified amount of best matches to
                        EACH of the masked motifs. This means that the script
                        tries to make sure that each column of the final PFM
                        is constructed from same number of sequences. This
                        might not always work depending on the MOODS threshold
                        given. If there are too few sequences in some of the
                        columns, a warning is printed.
  --tmpdir TMPDIR       Directory used to store temporary files (masked
                        PFMs/PWMs, MOODS output files), default = ./
  --keeptmp {yes,no}    If yes, keep tmp files (default=no).
  --v {yes,no}          If yes (=default), print progress to screen.

Matrix file type::
  --mtype {pwm,pfm}     pwm or pfm (default=pfm).

Threshold selection in MOODS (exactly one required)::
  --p P                 Compute threshold from p-value.
  --t T                 Use specified absolute threshold.
  --B B                 Return at least the specified amount of best matches.

Search and model behaviour (optional)::
  --no-rc {yes,no}      If yes, disable matching versus the reverse complement
                        strand (default=no).
  --batch {yes,no}      If yes, do not recompute thresholds from p-values for
                        each sequence separately (recommended when dealing
                        with lots of short sequences, default=no).
  --bg BG BG BG BG      Background distribution for computing thresholds from
                        p-value with --batch (default is 0.25 for all
                        alleles).
  --ps PS               Specify pseudocount for log-odds conversion
                        (default=0.1).
  --lo-bg LO_BG LO_BG LO_BG LO_BG
                        Background distribution for log-odds conversion
                        (default is 0.25 for all alleles).

About

Validating PWMs with data using the multinomial approach

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages