hartonen/rePWM
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
####### #rePWM# ####### rePWM is a simple script to build a PFM/PWM model from a dataset using a PWM/PFM seed. The PWM/PFM model is built by masking each position of the seed PWM in turn with a flat prior distribution to create an unbiased, 'multinomial' estimate of the PWM from the given data. See Figure 1b in "Jolma, Arttu, et al. "DNA-binding specificities of human transcription factors." Cell 152.1-2 (2013): 327-339." for a more detailed explanation. rePWM uses MOODS to calculate motif matches, so MOODS should be installed and the script moods_dna.py added to the system path. Other than that, only a standard python installation and standard Linux utilities grep, cat, sort and head are needed. Instructions on how to install MOODS can be found from https://github.com/jhkorhonen/MOODS . TEST ---- This test command run using the data in folder test/ (sequences from LoVo CTCF ChIP-nexus and the corresponding CTCF PFM obtained with MEME) should reproduce the matrix in folder test_results/ rePWM.py test/motif1.pfm test/top1000_peaks.fasta test_results/multinomial_CTCF.pfm --Bforce 200 --t 4 It will print the following output to stdout: Creating masked matrices... done! running MOODS: moods_dna.py -o ./moods.out -m ./rePWM_m0.pfm ./rePWM_m1.pfm ./rePWM_m2.pfm ./rePWM_m3.pfm ./rePWM_m4.pfm ./rePWM_m5.pfm ./rePWM_m6.pfm ./rePWM_m7.pfm ./rePWM_m8.pfm ./rePWM_m9.pfm ./rePWM_m10.pfm ./rePWM_m11.pfm ./rePWM_m12.pfm ./rePWM_m13.pfm ./rePWM_m14.pfm -s test/top1000_peaks.fasta -t 4.0 --ps 0.1 --lo-bg 0.25 0.25 0.25 0.25 ... done! Re-calculating the matrix... grep "rePWM_m0.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_0tmp.out head -200 ./moods_sorted_0tmp.out > ./moods_sorted_0.out grep "rePWM_m1.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_1tmp.out head -200 ./moods_sorted_1tmp.out > ./moods_sorted_1.out grep "rePWM_m2.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_2tmp.out head -200 ./moods_sorted_2tmp.out > ./moods_sorted_2.out grep "rePWM_m3.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_3tmp.out head -200 ./moods_sorted_3tmp.out > ./moods_sorted_3.out grep "rePWM_m4.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_4tmp.out head -200 ./moods_sorted_4tmp.out > ./moods_sorted_4.out grep "rePWM_m5.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_5tmp.out head -200 ./moods_sorted_5tmp.out > ./moods_sorted_5.out grep "rePWM_m6.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_6tmp.out head -200 ./moods_sorted_6tmp.out > ./moods_sorted_6.out grep "rePWM_m7.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_7tmp.out head -200 ./moods_sorted_7tmp.out > ./moods_sorted_7.out grep "rePWM_m8.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_8tmp.out head -200 ./moods_sorted_8tmp.out > ./moods_sorted_8.out grep "rePWM_m9.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_9tmp.out head -200 ./moods_sorted_9tmp.out > ./moods_sorted_9.out grep "rePWM_m10.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_10tmp.out head -200 ./moods_sorted_10tmp.out > ./moods_sorted_10.out grep "rePWM_m11.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_11tmp.out head -200 ./moods_sorted_11tmp.out > ./moods_sorted_11.out grep "rePWM_m12.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_12tmp.out head -200 ./moods_sorted_12tmp.out > ./moods_sorted_12.out grep "rePWM_m13.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_13tmp.out head -200 ./moods_sorted_13tmp.out > ./moods_sorted_13.out grep "rePWM_m14.pfm" ./moods.out | sort -t , -k5,5 -gr > ./moods_sorted_14tmp.out head -200 ./moods_sorted_14tmp.out > ./moods_sorted_14.out done! Matches to the output motif using TOMTOM: http://meme-suite.org/opal-jobs/appTOMTOM_5.0.0_1529496412046-1109278779/tomtom.html Matches to the input motif using TOMTOM: http://meme-suite.org/opal-jobs/appTOMTOM_5.0.0_1529496588319-1195667469/tomtom.html HELP ---- Help on running rePWM can be found by typing rePWM.py -h assuming that the script is in your system path. This should print the following documentation: usage: rePWM.py [-h] [--mtype {pwm,pfm}] [--p P] [--t T] [--B B] [--Bforce BFORCE] [--no-rc {yes,no}] [--batch {yes,no}] [--bg BG BG BG BG] [--ps PS] [--lo-bg LO_BG LO_BG LO_BG LO_BG] [--tmpdir TMPDIR] [--keeptmp {yes,no}] [--v {yes,no}] matrix seq [seq ...] outfile positional arguments: matrix The input position frequency matrix file (tab- separeted). seq Full path to a Fasta-file containing sequences used to recalculate the PFM. outfile Full path to the output file (the recalculated PFM). optional arguments: -h, --help show this help message and exit --Bforce BFORCE Return EXACTLY the specified amount of best matches to EACH of the masked motifs. This means that the script tries to make sure that each column of the final PFM is constructed from same number of sequences. This might not always work depending on the MOODS threshold given. If there are too few sequences in some of the columns, a warning is printed. --tmpdir TMPDIR Directory used to store temporary files (masked PFMs/PWMs, MOODS output files), default = ./ --keeptmp {yes,no} If yes, keep tmp files (default=no). --v {yes,no} If yes (=default), print progress to screen. Matrix file type:: --mtype {pwm,pfm} pwm or pfm (default=pfm). Threshold selection in MOODS (exactly one required):: --p P Compute threshold from p-value. --t T Use specified absolute threshold. --B B Return at least the specified amount of best matches. Search and model behaviour (optional):: --no-rc {yes,no} If yes, disable matching versus the reverse complement strand (default=no). --batch {yes,no} If yes, do not recompute thresholds from p-values for each sequence separately (recommended when dealing with lots of short sequences, default=no). --bg BG BG BG BG Background distribution for computing thresholds from p-value with --batch (default is 0.25 for all alleles). --ps PS Specify pseudocount for log-odds conversion (default=0.1). --lo-bg LO_BG LO_BG LO_BG LO_BG Background distribution for log-odds conversion (default is 0.25 for all alleles).