Skip to content
Janne H. Korhonen edited this page Apr 4, 2018 · 8 revisions

Sequence formats

For sequences, MOODS generally expects sequence files consisting of characters ACGTacgt. Linebreaks and whitespace at the beginning and end of the lines are ignored, other characters are treated as positions where no matches can occur.

If the the command-line tool encounters IUPAC nucleotide codes corresponding to multiple nucleotides (other than N), it will do matching for all possible interpretations of the code. Output indicates which interpretations are matching.

Plain text

Sequences can be given as plain text files, consisting of only the sequence without any metadata. The file name will be used as the sequence name in the output.

Fasta

The command-line tool can also parse single and multi-sequence FASTA files. The FASTA header is used as the sequence name for each found sequence.

Pattern formats

PWMs and score matrices

For standard count or frequency matrices, MOODS uses the JASPAR raw position frequency matrix (.pfm) format. A .pfm file consist of four rows, specifying the counts or frequencies for nucleotides A, C, G and T, respectively:

10.00  12.00   4.00   1.00   2.00   2.00   0.00   0.00   0.00   8.00  13.00
 2.00   2.00   7.00   1.00   0.00   8.00   0.00   0.00   1.00   2.00   2.00
 3.00   1.00   1.00   0.00  23.00   0.00  26.00  26.00   0.00   0.00   4.00
11.00  11.00  14.00  24.00   1.00  16.00   0.00   0.00  25.00  16.00   7.00

In the command line tool, count/frequency matrix input files are specified with parameter -m: python moods_dna.py -m [filenames]. MOODS will automatically convert these to PWMs using log-likehood scores. The parameters --ps, --log-base and --lo-bg can be used to fine-tune the transformation; see the brief theoretical introduction for details on the default log-likelihood computation.

Alternatively, PWMs and other additive scoring matrices can be given as input with the parameter -S. These should follow the same format:

 0.4306    0.6129   -0.4853   -1.8697   -1.1778   -1.1778   -7.8637   -7.8637   -7.8637    0.2076    0.6930
-1.1778   -1.1778    0.0741   -1.8697   -7.8637    0.2076   -7.8637   -7.8637   -1.8697   -1.1778   -1.1778
-0.7727   -1.8697   -1.8697   -7.8637    1.2634   -7.8637    1.3860    1.3860   -7.8637   -7.8637   -0.4853
 0.5259    0.5259    0.7670    1.3060   -1.8697    0.9006   -7.8637   -7.8637    1.3468    0.9006    0.0741

Scoring matrices given with parameter -S are used for matching without applying any transformations.

First-order PWMs

The command-line tool has a limited support for first-order generalisations of PWMs using the .adm format and raw first-order scoring matrices; see Korhonen et al. (2017, Bioinformatics) for more background on high-order PWM matching.

An .adm file should consist of 16 rows describing the conditional probabilities Pᵢ( p | q) of nucleotide p appearing at position i, conditioned on nucleotide q appearing at position i-1. These rows are ordered in lexicographical order, that is, AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG and TT. For example, the first row corresponds to probabilities Pᵢ( A | A). These 16 rows should be followed by 4 rows corresponding to a standard probability matrix presentation of the motif.

Files corresponding to this specification can be given to MOODS command line tool with parameter -m, and they are converted to (first-order) log-likelihood scoring matrices:

0.806   0.815   0.249   0.169   0.088   0.312   0.289   AA
0.018   0.018   0.021   0.192   0.138   0.156   0.157   AC
0.016   0.028   0.151   0.578   0.692   0.062   0.325   AG
0.16    0.139   0.578   0.061   0.082   0.469   0.229   AT
0.261   0.228   0.216   0.243   0.033   0.345   0.002   CA
0.388   0.383   0.319   0.38    0.117   0.379   0.001   CC
0.129   0.137   0.147   0.156   0.814   0       0.997   CG
0.221   0.251   0.318   0.22    0.037   0.276   0.001   CT
0.27    0.283   0.112   0.078   0       0.001   0.275   GA
0.158   0.151   0.079   0.375   0       0.997   0       GC
0.314   0.218   0.553   0.324   0.999   0.001   0.525   GG
0.257   0.348   0.256   0.223   0.001   0.002   0.2     GT
0.039   0.04    0.032   0.004   0.034   0.203   0.189   TA
0.01    0.009   0.009   0.004   0.035   0.304   0.117   TC
0.01    0.036   0.026   0.98    0.881   0.089   0.514   TG
0.941   0.915   0.934   0.012   0.05    0.405   0.18    TT
0.349   0.319   0.298   0.098   0.01    0.002   0.003   0.004   A
0.026   0.027   0.025   0.014   0.034   0.001   0.993   0.001   C
0.029   0.024   0.041   0.077   0.933   0.995   0.001   0.993   G
0.596   0.63    0.636   0.811   0.023   0.002   0.003   0.002   T

The annotations are optional. Furthermore, all but the first column can be omitted on the last four rows, as they are not used for the log-likehood transformation (MOODS >=1.9.4).

Similarly, 16-row additive scoring matrices (with scores corresponding to all pairs of adjacent positions) can be given with parameter -S, with rows corresponding to character combinations in order AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG and TT, and columns corresponding to positions:

  1.4946    1.1748   -0.0030   -0.3868   -1.0261    0.2206    0.1436
 -2.1802   -2.5110   -2.3734   -0.2610   -0.5862   -0.4647   -0.4594
 -2.2828   -2.1137   -0.4967    0.8325    1.0118   -1.3638    0.2601
 -0.1099   -0.5791    0.8335   -1.3804   -1.0947    0.6255   -0.0868
 -2.1379   -0.0902   -0.1446   -0.0271   -1.9629    0.3194   -4.0283
 -1.7445    0.4241    0.2416    0.4163   -0.7491    0.4127   -4.2796
 -2.8329   -0.5924   -0.5241   -0.4647    1.1726   -4.6151    1.3749
 -2.3025    0.0049    0.2385   -0.1255   -1.8561    0.0980   -4.2796
 -2.0042    0.1228   -0.7908   -1.1432   -4.6151   -4.2796    0.0944
 -2.5336   -0.4977   -1.1308    0.4022   -4.6151    1.3749   -4.6151
 -1.8545   -0.1355    0.7885    0.2570    1.3778   -4.2796    0.7367
 -2.0531    0.3279    0.0235   -0.1131   -4.2786   -4.0283   -0.2207
 -0.9427   -1.7819   -1.9914   -3.6596   -1.9341   -0.2070   -0.2765
 -2.1427   -3.0891   -3.0901   -3.6596   -1.9071    0.1928   -0.7481
 -2.1427   -1.8808   -2.1825    1.3587    1.2525   -1.0161    0.7157
  2.1812    1.2902    1.3097   -2.8573   -1.5706    0.4776   -0.3247

Output format

The output of the MOODS command-line tool looks like this:

...
seq1,MA0012.pfm,8429,+,8.20860460572,TAAACAAAAYA,TAAACAAAATA
seq1,MA0012.pfm,8712,-,9.41161988415,CTTTTTGTTTA,
seq1,MA0012.pfm,11573,-,8.43141521909,CAATTTGTTTA,
...

Each output line corresponds to a single match. The fields, separated by , (or by separator specified with --sep), are the following:

  1. Name of the sequence (file name or FASTA header).
  2. Name of the motif file.
  3. Hit position. First position of the sequence is 0; the end position of the match is hit position + length of the motif.
  4. + indicates normal match, - indicates reverse complement match. Note that in the latter case, the match position is still reported in terms of the original sequence.
  5. Hit score.
  6. The matching part of the sequence in the original input. In particular, this is not reverse complemented even for - matches, i.e. this is always a substring of the original sequence.
  7. Matching sequence with ambiguous IUPAC symbols replaced with the interpretations yielding the current match. Omitted for matches without ambiguous IUPAC symbols.