Skip to content
Janne H. Korhonen edited this page Apr 22, 2016 · 12 revisions

This document explains the basics of using MOODS in your own workflows. It covers both the MOODS command-line utility moods_dna.py and using the MOODS Python library.

(The documentation is still somewhat sparse at the moment, but it will be continuously expanded as the various paper deadlines permit.)

Getting started with the command-line tool

MOODS 1.9.1 introduced a Python command-line script moods_dna.py for standalone PWM analysis using the MOODS algorithms. This section explains the basics of using moods_dna.py.

Input formats

Matrices. The matrices are expected in the JASPAR .pfm format or the .adm format for first-order models. A JASPAR .pfm file consist of four rows, specifying the counts or frequencies for nucleotides A, C, G and T, respectively:

 0  3 79 40 66 48 65 11 65  0
94 75  4  3  1  2  5  2  3  3
 1  0  3  4  1  0  5  3 28 88
 2 19 11 50 29 47 22 81  1  6

A .adm should similarly give counts or frequencies for the 16 possible combinations of nucleotides (16 rows with m-1 columns), along with the 0-order counts (4 rows with m columns). Annotations at the end of the rows are optional and will be ignored:

0.806   0.815   0.249   0.169   0.088   0.312   0.289   ADM_DI  AA
0.018   0.018   0.021   0.192   0.138   0.156   0.157   ADM_DI  AC
0.016   0.028   0.151   0.578   0.692   0.062   0.325   ADM_DI  AG
0.16    0.139   0.578   0.061   0.082   0.469   0.229   ADM_DI  AT
0.261   0.228   0.216   0.243   0.033   0.345   0.002   ADM_DI  CA
0.388   0.383   0.319   0.38    0.117   0.379   0.001   ADM_DI  CC
0.129   0.137   0.147   0.156   0.814   0       0.997   ADM_DI  CG
0.221   0.251   0.318   0.22    0.037   0.276   0.001   ADM_DI  CT
0.27    0.283   0.112   0.078   0       0.001   0.275   ADM_DI  GA
0.158   0.151   0.079   0.375   0       0.997   0       ADM_DI  GC
0.314   0.218   0.553   0.324   0.999   0.001   0.525   ADM_DI  GG
0.257   0.348   0.256   0.223   0.001   0.002   0.2     ADM_DI  GT
0.039   0.04    0.032   0.004   0.034   0.203   0.189   ADM_DI  TA
0.01    0.009   0.009   0.004   0.035   0.304   0.117   ADM_DI  TC
0.01    0.036   0.026   0.98    0.881   0.089   0.514   ADM_DI  TG
0.941   0.915   0.934   0.012   0.05    0.405   0.18    ADM_DI  TT
0.349   0.319   0.298   0.098   0.01    0.002   0.003   0.004   ADM_MONO_A
0.026   0.027   0.025   0.014   0.034   0.001   0.993   0.001   ADM_MONO_C
0.029   0.024   0.041   0.077   0.933   0.995   0.001   0.993   ADM_MONO_G
0.596   0.63    0.636   0.811   0.023   0.002   0.003   0.002   ADM_MONO_T

The MOODS standard expectation is that the given matrices are count or frequency matrices and they should be converted to PWMs (i.e. log-likelihood ratios), but it is also possible to input matrices already in PWM format (see below).

Sequences. The sequences can be given either as plain text files, or as fasta files possibly containing multiple sequences. Newlines and leading and trailing whitespace will be ignored, but other characters not encoding nucleotides will be treated as non-matchable positions (i.e., they should not appear outside fasta headers). IUPAC nucleotide ambiguity codes (WSMKRYBDHV) will be treated as coding SNPs, and MOODS will match the matrices versus all possible combinations.

Basic usage

The basic usage pattern of moods_dna.py is the following:

python moods_dna.py -m example-data/matrices/*.{pfm,adm} -s example-data/seq/chr1-5k-55k.fa -p 0.0001

Here, -m specifies the count/frequency matrix files, -s specifies the sequence files and -p gives the p-value threshold for reporting matches. For matrix files already in PWM format, use -S instead of -m.

Refer to python moods_dna.py -h for full list of options. However, one specific option you should be aware of is --batch. In case you have lots of short input sequences (either in separate files or in a single fasta file), using --batch will significantly speed up things, as preprocessing will only be done once, but p-value based threshold computation will then only be done once, which may cause variance in the number of hits produced between different sequences.

Output format

The output from moods_dna.py consists of lines looking like this:

seq_1,MA0001.pfm,13329,+,5.23551915867,CCATAATTGC,
seq_1,MA0001.pfm,15494,+,9.69547780506,CCATWTATAG,ccatTtatag

The comma-separated fields are:

  1. Sequence name (either file name or fasta header).
  2. Matrix file name.
  3. Hit position (first position of sequence is position 0).
  4. Indicates whether the match is versus the input strand (+) or the reverse complement (-).
  5. Match score.
  6. Hit site in the original sequence (note that if the match is -, then the match is versus the reverse complement of this).
  7. Hit sequence with SNPs applied, i.e. shows how the ambiguity codes are resolved to get this score (there can be multiple hits at the same position with different "real" hit sequences).

The output can be interpreted as .csv. The separator can be changed with --sep parameter.

Getting started with Python libraries

The example scripts (scripts/ex-*.py) provided with the MOODS package show how to use and combine the various MOODS functions. This sections explains some of the basic functions in more detail.

(Under construction...)