Skip to content

Brief theoretical introduction

Lucas edited this page Mar 17, 2020 · 6 revisions

Position weight matrices

Position weight matrices (PWMs), also known position specific scoring matrices (PSSM) or weighted patterns, are a simple yet powerful model for sequence signals used in bioinformatics. They can be, for example, used to model transcription factor binding sites in DNA or other binding sites.

PWMs are obtained usually from empirically observed instances of binding sites by counting occurrences of symbols in different positions, represented as count or frequency matrices. For example, a count matrix could look like this (example from JASPAR database), where each four rows specifies the counts or frequencies for nucleotides A, C, G and T, respectively:

10.00  12.00   4.00   1.00   2.00   2.00   0.00   0.00   0.00   8.00  13.00
 2.00   2.00   7.00   1.00   0.00   8.00   0.00   0.00   1.00   2.00   2.00
 3.00   1.00   1.00   0.00  23.00   0.00  26.00  26.00   0.00   0.00   4.00
11.00  11.00  14.00  24.00   1.00  16.00   0.00   0.00  25.00  16.00   7.00

MOODS by default can read matrix files in JASPAR count format (.pfm).

Log-odds scoring

By default, MOODS converts count/frequency matrices to PWMs using log-likelihood scoring that defines a score against a DNA sequence; this is then used to identify putative sites matching the motif described by the matrix. Specifically, given a 4 x m count/frequency matrix M and a background probability π[a] for each nucleotide a, the pseudocounts are added to the matrix and the values are normalised:

eq1

where P is the pseudocount value (default 0.01 in moods_dna.py).

The frequencies are then used to compute the PWM L as

eq2

For any length m sequence u, the PWM L defines a match score as

eq3

Intuitively, this score compares the probability that the model specified by the original frequency matrix f generated the sequence u versus the probability that the background model π generated u.

Finding PWM matches

Given a PWM L, and a sequence eq4, the specific task we consider is finding sequence positions i such that the score given by the matrix exceeds some given threshold T, that is,

eq5

The idea is that these matches should correspond e.g. to putative transcription factor binding sites for the factor described by the original count matrix.

The exact choice of the threshold T obviously greatly affects the results. There are a couple of options for choosing the threshold; by default, MOODS assume that the threshold is given by a p-value x, and the actual threshold T is chosen so that the probability that the background distribution π generates a sequence u of length m with score eq6 is p. Alternatively, the score threshold can be set heuristically to some suitable value.

References

For more discussion and background, see e.g.

For more technical details, see the following articles.