# Chapter ‍14 Sequence motif analysis using Bio.motifs

This chapter gives an overview of the functionality of the Bio.motifs package included in Biopython. It is
intended for people who are involved in the analysis of sequence motifs, so I’ll assume that you are familiar
with basic notions of motif analysis. In case something is unclear, please look at Section 14.10 for some
relevant links.
Most of this chapter describes the new Bio.motifs package included in Biopython 1.61 onwards, which
is replacing the older Bio.Motif package introduced with Biopython 1.50, which was in turn based on two
older former Biopython modules, Bio.AlignAce and Bio.MEME. It provides most of their functionality with
a unified motif object implementation.
Speaking of other libraries, if you are reading this you might be interested in TAMO, another python
library designed to deal with sequence motifs. It supports more de-novo motif finders, but it is not a part
of Biopython and has some restrictions on commercial use.

## 14.1 Motif objects

### 14.1.1  Creating a motif from instances

In [1]:
from Bio import motifs    # import the Bio.motifs library: 

Suppose we have these instances of a DNA motif:

In [2]:
from Bio.Seq import Seq
instances = [Seq("TACAA"),
    Seq("TACGC"),
    Seq("TACAC"),
    Seq("TACCC"),
    Seq("AACCC"),
    Seq("AATGC"),
    Seq("AATGC")]

then we can create a Motif object as follows

In [3]:
m = motifs.create(instances)   # create a Motif object

In [4]:
print(m)

TACAA
TACGC
TACAC
TACCC
AACCC
AATGC
AATGC



The length of the motif is defined as the sequence length, which should be the same for all instances:

In [5]:
len(m)

5

The Motif object has an attribute .counts containing the counts of each nucleotide at each position. Printing
this counts matrix shows it in an easily readable format:

In [6]:
print(m.counts)

        0      1      2      3      4
A:   3.00   7.00   0.00   2.00   1.00
C:   0.00   0.00   5.00   2.00   6.00
G:   0.00   0.00   0.00   3.00   0.00
T:   4.00   0.00   2.00   0.00   0.00



You can access these counts as a dictionary:

In [7]:
m.counts['A']

[3, 7, 0, 2, 1]

but you can also think of it as a 2D array with the nucleotide as the first dimension and the position as the second dimension:

In [8]:
m.counts['T', 0]

4

In [9]:
m.counts['T', 2]

2

In [10]:
m.counts['T', 3]

0

You can also directly access columns of the counts matrix

In [11]:
 m.counts[:, 3]

{'A': 2, 'C': 2, 'G': 3, 'T': 0}

nstead of the nucleotide itself, you can also use the index of the nucleotide in the sorted letters in the alphabet of the motif:

In [12]:
m.alphabet

'ACGT'

In [13]:
m.counts["A", :]

(3, 7, 0, 2, 1)

In [14]:
m.counts[0, :]

The motif has an associated consensus sequence, defined as the sequence of letters along the positions of the motif for which the largest value in the corresponding columns of the .counts matrix is obtained:

In [15]:
m.consensus

 anticonsensus sequence, corresponding to the smallest values in the columns of the .counts matrix:

In [16]:
m.anticonsensus

You can also ask for a degenerate consensus sequence, in which ambiguous nucleotides are used for positions where there are multiple nucleotides with high counts:

In [17]:
m.degenerate_consensus

We can also get the reverse complement of a motif

In [18]:
r = m.reverse_complement()
r.consensus

In [19]:
Seq('GBGTW')

In [20]:
print(r)

### 14.1.2 Creating a sequence logo

In [21]:
m.weblogo("mymotif.png")

## 14.2 Reading motifs

Creating motifs from instances by hand is a bit boring, so it’s useful to have some I/O functions for reading and writing motifs. There are not any really well established standards for storing motifs, but there are a couple of formats that are more used than others.

### 14.2.1 JASPAR    

One of the most popular motif databases is JASPAR. In addition to the motif sequence information, the JASPAR database stores a lot of meta-information for each motif. The module Bio.motifs contains a specialized class jaspar.Motif in which this meta-information is represented as attributes:

The parts of the sequence in capital letters are the motif instances that were found to align to each other.
* Download test file: [`Arnt.sites`](https://github.com/biopython/biopython/blob/master/Tests/motifs/Arnt.sites)

In [22]:
# reate a Motif object from these instances as follows:
from Bio import motifs
with open("Arnt.sites") as handle:
    arnt = motifs.read(handle, "sites")

In [23]:
# The counts matrix of this motif is automatically calculated from the instances:
print(arnt.instances[:3])

In [24]:
for instance in arnt.instances:
    print(instance)

In [25]:
print(arnt.counts)

create a motif for this count matrix as follows:
* Download test file: [`SRF.pfm`](https://github.com/biopython/biopython/blob/master/Tests/motifs/SRF.pfm)

In [26]:
with open("SRF.pfm") as handle:
    srf = motifs.read(handle, "pfm")

In [27]:
print(srf.counts)

As this motif was created from the counts matrix directly, it has no instances associated with it

In [28]:
print(srf.instances)

We can now ask for the consensus sequence of these two motifs:

In [29]:
print(arnt.counts.consensus)

In [30]:
print(srf.counts.consensus)

##### The JASPAR format jaspar
The following example shows a `jaspar` formatted file containing the three motifs Arnt, RUNX1 and MEF2A:
```
>MA0004.1 Arnt
A [ 4 19 0 0 0 0 ]
C [16 0 20 0 0 0 ]
G [ 0 1 0 20 0 20 ]
T [ 0 0 0 0 20 0 ]
>MA0002.1 RUNX1
A [10 12 4 1 2 2 0 0 0 8 13 ]
C [ 2 2 7 1 0 8 0 0 1 2 2 ]
G [ 3 1 1 0 23 0 26 26 0 0 4 ]
T [11 11 14 24 1 16 0 0 25 16 7 ]
>MA0052.1 MEF2A
A [ 1 0 57 2 9 6 37 2 56 6 ]
C [50 0 1 1 0 0 0 0 0 0 ]
G [ 0 0 0 0 0 0 0 0 2 50 ]
T [ 7 58 0 55 49 52 21 56 0 2 ]
```

In [31]:
# The JASPAR format jaspar
fh = open("jaspar_motifs.txt")   
for m in motifs.parse(fh, "jaspar"):
    print(m)

##### Accessing the JASPAR database
In addition to parsing these flat file formats, we can also retrieve motifs from a JASPAR SQL database.
Unlike the flat file formats, a JASPAR database allows storing of all possible meta information defined in
the JASPAR Motif class. It is beyond the scope of this document to describe how to set up a JASPAR
database (please see the main JASPAR website). Motifs are read from a JASPAR database using the
Bio.motifs.jaspar.db module. First connect to the JASPAR database using the JASPAR5 class which
models the the latest JASPAR schema:



In [32]:
# from Bio.motifs.jaspar.db import JASPAR5

# import pymysql

# JASPAR_DB_HOST = "yourhostname" # fill in these values
# JASPAR_DB_NAME = "yourdatabase"
# JASPAR_DB_USER = "yourusername"
# JASPAR_DB_PASS = "yourpassword"

# jdb = JASPAR5(
#   host=JASPAR_DB_HOST,
#   name=JASPAR_DB_NAME,
#   user=JASPAR_DB_USER,
#   password=JASPAR_DB_PASS,
#  )

# arnt = jdb.fetch_motif_by_id("MA0004")

In [33]:
# arnt = jdb.fetch_motif_by_id("MA0004")

In [34]:
# print(arnt)

In [35]:
# motifs = jdb.fetch_motifs_by_name("Arnt")
# print(motifs[0])

In [36]:
# motifs = jdb.fetch_motifs(
# collection="CORE",
# tax_group=["vertebrates", "insects"],
# tf_class="Winged Helix-Turn-Helix",
# tf_family=["Forkhead", "Ets"],
# min_ic=12,
# )

In [37]:
 #for motif in motifs:
 #  pass # do something with the motif

##### Compatibility with Perl TFBS modules

An important thing to note is that the JASPAR Motif class was designed to be compatible with the popular Perl TFBS modules. Therefore some specifics about the choice of defaults for background and pseudocounts as well as how information content is computed and sequences searched for instances is based on this compatibility criteria. These choices are noted in the specific subsections below.



In [38]:
test_seq = Seq("TAAGCGTGCACGCGCAACACGTGCATTA")
arnt.pseudocounts = motifs.jaspar.calculate_pseudocounts(arnt)
pssm = arnt.pssm
max_score = pssm.max
min_score = pssm.min
abs_score_threshold = (max_score - min_score) * 0.8 + min_score
for pos, score in pssm.search(test_seq, threshold=abs_score_threshold):
    rel_score = (score - min_score) / (max_score - min_score)
    print(f"Position {pos}: score = {score:5.3f}, rel. score = {rel_score:5.3f}")

## 14.2.2 MEME


MEME is a tool for discovering motifs in a group of related DNA or protein sequences. It takes as input
a group of DNA or protein sequences and outputs as many motifs as requested. Therefore, in contrast to
JASPAR files, MEME output files typically contain multiple motifs
* Download test file: [`meme.INO_up800.classic.oops.xml`](https://github.com/biopython/biopython/blob/master/Tests/motifs/meme.INO_up800.classic.oops.xml)

In [39]:
 with open("meme.INO_up800.classic.oops.xml") as handle:
    record = motifs.parse(handle, "meme")

The motifs.parse command reads the complete file directly, so you can close the file after calling motifs.parse.
The header information is stored in attributes:

In [40]:
record.version

In [41]:
record.datafile

In [42]:
record.command

In [43]:
record.alphabet

In [44]:
record.sequences

In [45]:
len(record)

In [46]:
motif = record[0]
print(motif.consensus)

In [47]:
print(motif.degenerate_consensus)

In addition to these generic motif attributes, each motif also stores its specific information as calculated by
MEME

In [48]:
motif.num_occurrences

In [49]:
motif.length

In [50]:
evalue = motif.evalue
print("%3.1g" % evalue)

In [51]:
motif.name

In [52]:
motif.id

In [53]:
motif = record["GSKGCATGTGAAA"]

Each motif has an attribute .instances with the sequence instances in which the motif was found, providing
some information on each instance:

In [54]:
len(motif.instances)

In [55]:
motif.instances[0]

In [56]:
motif.instances[0].motif_name

In [57]:
motif.instances[0].sequence_name

In [58]:
motif.instances[0].sequence_id

In [59]:
motif.instances[0].start

In [60]:
motif.instances[0].strand

In [61]:
motif.instances[0].length

In [62]:
pvalue = motif.instances[0].pvalue
print("%5.3g" % pvalue)

### 14.2.3 TRANSFAC

TRANSFAC is a manually curated database of transcription factors, together with their genomic binding
sites and DNA binding profiles [34]. While the file format used in the TRANSFAC database is nowadays
also used by others, we will refer to it as the TRANSFAC file format.
* Download test file: [`transfac.dat`](https://github.com/biopython/biopython/blob/master/Tests/motifs/transfac.dat)

In [63]:
with open("transfac.dat") as handle:
    record = motifs.parse(handle, "TRANSFAC")

In [64]:
with open("transfac.dat") as handle:
    record = motifs.parse(handle, "TRANSFAC", strict=False)

In [65]:
record.version

In [66]:
motif = record[0]

In [67]:
motif.degenerate_consensus # Using the Bio.motifs.Motif property

In [68]:
motif["ID"] # Using motif as a dictionary

In [69]:
print(record)

You can export the motifs in the TRANSFAC format by capturing this output in a string and saving it in
a file:


In [70]:
text = str(record)
with open("mytransfacfile.dat", "w") as out_handle:
    out_handle.write(text)

## 14.3 Writing motifs

Speaking of exporting, let’s look at export functions in general. We can use the format built-in function to
write the motif in the simple JASPAR pfm format:

In [71]:
print(arnt.format("pfm"))

Similarly, we can use format to write the motif in the JASPAR jaspar format:

In [72]:
print(arnt.format("jaspar"))

In [73]:
print(m.format("transfac"))

To write out multiple motifs, you can use motifs.write. This function can be used regardless of whether
the motifs originated from a TRANSFAC file. For example,

In [74]:
two_motifs = [arnt, srf]
print(motifs.write(two_motifs, "transfac"))

 multiple motifs in the jaspar format:

In [75]:
#two_motifs = [arnt, mef2a]
#print(motifs.write(two_motifs, "jaspar"))

## 14.4 Position-Weight Matrices

The .counts attribute of a Motif object shows how often each nucleotide appeared at each position along the
alignment. We can normalize this matrix by dividing by the number of instances in the alignment, resulting
in the probability of each nucleotide at each position along the alignment. We refer to these probabilities as
the position-weight matrix. However, beware that in the literature this term may also be used to refer to
the position-specific scoring matrix, which we discuss below.
Usually, pseudocounts are added to each position before normalizing. This avoids overfitting of the
position-weight matrix to the limited number of motif instances in the alignment, and can also prevent
probabilities from becoming zero. To add a fixed pseudocount to all nucleotides at all positions, specify a
number for the pseudocounts argument:

In [76]:
pwm = m.counts.normalize(pseudocounts=0.5)
print(pwm)    

In [77]:
pwm = m.counts.normalize(pseudocounts={"A": 0.6, "C": 0.4, "G": 0.4, "T": 0.6})
print(pwm)

In [78]:
pwm.consensus

In [79]:
pwm.anticonsensus

In [80]:
pwm.degenerate_consensus

In [81]:
m.degenerate_consensus

The reverse complement of the position-weight matrix can be calculated directly from the pwm:

In [82]:
rpwm = pwm.reverse_complement()
print(rpwm)

## 14.5 Position-Specific Scoring Matrices

Using the background distribution and PWM with pseudo-counts added, it’s easy to compute the log-odds
ratios, telling us what are the log odds of a particular symbol to be coming from a motif against the
background. We can use the .log_odds() method on the position-weight matrix:

In [83]:
pssm = pwm.log_odds()
print(pssm)

Here we can see positive values for symbols more frequent in the motif than in the background and negative
for symbols more frequent in the background. 0.0 means that it’s equally likely to see a symbol in the
background and in the motif.
This assumes that A, C, G, and T are equally likely in the background. To calculate the position-specific
scoring matrix against a background with unequal probabilities for A, C, G, T, use the background argument.
For example, against a background with a 40% GC content, use


In [84]:
background = {"A": 0.3, "C": 0.2, "G": 0.2, "T": 0.3}
pssm = pwm.log_odds(background)
print(pssm)

The maximum and minimum score obtainable from the PSSM are stored in the .max and .min properties:

In [85]:
print("%4.2f" % pssm.max)

In [86]:
print("%4.2f" % pssm.min)

The mean and standard deviation of the PSSM scores with respect to a specific background are calculated
by the .mean and .std methods.


In [87]:
mean = pssm.mean(background)
std = pssm.std(background)
print("mean = %0.2f, standard deviation = %0.2f" % (mean, std))

## 14.6 Searching for instances

The most frequent use for a motif is to find its instances in some sequence. For the sake of this section, we
will use an artificial sequence like this:

In [88]:
test_seq = Seq("TACACTGCATTACAACCCAAGCATTA")
len(test_seq)

### 14.6.1 Searching for exact matches

The simplest way to find instances, is to look for exact matches of the true instances of the motif:

In [89]:
for pos, seq in r.instances.search(test_seq):
    print("%i %s" % (pos, seq))

In [90]:
for pos, seq in r.instances.search(test_seq):
    print("%i %s" % (pos, seq))

### 14.6.2 Searching for matches using the PSSM score

It’s just as easy to look for positions, giving rise to high log-odds scores against our motif:


In [91]:
for position, score in pssm.search(test_seq, threshold=3.0):
    print("Position %d: score = %5.3f"%(position, score))

In [92]:
pssm.calculate(test_seq)

In [93]:
rpssm = pssm.reverse_complement()

In [94]:
rpssm.calculate(test_seq)

### 14.6.3 Selecting a score threshold

If you want to use a less arbitrary way of selecting thresholds, you can explore the distribution of PSSM
scores. Since the space for a score distribution grows exponentially with motif length, we are using an
approximation with a given precision to keep computation cost manageable:


In [95]:
distribution = pssm.distribution(background=background, precision=10**4)

## 14.7 Each motif object has an associated Position-Specific Scoring Matrix


To facilitate searching for potential TFBSs using PSSMs, both the position-weight matrix and the positionspecific scoring matrix are associated with each motif

In [96]:
from Bio import motifs
    
with open("Arnt.sites") as handle:
    motif = motifs.read(handle, "sites")

In [97]:
print(motif.counts)

In [98]:
print(motif.pwm)

In [99]:
print(motif.pssm)

The negative infinities appear here because the corresponding entry in the frequency matrix is 0, and we are
using zero pseudocounts by default:

In [100]:
for letter in "ACGT":
    print("%s: %4.2f" % (letter, motif.pseudocounts[letter]))

If you change the .pseudocounts attribute, the position-frequency matrix and the position-specific scoring
matrix are recalculated automatically:

In [101]:
motif.pseudocounts = 3.0
for letter in "ACGT":
    print("%s: %4.2f" % (letter, motif.pseudocounts[letter]))

In [102]:
print(motif.pwm)

In [103]:
print(motif.pssm)

In [104]:
for letter in "ACGT":
    print("%s: %4.2f" % (letter, motif.background[letter]))   

if you modify the background distribution, the position-specific scoring matrix is recalculated

In [105]:
motif.background = {"A": 0.2, "C": 0.3, "G": 0.3, "T": 0.2}
print(motif.pssm)

Setting motif.background to None resets it to a uniform distribution:

In [106]:
motif.background = None
for letter in "ACGT":
    print("%s: %4.2f" % (letter, motif.background[letter]))

If you set motif.background equal to a single value, it will be interpreted as the GC content:

In [107]:
motif.background = 0.8
for letter in "ACGT":
    print("%s: %4.2f" % (letter, motif.background[letter]))

Note that you can now calculate the mean of the PSSM scores over the background against which it was
computed:

In [108]:
print("%f" % motif.pssm.mean(motif.background))

standard deviation:

In [109]:
print("%f" % motif.pssm.std(motif.background))

its distribution

In [110]:
distribution = motif.pssm.distribution(background=motif.background)
threshold = distribution.threshold_fpr(0.01)
print("%f" % threshold)

In [111]:
pssm = motif.pssm

## 14.8 Comparing motifs


Once we have more than one motif, we might want to compare them.
Before we start comparing motifs, I should point out that motif boundaries are usually quite arbitrary.
This means we often need to compare motifs of different lengths, so comparison needs to involve some kind
of alignment. This means we have to take into account two things:
* Download test file: [`REB1.pfm`](https://github.com/biopython/biopython/blob/master/Tests/motifs/REB1.pfm)

In [112]:
with open("REB1.pfm") as handle:
    m_reb1 = motifs.read(handle, "pfm")

In [113]:
m_reb1.consensus

In [114]:
print(m_reb1.counts)

To make the motifs comparable, we choose the same values for the pseudocounts and the background
distribution as our motif m:

In [115]:
m_reb1.pseudocounts = {"A": 0.6, "C": 0.4, "G": 0.4, "T": 0.6}

In [116]:
m_reb1.background = {"A": 0.3, "C": 0.2, "G": 0.2, "T": 0.3}

In [117]:
pssm_reb1 = m_reb1.pssm

In [118]:
print(pssm_reb1)

We’ll compare these motifs using the Pearson correlation. Since we want it to resemble a distance measure,
we actually take 1 − r, where r is the Pearson correlation coefficient (PCC):

In [119]:
distance, offset = pssm.dist_pearson(pssm_reb1)
print("distance = %5.3g" % distance)
print(offset)

## 14.9 De novo motif finding

### 14.9.1 MEME

Let’s assume, you have run MEME on sequences of your choice with your favorite parameters and saved the
output in the file meme.out. You can retrieve the motifs reported by MEME by running the following piece
of code:
* Download test file: [`meme.psp_test.classic.zoops.xml`](https://github.com/biopython/biopython/blob/master/Tests/motifs/meme.psp_test.classic.zoops.xml)

In [120]:
from Bio import motifs
with open("meme.psp_test.classic.zoops.xml") as handle:
     motifsM = motifs.parse(handle, "meme")

In [121]:
motifsM

The motifs returned by the MEME Parser can be treated exactly like regular Motif objects (with instances), they also provide some extra functionality, by adding additional information about the instances

In [122]:
motifsM[0].consensus

In [123]:
motifsM[0].instances[0].sequence_name

In [124]:
motifsM[0].instances[0].sequence_id

In [125]:
motifsM[0].instances[0].start

In [126]:
motifsM[0].instances[0].strand

In [127]:
motifsM[0].instances[0].pvalue