coord_util is a library for reading, manipulating and writing molecule geometries. coord_util can read and write geometries in several common formats.
coord_util depends on my
coord_util also depends on
numpy (tested with 1.6.1).
coord_util contains a module for storing geometries in a useful
trajdb. In order to make use of this module, coord_util
depends on the presence of
Scripts for building all of the above dependencies in the user's home
directory (and thus not requring root permissions) are available in my
coord_util into your python
cd your_python_site-packages git clone git://github.com/plediii/coord_util
Configure and Build
This step is optional but recommended; all of the
functions are pure python. However, some functions have been
implemented more efficiently in Fortran. To access the more efficient
versions configure and build the moules via:
coord_util package includes several test suites with names in
test_foo.py. To execute all of them:
The coord_math module exposes several geometric functions which interpret one-dimensional numpy arrays holding contiguous sequence of atoms. The arguments to the coord_math functions assume that the numpy arrays are in the order:
x = np.array([x1, y1, z1, x2, y2, z2, ..., xn, yn, zn]),
x1 is the x coordinate of the first atom,
y2 is the y
coordinate of the second atom, and
zn is the z coordinate of the nth
atom. The atom index in the following functions is 1-based, so that
the first atom has index 1.
get_atom_coords returns a 1x3 array containing the coordinates of the ith atom.
get_atom_coords(np.array([1, 2, 3, 4, 5, 6]), 2) == np.array([4, 5, 6])
center_of_geometry returns a 1x3 array containing the coordinates at the average of the atom coordinates in the system.
For instance, the center of geometry of atoms arranged at the corners of a square is the center of the square:
center_of_geometry(np.array([0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0.,])) == np.array([0.5, 0.5, 0.0])
rmsd is a slight misnomer, since rmsd is the least root mean
square distance between two geometries. However, the convention is to
simply refer to this measure as rmsd. The non least root mean square
distance between two geometries is simply the l2-norm of the
difference vector, divided by the square root of the number of atoms:
flat_rmsd(v1, v2) == sqrt(dot(v1-v2, v1-v2)/(len(v1)/3)).
The least root mean square is the least root mean square between the two vectors, amongst all possible rotations and translations of the geometry. The intuition for using the least rmsd is that, for example, changes in a water molecule geometry by rotation or translation are irrelevant to its dynamics.
rmsd gives the least root mean square distance, we have, for
rmsd(v1, v2) == rmsd(v1, translate(v2, anything))
rmsd(v1, v2) == rmsd(v1, rotate(v2, anyangle)).
align transforms a geometry to minimize the
flat_rmsd to another
target geometry. Explicitly:
rmsd(v1, v2) == flat_rmsd(v1, align(v1, v2)).
Geometries can be aligned considering only subset of the coordinates
subalign function and the
topology module described below.
translate displaces all atoms in a geometry by a uniform vector; for
example, if we translate a geometry via,
Then, for any atom in the geometry
get_atom_coords(geom, idx) - get_atom_coords(translate(geom, delta_vector)), idx) == delta_vector.
dihedral calculates the dihedral angle between involving for atoms. Example:
dihedral(geom, idx, jdx, kdx, ldx)
atom_dist calculates the distance between the idxth and jdxth atoms. Example:
atom_dist(geom, idx, jdx).
rotate_euler rotates the geometry about the origin according to the euler angles. Example:
rotate_euler(geom, alpha, beta, gamma).
transform applies a 3x3 transformation matrix to each atom in the geometry. Example:
transform(geom, euler_rotation_matrix(alpha, beta, gamma)).
topology module exposes classes and functions for selecting and
manipulating components of geometries, especially residues of
proteins. The inspiration for the design of this module came from the
pattern used by
jQuery for selecting and manipulating components of
The need for this module in my own projects arose in the context of monte-carlo rotamer switching in protein dynamics. Rotamer switching refers to changing the geometry of a single amino acid side chain in a protein between different stable formations separated by energetic frustration. Monte-Carlo rotamer switching allows for rapid exploration of a protein's dynamical space without requiring the time normally required for transitions within the residues.
The usefulness of the
topology module goes beyond just rotamer
switching. When coupled with coordinates describing a protein
geometry, it can be used to extract specific atoms, or switch the atom
order in a clear way.
Use of the
topology module centers around the
Instances of the topology classes can be read by reading the topology
of a PDB file, or by creating them manually.
Creating topology instances
For example, to obtain the topology describing the model in "protein.pdb":
import coord_util.pdb as p top = p.read_topology('protein.pdb')
Alternatively, specific topologies can be constructed manually. For instance, we can construct a toplogy for a glycine residue, or a water molecule:
import topology as t gly = t.Molecule('GLY', ['N', 'CA', 'C', 'O']) water = t.Molecule('HOH', ['H', 'O', 'H'])
Individual molecules can be composed to create polymers. A glycine dipeptide can be constructed via:
digly = t.Polymer('GLYGLY', [gly, gly])
Polymers can be composed with water molecules to create a dipeptide solvated in 100 water molecules:
system = t.Polymer('solvagted gly', [digly] + [water] * 100)
Alternatively, chains can be constructed by creating a monomer set, and using its sequence method.
ala = t.Molecule('ALA', ['N', 'CA', 'CB', 'C', 'O']) monomers = t.Monomers([gly, ala]) gly_ala_gly = t.Polymer('GLYALAGLY', monomers.sequence(['GLY', 'ALA', 'GLY']))
coord_util project includes a module
aminoacids with the standard set of aminoacids:
from aminoacids import aminoacids gly_ala_gly = aminoacids.sequence(['ALA', 'GLY', 'ALA'])
Selecting and extracting substructures
Topology instance, when coupled with the coordinates in the PDB
file, allows us to select specific components.
Suppose we already have the topology in "protein.pdb", "top" from above. The trajectory of geometries in the PDB file can be obtained via:
xs = list(p.read_coords('protein.pdb'))
The coordinates of specific components of the geometry can be
extracted by coupling a topology instance with the coordinate
described. For instance, to extract only the CA atoms in the protein,
we can use the
Topolgy.get_atoms method to create a subtopology, and
Topology.get_coords to extract the coordinates for those atoms from
a numpy array:
ca_top = top.get_atoms('CA') ca_xs = [ca_top.get_coords(x) for x in xs]
After the above, assuming the protein consists of N residues with CA atoms, each of the ca_xs are numpy arrays of length 3N containing the coordinates of the CA atoms.
Alternatively, we could extract just the backbone of the geometry
backbone_top = top.get_atomset(['CA', 'CB', 'C', 'N', 'O'])
Atoms can also be selected with regular expressions (regex). We could have selected the backbone atoms using a regex instead:
backbone_top = top.regex_get_atoms('^(CA|CB|C|N|O)$')
We could also select for specific monomers by name using
Polymer.monomers_slice(idx, jdx) and
Besides extracting substructures from a geometry, replacing those
substructures into a geometry is also useful. In the case of the
rotamer switching problem, for instance, we want to change the
coordinates of specific side chains in the protein. This operation is
implemented by the
For instance, suppose we have the subtopology for the first monomer in a Polymer:
first = top.get_monomer_by_index(0).
x is an numpy array of length 3N (corresponding to the N atoms
x1 is a numpy array of length 3M
(corresponding to the M atoms of the first residue), we can change the
geometry of the first residue to that of
Lifting and reordering atoms
When building a geometry from scratch from its constituent
substructures, a common operation is to build a new array for the
entire geometry, and fill in each of the substructures using
subtopologies For instance, suppose we are beginning to reconstruct a
geometry from its CA atoms. We can create a numpy array for the
entire geometry, with the CA atoms already filled in using
Topology.lift_coords like so:
lift_ca = top.lift_coords(top.get_atoms('CA')) full_x = lift_ca.lift_coords(ca_x)
ca_x is a numpy array of length 3M corresponding to
the M CA atoms,
full_x will be a numpy array of length 3N
corresponding to the N atoms in the complete geometry, with the CA
coordinates fill in, and the coordinates for the remaining atoms set
A related situation is when we have coordinates for a molecule with
atoms in a different order. Suppose
top2 contains a topology for
the same molecule as
top, except that the atom names are in a
different order. Additionally, suppose
x2 contains a geometry for
the atoms in the order of
top2. Then we can change order of the
atoms to that of
top_lift_top2 = top.lift_topology(top2, reorder=True) x = top_lift_top2.lift_coords(x2)
The extra argument to
reorder, is not strictly
necessary, but it dictates that
lift_topology should raise an
exception if the topology to be lifted does not contain the same
number of atoms.
Topology lifting considers only the names of atoms in molecules, so it can be used between Polymers with different names as long as the Molecules in the corresponding positions in the Polymer tree can be lifted to one another.
Aligning sub geometries
A common procedure is to align a geometry so that a sub component is
optimally aligned with the same component of another geometry. This
can be accomplished with the
subalign function in the
module. For example, to align protein coordinates
y to protein
x along their backbone:
backbone_top = top.regex_get_atoms('^(CA|CB|C|N|O)$') aligned_y = subalign(backbone_top, x, y)
coord_util package includes reader modules for several
coordinate formats. These reader modules provide an
returning an iterator over the geometries in the file.
mdcrd format requires the number of atoms in the geometry.
import coord_util.mdcrd as mdcrd with mdcrd.open('ala.crd', num_atoms=22) as f: for geom in f: print center_of_geometry(geom)
import coord_util.rst as rst with rst.open('ala.rst') as f: for geom in f: print center_of_geometry(geom)
import coord_util.pdb as pdb with pdb.open('protein.pdb') as f: for geom in f: print center_of_geometry(geom)
The Gro format specifies coordinates in nanometers, but
converts them to angstroms for consistency with the other formats.
import coord_util.gro as gro with gro.open('ala.gro') as f: for geom in f: print center_of_geometry(geom)
coord_util includes a module for storing molecular dynamics
trajectories in a convenient database format. The purpose of
is to associate snapshots molecular dynamics trajectories with a
collection of all relevant real valued physical properties in an
efficient and easily queried pair of files. A
trajdb consists of an
hdf5 database comprising the trajectories and physical
This module requires
h5py python modules.
Opening and creating trajdbs
Opening an existing trajdb
An existing trajdb named
mytraj.db can be opened via:
import coord_util.trajdb as trajdb db = trajdb.open_trajectory_database('mytraj.db', create=False)
create argument being set to
False will cause an
exception to be raised if the file does not exist.
Creating a new trajdb
A new trajdb named
mytraj.db for geometries with
ndof degrees of
freedom (e.g., water molecules have 3*3=6 degrees of freedom) can be
import coord_util.trajdb as trajdb db = trajdb.open_trajectory_database('mytraj.db', ndof=ndof, create=True)
create argument being set to
True will cause
to overwrite any previously existing database.
ndof argument is required when creating a new database.
Adding and retrieving samples
New trajectory insertion
Suppose we have an iterator
traj_iter, which yields numpy vectors of
molecule geometries representing snapshots from a trajectory with
dt (with initial time 0). This sequence can be inserted
trajdb stored referenced by
with db.session(): for x in traj_iter: samplekey = db.new_sample(x) db.step_time(dt) db.close()
The samplekey returned by
new_sample is a unique identifier
representing the geometry in the database.
session ensures that the SQL database is only modified if all of
the samles are successfully inserted. The
close function ensures
that the samples are written to the
Multiple trajectory insertion
It is possible to denote that samples in the
trajdb come from
multiple independent trajectories by using the
switch_trajectory functions. Suppose
traj_iters is an iterator of
trajectory iterators. Then the following will insert multiple
trajectories each with initial time 0 and time step
with db.session(): for traj_iter in traj_iters: trajkey = db.new_trajectory() for x in traj_iter: samplekey = db.new_sample(x) db.step_time(dt) db.close()
trajkey returned from
new_trajectory is a unique identifier
representing the trajectory in the database.
Alternatively, a previous inserted trajectory can be "resumed" by
db.switch_trajectory with an existing
Geometries inserted into a
trajdb can be retrieved as long as the
hdf5 database is present, and the samplekey of the desired geometry is
known. The geometry associated with a given samplekey can be
get_sample such as:
x = db.get_sample(samplekey)
x will be a numpy array holding the geometry of the
samplekeyth geometry. All the samplekeys for the database can be
retrieved by the
select function (described in more detail below):
for samplekey in db.select([db.samplekeys.samplekey]): x = db.get_sample(samplekey) print 'Sample %s is %s % (samplekey, x)
To simply iterate over the trajectories (in samplekey order), use the
db.iter_coordinates function. The previous example is equivalent to
for samplekey, x in db.iter_coordinates(): print 'Sample %s is %s % (samplekey, x)
Adding physical properties
In the most common case, molecule samples are described through a smaller set (i.e., smaller than the number of degrees of freedom) of real valued functions of the complete geometry. Examples of such functions are dihedral angles, bond lengths, or rmsd from a specific geometry.
trajdb module provides functions to associate real valued
properties to the samplekeys describing the geometries in the
database, and to query the database for based on these properties.
The physical properties for a trajdb can be queried without the presence of the (often much larger) full geometries (in the hdf5 database).
Adding new physical properties
Suppose we want to describe molecules by the dihedral involving atoms 1, 2, 3 and 4; named psi. This attribute can be done in the following way:
def psis(): for samplekey, x in db.iter_coordinates(): yield samplekey, dihedral(x, 1, 2, 3, 4) db.add_sample_table('psi') # the above adds the attribute psi to `db` db.insert(db.psi, psis())
Retrieving physical properties
By default, time and trajectory keys are associated with each sample. We can select these values, along with any additional properties in the following way:
for row in db.select([db.trajectories.trajectorykey, db.times.time, db.psis.value]): print 'Trajectory %d at time %f had psi=%f' % row
Selecting properties by range
To restrict the samples selected, the
select function includes a where keyword:
for row in db.select([db.trajectories.trajectorykey, db.times.time], where=((db.psis.value>0.)&(db.psis.value<3.14/2))): print 'Trajectory %d at time %f had psi between 0 and pi/2' % row
coord_util module includes efficient functions for searching for
nearest neighbors in a set of numpy arrays. The code is the Geometric
Near-neighbor Access Tree (GNAT), and closely follows the strategy
outlined by Brin (infolab.standford.edu/~sergey/near.html).
'Nearest' in this context refers to any metric, but the
This GNAT library is designed to work with a trajdb, but can easily be freed from this constraint (see test_gnat.py).
db is a
tradb, we can build a GNAT indexing the
import coord_util.gnat as g gnat = g.build_gnat(db)
Given a sample
x, the samplekeys of all neighbors closer than
according to the metric can be obtained using the
for samplekey in gnat.query(x, r): print 'Sample %d is closer than %f' % (samplekey, r)
The nearest neighbor can be identified by the
print 'The nearest neighbor is sample %d' % (g.neighbor_query(x))
A GNAT would not be very useful if we had to recreate the structure
everytime we wanted to do queries. The GNAT can be saved to the
trajdb by using the
The GNAT stored in the
trajdb can be restored via the
restored_gnat = g.load_gnat(db)