# Working with topologies

- Ground-up construction
- OpenMM
- Loading PDB files
- custom substructure loading

<div class="alert alert-info" style="max-width: 700px; margin-left: auto; margin-right: auto;">
    ℹ️ OpenFF <code>Topology</code> objects are little more than collections (lists) of OpenFF <code>Molecule</code> objects!
</div>

For starters, let's look at the `Topology` docstring:

In [None]:
from openff.toolkit import Topology

?Topology

## Ground-up construction


Topologies can always be assembled by constructing individual molecules and adding them together; these methods are for making common operations easier.

To convert a single `Molecule` to a `Topology`, you can use either `Molecule.to_topology()` or `Topology.from_molecules`

In [None]:
from openff.toolkit import Molecule

ligand = Molecule.from_smiles("Fc1ccc(/C=C/c2cc(NCCCCN3CCOCC3)c3cc(Cl)ccc3n2)cn1")
ligand.generate_conformers(n_conformers=1)

ligand

In [None]:
from viz import visualize_topology

topology = ligand.to_topology()

# Equivalent
topology = Topology.from_molecules(molecules=[ligand])

visualize_topology(topology)

From here we can add as many other molecules as we wish. For example, create a water molecule and it to this topology 100 times.

In [None]:
water = Molecule.from_mapped_smiles("[H:2][O:1][H:3]")

for index in range(100):
    topology.add_molecule(water)

topology.n_molecules

<div class="alert alert-info" style="max-width: 500px; margin-left: auto; margin-right: auto;">
    ℹ️ Positions are <i>optional</i> in <code>Molecule</code> (any by extension <code>Topology</code>) objects, so visualizing this topology in 3D doesn't make sense. Using it in a simulation would requiring assigning positions using a tool like Packmol or PDBFixer.
</div>

Keeping in mind that topologies are just collections of molecules, we can look up individual molecules by index in the `Topology.molecule()` function.

In [None]:
topology.molecule(0), topology.molecule(1), topology.molecule(-1)

In [None]:
topology.molecule(0)

## Serialization

OpenFF topologies, like molecules, can be serialized using common dict-like serialization formats such as JSON, YAML, XML, etc. These files are somewhat human-readable and do not tend to be efficiently compressed on disk.

<div class="alert alert-info" style="max-width: 700px; margin-left: auto; margin-right: auto;">
    ℹ️ These files are best when written and read by the OpenFF Toolkit; other tools aren't likely to be able to parse these files. There is also a change, prior to a stable 1.0 version of the toolkit, that it may not read files written by a different version of the toolkit.
</div>

Let's write this topology out to a JSON file - then open it up in your favorite text editor, or explore it using IPython's fancy JSON explorer.

In [None]:
with open("topology.json", "w") as file:
    file.write(topology.to_json())

In [None]:
import json

from IPython.display import JSON

JSON(json.loads(topology.to_json()), expanded=False)

They key value of serialization is the ability to dump something from memory onto disk and read it back in later. This might seem trivial for this topology, which we could easily generate, but maybe you're working with a more complex topology or you want to move files around HPC resources.

In [None]:
with open("topology.json") as file:
    loaded_topology = Topology.from_json(file.read())

assert loaded_topology.n_molecules == 101

loaded_topology.molecule(0)

# Loading PDB files

_Author: Jeff Wagner_

PDB files are a common way to represent biopolymer structures, but they don't contain all of the chemical information that we need to parameterize a molecule in OpenFF (they're missing bond orders and formal charges, so those must be determined by cross-referncing against the known chemistry of amino acids and other building blocks). PDB files are used widely in the molecular simulation community for legacy reasons, so OpenFF has implemented functionality to load them.

The best way to load PDB files for use with OpenFF is using the [`Topology.from_pdb` method](https://docs.openforcefield.org/projects/toolkit/en/stable/api/generated/openff.toolkit.topology.Topology.html#openff.toolkit.topology.Topology.from_pdb). Lots of file names end in `.pdb` but not all of them are loadable by the OpenFF Toolkit. Specifically, we require the loaded PDBs to have a few things:

For protein atoms:
* Atom and residue names must be consistent with the [Chemical Components Dictonary](https://www.wwpdb.org/data/ccd)
* Only the 20 canonical amino acids are supported by default (including protonated/deprotonated variants)
* All hydrogens must be explicit

For small molecules, waters, and ions:
* The element of each atom must be identified in the final column of the file
* Bonds must be identified in the CONECT records at the bottom of the file
* All unique small molecules must be identified in the `unique_molecules` keyword argument

<div class="alert alert-info" style="max-width: 700px; margin-left: auto; margin-right: auto;">
    ℹ️ OpenFF <code>Topology</code> and <code>Molecule</code> objects store much more information than is in the PDB files they might be generated from. Don't be surprised if OpenFF takes a few more seconds to load a PDB file than other tools.
</div>

## Loading a PDB with just a protein

`6hvi_prepared.pdb` is a prepared protein from the [Merck free energy perturbation study](https://github.com/MCompChem/fep-benchmark/blob/master/pfkfb3/6hvi_prepared.pdb). It was prepared using commercial tools by Schrodinger, and doesn't have any (solvent) waters, ions, or small molecules.

In [None]:
_6hvi = Topology.from_pdb("../pdb/6hvi_prepared.pdb")

visualize_topology(_6hvi)

From here, you might use Packmol or PDBFixer to add water or other solvents without leaving Python (or there are tools available for water addition in AMBER, GROMACS, and many commercial packages). An example of using PDBFixer can be found in our ["Toolkit Showcase" example](https://docs.openforcefield.org/en/latest/examples/openforcefield/openff-toolkit/toolkit_showcase/toolkit_showcase.html).

## Loading a PDB with a protein, waters, and ions

This PDB file comes from the [AMBER tutorial on making solvated simulations](https://ambermd.org/tutorials/basic/tutorial7/index.php). It is slightly modified to create an output suitable for loading into the OpenFF Toolkit (TIP3P water is used instead of OPC, a solvent box is used instead of an octohedron, and the N terminus of the protein is made neutral).

In [None]:
RAMP1 = Topology.from_pdb("../pdb/RAMP1_solv_box_ions.pdb")

visualize_topology(RAMP1)

In [None]:
from counts import count

count(RAMP1)

# Loading a PDB with one or more small molecules/ligands

The following PDB is taken from the [GROMACS protein-ligand tutorial](http://www.mdtutorials.com/gmx/complex/index.html) (with manually-added CONECT records) and contains a protein, water, and a small molecule ligand.

While the chemistry of water, ions, and proteins can be deduced from known templates (this is what happened under the hood in the previous two examples), small molecules are trickier. The number of possible small molecule is tremendous, so storing them in a database is not feasible. And because PDB files lack some information the OpenFF needs (bond order, stereochemistry), we ask the user to fill in the dots. When one of these is included in a PDB, we require the user to provide an OpenFF Molecule with the correct chemistry, via the `unique_molecules` keyword argument. (This is just like what we did with `Topology.from_openmm` earlier.)

If you try to load a PDB file containing a small molecule, without providing the chemistry of the small molecule, you'll get an error message identifying the atoms that couldn't be loaded:

In [None]:
Topology.from_pdb("../pdb/gromacs_solv_complex.pdb")

Any appropriate OpenFF Molecule object can be used to define the small molecule ligand when loading the PDB file. It's fine to just provide a SMILES string identifying the ligand chemistry, since the coordinates will come from the PDB file.

In [None]:
jz4 = Molecule.from_smiles("CCCC1=CC=CC=C1O")
complex = Topology.from_pdb("../pdb/gromacs_solv_complex.pdb", unique_molecules=[jz4])

count(complex)

In [None]:
visualize_topology(complex)