## Find neighbours of atoms in PDB structures using biopython

The PDB structure contains the (x, y, z) coordinates of every atom in the structure. biopython has an implementation of a fast algorithm, `NeighborSearch`, that can find neighbouring atoms that are within a certain distance of a given coordinate.

In [1]:
import os.path
from Bio.PDB import PDBParser, NeighborSearch

Define the directory where the Chothia renumbered PDB file are stored.

In [2]:
PDB_DIR = "../data/pdbs"

Given the `pdb_id`, find the path to the corresponding PDB file

In [3]:
pdb_id = "9ds1"
filename = os.path.join(PDB_DIR, f"{pdb_id}_chothia.pdb")

First, we need to parse the PDB file

In [4]:
parser = PDBParser(PERMISSIVE=1)
structure = parser.get_structure(pdb_id, filename)

Then we need to initialise the data structure that is used to perform the search of neighbours. Here we use all atoms of the structure for demonstration. This can be wasteful. Later, it can be advantageous to only use the atoms of the antigen, for example.

In [19]:
atoms = list(structure.get_atoms())
ns = NeighborSearch(atoms)

Now we can use the `ns.search(coords, distance)` method to find all atoms that are within `distance` angstroms from the position specified by the `[x, y, z]` array `coords`.

We can retrieve the position of an atom using its `coord` property:

In [23]:
atom = structure[0]['H'][52]['N']
atom.coord

array([ 14.805, -56.923,  25.988], dtype=float32)

To find all atoms in vicinity of an atom we simply pass the coordinates of the atom to the `ns.search` method.

In [24]:
close_atoms = ns.search(atom.coord, 4.0)

`close_atoms` now contains the atom entities of all atoms that are within 4.0 angstrom of the supplied coordinates. To find the residues and chains these atoms belong to, we can either use repeated application of the `get_parent()` method, or we can simply use the `get_full_id()` method.

As we want the output as a DataFrame, we first create a vector of dicts, and then create the DataFrame. 

In [120]:
import pandas as pd

data = []

for atom in close_atoms:
    s, m, c, r, a = atom.full_id
    resnum = str(r[1]) + r[2]

    # to get the resname we need to access the residue the atom belongs to
    residue = atom.get_parent()
    resname = residue.get_resname()
    data.append(dict(chain = c, resnum = resnum, resname = resname, atom = a[0]))


df = pd.DataFrame(data)

df

Unnamed: 0,chain,resnum,resname,atom
0,H,52,SER,OG
1,H,56,SER,O
2,H,52,SER,N
3,H,51,ILE,CG1
4,H,52,SER,O
5,H,51,ILE,O
6,H,52,SER,CB
7,H,52,SER,CA
8,H,52,SER,C
9,H,52A,GLY,N


Now we have learned how to find all atoms within the vicinity of a given atom. 

We can use this to find the atomic contact points between residues of an antibody and residues of its bound antigen.
We say that the residue of an antibody is in contact with the residue of an antigen if the distance 
of the corresponding atoms is less that a threshold distance. Often a threshold of 4 angstroms is used.

Write a function `atomic_contact_points(ab_chain, ag_chain, distance)` that loops over the atoms in ab_chain to get all atoms of ag_chain that are within distance, and reports a list of dicts with elements
- ab_resnum
- ab_resname
- ab_atom
- ag_resnum
- ag_resname
- ag_atom

But only report for residues that are amino acids, i.e. het_flag == ' '.

In [None]:
def atomic_contact_points(ab_chain, ag_chain, distance):
    ...

We want to create a DataFrame that contains atomic contacts for all the structures in our cleaned summary file.

The desired result is a DataFrame with columns

- `pdb_id`, e.g. '9ds1'
- `ab_chain`, e.g. 'H' 
- `ab_chaintype`, 'heavy' or 'light'
- `ab_resnum`, including icode, e.g. '52A'
- `ab_resname`
- `ab_atom`
- `ag_chain`, e.g. 'G'
- `ag_resnum`, e.g. '13'
- `ag_resname`, e.g. 'TYR'
- `ag_atom`


We can accomplish this using our `atomic_contact_points` function from above.

- read the cleaned summary file into a pandas DataFrame
- for each row of the summary DataFrame
    - read the PDB file
    - find atomic contact points for Hchain
    - find atomic contact points for Lchain
    - add columns pbs_id, ab_chaintype ('heavy' and 'light')

and concatenate to a single DataFrame

