## Find neighbours of atoms in PDB structures using biopython

The PDB structure contains the (x, y, z) coordinates of every atom in the structure. biopython has an implementation of a fast algorithm, `NeighborSearch`, that can find neighbouring atoms that are within a certain distance of a given coordinate.

In [2]:
import os.path
from Bio.PDB import PDBParser, NeighborSearch

Define the directory where the Chothia renumbered PDB file are stored.

In [3]:
PDB_DIR = "../data/pdbs"

Given the `pdb_id`, find the path to the corresponding PDB file

In [4]:
pdb_id = "9ds1"
filename = os.path.join(PDB_DIR, f"{pdb_id}_chothia.pdb")

First, we need to parse the PDB file

In [5]:
parser = PDBParser(PERMISSIVE=1)
structure = parser.get_structure(pdb_id, filename)

Then we need to initialise the data structure that is used to perform the search of neighbours. Here we use all atoms of the structure for demonstration. This can be wasteful. Later, it can be advantageous to only use the atoms of the antigen, for example.

In [6]:
atoms = list(structure.get_atoms())
ns = NeighborSearch(atoms)

Now we can use the `ns.search(coords, distance)` method to find all atoms that are within `distance` angstroms from the position specified by the `[x, y, z]` array `coords`.

We can retrieve the position of an atom using its `coord` property:

In [12]:
atom = structure[0]['H'][52]['N']
atom.coord

array([ 14.805, -56.923,  25.988], dtype=float32)

To find all atoms in vicinity of an atom we simply pass the coordinates of the atom to the `ns.search` method.

In [13]:
close_atoms = ns.search(atom.coord, 4.0)

Compute the actual distance between two atoms

In [22]:
import math
math.sqrt(sum((atom.coord - close_atoms[0].coord)**2))

2.9126884759651213

`close_atoms` now contains the atom entities of all atoms that are within 4.0 angstrom of the supplied coordinates. To find the residues and chains these atoms belong to, we can either use repeated application of the `get_parent()` method, or we can simply use the `get_full_id()` method.

As we want the output as a DataFrame, we first create a vector of dicts, and then create the DataFrame. 

In [9]:
import pandas as pd

data = []

for atom in close_atoms:
    s, m, c, r, a = atom.full_id
    resnum = str(r[1]) + r[2]

    # to get the resname we need to access the residue the atom belongs to
    residue = atom.get_parent()
    resname = residue.get_resname()
    data.append(dict(chain = c, resnum = resnum, resname = resname, atom = a[0]))


df = pd.DataFrame(data)

df

Unnamed: 0,chain,resnum,resname,atom
0,H,52,SER,OG
1,H,56,SER,O
2,H,52,SER,N
3,H,51,ILE,CG1
4,H,52,SER,O
5,H,51,ILE,O
6,H,52,SER,CB
7,H,52,SER,CA
8,H,52,SER,C
9,H,52A,GLY,N


Now we have learned how to find all atoms within the vicinity of a given atom. 

We can use this to find the atomic contact points between residues of an antibody and residues of its bound antigen.
We say that the residue of an antibody is in contact with the residue of an antigen if the distance 
of the corresponding atoms is less that a threshold distance. Often a threshold of 4 angstroms is used.

Write a function `atomic_contact_points(ab_chain, ag_chain, distance)` that loops over the atoms in ab_chain to get all atoms of ag_chain that are within distance, and reports a list of dicts with elements
- ab_resnum
- ab_icode
- ab_resname
- ab_atom
- ag_resnum
- ag_icode
- ag_resname
- ag_atom

But only report for residues that are amino acids, i.e. het_flag == ' '.

In [10]:
def atomic_contact_points(ab_chain, ag_chain, distance):
    res = []
    ns = NeighborSearch(list(ag_chain.get_atoms()))
    for ab_atom in ab_chain.get_atoms():
        ab_res = ab_atom.get_parent()
        close_ag_atoms = ns.search(ab_atom.coord, distance)
        for ag_atom in close_ag_atoms:
            ag_res = ag_atom.get_parent()
            if ab_res.id[0] == ' ' and ag_res.id[0] == ' ':
                tmp = dict(ab_resnum = ab_res.id[1],
                           ab_icode =  ab_res.id[2],
                           ab_resname = ab_res.get_resname(),
                           ab_atom = ab_atom.id,
                           ag_resnum = ag_res.id[1],
                           ag_icode = ag_res.id[2],
                           ag_resname = ag_res.get_resname(),
                           ag_atom = ag_atom.id)
                res.append(tmp)

    return pd.DataFrame(res)

In [11]:
cp = atomic_contact_points(structure[0]['H'], structure[0]['A'], 4.0)
cp

Unnamed: 0,ab_resnum,ab_icode,ab_resname,ab_atom,ag_resnum,ag_icode,ag_resname,ag_atom
0,55,,ILE,C,10,,GLY,CA
1,55,,ILE,O,9,,PRO,CB
2,55,,ILE,O,9,,PRO,CG
3,55,,ILE,O,9,,PRO,C
4,55,,ILE,O,10,,GLY,N
5,55,,ILE,O,10,,GLY,CA
6,55,,ILE,O,9,,PRO,O
7,55,,ILE,CG2,9,,PRO,CB
8,55,,ILE,CG2,9,,PRO,CG
9,55,,ILE,CG2,10,,GLY,N


In [177]:
cp.groupby(['ab_resnum', 'ab_icode', 'ab_resname']).agg(natomcontacts = ('ag_resnum', 'size'),
                                            nrescontacts = ('ag_resnum', 'nunique')).reset_index()

Unnamed: 0,ab_resnum,ab_icode,ab_resname,natomcontacts,nrescontacts
0,55,,ILE,10,2
1,56,,SER,9,3
2,58,,TYR,5,2
3,100,A,ILE,5,2
4,100,B,TRP,1,1


We are also interested in the occurrence of residue numbers in the heavy and light chains.
Write a function `residue_occurrence(chain)` that creates as output a DataFrame with columns 
- ab_resnum
- ab_icode
- ab_resname

Restrict to het_name == ' ' and resnum <= 128

In [178]:
def residue_occurrence(chain):
    results = []
    for res in chain.get_residues():
        if res.id[0] == ' ' and res.id[1] <= 128:
            tmp = dict(ab_resnum = res.id[1],
                      ab_icode =  res.id[2],
                   ab_resname = res.get_resname())
            results.append(tmp)

    return pd.DataFrame(results)

In [179]:
ro = residue_occurrence(structure[0]['H'])
ro

Unnamed: 0,ab_resnum,ab_icode,ab_resname
0,1,,GLU
1,2,,VAL
2,3,,GLN
3,4,,LEU
4,5,,LEU
...,...,...,...
134,124,,ALA
135,125,,PRO
136,126,,SER
137,127,,SER


In [180]:
ro.insert(loc = 0, column = 'chain_type', value = 'heavy')
ro.insert(loc = 0, column = 'pdb_id', value = pdb_id)
ro

Unnamed: 0,pdb_id,chain_type,ab_resnum,ab_icode,ab_resname
0,9ds1,heavy,1,,GLU
1,9ds1,heavy,2,,VAL
2,9ds1,heavy,3,,GLN
3,9ds1,heavy,4,,LEU
4,9ds1,heavy,5,,LEU
...,...,...,...,...,...
134,9ds1,heavy,124,,ALA
135,9ds1,heavy,125,,PRO
136,9ds1,heavy,126,,SER
137,9ds1,heavy,127,,SER
