# PDB Preparer
This notebook prepares the geometries which we will perform quantum mechanical calculations on. It will follow the procedure of downloading the data from the web, patching any mistakes in the metadata, adding hydrogens to the structures, and performing some initial relaxation.

In [1]:
from os.path import join
from futile.Utils import ensure_dir
from os import system
geomdir = "raw-structures"
tempdir = "temp"
ensure_dir(tempdir)
outdir = "processed-structures"
ensure_dir(outdir)
solvdir = "solvated-structures"
ensure_dir(solvdir)
picdir = "pictures"
ensure_dir(picdir)
zipfile = "https://www.diamond.ac.uk/dam/jcr:6423a0d7-9b25-4dc1-b44d-6d6665fd6e32/Mpro_All_PDBs%20-%20ver%202020-03-24.zip"
excelfile = "https://www.diamond.ac.uk/dam/jcr:cb44b3b1-fb14-4376-b172-ce45cbd66b48/Mpro%20full%20XChem%20screen%20-%20hits%20summary%20-%20ver-2020-03-25.xlsx"
fdir = "Mpro_All_PDBs - ver 2020-03-24"

## Metadata Preparation
The metadata about each structure is stored in an excel document on the web. Here we download this data and patch it up. First, we download the data file.

In [2]:
from os.path import exists
from urllib.request import urlretrieve
if not exists("files.zip"):
    urlretrieve(zipfile, 'files.zip')
if not exists("data.xlsx"):
    urlretrieve(excelfile, 'data.xlsx')

Unzip and rename that directory something sensible.

In [3]:
from os import system, rename
system("tar xvf files.zip")
system("rm -r " + geomdir)
rename(fdir, geomdir)

Now load the data stored in the excel spreadsheet into a python friendly pandas format.

In [4]:
from pickle import load
pname = "raw-data.pickle"
if exists(pname):
    with open(pname, "rb") as ifile:
        data = load(ifile)
else:
    from pandas import read_excel
    from pickle import dump
    data = read_excel("data.xlsx")
    with open(pname, "wb") as ofile:
        dump(data, ofile)

There is at least one structure which has the ligand in two separate positions. Let's split that into two input files.

In [5]:
from BigDFT.IO import read_pdb
from copy import deepcopy

geomlist = deepcopy(data["Crystal ID"])
droplist = []
for i, geom in enumerate(geomlist):
    with open(join(geomdir, geom+".pdb")) as ifile:
        sys = read_pdb(ifile)
    ligands = [x for x in sys if "LIG" in x]
    if len(ligands) > 1:
        for j, lig in enumerate(ligands):
            # Create the split PDB file
            with open(join(geomdir, geom+"-"+str(j)+".pdb"), "w") as ofile:
                with open(join(geomdir, geom+".pdb")) as ifile:
                    for line in ifile:
                        split = line.split()
                        if len(split) > 3 and split[3] == "LIG":
                            lineid = split[3] + ":" + str(split[4])[1:]
                            if lineid != lig:
                                continue
                        ofile.write(line)
        
            # Correct The Data Frame
            row = deepcopy(data.iloc[i])
            row["Crystal ID"] += "-"+str(j)
            data = data.append(row, ignore_index=True)
        droplist.append(i)

# Delete the no longer needed rows
for row in droplist[::-1]:
    data.drop(row, inplace=True)

Inside each PDB file is information about the covalent bonding of the ligand. We want to extract that here.

In [6]:
covalent = {}
for geom, site in zip(data["Crystal ID"], data["Site"]):
    if "covalent" not in site:
        covalent[geom] = None
    with open(join(geomdir, geom+".pdb")) as ifile:
        for line in ifile:
            if "LINK" in line:
                split = line.split()
                linkid = split[5] + ":" + split[7]
                covalent[geom] = linkid
data["Link"] = list(covalent.values())

There are some mistakes in the data set. Here we will manually fix them.

In [7]:
updated_data = deepcopy(data)
for i, row in data.iterrows():
    geom = row["Crystal ID"]
    smi = row["Compound SMILES"]
    
    if geom in ["Mpro-x0705", "Mpro-x0708", "Mpro-x0731", "Mpro-x0736", "Mpro-x0771",
                "Mpro-x0786", "Mpro-x1412"]:
        updated_data.at[i, "Modified Compound SMILES"] = smi.replace("Cl", "")
    if geom in ["Mpro-x0978", "Mpro-x0981"]:
        updated_data.at[i, "Modified Compound SMILES"] = smi.replace("Br", "")
        
data = updated_data

This concludes our modification to the dataset, which can be written to file.

In [8]:
from pickle import dump
with open("updated-data.pickle", "wb") as ofile:
    dump(data, ofile)