# Docking basics

This a notebook intended to be run in Colab. This is notebook 2.

1. Intro to RDKit: [![colab demo](https://img.shields.io/badge/Run_RDKit_intro-f9ab00?logo=googlecolab)](https://colab.research.google.com/github/matteoferla/DTC-compchem-practical/blob/main/1-basics.ipynb) — Overview of RDKit functionality
2. Intro to Forcefields & docking: [![colab demo](https://img.shields.io/badge/Run_Docking_intro-f9ab00?logo=googlecolab)](https://colab.research.google.com/github/matteoferla/DTC-compchem-practical/blob/main/2-redocking.ipynb) — Overview of forcefields in PyRosetta and redocking
3. Merging: [![colab demo](https://img.shields.io/badge/Run_RDKit_intro-f9ab00?logo=googlecolab)](https://colab.research.google.com/github/matteoferla/DTC-compchem-practical/blob/main/3-merging.ipynb) — Expample of merging

## Overview
In this notebook we will use PyRosetta to glean the basics of forcefields and understand what happens in a docking protocol.
Finally we will dock small molecules with known bound structure in order to compare the results with the empirical data.

There are myriads of docking algorithms, in this notebook we are using PyRosetta as it's terms can be disected easily,
and it will be used for the next notebook. Other options include, Gold, rDock, OpenEye Dock etc. They each have their pros and cons, for example, Glide, which is one of the top ranking docking software, but is part of the impressive Schrödinger suite, which is expensive.

For the analysis, we will use fragment screen data from [Fragalysis](https://fragalysis.diamond.ac.uk/),
the app that provides an interface to the various datasets in XChem, prof Frank von Delft's group at Diamond.
For what is what consult [this table](https://github.com/matteoferla/munged-Fragalysis-targets/blob/main/targets.md).
In this practical we will be using it for the data, but you are welcome to explore it.
You will be shown it properly in the Diamond visit.
Additionally, a key idea is fragment binding sites are no way of equal important to a researcher,
i.e. designing an inhibitor for an enzyme requires knowledge of where and how catalysis occurs.
This is also beyond the scope of this practical but worth keeping in mind.

In [None]:
#@title Installation
local_debug = False
if local_debug:
    raise Exception('CURRENTLY IN DEBUG MODE.... REMEMBER TO CLEAR ALL CELLS!')
#@markdown Press the play button on the top right hand side of this cell
#@markdown once you have checked the settings.
#@markdown You will be notified that this notebook is not from Google, that is normal.

## Install all requirements and get some goodies
!pip install git+https://github.com/matteoferla/DTC-compchem-practical.git
# this will be called as:
# import DTC_compchem_practical as dtc

## Jupyter lab? use `trident-chemwidgets`
!pip install git+https://github.com/matteoferla/JSME_notebook_hack.git
!pip install --upgrade plotly

# The next line is only valid for today without the Odin+Eduroam network
# ie. your IP address is one of these https://help.it.ox.ac.uk/ip-addresses#collapse2202811
#!pip install https://www.stats.ox.ac.uk/~ferla/pyrosetta-2022.46+release.f0c6fca0e2f-cp39-cp39-linux_x86_64.whl
!pip install https://www.stats.ox.ac.uk/~ferla/pyrosetta-2022.47+release.d2aee95a6b7-cp37-cp37m-linux_x86_64.whl
# Normally you have different ways of installing pyrosetta, e.g.
# pip install pyrosetta_help
# PYROSETTA_USERNAME=👾👾👾 PYROSETTA_PASSWORD=👾👾👾 install_pyrosetta

from google.colab import output  # noqa (It's a colaboratory specific repo)
output.enable_custom_widget_manager()

In [None]:
#@title Download off Fragalysis
#@markdown Choose a target
target_name = '👾👾👾'   #@param {type:"string"}
if local_debug:
    target_name = 'MID2A'

from rdkit import Chem
from IPython.display import display
from typing import Dict
import DTC_compchem_practical as dtc

#@markdown This will add the variables `pdb_filename`, `metadata_filename` and `sdf_filename`.
filenames: Dict[str, str] = dtc.download_fragalysis(target_name, 'input')
pdb_filename: str = filenames['reference.pdb']
metadata_filename: str = filenames['metadata.csv']
sdf_filename: str = filenames['combined.sdf']

In [None]:
#@title Make an apo structure
#@markdown Next we crudely remove HETATM record lines to get an apo structure.
#@markdown This is quick, but not great approach.

#@markdown [PDB file format](https://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html)
#@markdown and [PDB for general overview](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/primary-sequences-and-the-pdb-format)

from io import StringIO
with open(pdb_filename) as fh:
    pdb_block:str = fh.read()

apo_block = '\n'.join(filter(lambda l: 'HETATM' not in l , pdb_block.split('\n')))

with open(f'input/{target_name}_reference.clean.pdb', 'w') as fh:
    fh.write(apo_block)

# This is w/o ligand
import nglview as nv

view = nv.NGLWidget()
# change `apo_block` to `pdb_block` for the original:
view.add_component(StringIO(apo_block), ext='pdb')
view

## Questions
> What is stored in a HETATM? (see )

👾👾👾

> What is an "apo structure"?

👾👾👾

> Why is crudely removing heteroligand atoms bad, and what could be done to fix it?

👾👾👾

In [None]:
#@title Make a combined table
#@markdown Fragalysis does not give attributes in the sdf entries. This is instead stored in `metadata.csv`.

from rdkit import Chem
from rdkit.Chem import PandasTools
import pandas as pd

mol_df = pd.concat([PandasTools.LoadSDF(sdf_filename).set_index('ID'),
                       pd.read_csv(metadata_filename, index_col=0).set_index('crystal_name')
                      ], axis=1)
mol_df.to_pickle(f'input/{target_name}_df.p')

mol_df

In [None]:
#@title Initialise Pyrosetta
import pyrosetta, logging
import pyrosetta_help as ph

import types
prn: types.ModuleType = pyrosetta.rosetta.numeric
prc: types.ModuleType = pyrosetta.rosetta.core
prp: types.MethodType = pyrosetta.rosetta.protocols
prs: types.MethodType = prc.select.residue_selector


# capture to log
logger = ph.configure_logger()
logger.handlers[0].setLevel(logging.ERROR)  # logging.WARNING = 30
extra_options = ph.make_option_string(no_optH=False,
                                      ex1=None,
                                      ex2=None,
                                      mute='all',
                                      ignore_unrecognized_res=False,
                                      load_PDB_components=False,
                                      ignore_waters=True)
pyrosetta.init(extra_options=extra_options)

pose = pyrosetta.Pose()
pyrosetta.rosetta.core.import_pose.pose_from_pdbstring(pose, apo_block)

In [None]:
#@title Residue Topology
#@markdown As seen previously a molecule is a graph network where the nodes (atoms) may be connected by edges (bonds),
#@markdown And the nodes/atoms have a partial charges property.
#@markdown In molecular mechanics, blocks of atoms are called 'residues', be they ligands or polymer units.
#@markdown When dealing with several algorithms, such as those using forcefields, the residue needs to be "prepared",
#@markdown by adding how it bonds and its charges. Autodock uses a pdbqt format, which extends the PDB format with partial charge and atom types.
#@markdown While other tools have different formats. Rosetta has `.param` files, which adds atom types and the relationship between
#@markdown atoms in dihedral space, not cartesian.
#@markdown A reside type / topology is the universal definition of a residue, not a specific residue.

#@markdown This cell outputs the params file for the molecule 'CO' (methanol).
#@markdown The format is specific to this toolkit, but the idea is common:
#@markdown for an atom you need an atomname and a partial charge and... an 'atomtype'.
#@markdown an atomtype combines element, hybridisation, VdW radius, etc. Similarly to a residuetype, it is a universal and not a specific residue.
from rdkit_to_params import Params

topo = Params.from_smiles('CO', name='LIG')
display(topo)

NameError: name 'Chem' is not defined

## Questions

> Why does the partial charge reside in an atom of a residue type not an atom type?

👾👾👾

> Why is bond order often absent in residue types/topologies?

👾👾👾

In [None]:
#@title Atomtype inspection
#@markdown Let's have a gander of what AtomTypes look like
import importlib_resources

print(
    importlib_resources.read_text('pyrosetta.database.chemical.atom_type_sets.fa_standard', 'atom_properties.txt')
)
#@markdown **Q**: Why is bond order often absent in residue types/topologies?

## Diffence in Gibbs free energy

The following cells will talk of 'score' or 'change in Gibbs free energy'.

A [change in Gibbs free energy](https://en.wikipedia.org/wiki/Gibbs_free_energy), written as ∆∆G (altered ∆G, say protein mutation, minus reference/wild type ∆G) or
∆G_bind (ligand specific: ∆G bound complex minus ∆G of a pose where the ligand is well away from the protein),
is a potential energy released to go to the reference state.
Yesterday, Monday 28th, you were introduced to the Levinthal paradox and the folding funnel,
an unfolded protein rolls down the energy funnel releasing energy and finding a _stable_ configuration.

A ∆∆G is seen in the [Arrhenius equation](https://en.wikipedia.org/wiki/Arrhenius_equation) for rates:

$$k=Ae^{\frac {-E_{\rm {a}}}{RT}}$$

and it's variants (e.g. [Eyring equation](https://en.wikipedia.org/wiki/Eyring_equation)). I.e. the relationship between a rate and the change in potential is logarithm, hence why rates are occassinally writen in the log form (e.g. pIC50 or pKd).

It is named after a person (J. Willard Gibbs) and is the difference in enthalpy minus entropy times temperature
in its simplest form —...although do be aware that heat capacity factors into it in macromolecular rate theory.

The [_RT_ denominator](https://en.wikipedia.org/wiki/KT_(energy)#RT) in the Arrhenius equation is the Boltzmann constant (kB) times temperature (T) and molarity (N_A, R&#183;N_A&#183;T = k_B&#183;T)

At 25°C the mean collision energy of water will be approximately 0.6kcal/mol and 1kcal/mol at 37°C (cf. RT & Maxwell–Boltzmann distribution). A hydrogen bond is roughly -1 kcal/mol, a salt bridge is -2 kcal/mol and pi-pi interactions around 1.5 kcal/mol.

The unit is either kcal/mol or kJ/mol. Do be vigilant which is being used as this is a common tripping poit. (But no, Americans do not horsepower per pound-mole luckily)

A way to predict what this may be is summing the various contributions of energy, such as Columbic charge interactions, van der Vaals interactions etc.
A key term is the [Lennard-Jones potential](https://en.wikipedia.org/wiki/Lennard-Jones_potential):

$$V_{\text{LJ}}(r)=4\varepsilon \left[\left({\frac {\sigma }{r}}\right)^{12}-\left({\frac {\sigma }{r}}\right)^{6}\right]$$

In Rosetta is split into an attractive term (the "six term") and the repulsive term ("twelve term").
In many scorefunctions the output is not in kcal/mol or kJ/mol, and may not even correlate linearly.

A very common forcefield is the Amber forcefield, which is used by AlphaFold2 as seen yesterday. They are not all the same and may have different terms, for example in the next few examples, the scorefunction does not take into account bond lengths, angles and dihedrals. Another common difference for example is whether the polarisability of bromide or aromatic systems is modelled (via a "Drude particle" as an approximation). But these are in the realm of molecular mechanics, whereas for more accurate and computationally-expensive one has to venture into the realm of quantum mechanics. For example, whether two pi-pi interacting rings will be T-stacking or parallel stacking model depends on the dipole moments or their orbitals.

## Solatation

In the model used the solvent is [_implicit_](https://en.wikipedia.org/wiki/Implicit_solvation). In MD simulations, the solvent is instead most often explicit. A common implicit model used is the [Born equation](https://en.wikipedia.org/wiki/Born_equation), however this is not the case w/ Rosetta.

In [None]:
#@title Forcefields
#@markdown In this experiment we will place a chlorine ion and sodium ion at different distances.
#@markdown to see how the terms behave.
# For extra fun, we can change the charge of the chlorine:

cl_charge: float = -1.0 #@param {type:"number"}

import DTC_compchem_practical as dtc
import numpy as np
import pandas as pd

combined_scores = {}

#@markdown A `pyrosetta.ScoreFunction` is a callable which returns the score (~∆G for some ScoreFunction instances)
#@markdown of a pose.
scorefxn: pyrosetta.ScoreFunction = pyrosetta.get_fa_scorefxn()
for offset in np.arange(0,10, 0.1):
    test: pyrosetta.Pose = pyrosetta.pose_from_sequence('Z[NA]')
    xyz = prn.xyzVector_double_t(test.residues[1].xyz(1))
    xyz.x +=offset
    dtc.add_mod_cl(test,
                   gasteiger=cl_charge,
                   xyz = xyz)
    scores = {st.name: scorefxn.score_by_scoretype(test, st, True) for st in scorefxn.get_nonzero_weighted_scoretypes()}
    scores['distance'] = (test.residue(1).xyz(1) - test.residue(2).xyz(1)).norm()
    combined_scores[offset] = scores

df = pd.DataFrame.from_dict(combined_scores, orient='index').round(2)
ndf=(df-df.min())/abs(df.max()-df.min())
#ndf.columns = map(ph.weights.term_meanings, ndf.columns.values)
import plotly.express as px

#@markdown For a dictionary of what the columns mean see `ph.weights.term_meanings`.
#@markdown `fa_atr` is the six-term of the LJ potential, `fa-rep` is the twelve-term.
#@markdown `fa_elec` is the Columbic interaction (charge)
#@markdown `fa_sol` is a term to model the _implicit_ solvent used.
fig = px.line(df)
fig.update_xaxes(title="Distance [Å]")
fig.update_yaxes(title="Energy [kcal/mol]")
fig

## Questions

> What force dominates when atoms are too close?

👾👾👾

> What happens between 2-4 Å? (Zoom into the interactive plotly figure)

👾👾👾

In [None]:
#@markdown Lets look at how the sidechains look with a given molecule in the _reference_ pose

# avoid issues with multiple chains for simplicity of going through things: dont do this at home
chainA = pose.split_by_chain()[1]

mol_i = 0   #@param {type:"integer"}
mol = mol_df.ROMol[mol_i]
# add it to the pose
# let's pretend by magic:
combined = dtc.add_mol_in_pose(chainA, mol )

# show it
view = nv.show_rosetta(combined)
dtc.add_neighbors(view, '[LIG]', radius=6)
view

In [None]:
#@markdown Lets repack the sidechains and have a look to see if anything changed
#@markdown As we saw on the video sidechains and backbones may differ between bound ligands.

#@markdown To do this we pass the pose is altered by a mover (called 'sampler' in some other tools),
#@markdown Once an instance of a mover is set up it is applied to a pose with the method
#@markdown `mover.apply(pose)`.

#@markdown Some do a fixed/deterministic operation or set of operations (some of these with random paramaters).
#@markdown Others iterate over and over smaller operations and accept the outcome based on a criterion (cf. Monte Carlo)

# select the neighbourhood
lig_i = [i+1 for i, r in enumerate(combined.residues) if r.name3() == 'LIG'][-1]
lig_sele = prs.ResidueIndexSelector(lig_i)
neigh_sele = prs.NeighborhoodResidueSelector(lig_sele, 6, False)

# minimise the neighbouring sidechains
movemap = pyrosetta.MoveMap()
movemap.set_bb(False)
movemap.set_jump(False)
movemap.set_chi(allow_chi=neigh_sele.apply(combined) )  # repack these sidechains
relax: prp.moves.Mover = pyrosetta.rosetta.protocols.relax.FastRelax(scorefxn, 5)
relax.set_movemap_disables_packing_of_fixed_chi_positions(True)
relax.set_movemap(movemap)
relax.apply(combined)

# have a gander
view = nv.show_rosetta(combined)
dtc.add_neighbors(view, '[LIG]', radius=6)
view

In [None]:
#@title Distort and dock molecule
#@markdown In this section the bound ligand is first wiggled around randomly
#@markdown and then its position corrected by docking.
#@markdown The details of commands run is not important.

from io import StringIO

# wiggle the ligand
rot_mag_in=180 #@param {type:"integer"}
trans_mag_in=3 #@param {type:"integer"}
initial = dtc.add_mol_in_pose(chainA, mol )
perturbed: pyrosetta.Pose = initial.clone()
pert_mover: prp.moves.Mover = prp.rigid.RigidBodyPerturbNoCenterMover(rb_jump_in=perturbed.num_jump(),
                                                                        rot_mag_in=rot_mag_in,
                                                                        trans_mag_in=trans_mag_in)
pert_mover.apply(perturbed)

# dock it
docked: pyrosetta.Pose = perturbed.clone()
lig_idx: int = [i+1 for i, r in enumerate(docked.residues) if r.name3() == 'LIG'][-1]
docked.pdb_info().set_resinfo(res=lig_idx, chain_id='X', pdb_res=1)
docked.remove_constraints()
pyrosetta.rosetta.protocols.docking.setup_foldtree(docked, 'A_B', pyrosetta.Vector1([1]))
docking: prp.moves.Mover = pyrosetta.rosetta.protocols.docking.DockMCMProtocol()
docking.set_scorefxn( pyrosetta.create_score_function('ligand') )
docking.apply(docked)

# separated
separated: pyrosetta.Pose = docked.clone()
protein, ligand = separated.split_by_chain()
ligand.translate()

#@markdown PyRosetta can do RMSD calculations, but as RDKit is a more common tool and can do RMSD, we will use that.
#@markdown specifically `Chem.rdMolAlign.CalcRMS`, not Chem.rdMolAlign.getBestRMS` as we do not want to align the molecules.
#@markdown This holds true even in PyMOL.
mol_i: Chem.Mol = Chem.MolFromPDBBlock(ph.get_pdbstr(initial.split_by_chain()[1]))
mol_p: Chem.Mol = Chem.MolFromPDBBlock(ph.get_pdbstr(perturbed.split_by_chain()[1]))
mol_d: Chem.Mol = Chem.MolFromPDBBlock(ph.get_pdbstr(docked.split_by_chain()[1]))

print(f'RMSD of initial vs perturbed', Chem.rdMolAlign.CalcRMS(mol_i, mol_p))
print(f'RMSD of initial vs docked', Chem.rdMolAlign.CalcRMS(mol_i, mol_d))

import nglview as nv

view = nv.show_rosetta(docked, color='gainsboro')
view.component_0.add_representation('hyperball', '[LIG]', colorValue='#F8766D')
view.add_component(StringIO(Chem.MolToMolBlock(mol)), ext='mol', colorValue='#00B4C4')
display(view)
#@markdown **Q**: What is a Monte Carlo method? Hint: it is not a method written in monégasque.

In [None]:
#@title Create docking algorithm
#@markdown TBC
#@markdown ...

# MonteCarlo object, the RigidBodyMover, PackRotamers, and the MinMover


scorefxn = pyrosetta.get_score_function()

seq_mover = SequenceMover()
n_moves = 1
movemap = MoveMap()
...
# mover for conformer resampling
...
# mover for small translation
...
# mover for small rotation
...
# mover for repacking the sidechains
min_mover = MinMover()
min_mover.movemap(movemap)
min_mover.score_function(scorefxn)
seq_mover.add_mover(min_mover)


kT = 1.0
mc = pyrosetta.MonteCarlo(combined, scorefxn, kT)
#mc.boltzmann(pose)
trial_mover = TrialMover(seq_mover, mc)
trial_mover

n_repeats = 10
repeat_mover = RepeatMover(trial_mover, n_repeats)
repeat_mover.apply(pose)

#@markdown **Q**: What is a Monte Carlo method? Hint: it is not a method written in monégasque.