# Merging

This a notebook intended to be run in Colab. This is notebook 3.

1. Intro to RDKit: [![colab demo](https://img.shields.io/badge/Run_RDKit_intro-f9ab00?logo=googlecolab)](https://colab.research.google.com/github/matteoferla/DTC-compchem-practical/blob/main/1-basics.ipynb) — Overview of RDKit functionality
2. Intro to Forcefields & docking: [![colab demo](https://img.shields.io/badge/Run_Docking_intro-f9ab00?logo=googlecolab)](https://colab.research.google.com/github/matteoferla/DTC-compchem-practical/blob/main/2-redocking.ipynb) — Overview of forcefields in PyRosetta and redocking
3. Merging: [![colab demo](https://img.shields.io/badge/Run_RDKit_intro-f9ab00?logo=googlecolab)](https://colab.research.google.com/github/matteoferla/DTC-compchem-practical/blob/main/3-merging.ipynb) — Expample of merging

## Overview
In this part we will merge hits and place them.

If you are running out of time, check out [merger playground notebook](https://colab.research.google.com/github/matteoferla/Fragmenstein/blob/master/colab_playground.ipynb) instead.

For a version that does not rely on Fragalysis data, see [github::matteoferla::Fragmenstein::colab_fragmenstein.ipynb](https://colab.research.google.com/github/matteoferla/Fragmenstein/blob/master/colab_fragmenstein.ipynb).

# Fragmenstein
Fragmenstein is a position-based fragment-merging python3 tool.

<img src="https://github.com/matteoferla/Fragmenstein/raw/master/images/overview.png" width="800" alt="logo">

In its merging/linking operation, under the coordination of the class Victor,
the class Monster finds spatially overlapping atoms and stitches them together (with RDKit),
then the class Igor reanimates (minimises in PyRosetta) them within the protein site restraining the atoms to original positions.
As this compound may not be purchasable, one can use the placement operation to
make a stitched-together molecule based a template.

# Further details

## Details

This notebook does sevaral operations as examples.

* It optionally minimises the template structure, 
* optionally extracts the hits from provided PDB structures.
* It combines combinatorially the provided hits
* It then searches for the most similar molecules to the user chosen molecule in the Enamine Real database (via the API of John Irwin's SmallWorld server)
* places them.

NB Whereas Fragmenstein can deal with covalent ligands
and can interconvert a few cysteine reactive warheads
this notebook does not do any of that due to corner case mayhem. The demo data (MPro from the [Covid Moonshot](https://covid.postera.ai/covid) available from [Fragalysis](https://fragalysis.diamond.ac.uk/)) will use covalent residues.

Fragmenstein can _partially_ work without PyRosetta, but this notebook does not work that way.

See also:

* https://github.com/matteoferla/Fragmenstein
* https://fragmenstein.readthedocs.io
* https://github.com/matteoferla/Python_SmallWorld_API
* https://github.com/matteoferla/pyrosetta_help

# Installation and initialisation

In [None]:
#@title Installation
local_debug = True
if local_debug:
    raise Exception('CURRENTLY IN DEBUG MODE.... REMEMBER TO CLEAR ALL CELLS!')
#@markdown Press the play button on the top right hand side of this cell
#@markdown once you have checked the settings.
#@markdown You will be notified that this notebook is not from Google, that is normal.

## Install all requirements and get some goodies
!pip install git+https://github.com/matteoferla/DTC-compchem-practical.git
# this will be called as:
# import DTC_compchem_practical as dtc

## Jupyter lab? use `trident-chemwidgets`
!pip install git+https://github.com/matteoferla/JSME_notebook_hack.git
!pip install --upgrade plotly

# The next line is only valid for today without the Odin+Eduroam network
# ie. your IP address is one of these https://help.it.ox.ac.uk/ip-addresses#collapse2202811
#!pip install https://www.stats.ox.ac.uk/~ferla/pyrosetta-2022.46+release.f0c6fca0e2f-cp39-cp39-linux_x86_64.whl
!pip install https://www.stats.ox.ac.uk/~ferla/pyrosetta-2022.47+release.d2aee95a6b7-cp37-cp37m-linux_x86_64.whl
# Normally you have different ways of installing pyrosetta, e.g.
# pip install pyrosetta_help
# PYROSETTA_USERNAME=👾👾👾 PYROSETTA_PASSWORD=👾👾👾 install_pyrosetta

from google.colab import output  # noqa (It's a colaboratory specific repo)
output.enable_custom_widget_manager()

In [None]:
#@title Start PyRosetta
#@markdown Leave alone and just run it. 
#@markdown Only one that you might want to change is `ignore_waters`
#@markdown as merging a ligand with its watershell may be a reasonable thing to do
#@markdown in extreme circumstances —like not even Chuck Norris could find a followup.
import pyrosetta, logging
import pyrosetta_help as ph

#@markdown Do not optimise hydrogen on loading:
no_optH = False  #@param {type:"boolean"}
#@markdown Ignore (True) or raise error (False) if novel residue (e.g. ligand) —  **don't tick this**.
ignore_unrecognized_res = False  #@param {type:"boolean"}
#@markdown Use autogenerated PDB residues are often weird (bad geometry, wrong match, protonated etc.): —best do it properly and parameterise it, so **don't tick this**.
load_PDB_components = False  #@param {type:"boolean"}
#@markdown Ignore all waters:
ignore_waters = True  #@param {type:"boolean"}

extra_options = ph.make_option_string(no_optH=no_optH,
                                      ex1=None,
                                      ex2=None,
                                      mute='all',
                                      ignore_unrecognized_res=ignore_unrecognized_res,
                                      load_PDB_components=load_PDB_components,
                                      ignore_waters=ignore_waters)

# capture to log
# circuitous as I needed to debug...
logger = ph.configure_logger()
logger.handlers[0].setLevel(logging.WARNING)  # logging.WARNING = 30
pyrosetta.init(extra_options=extra_options,
               #set_logging_handler=True
               )

# Data loading

In [None]:
#@title Download off Fragalysis
#@markdown Choose a target
target_name = '👾👾👾'   #@param {type:"string"}
if local_debug:
    target_name = 'MID2A'

from rdkit import Chem
from IPython.display import display
from typing import Dict
import DTC_compchem_practical as dtc

#@markdown This will add the variables `pdb_filename`, `metadata_filename` and `sdf_filename`.
filenames: Dict[str, str] = dtc.download_fragalysis(target_name, 'input')
pdb_filename: str = filenames['reference.pdb']
metadata_filename: str = filenames['metadata.csv']
sdf_filename: str = filenames['combined.sdf']

In [None]:
#@title Make an apo structure & ligand table as before

from io import StringIO
with open(pdb_filename) as fh:
    holoblock:str = fh.read()

pdbblock = '\n'.join(filter(lambda l: 'HETATM' not in l , holoblock.split('\n')))

with open(f'input/{target_name}_reference.clean.pdb', 'w') as fh:
    fh.write(apo_block)

from rdkit import Chem
from rdkit.Chem import PandasTools
import pandas as pd

mol_df = pd.concat([PandasTools.LoadSDF(sdf_filename).set_index('ID'),
                       pd.read_csv(metadata_filename, index_col=0).set_index('crystal_name')
                      ], axis=1)

# Tweaks

In [None]:
#@title Optionally prepare it (1/3)
#@markdown This will be done by 

#@markdown 1. loading it in PyRosetta
#@markdown 2. Optionally energy minimising around a target
#@markdown 3. Optionally remove some molecules

#@markdown ### Step 1
#@markdown If the model has novel ligands, they will be loaded.
#@markdown But to do this a residue type  (=topology) needs to be made or loaded.
#@markdown These are saved as "params files".
#@markdown These following options control both the "acceptor" and "donor" poses (if uploaded).
#@markdown ### Params
#@markdown * Some compounds are parameterised in the database folder of rosetta,
#@markdown others in the PDB component database (if loaded).
#@markdown * Uses the params defined in the cell of the acceptor pose.
#@markdown * If there is no topology avalaible one will be made.
#@markdown * If a params file is present in the working folder it will use it.
#@markdown * See below or visit https://params.mutanalyst.com/ to generate them (upload the with the folder icon on the left).

#@markdown This forces it (a bit silly):
force_parameterisation = False  #@param {type:"boolean"}
#@markdown If it needs to be parameterised make it protonated for pH 7?
neutralize_params = True  #@param {type:"boolean"}
save_params = True  #@param {type:"boolean"}

#@markdown If a params file is present in the working folder it will use it.
#@markdown Leave this blank... otherwise  (comma separated w/ no rando spaces):
extra_params_files_to_use = ''  #@param {type:"string"}
extra_params = [f for f in extra_params_files_to_use.split(',') if f]
use_all_folder_params = ''  #@param {type:"boolean"}
if use_all_folder_params:
    present_params = [filename for filename in os.listdir() if os.path.splitext(filename) == '.params']
else:
    present_params = []
print('loading pose...')
template_pose = ph.ligands.load.parameterized_pose_from_pdbblock(pdbblock,
                                                                 wanted_ligands=[],
                                                                 force_parameterisation=force_parameterisation,
                                                                 neutralise_params=neutralize_params,
                                                                 save_params=save_params,
                                                                 overriding_params=extra_params + present_params)

In [None]:
#@title Optional energy minimisation around a target (2/3)
#@markdown (Requires previous cell run)
assert 'template_pose' in globals(), 'Step 1 was not run'

#@markdown If use density map is true, you will be prompted to upload a density map.
#@markdown Upload a f0fc ccp4 or a mrc map. (not a ccp4 difference map,
#@markdown a mtz reciprocal space map or a pirate treasure map)
#@markdown The map needs to be in the same position as the template.
use_density_map = False  #@param {type:"boolean"}
#@markdown The whole structure could be minimised, but that would be pointless costly timewise
#@markdown for this task.
#@markdown Specify what residue (amino acid or ligand) to centre around
center_residue_index = 1  #@param {type:"integer"}
center_residue_chain = 'A'  #@param {type:"string"}
center_index: int = template_pose.pdb_info().pdb2pose(res=center_residue_index, chain=center_residue_chain)
assert center_index != 0, 'That residue does not exist!'

#@markdown Specify which neighbouring residues to select in one of three ways:

#@markdown (1) Cutoff distance for the neighbouring residues (in Ångströms) (centroid to centroid)?
#@markdown set to zero to not use.
neighborhood_radius = 1  #@param {type:"integer"}
#@markdown (2) Cutoff distance for the neighbouring residues (in Ångströms) (closest atom to closest atom)?
#@markdown set to zero to not use.
cc_neighborhood_radius = 0  #@param {type:"integer"}
#@markdown (3) Max number of neighbouring residues to choose?
#@markdown set to zero to not use.
n_neighbors = 0  #@param {type:"integer"}

#@markdown ## Minimisation
#@markdown How many cycles of FastRelax to use? 3–15
cycles = 3  #@param {type:"integer"}
#@markdown to change scorefunctions and so forth edit the code.

#@markdown Note. The class Igor has two classmethods that normally
#@markdown perform these template minimisation steps (`Igor.download_map` and `Igor.relax_with_ED`),
#@markdown but this a cruder and quicker minimisation (local).

# Get map
if use_density_map:
    map_filename = 'uploaded_map.ccp4'
    uploaded = files.upload()
    assert len(uploaded) == 1, 'wrong number of files (only one plz)'
    filename = list(uploaded.keys())[0]
    mapblock = list(uploaded.values())[0]
    with open(os.path.join(input_folder, filename), 'wb') as fh:
        fh.write(mapblock)
    # this can be done with `Igor.relax_with_ED`, but I wanted the option here to do
    # it with or without the map
    ed = ph.prep_ED(template_pose, map_filename)
    assert ed.matchPose(template_pose) > 0.5, 'This is a rubbish fit. Upload the right map.'

# prep scorefunction
scorefxn = pyrosetta.get_fa_scorefxn()
if use_density_map:
    scorefxn.set_weight(pyrosetta.rosetta.core.scoring.ScoreType.elec_dens_fast,
                        30)

selector = pyrosetta.rosetta.core.select.residue_selector
resi_sele = selector.ResidueIndexSelector(center_index)
if neighborhood_radius != 0:
    neighbor_sele = selector.NeighborhoodResidueSelector(resi_sele,
                                                         distance=neighborhood_radius,
                                                         include_focus_in_subset=True)
elif cc_neighborhood_radius != 0:
    neighbor_sele = selector.CloseContactResidueSelector()
    neighbor_sele.central_residue_group_selector(resi_sele)
    neighbor_sele.threshold(cc_neighborhood_radius)
elif n_neighbors != 0:
    neighbor_sele = selector.NumNeighborsSelector(n_neighbors, 20)
    # Ah. True. NumNeighborsSelector does not work in PyRosetta.
    raise NotImplementedError
else:
    raise ValueError

# relax
movemap = pyrosetta.MoveMap()
movemap.set_bb(allow_bb=neighbor_sele.apply(template_pose))
movemap.set_chi(allow_chi=neighbor_sele.apply(template_pose))
relax = pyrosetta.rosetta.protocols.relax.FastRelax(scorefxn, cycles)
relax.set_movemap(movemap)
relax.apply(template_pose)

pdbblock = ph.get_pdbstr(template_pose)

# Enter Victor Fragmenstein's laboratory!

In [None]:
#@title Merge/link -> find similars -> place similars
#@markdown Three step process:

#@markdown 1. the hits are combined pairwise
#@markdown 2. the mergers are queried in the SmallWorld server against the Enamine REAL DB
#@markdown 3. the purchasable similars are placed

#@markdown In the documentation the example uses `sqlitedict.SqliteDict`
#@markdown as this avoids dramas from segfaults from `KeyboardInterrupt` or funky entries.

# Okay, the code below contains some black magic.
# a Chem.Mol is sent down the pipe to the subprocess pickled.
# But this loses its properties (`mol.HasProp`).
# unless this dark ritual is performed:
# https://github.com/matteoferla/Fragmenstein/blob/master/documentation/mol_properties.md

#@markdown &#9888; In this notebook, ligand efficiency against the filtered set is used for ranking.
#@markdown Ranking is a topic into itself. So only simple ranking options are presented here:
ranking = '∆∆G' #@param ["LE", "∆∆G", "comRMSD"]
#@markdown ∆∆G: this has the issue that a greater number of atoms will result in a lower score,
#@markdown even if each is not contributing much.
#@markdown LE: Ligand efficiency is the most correct way to rank, but will result in similarly sized compounds to the hits, which is not desirable in fragment building.
#@markdown comRMSD: by sorting by combined RMSD the most faithful hits will be placed first.

#@markdown Angstrom distance
joining_cutoff = 5  #@param {type:"integer"}
#@markdown Angstrom distance
quick_reananimation = True  #@param {type:"boolean"}
#@markdown Convalent residue (cysteine only out of the box).
#@markdown Set as '' if noncovalent. '145A' is for MPro demo data.
covalent_resi = '145A' #@param {type:"string"}
if covalent_resi in ('', 'none', 'None', 'False', 'false', '0'):
    covalent_resi = None

# ============================================================================================
# ## Define the process

#@markdown The mergers may not be purchasable.
#@markdown As a result here the purchasable similars in Enamine Real can be sought
find_similars = True #@param {type:"boolean"}
topN_to_pick = 10  #@param {type:"integer"}
place_similars = True #@param {type:"boolean"}
#@markdown For the placement of similars use the original hits or the unminimised merger?
use_originals = False  #@param {type:"boolean"}

#@markdown How many hits to merge in the first place?
n_hits = 20   #@param {type:"integer"}
hits = mol_df.sample(n_hits).ROMol

place_similars = find_similars and place_similars


import os, re
import pyrosetta, logging
import pandas as pd
from rdkit import Chem
from fragmenstein import Victor, Laboratory

Victor.work_path = output_folder
Victor.monster_throw_on_discard = True  # stop this merger if a fragment cannot be used.
Victor.monster_joining_cutoff = joining_cutoff  # Å
Victor.quick_reanimation = quick_reananimation  # for the impatient
Victor.error_to_catch = Exception  # stop the whole laboratory otherwise
#Victor.enable_stdout(logging.ERROR)
Victor.enable_logfile(os.path.join(output_folder, 'demo.log'), logging.ERROR)

# calculate !
lab = Laboratory(pdbblock=pdbblock, covalent_resi=covalent_resi)
n_cores = 1  #@param {type:"integer"}
combinations:pd.DataFrame = lab.combine(hits, n_cores=n_cores)


# =============================================================================================
# ## plot results


import plotly.express as px
from IPython.display import display

fig = px.histogram(combinations,
                   x='outcome',
                   category_orders={'outcome': lab.category_labels},
                   title='Distribution of Combination outcome')
fig.show()

# =============================================================================================
# ## Reverse the warhead...
# this is really unusual and janky way of doing it as one ought to know the metadata already...

from rdkit.Chem import AllChem
from fragmenstein import Victor
from typing import *
from warnings import warn

warhead_names = []
unreacted_smiles = []
for i, row in combinations.iterrows():
    unrxn, wn = Victor.guess_warhead(row.smiles) #: Tuple[str, str]
    warhead_names.append(wn)
    unreacted_smiles.append(unrxn)

combinations['unreacted_smiles'] = unreacted_smiles
combinations['warhead_type'] = warhead_names

# save

combinations.to_csv('combinations.csv')
combinations.to_pickle('combinations.p')

# =============================================================================================
# ## top 10
best_combinations = combinations.loc[combinations.outcome == 'acceptable'].sort_values(ranking).reset_index(drop=True).head(topN_to_pick)
if len(best_combinations):
    print(f'Top {topN_to_pick} mergers/linkers sorted by {ranking}')
    #PandasTools.AddMoleculeColumnToFrame(best_combinations,'smiles','molecule',includeFingerprints=False)
    display(best_combinations.drop(['unmin_binary', 'min_binary'], axis=1))
else:
    display(combinations.error)
    reasons = combinations.error.astype(str).str.split(r'^(\w+)\:', expand=True)[1].value_counts().to_dict()
    raise RuntimeError(f'The combinations failed because {reasons}')

# =============================================================================================
# ### Place purchasable similars

from smallworld_api import SmallWorld
from warnings import warn

sws = SmallWorld()
# this call requires an internet connection
chemical_databases:pd.DataFrame = sws.retrieve_databases()

if find_similars:
    similars = sws.search_many(best_combinations.unreacted_smiles,
                               dist=25,
                               db=sws.REAL_dataset,
                               tolerated_exceptions=Exception)

    similars['inspirations'] = similars.query_index.map( best_combinations.regarded.to_dict() )
    similars['merger'] = similars.query_index.map( best_combinations.smiles.to_dict() )
    similars['merger_∆∆G'] = similars.query_index.map( best_combinations['∆∆G'].to_dict() )
    similars['inspiration_mols'] = similars.query_index.map( best_combinations.hit_mols.to_dict() )
    similars['merger_unminimized_mol'] = similars.query_index.map( best_combinations.unminimized_mol.to_dict() )
    similars['merger_minimized_mol'] = similars.query_index.map( best_combinations.minimized_mol.to_dict() )
    similars.to_csv('similars.csv')
    similars.to_pickle('similars.p')

    display(similars[['smiles', 'name', 'topodist', 'inspirations', 'merger', 'merger_∆∆G']])

# ============ place the similars ==================
if place_similars:
    if use_originals:
        similars['hits'] = similars.inspiration_mols
    else:
        # make a list of one, the unminimised merger
        similars['hits'] = similars.merger_unminimized_mol.apply(lambda m: [m])

    lab = Laboratory(pdbblock=pdbblock, covalent_resi=covalent_resi)
    placements:pd.DataFrame = lab.place(similars, expand_isomers=False, n_cores=n_cores)
    display(placements)
    placements['const_ratio'] = placements['N_constrained_atoms'] / (
                placements['N_constrained_atoms'] + placements['N_unconstrained_atoms'])

    from rdkit import Chem
    from typing import *

    m = similars.drop_duplicates('name').set_index('name').to_dict()
    placements['merger_∆∆G'] = placements['name'].map(m['merger_∆∆G'])
    placements['merger_minimized_mol'] = placements['name'].map(m['merger_minimized_mol'])
    placements['merger_unminimized_mol'] = placements['name'].map(m['merger_unminimized_mol'])
    placements.rename(columns={'unminimized_mol': 'enamine_unminimized_mol',
                               'minimized_mol': 'enamine_minimized_mol'}, inplace=True)
    placements['merger_inspiration_names'] = placements['name'].map(m['inspirations'])
    placements['merger_inspiration_mols'] = placements.hit_mols
    nan_to_list = lambda value: value if isinstance(value, list) else []
    placements['disregarded'] = placements.disregarded.apply(nan_to_list)
    placements['regarded'] = placements.regarded.apply(nan_to_list)

    placements.to_csv('placements.csv')
    placements.to_pickle('placements.p')

    # NB: more than 2 in 3 constrained is actually uncommon with enamine real for larger mergers.
    # hence the 1 in 2

    best_placements = placements.loc[
        (placements.outcome == 'acceptable')
        & (placements.const_ratio > 1/2) ].sort_values(ranking).reset_index(drop=True).head(topN_to_pick)
    if len(best_placements):
        print(f'Top {topN_to_pick} placements sorted by {ranking}')
        #PandasTools.AddMoleculeColumnToFrame(best_combinations,'smiles','molecule',includeFingerprints=False)
        # noisy_fields = ['hit_mols', 'merger_unminimized_mol',
        #                           'merger_unminimized_mol',
        #                           'unmin_binary', 'min_binary']
        noisy_fields = []
        display( best_placements.drop(noisy_fields, axis=1) )
    else:
        display(placements.error)
        reasons = placements.error.astype(str).str.split(r'^(\w+)\:', expand=True)[1].value_counts().to_dict()
        raise RuntimeError(f'The placements failed because {reasons}')

#PandasTools.AddMoleculeColumnToFrame(best_placements,'smiles','molecule',includeFingerprints=False)


# =============================================================================================
# ### Results redux

from IPython.display import clear_output, HTML, display
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

headerify: Callable[[str], HTML] = lambda header: HTML(f'<h3>{header}</h3>')

clear_output()
if place_similars:
    fig = px.histogram(placements,
                       x='outcome',
                       category_orders={'outcome': lab.category_labels},
                       title='Distribution of Placement outcome')
    fig.show()


display(headerify('Provided hits'))
display_mols(hits)
display(headerify('Step 1. Combine'))
fig.show()
print(f'Top {topN_to_pick} mergers/linkers sorted by {ranking}')
display(best_combinations.drop(['unmin_binary', 'min_binary'], axis=1))
display_mols(best_combinations.unminimized_mol)

def show3D_combined(index_to_show):
    row = best_placements.iloc[index_to_show]
    print(f'Green: hits {row["merger_inspiration_names"]}')
    print('Cyan: merger')
    display(make_3Dview(pdbblock, {'greenCarbon': row.hit_mols,
                           'cyanCarbon': [row.minimized_mol]}))

print('In the sorted combinations table (`best_combinations`) which index do you want to see:')
scale = widgets.IntSlider(min=0,max=len(best_combinations)-1, step=1, value=0)
interact(show3D_combined, index_to_show=scale)
#display(similars)
if place_similars:

    def show3D_placed(index_to_show):
        row = best_placements.iloc[index_to_show]
        print(f'Green: hits {row["merger_inspiration_names"]}')
        print('Cyan: merger')
        print(f'Magenta: Enamine Real purchasable {row["name"]}')
        display(make_3Dview(pdbblock, {'greenCarbon': row.merger_inspiration_mols,
                               'cyanCarbon': [row.merger_minimized_mol],
                               'magentaCarbon': [row.enamine_minimized_mol]}))

    display(headerify('Step 2. Placement of purchasable similars'))
    display(headerify(f'Top {topN_to_pick} Placements'))
    display( best_placements.drop(noisy_fields, axis=1) )

    print('In the sorted combinations table (`best_combinations`) which index do you want to see:')
    scale = widgets.IntSlider(min=0,max=len(best_placements)-1, step=1, value=0)
    interact(show3D_placed, index_to_show=scale)
elif find_similars:
    display(headerify('Purchasable similars'))
    display(similars)