<a href="https://colab.research.google.com/github/pablo-arantes/Cloud-Bind/blob/main/Virtual_Screening.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Hi there!**

This is a Jupyter notebook for running a Virtual Screening protocol, including molecular docking calculations with deep learning using the Gnina docking software, the Protein-Ligand Atomistic Conformational Ensemble Resolver and the AEV-PLIG binding affinity predictor.

The main goal of this notebook is to demonstrate how to harness the power of cloud-computing to perform drug binding structure prediction in a cheap and yet feasible fashion.

---

 **This notebook is NOT a standard protocol for molecular docking calculations!** It is just a simple docking pipeline illustrating each step of the process.

---


**Bugs**
- If you encounter any bugs, please report the issue to https://github.com/pablo-arantes/Cloud-Bind/issues

**Acknowledgments**
- We would like to thank the [GNINA](https://github.com/gnina/gnina) team for doing an excellent job open sourcing the software.
- We would like to thank the [Roitberg](https://roitberg.chem.ufl.edu/) team for developing the fantastic [TorchANI](https://github.com/aiqm/torchani).
- We would like to thank [@ruiz_moreno_aj](https://twitter.com/ruiz_moreno_aj) for his work on [Jupyter Dock](https://github.com/AngelRuizMoreno/Jupyter_Dock)
- We would like to thank the ChemosimLab ([@ChemosimLab](https://twitter.com/ChemosimLab)) team for their incredible [ProLIF](https://prolif.readthedocs.io/en/latest/index.html#) (Protein-Ligand Interaction Fingerprints) tool.
- We would like to thank the [OpenBPMD](https://github.com/Gervasiolab/OpenBPMD) team for their open source implementation of binding pose metadynamics (BPMD).
- Also, credit to [David Koes](https://github.com/dkoes) for his awesome [py3Dmol](https://3dmol.csb.pitt.edu/) plugin.
- Finally, we would like to thank [Making it rain](https://github.com/pablo-arantes/making-it-rain) team, **Pablo R. Arantes** ([@pablitoarantes](https://twitter.com/pablitoarantes)), **Marcelo D. Polêto** ([@mdpoleto](https://twitter.com/mdpoleto)), **Conrado Pedebos** ([@ConradoPedebos](https://twitter.com/ConradoPedebos)) and **Rodrigo Ligabue-Braun** ([@ligabue_braun](https://twitter.com/ligabue_braun)), for their amazing work.
- For related notebooks see: [Cloud-Bind](https://github.com/pablo-arantes/Cloud-Bind)

In [None]:
#@title **Install Conda Colab**
#@markdown It will restart the kernel (session), don't worry.
# !pip install -q condacolab
# import condacolab
# condacolab.install()

!pip install -q condacolab
import condacolab
condacolab.install_from_url("https://github.com/conda-forge/miniforge/releases/download/25.3.1-0/Miniforge3-Linux-x86_64.sh")


In [None]:
import condacolab
condacolab.check()

In [None]:
#@title **Install dependencies**
#@markdown It will take a few minutes, please, have a coffee and wait. ;-)
# install dependencies
%%capture
import sys
import tarfile
import os
import subprocess
import sys
#subprocess.run("rm -rf /usr/local/conda-meta/pinned", shell=True)
subprocess.run("pip -q install py3Dmol", shell=True)
subprocess.run("pip install git+https://github.com/pablo-arantes/biopandas", shell=True)
subprocess.run("pip install bio", shell=True)
subprocess.run("pip install torch torchvision", shell=True)
subprocess.run("pip install torchani", shell=True)
subprocess.run("pip install ase", shell=True)
subprocess.run("pip install pandas", shell=True)
subprocess.run("pip install seaborn", shell=True)
subprocess.run("pip install openmm", shell=True)
subprocess.run("pip install datamol", shell=True)
subprocess.run("conda install -c conda-forge pdbfixer -y", shell=True)
subprocess.run("pip install parmed", shell=True)
subprocess.run("conda install -c conda-forge openbabel -y", shell=True)
subprocess.run("pip install rdkit", shell=True)
subprocess.run("wget https://github.com/gnina/gnina/releases/download/v1.3/gnina", shell=True)
subprocess.run("chmod +x gnina", shell=True)
subprocess.run("git clone https://github.com/pablo-arantes/AEV-PLIG.git", shell=True)
subprocess.run("pip install logmd==0.1.30", shell=True)
subprocess.run("pip install MDAnalysis", shell=True)
subprocess.run("pip install posebusters --upgrade", shell=True)
subprocess.run("mamba install -c conda-forge pymol-open-source -y", shell=True)
subprocess.run("pip install qcelemental", shell=True)
subprocess.run("pip install torch-geometric", shell=True)
subprocess.run("wget https://github.com/rdk/p2rank/releases/download/2.5/p2rank_2.5.tar.gz", shell=True)
file = tarfile.open('p2rank_2.5.tar.gz')
file.extractall('/content/')
file.close()
os.remove('p2rank_2.5.tar.gz')
subprocess.run("pip install medchem", shell=True)

#load dependencies
import parmed as pmd
from biopandas.pdb import PandasPdb
import urllib.request
import numpy as np
import py3Dmol
import platform
import scipy.cluster.hierarchy
from scipy.spatial.distance import squareform
import scipy.stats as stats
import matplotlib.pyplot as plt
import pandas as pd
from scipy.interpolate import griddata
import seaborn as sb
from statistics import mean, stdev
from matplotlib import colors
from IPython.display import set_matplotlib_formats
from rdkit import Chem
import datamol as dm
import seaborn as sns
from concurrent.futures import ProcessPoolExecutor

In [None]:
#@title **Install PLACER dependencies**
#@markdown Please, continue drinking your coffee and wait. ;-)

#@markdown Protein-Ligand Atomistic Conformational Ensemble Resolver (PLACER) is a graph neural network that operates entirely at the atomic level; the nodes of the graph are the atoms in the system. PLACER was trained to recapitulate the correct atom positions from partially corrupted input structures from the Cambridge Structural Database and the Protein Data Bank. PLACER accurately generates structures of diverse organic small molecules given knowledge of their atom composition and bonding, and given a description of the larger protein context, can accurately build up structures of small molecules and protein side chains; used in this way PLACER has competitive performance on protein-small molecule docking given approximate knowledge of the binding site. PLACER is a rapid and stochastic denoising network, which enables generation of ensembles of solutions to model conformational heterogeneity.

#@markdown Reference: https://www.biorxiv.org/content/10.1101/2024.09.25.614868v1



#install dependencies
%%capture
import sys
import tarfile
import os
import subprocess
import sys

commands = [
    "git clone https://github.com/pablo-arantes/PLACER.git",
    "mamba env create -f /content/PLACER/envs/placer_env_lite.yml"
]


for cmd in commands:
    subprocess.run(cmd, shell=True)

## Using Google Drive to store simulation data

Google Colab does not allow users to keep data on their computing nodes. However, we can use Google Drive to read, write, and store our simulations files. Therefore, we suggest to you to:

1.   Create a folder in your own Google Drive and copy the necessary input files there.
2.   Copy the path of your created directory. We will use it below.

In [None]:
#@title ### **Import Google Drive**
#@markdown Click the "Run" buttom to make your Google Drive accessible.
from google.colab import drive

drive.mount('/content/drive', force_remount=True)

In [None]:
#@title **Check if you correctly allocated GPU nodes**

gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)


---
# **Loading the necessary input files**

At this point, we should have all libraries and dependencies installed.

**Important**: Make sure the PDB file points to the correct structure.

Below, you should provide the names of all input files and the pathway of your Google Drive folder containing them.

**Please, don't use spaces in the files and folders names, i.e. MyDrive/protein_ligand and so on.**

In [None]:
#@title **Please, provide the necessary input files below for receptor**:
#@markdown **Important:** Run the cell to prepare your receptor and select your reference residue for the construction of an optimal box size for the docking calculations.

#@markdown Choose between uploading your own PDB file (pdb_file) or using a PDB ID (Query_PDB_ID) to download the correct file. The appropriate chains can be selected as well.

from openmm.app.pdbfile import PDBFile

import requests
import warnings
warnings.filterwarnings('ignore')
import os
from Bio.PDB import PDBParser, PDBIO, Select
from Bio.PDB import is_aa
import pandas as pd
from pdbfixer import PDBFixer

Google_Drive_Path = '/content/drive/MyDrive/' #@param {type:"string"}
workDir = Google_Drive_Path

workDir2 = os.path.join(workDir)
workDir_check = os.path.exists(workDir2)
if workDir_check == False:
  os.mkdir(workDir2)
else:
  pass

if os.path.exists(os.path.join(workDir, "name_residues.txt")):
  os.remove(os.path.join(workDir, "name_residues.txt"))
  os.remove(os.path.join(workDir,"name_residues_receptor.txt"))
else:
  pass

temp = os.path.join(workDir, "temp.pdb")
receptor = os.path.join(workDir, "receptor.pdb")
ligand = os.path.join(workDir, "ligand.sdf")

# Choose PDB source: Upload or PDB ID
PDB_Source = "Uploaded_PDB" # @param ["PDB_ID","Uploaded_PDB"]

if PDB_Source == "Uploaded_PDB":
    pdb_file = 'protein.pdb' #@param {type:"string"}
    outfnm = os.path.join(workDir, pdb_file)

elif PDB_Source == "PDB_ID":
    Query_PDB_ID = '9BDQ' #@param {type:"string"}
    pdbfn = Query_PDB_ID + ".pdb"
    url = 'https://files.rcsb.org/download/' + pdbfn
    outfnm = os.path.join(workDir, pdbfn)
    try:
      response = requests.get(url)
      response.raise_for_status()  # Raise an exception for bad responses (4xx or 5xx)
      with open(outfnm, 'wb') as outfile:
          outfile.write(response.content)
      print(f"File downloaded to: {outfnm}")
      print(f"File size: {os.path.getsize(outfnm)} bytes")
    except requests.exceptions.RequestException as e:
      print(f"Error downloading PDB file: {e}")

print(outfnm)
# Read the PDB file
ppdb = PandasPdb().read_pdb(outfnm)
selected_chains = ['A'] # @param {"type":"raw"}

# Check if selected_chains is empty or not provided:
if 'selected_chains' not in locals() or not selected_chains:
    print("No chains selected, processing all chains.")
    # Filter chains for ATOM and HETATM records
    ppdb.df['ATOM'] = ppdb.df['ATOM']
else:
    ppdb.df['ATOM'] = ppdb.df['ATOM'][ppdb.df['ATOM']['chain_id'].isin(selected_chains)]
    ppdb.df['HETATM'] = ppdb.df['HETATM'][ppdb.df['HETATM']['chain_id'].isin(selected_chains)]

# Remove water molecules (HOH)
ppdb.df['HETATM'] = ppdb.df['HETATM'][ppdb.df['HETATM']['residue_name'] != 'HOH']

# Save the filtered PDB file
ppdb.to_pdb(path=temp, records=['ATOM', 'HETATM'], gz=False, append_newline=True)

# Prepare receptor (additional filtering)
ppdb = PandasPdb().read_pdb(outfnm)

# Remove water molecules (HOH)
ppdb.df['HETATM'] = ppdb.df['HETATM'][ppdb.df['HETATM']['residue_name'] != 'HOH']

# Remove OXT atoms and hydrogen atoms
ppdb.df['ATOM'] = ppdb.df['ATOM'][ppdb.df['ATOM']['atom_name'] != 'OXT']
ppdb.df['ATOM'] = ppdb.df['ATOM'][ppdb.df['ATOM']['element_symbol'] != 'H']

# Save the filtered receptor PDB file
ppdb.to_pdb(path=receptor, records=['ATOM', 'HETATM'], gz=False, append_newline=True)

fixer = PDBFixer(filename=receptor)
fixer.removeHeterogens()
fixer.findMissingResidues()
fixer.findMissingAtoms()
fixer.addMissingAtoms()
fixer.addMissingHydrogens(pH=7.4)
PDBFile.writeFile(fixer.topology, fixer.positions, open(receptor, 'w'))


path = '/content/'


def is_het(residue):
    res = residue.id[0]
    return res != " " and res != "W"

def aa(residue):
    res = residue.id[0]
    return res != "W"


class ResidueSelect(Select):
    def __init__(self, chain, residue):
        self.chain = chain
        self.residue = residue

    def accept_chain(self, chain):
        return chain.id == self.chain.id

    def accept_residue(self, residue):
        return residue == self.residue and aa(residue)

def extract_ligands(path):
    pdb = PDBParser().get_structure(temp, temp)
    io = PDBIO()
    io.set_structure(pdb)
    i = 1
    name_residues = []
    for model in pdb:
      for chain in model:
        for residue in chain:
          if not aa(residue):
            continue
          # print(f"{chain[i].resname} {i}")
          name_residues.append(residue)
          print((f"saving {residue}"), file=open(os.path.join(workDir, "name_residues.txt"), "a",))
          i += 1

extract_ligands(path)

def extract_ligands2(path):
    pdb = PDBParser().get_structure(receptor, receptor)
    io = PDBIO()
    io.set_structure(pdb)
    i2 = 1
    name_residues2 = []
    for model in pdb:
      for chain in model:
        for residue in chain:
          if not aa(residue):
            continue
          # print(f"{chain[i].resname} {i}")
          name_residues2.append(residue)
          print((f"saving {residue}"), file=open(os.path.join(workDir, "name_residues_receptor.txt"), "a",))
          i2 += 1

extract_ligands2(path)


dataset = pd.read_csv(os.path.join(workDir, 'name_residues.txt'), delimiter = " ", header=None)
df = pd.DataFrame(dataset)
df = df.iloc[:, [2]]
new = df.to_numpy()

dataset2 = pd.read_csv(os.path.join(workDir, 'name_residues_receptor.txt'), delimiter = " ", header=None)
df2 = pd.DataFrame(dataset2)
df2 = df2.iloc[:, [2]]
new2 = df2.to_numpy()

b = 1
res_number = []
for j in new2:
  res_number.append(b)
  b += 1

print("Residue" + " - "  + "Number" )
a = 1
for j in new:
  print(', '.join(j) + " - "  + str(a))
  a += 1

In [None]:
#@title **Predict ligand-binding pockets from your protein structure using P2Rank**:
#@markdown **P2Rank** is a stand-alone command line program that predicts ligand-binding pockets from a protein structure. It achieves high prediction success rates without relying on an external software for computation of complex features or on a database of known protein-ligand templates.
#@markdown P2Rank makes predictions by scoring and clustering points on the protein's solvent accessible surface. Ligandability score of individual points is determined by a machine learning based model trained on the dataset of known protein-ligand complexes. For more details see [here](https://github.com/rdk/p2rank).

import subprocess
import csv
import os
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

!apt-get install openjdk-17-jdk-headless -qq > /dev/null
!update-alternatives --set java /usr/lib/jvm/java-17-openjdk-amd64/bin/java
!update-alternatives --set javac /usr/lib/jvm/java-17-openjdk-amd64/bin/javac

print(f"Receptor file: {receptor}")
output_p2rank = os.path.join(workDir, "output_p2rank")
print(f"P2Rank output directory: {output_p2rank}")

p2rank = "/content/p2rank_2.5/prank predict -f " + str(receptor) + " -o " + str(output_p2rank)
print(f"P2Rank command: {p2rank}")

original_stdout = sys.stdout
with open('p2rank.sh', 'w') as f:
  sys.stdout = f
  print(p2rank)
  sys.stdout = original_stdout
subprocess.run(["chmod 700 p2rank.sh"], shell=True)
#subprocess.run(["./p2rank.sh"], shell=True,)
result = subprocess.run(["./p2rank.sh"], shell=True, capture_output=True, text=True)
print("P2Rank stdout:", result.stdout)
print("P2Rank stderr:", result.stderr)


# Check if the output file exists
output_file = os.path.join(workDir, "output_p2rank/receptor.pdb_predictions.csv")
if os.path.exists(output_file):
  with open(output_file, 'r') as file:
    csvreader = csv.reader(file)
    residue = []
    score = []
    center_x = []
    center_y = []
    center_z = []
    for row in csvreader:
      residue.append(row[9:10])
      score.append(row[2:3])
      center_x.append(row[6:7])
      center_y.append(row[7:8])
      center_z.append(row[8:9])

  for i in range(1,len(residue)):
    file = str((residue[i])[0]).split()
    score_end = str((score[i])[0]).split()
    center_x_end = str((center_x[i])[0]).split()
    center_y_end = str((center_y[i])[0]).split()
    center_z_end = str((center_z[i])[0]).split()
    print("Pocket " + str(i))
    print("Score = " + score_end[0])
    final_residues = []
    for i in range(0,len(file)):
      test = file[i]
      final_residues.append(int(test[2:]))
    print("Selected Residues = " + str(final_residues))
    print("Center x = "+ str(center_x_end[0]), "Center y = "+ str(center_y_end[0]), "Center z = "+ str(center_z_end[0]) + "\n")
else:
  print(f"Error: P2Rank output file not found at: {output_file}")
  print("Please, check if P2Rank executed successfully and generated the expected output.")

In [None]:
#@title **Please, provide the pocket or residue number for the selection**:
#@markdown **Important:** The selected pocket or residues will be used as a reference for the construction of an optimal box size for the ligand during the docking. If you want to select more than one residue, please, use comma to separte the numbers (i.e. 147,150,155,160). **Please, DO NOT USE SPACES BETWEEN THEM.**


import re
import csv

if os.path.exists(os.path.join(workDir, "name_residue.txt")):
  os.remove(os.path.join(workDir, "name_residue.txt"))
else:
  pass

# Python code to convert string to list
def Convert(string):
	li = list(string.split(","))
	return li

def extract_ligands(path,residues):
    pdb = PDBParser().get_structure(temp, temp)
    io = PDBIO()
    io.set_structure(pdb)
    i = 1
    name_residues = []
    for model in pdb:
      for chain in model:
        for residue in chain:
          if not aa(residue):
            continue
          if i == int(residues):
            # print(residues)
            print((f"saving {residue}"), file=open(os.path.join(workDir, "name_residue.txt"), "a",))
            io.save(f"res_{i}_certo.pdb", ResidueSelect(chain, residue))
          i += 1

Selection = "Pocket" #@param ["Pocket", "Residues"]

number = '1' #@param {type:"string"}

if Selection == "Pocket":
  file = str((residue[int(number)])[0]).split()
  score_end = str((score[int(number)])[0]).split()
  center_x_end = str((center_x[int(number)])[0]).split()
  center_y_end = str((center_y[int(number)])[0]).split()
  center_z_end = str((center_z[int(number)])[0]).split()
  center_x_gnina = float(center_x_end[0])
  center_y_gnina = float(center_y_end[0])
  center_z_gnina = float(center_z_end[0])
  print("Pocket " + str(number))
  print("Score = " + score_end[0])
  print("Center x = "+ str(center_x_end[0]), "Center y = "+ str(center_y_end[0]), "Center z = "+ str(center_z_end[0]) + "\n")
  final_residues = []
  for i in range(0,len(file)):
    test = file[i]
    final_residues.append(int(test[2:]))
  residues_num = final_residues
else:
  residues_num = Convert(number)

filenames=[]
for k in range(0, len(residues_num)):
  extract_ligands(path, residues_num[k])
  filenames.append(f"res_{residues_num[k]}_certo.pdb")


with open('selection_merge.pdb', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)

# reading each line from original
# text file
file1 = open('/content/selection_merge.pdb', 'r')
file2 = open('/content/selection_merge_end.pdb','w')

for line in file1.readlines():

    # reading all lines that begin
    # with "TextGenerator"
    x = re.findall("^END", line)

    if not x:
        file2.write(line)

# close and save the files
file1.close()
file2.close()

dataset = pd.read_csv(os.path.join(workDir, "name_residue.txt"), delimiter = " ", header=None)
df = pd.DataFrame(dataset)
df = df.iloc[:, [2]]
new = df.to_numpy()

print("Selected Residue" + " - "  + "Number" )
for j, i in zip(new, range(0, len(residues_num))):
# for j in new:
  print(', '.join(j) + " - "  + str(residues_num[i]))
res_box = '/content/selection_merge_end.pdb'

In [None]:
#@title **Receptor Visualization**:
#@markdown Now that the protein has been sanitized and the selection has been chosen, it is a good idea to visualize and check the protein (gray) and your selection (green).

view = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js',)
view.removeAllModels()
view.setViewStyle({'style':'outline','color':'black','width':0.1})

view.addModel(open(receptor,'r').read(),format='pdb')
Prot=view.getModel()
Prot.setStyle({'cartoon':{'arrows':True, 'tubes':True, 'style':'oval', 'color':'white'}})
view.addSurface(py3Dmol.VDW,{'opacity':0.6,'color':'white'})


view.addModel(open(res_box,'r').read(),format='mol2')
ref_m = view.getModel()
ref_m.setStyle({},{'stick':{'colorscheme':'greenCarbon','radius':0.2}})

view.zoomTo()
view.show()

In [None]:
#@title **Please, provide the necessary input files for the ligand**:

#@markdown Type the smiles or a filename (SMI, CSV or SDF format) of your molecule. **Ex: C=CC(=O)OC, molecules.smi, molecules.csv or molecules.sdf**

#@markdown Just remember that if you want to use a smi, a csv or a sdf file, you should first upload the file here in Colab or in your Google Drive, and then provide the path for the file.

#@markdown If you don't know the exact smiles for your molecule, please, check https://pubchem.ncbi.nlm.nih.gov/

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem import rdMolTransforms
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem import rdDepictor
from rdkit.Chem import PandasTools
from IPython.display import SVG
import ipywidgets as widgets
import rdkit
from rdkit.Chem.Draw import IPythonConsole
AllChem.SetPreferCoordGen(True)
from IPython.display import Image
from openbabel import pybel
import matplotlib.image as mpimg


import os

import py3Dmol


Type = "smiles" #@param ["smiles", "csv", "sdf"]

smiles_or_filename = "ligands.smi" #@param {type:"string"}
smiles_or_filename = os.path.join(workDir, smiles_or_filename)

if Type == "smiles":
  mol_list = dm.read_smi(smiles_or_filename)
  data = dm.to_df(mol_list, smiles_column='Smiles')
  data["mol"] = data["Smiles"].apply(dm.to_mol)
  mols = data["mol"].tolist()
  mols = [dm.fix_mol(mol) for mol in mols]
  mols = [dm.sanitize_mol(mol, sanifix=True, charge_neutral=False) for mol in mols if mol is not None]
  mols = [dm.standardize_mol(mol, disconnect_metals=False, normalize=True, reionize=True) for mol in mols if mol is not None]
  data["mol"] = mols

elif Type == "csv":
  # Load the CSV file into a pandas DataFrame
  data = pd.read_csv(smiles_or_filename)

  #@markdown Column name containing the SMILES data (**.csv option only**)
  smiles_column = 'Smiles' #@param {type:"string"}

  # Drop rows with missing SMILES data if necessary
  data.dropna(subset=[smiles_column], inplace=True)

  # Convert the SMILES column to RDKit molecules
  data["mol"] = data[smiles_column].apply(dm.to_mol)

  mols = data["mol"].tolist()
  mols = [dm.fix_mol(mol) for mol in mols]
  mols = [dm.sanitize_mol(mol, sanifix=True, charge_neutral=False) for mol in mols if mol is not None]
  mols = [dm.standardize_mol(mol, disconnect_metals=False, normalize=True, reionize=True) for mol in mols if mol is not None]
  data["mol"] = mols

elif Type == "sdf":
  mol_list = dm.read_sdf(smiles_or_filename)
  smi_file_path = os.path.join(workDir, "molecules.smi")
  smi_list = dm.to_smi(mol_list, smi_file_path)
  data = dm.to_df(mol_list, smiles_column='Smiles')
  data["mol"] = mol_list
  mols = data["mol"].tolist()
  mols = [dm.fix_mol(mol) for mol in mols]
  mols = [dm.sanitize_mol(mol, sanifix=True, charge_neutral=False) for mol in mols if mol is not None]
  mols = [dm.standardize_mol(mol, disconnect_metals=False, normalize=True, reionize=True) for mol in mols if mol is not None]
  data["mol"] = mols

data

In [None]:
#@title **Small molecule filtering**:

#@markdown Choose the type of filter you want to apply; **Options: Lipinski Rule of 5 (Ro5), Rule of 4 (Ro4), Astex Rule of 3 (Ro3), picking the centroids of clusters (Centroids) or No Filters applied**

Type = "NoFilter" #@param ["Ro4", "Ro5", "Ro3", "Centroids", "NoFilter"]

data['ID'] = range(1, len(data) + 1)

df_descr = dm.descriptors.batch_compute_many_descriptors(mols)

if Type == "Ro4":
  filtered = pd.concat([data, df_descr], axis=1)
  filtered = filtered[filtered["mw"] >= 400]
  filtered = filtered[filtered["n_lipinski_hba"] >= 4]
  filtered = filtered[filtered["n_rings"] >= 4]
  filtered = filtered[filtered["clogp"] >= 4]
  filtered

elif Type == "Ro5":
  filtered = pd.concat([data, df_descr], axis=1)
  filtered = filtered[filtered["mw"] <= 500]
  filtered = filtered[filtered["n_lipinski_hba"] <= 10]
  filtered = filtered[filtered["n_lipinski_hbd"] <= 5]
  filtered = filtered[filtered["clogp"] <= 5]
  filtered

elif Type == "Ro3":
  filtered = pd.concat([data, df_descr], axis=1)
  filtered = filtered[filtered["mw"] <= 300]
  filtered = filtered[filtered["n_lipinski_hba"] <= 3]
  filtered = filtered[filtered["n_lipinski_hbd"] <= 3]
  filtered = filtered[filtered["clogp"] <= 3]
  filtered = filtered[filtered["n_rotatable_bonds"] <= 3]
  filtered

elif Type == "Centroids":
  #@markdown Only applicable to the **Centroids** filter
  n_centroids = "50" #@param {type:"string"}
  cutoff_value = "0.3" #@param {type:"string"}
  clusters, mol_clusters = dm.cluster_mols(mols, cutoff=float(cutoff_value))
  indices, centroids = dm.pick_centroids(mols, npick=int(n_centroids), threshold=float(cutoff_value), method="sphere", n_jobs=-1)
  print(str(n_centroids) + " centroids picked")
  del filtered

elif Type == "NoFilter":
  filtered = pd.concat([data, df_descr], axis=1)
  filtered

else:
  pass

filtered

In [None]:
#@title **Ligand Optimization with RDKit**:

#@markdown Choose the output name for your optimized molecules file **(sdf format)**.
import warnings
warnings.filterwarnings('ignore')
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem import rdMolTransforms
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem import rdDepictor
from rdkit.Chem import SDWriter
from IPython.display import SVG
import ipywidgets as widgets
import rdkit
from rdkit.Chem.Draw import IPythonConsole
AllChem.SetPreferCoordGen(True)
from IPython.display import Image
from openbabel import pybel
import os, sys, glob


import py3Dmol

Optimized_Mol_File = "lig_opt.sdf" #@param {type:"string"}
Optimized_Mol_File = os.path.join(workDir, Optimized_Mol_File)

# SDF writer
writer = Chem.SDWriter(Optimized_Mol_File)

if 'filtered' in globals():
# Iterating through DataFrame molecules
  for i, (mol, mol_id) in enumerate(zip(filtered["mol"], filtered["ID"])):
      print(f"Processing molecule {i+1}/{len(filtered)}...")
      if mol is None:  # Check to see if molecule is valid
          print(f"Invalid molecule on line {i}. Ignoring...")
          continue

    # Add hydrogens
      hmol = Chem.AddHs(mol)

    # Embedding 3D coordinates
      status = AllChem.EmbedMolecule(hmol, maxAttempts=500, useRandomCoords=True)
      if status != 0:
         print(f"Embedding failed for molecule on line {i}. Ignoring...")
         continue

    # MMFF energy optimization
      mp = AllChem.MMFFGetMoleculeProperties(hmol)
      ff = AllChem.MMFFGetMoleculeForceField(hmol, mp)
      AllChem.OptimizeMolecule(ff, maxIters=1000)

    # Save temporary molecule files
      mol_file = os.path.join(workDir, f"{i}.mol")
      Chem.MolToMolFile(hmol, mol_file)


      mol2 = Chem.MolFromMolFile(mol_file, removeHs=False)
      if mol2 is not None:
          mol2.SetProp("_Name", f"mol_{mol_id}")
          writer.write(mol2, confId=0)
      else:
          print(f"Error saving/reading optimized molecule on line {i}.")

      # Clean temporary files
      for f in glob.glob(os.path.join(workDir, "*.mol")):
         os.remove(f)
      for f in glob.glob(os.path.join(workDir, "*.xyz")):
         os.remove(f)

# Close the SDF writer
  writer.close()

elif 'centroids' in globals():# Assuming 'centroids' is a list or iterable containing the molecules
  for i, mol in enumerate(centroids):
      print(f"Processing molecule {i+1}/{len(centroids)}...")
      if mol is None:  # Check if the molecule is invalid
          print(f"Invalid molecule on line {i}. Ignoring...")
          continue

    # Add hydrogens
      hmol = Chem.AddHs(mol)

    # Embed 3D coordinates
      status = AllChem.EmbedMolecule(hmol, maxAttempts=500, useRandomCoords=True)
      if status != 0:  # Check if conformation generation failed
          print(f"Embedding failed for molecule on line {i}. Ignoring...")
          continue

    # Geometry optimization with MMFF
      mp = AllChem.MMFFGetMoleculeProperties(hmol)
      ff = AllChem.MMFFGetMoleculeForceField(hmol, mp)
      AllChem.OptimizeMolecule(ff, maxIters=1000)

    # Save optimized molecule temporarily
      mol_file = os.path.join(workDir, f"{i}.mol")
      Chem.MolToMolFile(hmol, mol_file)

    # Reload the optimized molecule to include in the SDF
      mol2 = Chem.MolFromMolFile(mol_file, removeHs=False)
      if mol2 is not None:
        # Set the molecule title as "mol_X", where X is the sequential number
          mol2.SetProp("_Name", f"mol_{i+1}")  # Use i+1 to start numbering from 1
          writer.write(mol2, confId=0)
      else:
          print(f"Error saving/reading optimized molecule on line {i}.")

    # Clean up temporary files
      for f in glob.glob(os.path.join(workDir, "*.mol")):
          os.remove(f)
      for f in glob.glob(os.path.join(workDir, "*.xyz")):
          os.remove(f)
else:
  pass
# Close the SDF writer
  writer.close()

In [None]:
#@title **Parameters for the docking calculation:**

#@markdown Please choose the name of the output file from the docking calculation **(do not add file extension)**:

Output_file = "output_docking_lig" #@param {type:"string"}
Output_file = os.path.join(workDir, Output_file)
#@markdown Amount of buffer space to add the generated box (Angstroms):

size = 20 #@param {type:"slider", min:1, max:20, step:1}

#@markdown Exhaustiveness of the global search (roughly proportional to time):
exhaustiveness = 10 #@param {type:"slider", min:2, max:64, step:2}

#@markdown Explicit random seed:
seed = "0" #@param {type:"string"}

#@markdown Convolutional neural network (CNN) parameter:

cnn_scoring = "rescore (default)" #@param ["none", "rescore (default)", "refinement", "all"]
if cnn_scoring == "rescore (default)":
  cnn_scoring = "rescore"
  scoring_vinardo = " "
elif cnn_scoring == "none":
  scoring_vinardo = " --scoring vinardo "
else:
  scoring_vinardo = " "

#@markdown **cnn_scoring** determines at what points of the docking procedure that the CNN scoring function is used.

#@markdown **none** - No CNNs used for docking. Here, uses all the empirical scoring from [Vinardo](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0155183) scoring function.

#@markdown **rescore** (default) - CNN used for reranking of final poses. Least computationally expensive CNN option.

#@markdown **refinement** - CNN used to refine poses after Monte Carlo chains and for final ranking of output poses. 10x slower than rescore when using a GPU.

#@markdown **all** - CNN used as the scoring function throughout the whole procedure. Extremely computationally intensive and not recommended.

#@markdown The default CNN scoring function is an ensemble of 5 models selected to balance pose prediction performance and runtime: dense, general_default2018_3, dense_3, crossdock_default2018, and redock_default2018.

import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

docking_output_gz = os.path.join(workDir, Output_file + ".sdf.gz")
docking_output = os.path.join(workDir, Output_file + ".sdf")

if os.path.exists(docking_output_gz):
  os.remove(docking_output_gz)
elif os.path.exists(docking_output):
  os.remove(docking_output)
else:
  pass


if Selection == "Pocket":
  gnina = "./gnina -r " + str(receptor) + " -l " +  str(Optimized_Mol_File) + " --center_x " + str(center_x_gnina) +  " --center_y " + str(center_y_gnina) +  " --center_z " + str(center_z_gnina) + " --size_x " + str(size) +  " --size_y " + str(size) +  " --size_z " + str(size) + " --cnn_scoring " + str(cnn_scoring) + " --exhaustiveness " + str(exhaustiveness) + " -o " + str(docking_output_gz) + str(scoring_vinardo) +  "--num_modes 10 " + "--seed " + str(int(seed))
else:
  gnina = "./gnina -r " + str(receptor) + " -l " +  str(Optimized_Mol_File) + " --autobox_ligand " + str(res_box) +  " --autobox_add " + str(size) + " --cnn_scoring " + str(cnn_scoring) + " --exhaustiveness " + str(exhaustiveness) + " -o " + str(docking_output_gz) + str(scoring_vinardo) +  "--num_modes 10 " + "--seed " + str(int(seed))

zip_gnina = "gunzip " + str(docking_output_gz)

original_stdout = sys.stdout # Save a reference to the original standard output
with open('gnina.sh', 'w') as f:
    sys.stdout = f # Change the standard output to the file we created.
    print(gnina)
    print(zip_gnina)
    sys.stdout = original_stdout # Reset the standard output to its original value

!chmod 700 gnina.sh 2>&1 1>/dev/null
!bash gnina.sh

import gzip

In [None]:
import pandas as pd
from rdkit.Chem import rdFMCS,AllChem, Draw, PandasTools
import seaborn as sns
from openbabel import pybel

#@title **Docking Analysis:**

#@markdown Please choose which parameter will be used to sort the results **(CNN_VS is recommended for Virtual Screening studies)**:
Parameter = "CNN_VS" #@param ["minimizedAffinity", "CNNscore", "CNN_VS", "CNNaffinity"]

#@markdown Please fill in the blanks with the sdf file name from the docking output:

# Input file path
Docking_output = "output_docking_lig.sdf" #@param {type:"string"}
Docking_output = os.path.join(workDir, Docking_output)

# Read the entire content of the file
with open(Docking_output, "r") as infile:
    lines = infile.readlines()

# Process the lines to update the titles
molecule_count = 0  # Counter for the current molecule (e.g., mol_1, mol_2)
solution_count = 0  # Counter for the solution number (e.g., _1, _2, ..., _10)
current_molecule = None  # Stores the current molecule title (e.g., mol_1)

for i, line in enumerate(lines):
    # Check if the line contains a molecule title (e.g., mol_1)
    if line.startswith("mol_"):
        # Extract the molecule number (e.g., 1 from mol_1)
        molecule_number = line.strip().split("_")[1]

        # If this is a new molecule, reset the solution counter
        if molecule_number != current_molecule:
            current_molecule = molecule_number
            solution_count = 0

        # Increment the solution counter
        solution_count += 1

        # Create the new title (e.g., mol_1_1, mol_1_2, etc.)
        new_title = f"mol_{molecule_number}_{solution_count}\n"

        # Update the line in the list
        lines[i] = new_title

# Write the updated content back to the input file
with open(Docking_output, "w") as outfile:
    outfile.writelines(lines)

# Load and process docking results
VinaPoses=PandasTools.LoadSDF(Docking_output)
AllPoses=pd.concat([VinaPoses])

# List to store scores and titles
scores = []

# Read the SDF file
for mol in pybel.readfile('sdf', Docking_output):
    molecule_title = mol.title.strip()

    # Create a dictionary to store the scores
    score_data = {
        'Molecule_Solution': molecule_title,
        'minimizedAffinity': float(mol.data['minimizedAffinity']),
        'CNNscore': float(mol.data['CNNscore']),
        'CNNaffinity': float(mol.data['CNNaffinity']),
        'CNN_VS': float(mol.data['CNN_VS'])
    }

    scores.append(score_data)

# Create a DataFrame from the list of dictionaries
scores_df = pd.DataFrame(scores)

# Reorder columns
scores_df = scores_df[['Molecule_Solution', 'minimizedAffinity', 'CNNscore', 'CNNaffinity', 'CNN_VS']]

# Sort the DataFrame based on the selected parameter
if Parameter == "minimizedAffinity":
    scores_sorted = scores_df.sort_values(by=Parameter, ascending=True).reset_index(drop=True)
else:
    scores_sorted = scores_df.sort_values(by=Parameter, ascending=False).reset_index(drop=True)

scores_sorted.to_csv(os.path.join(workDir, Parameter + "_sorted.csv"), index=False)
#scores_sorted

# New code to write the sorted molecules to a new SDF file
# Dictionary to store molecules by their titles
molecule_dict = {}

# Read the SDF file again and store molecules in the dictionary
for mol in pybel.readfile('sdf', Docking_output):
    molecule_title = mol.title.strip()
    molecule_dict[molecule_title] = mol

# Write the molecules to a new SDF file in the order specified by the sorted DataFrame
#@markdown Please choose the name of the output sdf file sorted according to the Parameter selected:
Output_sdf = "lig_sorted_CNN_VS.sdf" #@param {type:"string"}
Output_sdf = os.path.join(workDir, Output_sdf)

with open(Output_sdf, 'w') as outfile:
    for molecule_solution in scores_sorted['Molecule_Solution']:
        if molecule_solution in molecule_dict:
            mol = molecule_dict[molecule_solution]
            outfile.write(mol.write(format='sdf'))
        else:
            print(f"Warning: Molecule '{molecule_solution}' not found in the SDF file.")

print(f"Sorted SDF file saved to {Output_sdf}")
scores_sorted

In [None]:
# @title **Filter for the top 20 molecules in the output sdf file (preparation for PLACER step)**
#@markdown This step filters the sorted SDF file to keep only the top solutions from the docking step. The number of molecules can be changed by modifying the value of the **num_molecules** variable in the code.

from rdkit import Chem
import pandas as pd
import os

sdf_file = Output_sdf

def filter_top_molecules(sdf_file, num_molecules=20):
    """Filters the top N molecules from an SDF file, keeping original titles.

    Args:
        sdf_file: Path to the SDF file.
        num_molecules: Number of top molecules to extract (default is 20).

    Returns:
        A list of RDKit Mol objects representing the top molecules.
        Returns an empty list if the file does not exist or if an error occurs.
    """
    if not os.path.exists(sdf_file):
        print(f"Error: File not found - {sdf_file}")
        return []

    try:
        suppl = Chem.SDMolSupplier(sdf_file)
        top_molecules = []
        for i, mol in enumerate(suppl):
            if mol is not None:  # Check for valid molecules
                top_molecules.append(mol)
                if i + 1 == num_molecules:
                    break  # Stop after extracting the desired number

        return top_molecules

    except Exception as e:
        print(f"An error occurred: {e}")
        return []

top_molecules = filter_top_molecules(Output_sdf)  # Using Output_sdf

if top_molecules:
    print(f"Successfully extracted {len(top_molecules)} molecules.")

    writer = Chem.SDWriter(os.path.join(workDir, "top_20_molecules.sdf"))
    for mol in top_molecules:
        writer.write(mol)
    writer.close()

In [None]:
#@title **Convert molecules to PDB files (preparation for PLACER step)**


import os
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import AllChem # Added import statement

def sdf_to_individual_pdbs(sdf_file, output_dir):
    """
    Convert each molecule in an SDF file to individual PDB files.

    Args:
        sdf_file (str): Path to the input SDF file
        output_dir (str): Directory to save PDB files
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)

    # Read the SDF file
    supplier = Chem.SDMolSupplier(sdf_file)

    for i, mol in enumerate(supplier):
        if mol is not None:
            # Get the molecule title (first line of SDF record)
            title = mol.GetProp("_Name") if mol.HasProp("_Name") else f"molecule_{i+1}"

            # Clean the title to make it filesystem-safe
            safe_title = "".join(c if c.isalnum() or c in "_- " else "_" for c in title)
            safe_title = safe_title.strip()

            # Generate output path
            pdb_file = os.path.join(output_dir, f"{safe_title}.pdb")
            hmol = Chem.AddHs(mol, addCoords=True)
            # Write to PDB file
            Chem.MolToPDBFile(hmol, pdb_file)

            print(f"Saved: {pdb_file}")
        else:
            print(f"Warning: Failed to process molecule {i+1}")

if __name__ == "__main__":
    import sys

    if len(sys.argv) < 3:
        print("Usage: python sdf_to_pdbs.py <input.sdf> <output_directory>")
        sys.exit(1)

    sdf_file = "lig_sorted_CNN_VS.sdf" #@param {type:"string"}
    sdf_file = os.path.join(workDir, sdf_file)
    output_dir = os.path.join(workDir, 'conformer_docking')

    sdf_to_individual_pdbs(sdf_file, output_dir)

In [None]:
#@title **Convert single SDF to multiple SDF files (preparation for PLACER step)**


import os
from rdkit import Chem

def split_sdf_to_individual_files(input_sdf, output_dir):
    """
    Split an SDF file into individual molecule files.

    Args:
        input_sdf (str): Path to input SDF file
        output_dir (str): Directory to save individual molecule files
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)

    # Read the SDF file
    supplier = Chem.SDMolSupplier(input_sdf)

    for i, mol in enumerate(supplier):
        if mol is not None:
            # Get molecule title (or create one if none exists)
            if mol.HasProp("_Name"):
                title = mol.GetProp("_Name")
            else:
                title = f"molecule_{i+1}"

            # Clean the title to make it filesystem-safe
            safe_title = "".join(c if c.isalnum() or c in "_-" else "_" for c in title)
            safe_title = safe_title.strip()

            # Generate output path
            output_file = os.path.join(output_dir, f"{safe_title}.sdf")

            hmol = Chem.AddHs(mol, addCoords=True)
            # Write molecule to individual SDF file
            writer = Chem.SDWriter(output_file)
            writer.write(hmol)
            writer.close()

            print(f"Saved: {output_file}")
        else:
            print(f"Warning: Failed to process molecule {i+1}")

if __name__ == "__main__":
    import sys

    if len(sys.argv) != 3:
        print("Usage: python split_sdf.py <input.sdf> <output_directory>")
        sys.exit(1)

    input_sdf = "lig_sorted_CNN_VS.sdf" #@param {type:"string"}
    input_sdf = os.path.join(workDir, input_sdf)
    output_dir = output_dir

    split_sdf_to_individual_files(input_sdf, output_dir)

In [None]:
%%bash
#@title **Calculating the protein-small molecule conformational ensembles with PLACER**
# Activate the conda environment
eval "$(conda shell.bash hook)"
conda activate placer_env

# Run the Python script
python <<EOF
import sys, os
import glob
import warnings
warnings.filterwarnings("ignore")

Google_Drive_Path = '/content/drive/MyDrive/' #@param {type:"string"}
workDir = Google_Drive_Path
output_dir = os.path.join(workDir, 'conformer_docking')
#@markdown Number of samples to generate, 50-100 is a good number in most cases.
n_samples = 50 #@param {type:"slider", min:10, max:200, step:10}

# Get all molecule files in the output directory
mol_files = glob.glob(os.path.join(output_dir, "mol_*.pdb"))

# Process each molecule
for mol_file in mol_files:
    # Extract the mode_number from filename (e.g., "mol_253_3.pdb" -> "253_3")
    base_name = os.path.basename(mol_file)
    mode_number = base_name.replace("mol_", "").replace(".pdb", "")

    receptor_file = os.path.join(workDir, "receptor.pdb")
    output_file = os.path.join(output_dir, f"pose_{mode_number}.pdb")  # Save in output_dir

    # Combine receptor and ligand into a single PDB file
    with open(output_file, 'w') as outfile:
        with open(receptor_file, 'r') as infile1:
            lines = infile1.readlines()
            outfile.writelines(lines[:-1])  # Write all lines except the last
        with open(mol_file, 'r') as infile2:
            outfile.write(infile2.read())

    receptor = output_file
    ligand = os.path.join(output_dir, f"mol_{mode_number}.sdf")  # Corresponding SDF file

    # Check if ligand file exists
    if not os.path.exists(ligand):
        print(f"Warning: Ligand file {ligand} not found, skipping {mol_file}")
        continue

    print(f"\nProcessing molecule {mode_number}...")

    # PLACER processing
    import json
    import pandas as pd
    import matplotlib.pyplot as plt
    sys.path.append(f"/content/PLACER/")
    import PLACER

    # Initializing PLACER model with default checkpoint
    placer = PLACER.PLACER()

    print(f"""
    ###############################################
    # Predicting for mode {mode_number} #
    ###############################################
    """)

    pdbfile = receptor
    pdbstr = open(pdbfile, "r").read()

    ODIR = os.path.join(output_dir, "outputs_PLACER")  # Save outputs in output_dir

    inp_dict = {"ligand_reference": {"UNL": ligand},
                "name": os.path.basename(pdbfile).replace(".pdb", ""),
                "pdb": pdbstr}

    pl_inp = PLACER.PLACERinput()
    pl_inp.create_from_dict(inp_dict)
    outputs_denovo2 = placer.run(pl_inp, int(n_samples))

    # Ranking outputs by prmsd
    outputs_denovo2 = PLACER.utils.rank_outputs(outputs_denovo2, "prmsd")

    # Dumping output models to PDB
    os.makedirs(ODIR, exist_ok=True)
    print(f"Writing outputs to {ODIR}")
    PLACER.protocol.dump_output(outputs_denovo2, f"{ODIR}/{pl_inp.name()}")

    df = pd.DataFrame.from_dict({k: [outputs_denovo2[n][k] for n in outputs_denovo2] for k in outputs_denovo2[0].keys() if k not in ["item", "model", "center"]})

    print(f"Completed processing for mode {mode_number}")

print("\nAll molecules processed successfully!")
EOF

In [None]:
#@title **Batch Filter PLACER Ensembles**

#@markdown **Filter unphysical samples (e.g., steric clashes) from the PLACER ensemble**

#@markdown This cell applies structural filters to ligands in an ensemble to remove unphysical conformations while retaining the full system for valid frames.

#@markdown The ligand is evaluated on each frame using the following criteria:

#@markdown - ✅ **Steric Clashes** – Checks if atoms are too close together.
#@markdown - ✅ **Bond Lengths** – Identifies abnormally long or short bonds.
#@markdown - ✅ **Bond Angles** – Flags angles outside a reasonable range.

#@markdown Frames where the ligand passes all filters are saved into a new trajectory file, preserving the entire molecular system.


import os
import glob
import MDAnalysis as mda
from MDAnalysis.analysis import align
import numpy as np
from scipy.spatial.distance import pdist

Google_Drive_Path = '/content/drive/MyDrive/' #@param {type:"string"}
workDir = Google_Drive_Path
output_dir = os.path.join(workDir, 'conformer_docking')

def process_placer_ensemble(pdb_path, output_dir):
    """Process a single PLACER ensemble"""
    try:
        u = mda.Universe(pdb_path, pdb_path)

        # Align protein structure
        average = align.AverageStructure(u, u, select='protein and name CA', ref_frame=0).run()
        ref = average.results.universe
        aligner = align.AlignTraj(u, ref, select='protein and name CA', in_memory=True).run()

        # Save aligned topology
        pdb_file = u.select_atoms("all")
        topology_path = os.path.join(output_dir, "temp_topology.pdb")
        pdb_file.write(topology_path)

        # Reload with proper topology
        u = mda.Universe(topology_path, pdb_path)
        all_atoms = u.select_atoms("all")

        # Extract mode_number from filename
        base_name = os.path.basename(pdb_path)
        mode_number = base_name.split('_model.pdb')[0].split('pose_')[-1]

        # Define output path
        # new_traj = os.path.join(output_dir, f"Filter_PLACER/PLACER_filter.pdb")
        new_traj_dir = os.path.join(output_dir, "Filter_PLACER")
        os.makedirs(new_traj_dir, exist_ok=True)
        new_traj = os.path.join(new_traj_dir, f"PLACER_filter_{mode_number}.pdb")


        with mda.Writer(new_traj, all_atoms.n_atoms) as W:
            for ts in u.trajectory:
                ligand_sel = get_ligand_selection(u)
                if ligand_is_realistic(ligand_sel):
                    W.write(all_atoms)

        print(f"✅ Processed {mode_number}: Saved to {new_traj}")
        os.remove(topology_path)  # Clean up temporary file
        return True

    except Exception as e:
        print(f"❌ Failed to process {pdb_path}: {str(e)}")
        return False

# Your existing filter functions (keep these exactly the same)
def get_ligand_selection(universe):
    ligand = universe.select_atoms("resname UNL")
    if len(ligand) == 0:
        raise ValueError("Ligand selection is empty. Check if 'resname UNL' is correct.")
    ligand.guess_bonds()
    return ligand

def has_steric_clashes(ligand, min_distance=1.0):
    """Check if ligand atoms are too close together (steric clashes)."""
    if len(ligand) < 2:
        return False
    distances = pdist(ligand.positions)
    return any(dist < min_distance for dist in distances)

def has_unusual_bond_lengths(ligand, min_bond=0.9, max_bond=1.8):
    """Check for unrealistic bond lengths in the ligand."""
    if len(ligand.bonds) == 0:
        return False
    for bond in ligand.bonds.to_indices():
        if bond[0] >= len(ligand) or bond[1] >= len(ligand):
            continue
        dist = np.linalg.norm(ligand.positions[bond[0]] - ligand.positions[bond[1]])
        if dist < min_bond or dist > max_bond:
            return True
    return False

def has_unusual_angles(ligand, min_angle=80, max_angle=180):
    """Check for unrealistic bond angles."""
    if not hasattr(ligand, "angles") or len(ligand.angles) == 0:
        return False
    positions = ligand.positions
    for angle in ligand.angles.to_indices():
        if max(angle) >= len(positions):
            continue
        a, b, c = positions[angle[0]], positions[angle[1]], positions[angle[2]]
        v1 = a - b
        v2 = c - b
        cos_theta = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
        theta = np.degrees(np.arccos(np.clip(cos_theta, -1.0, 1.0)))
        if theta < min_angle or theta > max_angle:
            return True
    return False

def ligand_is_realistic(ligand):
    return not (
        has_steric_clashes(ligand) or
        has_unusual_bond_lengths(ligand) or
        has_unusual_angles(ligand)
    )

# Main processing loop
placer_outputs = glob.glob(os.path.join(output_dir, "outputs_PLACER/pose_*_model.pdb"))

if not placer_outputs:
    print("❌ No PLACER outputs found in:", os.path.join(output_dir, "outputs_PLACER"))
else:
    print(f"🔍 Found {len(placer_outputs)} PLACER outputs to process")
    for pdb_path in placer_outputs:
        process_placer_ensemble(pdb_path, output_dir)
    print("✅ All processing complete!")

In [None]:
#@title **Batch Calculate Binding Affinity for All PLACER Ensembles**

#!mamba install -c conda-forge pymol-open-source -y
#!pip install qcelemental
#!pip install torch-geometric

import warnings
warnings.filterwarnings('ignore')
import MDAnalysis as mda
from MDAnalysis.analysis import rms, align
import numpy as np
from pdbfixer import PDBFixer
from pymol import cmd
import glob
import csv
from pathlib import Path
import os
import pandas as pd
import subprocess
import matplotlib.pyplot as plt
from openmm.app import PDBFile
import qcelemental as qcel

Add_hydrogens = True #@param {type:"boolean"}

Skip = 1 #@param {type:"slider", min:1, max:100, step:1}

Google_Drive_Path = '/content/drive/MyDrive/' #@param {type:"string"}
workDir = Google_Drive_Path
output_dir = os.path.join(workDir, 'conformer_docking') # Define output_dir here

# Initialize results storage
all_results = []

# Find all filtered PLACER outputs
placer_outputs = glob.glob(os.path.join(output_dir, "Filter_PLACER/PLACER_filter_*.pdb")) # Corrected path

if not placer_outputs:
    print("❌ No PLACER outputs found in:", os.path.join(output_dir, "Filter_PLACER"))
else:
    print(f"🔍 Found {len(placer_outputs)} PLACER outputs to process")

    for filtered_pdb in placer_outputs:
        cmd.delete("all")
        # Extract mode_number from filename
        base_name = os.path.basename(filtered_pdb)
        mode_number = base_name.split('PLACER_filter_')[-1].split('.pdb')[0] # Corrected mode_number extraction

        print(f"\nProcessing ensemble {mode_number}...")

        # Create unique directory for this mode
        pdb_dir = os.path.join(output_dir, f'PDBs_affinity_{mode_number}')
        os.makedirs(pdb_dir, exist_ok=True)

        # Clear existing files
        if os.path.exists('/content/input_ensemble.csv'):
            os.remove('/content/input_ensemble.csv')

        # Load the trajectory - Use the filtered PDB for both topology and trajectory
        u = mda.Universe(filtered_pdb, filtered_pdb)


        # Alignment and processing (same as original)
        average = align.AverageStructure(u, u, select='protein and name CA', ref_frame=0).run()
        ref = average.results.universe
        aligner = align.AlignTraj(u, ref, select='protein and name CA', in_memory=True).run()

        pdb_file = u.select_atoms("all")
        topology_path = os.path.join(pdb_dir, f"topology_{mode_number}.pdb") # Save topology in pdb_dir
        pdb_file.write(topology_path)

        u1 = mda.Universe(topology_path, filtered_pdb)

        # Prepare CSV file
        csv_file_path = Path('/content/input_ensemble.csv')
        with open(csv_file_path, mode='w', newline='') as file:
            writer = csv.writer(file)
            writer.writerow(['unique_id', 'sdf_file', 'pdb_file'])

        # Process frames
        protein_atoms = u1.select_atoms("protein")
        ligand_atoms = u1.select_atoms("resname UNL")
        i = 0

        for ts in u1.trajectory[0:len(u1.trajectory):int(Skip)]:
            i += 1
            try:
                # Write protein and ligand frames
                with mda.Writer(os.path.join(pdb_dir, f'receptor_frame{i}.pdb'), protein_atoms.n_atoms) as W:
                    W.write(protein_atoms)
                with mda.Writer(os.path.join(pdb_dir, f'ligand_frame{i}.pdb'), ligand_atoms.n_atoms) as W:
                    W.write(ligand_atoms)

                # Convert to SDF and add hydrogens
                name_pymol = f'ligand_frame{i}'
                pdb_file = os.path.join(pdb_dir, f'ligand_frame{i}.pdb')
                sdf_file = os.path.join(pdb_dir, f'ligand_frame{i}.sdf')

                if Add_hydrogens:
                    # Protein hydrogens
                    fixer = PDBFixer(filename=os.path.join(pdb_dir, f'receptor_frame{i}.pdb'))
                    fixer.removeHeterogens()
                    fixer.findMissingResidues()
                    fixer.findMissingAtoms()
                    fixer.addMissingAtoms()
                    fixer.addMissingHydrogens(pH=7.4)
                    receptor_pdb = os.path.join(pdb_dir, f'receptor_frame{i}_H.pdb')
                    PDBFile.writeFile(fixer.topology, fixer.positions, open(receptor_pdb, 'w'))

                    # Ligand hydrogens
                    cmd.delete("all")
                    cmd.load(pdb_file, name_pymol)
                    cmd.h_add(name_pymol)
                    cmd.save(sdf_file, name_pymol, format="sdf")
                    cmd.delete("all")
                else:
                    cmd.delete("all")
                    receptor_pdb = os.path.join(pdb_dir, f'receptor_frame{i}.pdb')
                    cmd.load(pdb_file, name_pymol)
                    cmd.save(sdf_file, name_pymol, format="sdf")
                    cmd.delete("all")

                # Record in CSV
                with open(csv_file_path, mode='a', newline='') as file:
                    writer = csv.writer(file)
                    writer.writerow([i, sdf_file, receptor_pdb])

            except Exception as e:
                print(f"⚠️ Error processing frame {i} of {mode_number}: {str(e)}")
                continue

        # Run AEV-PLIG prediction
        result = subprocess.run(
            f"python /content/AEV-PLIG/process_and_predict.py --dataset_csv=/content/input_ensemble.csv --data_name=ensemble_{mode_number} --trained_model_name=model_GATv2Net_ligsim90_fep_benchmark",
            shell=True,
            capture_output=True,
            text=True
        )

        if result.returncode != 0:
            print(f"❌ AEV-PLIG failed for {mode_number}: {result.stderr}")
            continue

        # Process predictions
        csv_file_path = Path(f'/content/AEV-PLIG/output/predictions/ensemble_{mode_number}_predictions.csv')

        if not csv_file_path.exists():
            print(f"⚠️ No predictions file found for {mode_number}")
            continue

        dg_kcal_list = []
        with open(csv_file_path, mode='r', newline='') as file:
            csv_reader = csv.DictReader(file)
            for row in csv_reader:
                preds_values = float(row['preds'])
                r = 8.3145  # J/mol/K
                t = 297  # K
                k = 10**-preds_values
                dg_j = r * t * np.log(k)
                dg_kcal = dg_j / 4184
                dg_kcal_list.append(dg_kcal)

        # Store results
        dg_average = np.mean(dg_kcal_list)
        dg_std = np.std(dg_kcal_list)
        all_results.append({
            'mode_number': mode_number,
            'average_affinity': dg_average,
            'std_dev': dg_std,
            'num_frames': len(dg_kcal_list)
        })

        print(f"✅ Processed {mode_number}: Avg affinity = {dg_average:.4f} ± {dg_std:.4f} kcal/mol")

        # Save individual results
        df = pd.read_csv(csv_file_path)
        frames = df['unique_id']
        new_df = pd.DataFrame({
            'Frames': frames,
            'Binding Affinity (kcal/mol)': dg_kcal_list
        })
        output_csv = os.path.join(workDir, f'Affinity_AEV-PLIG_PLACER_{mode_number}.csv')
        new_df.to_csv(output_csv, index=False)

        # Plot individual results
        plt.figure(figsize=(10, 5))
        sc = plt.scatter(frames, dg_kcal_list, c=dg_kcal_list, cmap='viridis', s=50)
        cbar = plt.colorbar(sc)
        cbar.set_label('Binding Affinity (kcal/mol)', fontsize=14, fontweight='bold')
        plt.axhline(dg_average, color='red', linestyle='--', label='Average')
        plt.ylim(np.min(dg_kcal_list)-1, np.max(dg_kcal_list)+1)
        plt.xlabel('Frames', fontsize=12, fontweight='bold')
        plt.ylabel('Binding Affinity (kcal/mol)', fontsize=12, fontweight='bold')
        plt.title(f'Binding Affinity for Mode {mode_number}', fontsize=12, fontweight='bold')
        plt.legend()
        plt.grid(True, linestyle='--', alpha=0.5)
        plt.tight_layout()
        plot_path = os.path.join(workDir, f"Affinity_AEV-PLIG_PLACER_{mode_number}.png")
        plt.savefig(plot_path, dpi=600, bbox_inches='tight')
        plt.close()

        # Clean up
        cmd.delete("all")

    # Save summary of all results
    summary_df = pd.DataFrame(all_results)
    summary_csv = os.path.join(workDir, 'AEV-PLIG_Summary_Results.csv')
    summary_df.to_csv(summary_csv, index=False)

    print("\n✅ All processing complete! Summary saved to:", summary_csv)