<a href="https://colab.research.google.com/github/phonglam3103/Cheminformatics/blob/main/Standardize_visualize_IUPAC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This script will take the input from your csv file, in which the SMILES of the compounds are stored in the "SMILES" column. The following sequential processed are done:
- Sanitize (Kekulize, check valencies, set aromaticity, conjugation and hybridization)
- Normalize functional groups
- Uncharge molecule (not enabled by default)
- Get the parent fragment (not enabled by default)
- Reionize the molecule (not enabled by default)

In [None]:
# @title Install prerequisite rdkit, XlsxWriter, STDOUT
%pip install rdkit chemical-converters XlsxWriter git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git

In [None]:
# @title Data import
from google.colab import drive, files
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Draw, PandasTools
from IPython.display import HTML, display
from rdkit.Chem.rdmolfiles import MolFromSmiles
from rdkit.Chem.MolStandardize import rdMolStandardize


# Data input
print("Upload csv file containing the SMILES column")
uploaded = files.upload()


In [None]:
# @title Standardize process
df = pd.read_csv(list(uploaded.keys())[0])
pd.set_option('display.max_rows', None)

smis = df["SMILES"]
uncharger = rdMolStandardize.Uncharger()

ms=[]
for smi in smis:
    m = MolFromSmiles(smi)

    # Sanitize
    Chem.SanitizeMol(m, sanitizeOps = (Chem.SANITIZE_ALL^Chem.SANITIZE_CLEANUP^Chem.SANITIZE_PROPERTIES))

    # Normalize functional groups
    cm = rdMolStandardize.Normalize(m)

    # Get parent and uncharge
    #im = uncharger.uncharge(rdMolStandardize.FragmentParent(cm))

    # Reionization (if needed)
    #rm = rdMolStandardize.Reionize(im)

    # Append to list
    ms.append(Chem.MolToSmiles(cm))
df["SMILES_std"]=ms
PandasTools.AddMoleculeColumnToFrame(df, 'SMILES_std', 'Molecule')

In [None]:
# @title IUPAC reading using STOUT (use at your risk as this is based on a LLM model not a systematic one - https://doi.org/10.1186/s13321-021-00512-4)
from STOUT import translate_forward
iupacs = [translate_forward(SMILES) for SMILES in df["SMILES"]]
df["IUPAC"] = iupacs
PandasTools.SaveXlsxFromFrame(df, 'SMILES_processed.xlsx', molCol='Molecule')

Downloading cdk-2.8.jar: 0.00B [00:00, ?B/s]

Downloading trained model to /root/.data/STOUT-V2/models


Downloading models.zip: 0.00B [00:00, ?B/s]

/root/.data/STOUT-V2/models.zip
... done downloading trained model!


In [None]:
After the whole process, the processed file would be appeared as "SMILES_processed.xlsx"