# Discovering telomerase inhibitors with machine learning: Data collection

Telomerase is the enzyme that repairs the protective caps of chromosomes. It is active in 90% of cancer cells, but not in most normal cells. Inhibiting telomerase could be a promising strategy for cancer treatment.

In this notebook, we will build a dataset of biologically active compounds that could or are known to have telomerase-blocking activity. We will then train a classifier on numerical features extracted from the SMILES structures of these compounds, which can be used to screen chemical libraries for novel telomerase inhibitors.

**Note:** There is no need to run this notebook. The final dataset is provided in the [`data`](data/data.csv) directory.

In [1]:
%pip install -qU numpy pandas scikit-learn scipy matplotlib seaborn rdkit pubchempy

Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np
import pandas as pd

from pathlib import Path

In [3]:
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

In [5]:
DATA_DIR = Path("data")

TRAIN_PATH = DATA_DIR / "data.csv"
SCREEN_PATH = DATA_DIR / "screen.csv"

## Training data

### Data collection

The positive samples in the training set were gathered from [scientific literature](#references) and the [Selleck Anti-Aging Compound Library](https://www.selleckchem.com/screening/anti-aging-compound-library.html). We will retrieve their isomeric SMILES strings from sources such as PubChem, and combine them with the [Selleck FDA-approved Drug Library](https://www.selleckchem.com/screening/fda-approved-drug-library.html) to obtain the final dataset.

In [None]:
positive_compounds = """
1. BIBR 1532
1. Costunolide
1. EGCG
1. BMS-582949
1. Hypericin
1. MST-312
1. MST-295
1. MST-199
1. Boldine
1. Telomestatin
1. Sulfoquinovosyl diacylglycerol
1. Curcumin
1. Genistein
1. Pterostilbene
1. Sulforaphane
1. Retinoic acid
1. Tocotrienol
1. Quercetin
1. Berberine
1. Pristimerin
1. Oleanane
1. Gambogic acid
1. Gambogenic acid
1. TELIN
1. Imetelstat
1. Geldanamycin
1. AZT-TP
1. 1α,25(OH)2VD3
1. Ceramide
1. MKT077
1. Tamoxifen
1. BRACO19
1. AS1410
1. RHPS4
1. 5-azacytidine
1. TMPyP4 tosylate
1. Abacavir
1. L2H2-6OTD
1. Telomerase-IN-1
1. Telomerase-IN-2
1. Telomerase-IN-3
1. Telomerase-IN-6
1. 360A iodide
1. BMVC
1. 7-Deaza-2′-deoxyguanosine 5′-triphosphate
1. 2′,3′-Dideoxyguanosine 5′-triphosphate
1. 6-Thio-2′-Deoxyguanosine
1. Triethylene tetramine
1. 10H-indolo[3,2-b]quinoline
1. TMPI""".split("\n1. ")
positive_compounds = [c for c in positive_compounds if c]
print(len(positive_compounds), positive_compounds)

In [None]:
import pubchempy as pcp

positives_data = []
for c in positive_compounds:
  results = pcp.get_compounds(c, "name")
  try:
    result = results[0]
    positives_data.append((c, result.cid, result.isomeric_smiles))
  except:
    print(f"Failed to find {c}")

PubChem thankfully contained most of the positive samples. We will add the remaining samples manually, either by searching for the SMILES strings on the web or by deriving them from their IUPAC names.

In [None]:
positives_data.append(("MST-295", None, "OC=1C=C(C(=O)NC2=CC(=CC=C2)NC(C2=CC(=C(C=C2)O)O)=O)C=CC1O"))  # N,Nʹ-bis(3,4-dihydroxybenzoyl)-1,3-phenylenediamine; https://opsin.ch.cam.ac.uk/
positives_data.append(("MST-199", None, "OC=1C=C(C=CC1O)C=1OC2=CC=CC=C2C(C1NC(C1=CC(=C(C=C1)O)O)=O)=O"))  # N-[2-(3,4-dihydroxyphenyl)-4-oxo-4H-chromen-3-yl]-3,4 dihydroxybenzamide; https://opsin.ch.cam.ac.uk/
positives_data.append(("TELIN", None, "CC1=NN(C(=O)C1(Cl)Cl)c2ccc(Cl)cc2Cl"))  # 4,4-dichloro-1-(2,4-dichlorophenyl)-3-methyl-5-pyrazolone; https://cactus.nci.nih.gov/chemical/structure
positives_data.append(("1α,25(OH)2VD3", None, "[H][C@@]1(CC[C@@]2([H])C(\CCC[C@]12C)=C\C=C3\C[C@@H](O)C[C@H](O)C3=C)[C@H](C)CCCC(C)(C)O"))  # Sigma-Aldrich
positives_data.append(("C6-Ceramide", None, "C(NC(CCCCC)=O)(C(C=CCCCCCCCCCCCCC)O)CO"))  # https://www.larodan.com/product/c6-ceramide/
positives_data.append(("AS1410", None, "FC1=C(F)C=C(CNC2=C3C=CC(NC(=O)CCN4CCCC4)=CC3=NC5=C2C=CC(NC(=O)CCN6CCCC6)=C5)C=C1"))  # https://drugs.ncats.io/substance/6H8FDD5016
positives_data.append(("7-Deaza-2′-deoxyguanosine 5′-triphosphate", None, "NC1=NC(=O)c2ccn(C3CC(O)C(COP(O)(=O)OP(O)(=O)OP(O)(O)=O)O3)c2N1"))  # Sigma-Aldrich
positives_data.append(("2′,3′-Dideoxyguanosine 5′-triphosphate", None, "[Na+].OP([O-])(=O)OP(O)(=O)OP(O)(=O)OC[C@@H]1CC[C@@H](O1)n2cnc3C(=O)NC(=N)Nc23"))  # Sigma-Aldrich
positives_data.append(("6-Thio-2′-Deoxyguanosine", None, "S=C1C2=C(N([C@H]3C[C@H](O)[C@@H](CO)O3)C=N2)N=C(N)N1"))  # Sigma-Aldrich

In [None]:
from rdkit.Chem import MolFromMolFile, MolToSmiles, CanonSmiles

tmpi = MolFromMolFile(str(DATA_DIR / "CB61365618.mol"))  # https://www.chemicalbook.com/ChemicalProductProperty_EN_CB61365618.htm
positives_data.append(("TMPI", None, CanonSmiles(MolToSmiles(tmpi))))
tmpi

In [None]:
positives = pd.DataFrame(positives_data, columns=["name", "cid", "smiles"])
positives

With the positive samples in hand, we will then gather negative samples from the Selleck FDA-approved Drug Library.

In [None]:
fda = pd.read_csv(DATA_DIR / "fda.csv")
fda

In [None]:
# Remove drugs that are known to target telomerase
negatives = fda[~fda["Information"].str.contains("telomerase", case=False, na=False)]
negatives = negatives[["Name", "SMILES"]].rename(str.lower, axis=1).dropna()
negatives

In [None]:
data = pd.concat([positives.drop("cid", axis=1).assign(label=1), negatives.assign(label=0)])
data

We see that some of the positive drugs are already in the FDA-approved Drug Library with the incorrect label. We will remove these from the set of negatives, since they are actually known to have telomerase-promoting activity.

In [None]:
data[data.duplicated("name")]

In [None]:
data = data[~data.duplicated("name")]
data = data.reset_index(drop=True)
data

In [None]:
data["label"].value_counts()

### Data preprocessing

Using the RDKit library, we will extract 210 molecular descriptors from the SMILES strings of the compounds. These numerical properties will be used as features for the machine learning model.

In [4]:
from rdkit.Chem import MolFromSmiles
from rdkit.Chem.Descriptors import CalcMolDescriptors

def extract_descriptors(smiles):
  mol = MolFromSmiles(smiles)

  return CalcMolDescriptors(mol)

In [None]:
descriptors = data.loc[:, ["smiles"]].apply(lambda x: extract_descriptors(x["smiles"]), axis=1, result_type="expand")

In [None]:
descriptors

In [None]:
final = data.join(descriptors).drop_duplicates(descriptors.columns)
final.value_counts("label")

In [None]:
final.to_csv(TRAIN_PATH, index=False)

## Screening data

We will then use the trained model to screen a chemical library for potential telomerase inhibitors. We will form the screening dataset from the [Prestwick and LOPAC libraries](https://zenodo.org/records/7870357/files/list_of_compounds_for_training.csv).

In [10]:
data = pd.read_csv(TRAIN_PATH)

In [26]:
from rdkit.Chem import CanonSmiles

screen_data = pd.read_csv(DATA_DIR / "prestwick_lopac.csv")
screen_data = screen_data.iloc[:, :5]  # Drop precomputed descriptors
screen_data["smiles"] = screen_data["SMILES"].apply(CanonSmiles)
screen_data["name"] = screen_data["Name"]
screen_data = screen_data.drop(["senolytic", "Source", "Library", "SMILES", "Name"], axis=1)
screen_data

Unnamed: 0,smiles,name
0,Nc1nc2[nH]nnc2c(=O)[nH]1,Azaguanine-8
1,NC(=O)NC1NC(=O)NC1=O,Allantoin
2,CC(=O)Nc1nnc(S(N)(=O)=O)s1,Acetazolamide
3,CN(C)C(=N)NC(=N)N,Metformin hydrochloride
4,COc1ccc(CC2c3cc(OC)c(OC)cc3CC[N+]2(C)CCC(=O)OC...,Atracurium besylate
...,...,...
2518,COc1cc(/C=C/C(=O)CC(=O)/C=C/c2ccc(O)c(OC)c2)ccc1O,Curcumin
2519,Cc1nc(Nc2ncc(C(=O)Nc3c(C)cccc3Cl)s2)cc(N2CCN(C...,Dasatinib
2520,CC1(C)CCC(c2ccc(Cl)cc2)=C(CN2CCN(c3ccc(C(=O)NS...,Navitoclax
2521,Cc1c(-c2ccc(N3CCc4cccc(C(=O)Nc5nc6ccccc6s5)c4C...,A1331852


We will take care to remove any compounds that are already in the training set, and recompute the molecular descriptors for the remaining compounds:

In [27]:
screen_data = screen_data[~screen_data["smiles"].isin(data["smiles"]) & ~screen_data["name"].str.lower().isin(data["name"].str.lower())]
screen_data

Unnamed: 0,smiles,name
0,Nc1nc2[nH]nnc2c(=O)[nH]1,Azaguanine-8
3,CN(C)C(=N)NC(=N)N,Metformin hydrochloride
5,CC(=O)OCC(=O)[C@@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=...,Isoflupredone acetate
6,CCCc1ncc(C[n+]2ccccc2C)c(N)n1,Amprolium hydrochloride
7,NS(=O)(=O)c1cc2c(cc1Cl)NCNS2(=O)=O,Hydrochlorothiazide
...,...,...
2516,CN(C)CC[C@H](CSc1ccccc1)Nc1ccc(S(=O)(=O)NC(=O)...,ABT-737
2517,O=c1c(O)c(-c2ccc(O)c(O)c2)oc2cc(O)ccc12,Fisetin
2520,CC1(C)CCC(c2ccc(Cl)cc2)=C(CN2CCN(c3ccc(C(=O)NS...,Navitoclax
2521,Cc1c(-c2ccc(N3CCc4cccc(C(=O)Nc5nc6ccccc6s5)c4C...,A1331852


In [15]:
screen_descriptors = screen_data.loc[:, ["smiles"]].apply(lambda x: extract_descriptors(x["smiles"]), axis=1, result_type="expand")
screen_descriptors

Unnamed: 0,MaxAbsEStateIndex,MaxEStateIndex,MinAbsEStateIndex,MinEStateIndex,qed,SPS,MolWt,HeavyAtomMolWt,ExactMolWt,NumValenceElectrons,...,fr_sulfide,fr_sulfonamd,fr_sulfone,fr_term_acetylene,fr_tetrazole,fr_thiazole,fr_thiocyan,fr_thiophene,fr_unbrch_alkane,fr_urea
0,10.962593,10.962593,0.049583,-0.387731,0.430316,10.545455,152.117,148.085,152.044659,56.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,7.067778,7.067778,0.113426,-0.214167,0.248785,8.222222,129.167,118.079,129.101445,52.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,16.884033,16.884033,0.085680,-1.997379,0.679307,47.066667,420.477,391.245,420.194817,164.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,5.985117,5.985117,0.587225,0.587225,0.830783,10.555556,243.334,224.182,243.160423,94.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,11.633654,11.633654,0.014270,-4.069104,0.654458,18.235294,297.745,289.681,296.964475,94.0,...,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2516,13.350173,13.350173,0.140221,-4.426475,0.057731,14.035714,813.446,768.086,812.258138,292.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2517,12.082041,12.082041,0.087182,-0.653875,0.510622,10.857143,286.239,276.159,286.047738,106.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2520,14.177964,14.177964,0.001862,-6.052872,0.104649,18.338462,974.634,919.194,973.295510,350.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2521,13.522989,13.522989,0.025570,-1.061671,0.185260,24.187500,658.828,620.524,658.272610,244.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [28]:
screen_data.join(screen_descriptors).to_csv(SCREEN_PATH, index=False)

## References

- <https://www.selleckchem.com/screening/anti-aging-compound-library.html>

  - BIBR 1532 (incidentally, discovered through random screening)
  - Costunolide
  - EGCG
  - BMS-582949
  - Hypericin
  - MST-312

- <https://www.selleckchem.com/screening/fda-approved-drug-library.html>

  - Boldine
  - Costunolide
  - EGCG
  - Hypericin

- <https://pubs.acs.org/doi/full/10.1021/ja005780q>

  - Telomestatin

- <https://www.mdpi.com/1422-0067/19/2/478>

  - EGCG
  - 1α,25(OH)2VD3
  - Sulfoquinovosyl diacylglycerol
  - Curcumin
  - Genistein
  - Pterostilbene
  - Resveratrol?
  - Ceramide
  - Sulforaphane
  - Retinoic acid
  - Tocotrienol
  - Geldanamycin
  - 3′-azido-3′-deoxythymidine triphosphate (AZT-TP)

- <https://www.mdpi.com/1422-0067/19/1/13>

  - Quercetin
  - EGCG
  - Boldine
  - Berberine
  - Pristimerin
  - Oleanane
  - Gambogic acid
  - Gambogenic acid

- <https://www.embopress.org/doi/full/10.1093/emboj/20.24.6958>

  - BIBR 1532
  - BIBR 1591

- <https://www.sciencedirect.com/science/article/abs/pii/S0006291X04013671>

  - TELIN

- <https://www.nejm.org/doi/full/10.1056/nejmoa1503479>

  - Imetelstat (GRN163L)

- <https://aacrjournals.org/mct/article/1/9/657/233889/Telomere-Shortening-and-Growth-Inhibition-of-Human>

  - MST-312
  - MST-295
  - MST-199

- <https://pubmed.ncbi.nlm.nih.gov/10463599/>

  - MKT077
  - FJ5002 (can't find SMILES)

- <https://www.researchgate.net/publication/278717635_Targeting_Telomerase_Therapeutic_Options_for_Cancer_Treatment>

  - BRACO19
  - AS1410
  - RHPS4
  - Tamoxifen

- <https://aacrjournals.org/clincancerres/article-pdf/6/7/2868/2076507/df070002868.pdf>

  - 5-azacytidine

- <https://www.medchemexpress.com/Targets/Telomerase.html?effectName=Inhibitor>

  - TMPyP4 tosylate
  - Abacavir
  - L2H2-6OTD
  - Telomerase-IN-1
  - Telomerase-IN-2
  - Telomerase-IN-3
  - Telomerase-IN-6 (lead-optimized BIBR 1591)
  - 360A iodide
  - BMVC
  - 7-Deaza-2′-deoxyguanosine 5′-triphosphate

- <https://www.sigmaaldrich.com/AM/en/search/telomerase?focus=products&page=1&perpage=30&sort=relevance&term=telomerase&type=product>

  - Telomerase Inhibitor IX (MST-312)
  - 2′,3′-Dideoxyguanosine 5′-triphosphate
  - 6-Thio-2′-Deoxyguanosine

- <https://www.sciencedirect.com/science/article/abs/pii/S0753332207002600>

  - Triethylene tetramine

- <https://www.sciencedirect.com/science/article/abs/pii/S0960894X00003784>

  - 10H-indolo[3,2-b]quinoline

- <https://www.sciencedirect.com/science/article/pii/S1044579X1500022X>

  - TMPI