GitHub

Identifying The Kind Behind SMILES – Anatomical Therapeutic Chemical Classification using Structure-Only Representations

ATC (Anatomical Therapeutic Chemical) classification for compounds/drugs plays an important role in drug development and basic research. However, previous methods depend on interactions extracted from STITCH dataset which may make it depends on lab experiments.

We present a pilot study to explore the possibility of conducting the ATC prediction solely based on the molecular structures. The motivation is to eliminate the reliance on the costly lab experiments so that the characteristics of a drug can be pre-assessed for better decision-making and effort-saving before the actual development.

Our contributions are as follows:

We construct a new benchmark consisting of 4545 compounds which is with larger scale than the one used in previous study.
A light-weight prediction model is proposed. The model is with better explainability in the sense that it is consists of a straightforward tokenization that extracts and embeds statistically and physicochemically meaningful tokens, and a deep network backed by a set of pyramid kernels to capture multi-resolution chemical structural characteristics.

For details, please refer to our paper [1], which will be available in briefings in Bioinformatics soon. Besides, an online ATC-codes predictor is available now. http://aimars.net:8090/ATC_SMILES/

Quick tour

This is a pytorch implementation of the ATC-CNN proposed in our paper [1]. If you don't care details, just start training in ./run.py.

Step 1: Prepare dataset

To run the code, please make sure you have prepared canonical SMILES data ( Computed by OEChem 2.3.0 ) following the same data structure in ./data/ATC_SMILES.csv.

Step 2: Tokenization

We have provided tokenized ATC-SMILES dataset, if you want to use other canonical SMILES, please use tokenization API in ./utils.py.

Step 3: Embedding

We adopt the Word2Vec and Skip-gram model to train token embeddings, you can use our pre-trained embeddings in ./data/embedding_SMILES_Vocab.npz.

Tokenization

SMILES Tokenization:

from Utils import get_splited_smis
#Use our ATC-SMILES dataset, the splited file has already existed,You can set the parameters as follows:
get_splited_smis(filePath='./data/splited_smis.txt',smiList=None)
#Use other canonical SMILES sequence, Please prepare the SMILES list as follows:
SMILES_List=['CCCC(=O)',...,'CCCC(=O)C']
get_splited_smis(filePath='',smiList=SMILES_List)

Embedding

We adopt the Word2Vec and Skip-gram model to train token embeddings:

./models/TextCNN.py
self.embedding_pretrained=torch.tensor(np.load(dataset+'/data/embedding_SMILES_Vocab.npz')["embeddings"].astype('float32')) if embedding != 'random' else None

Start training

To train a model with our ATC-SMILES dataset:

./run.py
python run.py

Datasets

ATC-SMILES dataset proposed in our paper [1] with 4545 drugs/compounds.

If you plan to use other smiles files, please refer to the data structure in ATC_SMILES.csv.

./data/ATC_SMILES.csv

	KEGG_Drug_ID	CanSmiles	Lable
0	C00018	CC1=NC=C(C(=C1O)C=O)COP(=O)(O)O	[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1	C00220	CN(C)C1=CC2=C(C=C1)N= C3C=CC(=N+C)C=C3S2.[Cl-]	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
...	...	...	...
4544	D11817	CNC(=O)C1=NN=C(C= C1NC2=CC=CC(=C2OC)C3=NN (C=N3)C)NC(=O)C4CC4	[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

Chen-2012 dataset proposed in paper[2] with 3883 drugs/compounds.

Chen-2012 dataset is available in Chen-2012 .

ATC-SMILES-Aligned dataset proposed in our paper [1] with 3785 drugs/compounds.

ATC-SMILES is with a larger scale than that of Chen-2012 [2], but there are some mis-aligned items. We remove these items to generate a subset consisting of 3785 drugs/compounds, and we call this set ATC-SMILES-Aligned. ATC-SMILES-Aligned can be downloaded in ATC-SMILES-Aligned.csv.

Citation

[1] Yi Cao, Zhen-Qun Yang, Xu-Lu Zhang, Wenqi Fan, Yaowei Wang, Jiajun Shen, Dong-Qing Wei, Qing Li, and Xiao-Yong Wei. Identifying The Kind Behind SMILES – Anatomical Therapeutic Chemical Classification using Structure-Only Representations, Briefings in Bioinformatics, 2022, DOI:10.1093/bib/bbac346.

@ARTICLE{  
author={Yi Cao, Zhen-Qun Yang, Xu-Lu Zhang, Wenqi Fan, Yaowei Wang, Jiajun Shen, Dong-Qing Wei, Qing Li, and Xiao-Yong Wei.},  
journal={Briefings in Bioinformatics},   
title={Identifying The Kind Behind SMILES – Anatomical Therapeutic Chemical Classification using Structure-Only Representations},   
year={2022},   
doi={DOI:10.1093/bib/bbac346}}

Reference

[2] Chen L, Zeng WM, Cai YD, Feng KY, Chou KC. Predicting Anatomical Therapeutic Chemical (ATC) classification of drugs by integrating chemical-chemical interactions and similarities. PLoS One. 2012;7(4):e35254.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
log		log
models		models
saveModelFile		saveModelFile
README.md		README.md
Res.csv		Res.csv
run.py		run.py
tokens.png		tokens.png
train_eval.py		train_eval.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Identifying The Kind Behind SMILES – Anatomical Therapeutic Chemical Classification using Structure-Only Representations

Quick tour

Tokenization

Embedding

Start training

Datasets

Citation

Reference

About

Releases

Packages

Languages

lookwei/ATC_CNN

Folders and files

Latest commit

History

Repository files navigation

Identifying The Kind Behind SMILES – Anatomical Therapeutic Chemical Classification using Structure-Only Representations

Quick tour

Tokenization

Embedding

Start training

Datasets

Citation

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages