Over the last decade, computer-assisted chemical synthesis has re-emerged as a heavily researched subject in Chemoinformatics. Even though the idea of utilizing computers for chemical synthesis has existed for as long as computers themselves, the high level of reliability and innovation expected of such approaches has been repeatedly proven difficult to achieve. In recent years, however, utilizing machine learning has proven particularly promising, with novel approaches emerging frequently.
Unfortunately, the available open-source computer-assisted chemical synthesis data are lacking in quality and quantity, stored in various formats, or published behind a paywall, all of which can represent significant barriers to entry, especially for novice researchers. The main objective of the NeoChemSynthWave: Data project is to lower such barriers to entry by systematically curating and facilitating access to relevant data sources.
A minimal virtual environment can be created using Conda as follows:
conda env create -f environment.yaml
conda activate ncsw-data
The package can be locally installed using pip as follows:
pip install --no-build-isolation -e .
The following types of computer-assisted chemical synthesis data are currently supported:
The following chemical compound data sources are currently supported:
The ChEMBL [1] database is a manually curated collection of bioactive chemical compounds with drug-like properties that brings together chemical, bioactivity, and genomic data to aid the translation of genomic information into effective new drugs. The following versions are currently supported:
Version | DOI |
---|---|
v_release_{release_number} | https://doi.org/10.6019/CHEMBL.database.{release_number} |
e.g., v_release_34 | e.g., https://doi.org/10.6019/CHEMBL.database.34 |
The ZINC20 [2] database is a collection of commercially available chemical compounds for virtual screening. The following versions are currently supported:
Version | DOI |
---|---|
v_{building_block_subset_name} | https://doi.org/10.1021/acs.jcim.0c00675 |
e.g., v_bb_10 | e.g., https://doi.org/10.1021/acs.jcim.0c00675 |
v_{catalog_name} | https://doi.org/10.1021/acs.jcim.0c00675 |
e.g., v_wuxi-v | e.g., https://doi.org/10.1021/acs.jcim.0c00675 |
The following chemical reaction data sources are currently supported:
- United States Patent and Trademark Office (USPTO)
- Open Reaction Database (ORD)
- Chemical Reaction Database (CRD)
- Rhea
- RDB7
- Miscellaneous
The United States Patent and Trademark Office (USPTO) [3] dataset is an open-source collection of chemical reactions constructed by text-mining patent grant and application documents. The following versions are currently supported:
Version | DOI |
---|---|
v_1976_to_2013_by_20121009_lowe_d_m | https://doi.org/10.6084/m9.figshare.12084729.v1 |
v_50k_by_20161122_schneider_n_et_al | https://doi.org/10.1021/acs.jcim.6b00564 |
v_15k_by_20170418_coley_c_w_et_al | https://doi.org/10.1021/acscentsci.7b00064 |
v_1976_to_2016_by_20121009_lowe_d_m | https://doi.org/10.6084/m9.figshare.5104873.v1 |
v_50k_by_20171116_coley_c_w_et_al | https://doi.org/10.1021/acscentsci.7b00355 |
v_480k_or_mit_by_20171229_jin_w_et_al | https://doi.org/10.48550/arXiv.1709.04555 |
v_480k_or_mit_by_20180622_schwaller_p_et_al | https://doi.org/10.1039/C8SC02339E |
v_stereo_by_20180622_schwaller_p_et_al | https://doi.org/10.1039/C8SC02339E |
v_1k_tpl_by_20210705_schwaller_p_et_al | https://doi.org/10.1038/s42256-020-00284-w |
The Open Reaction Database (ORD) [4] is an open-source collection of chemical reactions designed to support machine learning and related efforts in chemical reaction prediction, chemical synthesis planning, and experiment design. The following versions are currently supported:
Version | DOI |
---|---|
v_release_v_0_1_0 | https://doi.org/10.1021/jacs.1c09820 |
v_main | https://doi.org/10.1021/jacs.1c09820 |
The Chemical Reaction Database (CRD) [5] is a collection of chemical reactions from scientific and patent literature. The following versions are currently supported:
Version | DOI |
---|---|
v_reaction_smiles_2001_to_2021 | https://doi.org/10.6084/m9.figshare.20279733.v1 |
v_reaction_smiles_2001_to_2023 | https://doi.org/10.6084/m9.figshare.22491730.v1 |
v_reaction_smiles_2023 | https://doi.org/10.6084/m9.figshare.24921555.v1 |
The Rhea [6] database is an open-source expert-curated knowledgebase of chemical and transport reactions of biological interest. The following versions are currently supported:
Version | DOI |
---|---|
v_release_{release_number} | https://doi.org/10.1021/acs.jcim.0c00675 |
e.g., v_release_133 | e.g., https://doi.org/10.1021/acs.jcim.0c00675 |
The RDB7 [7][8] dataset is an open-source collection of diverse chemical reactions whose transition states contain up to seven heavy atoms. The following versions are currently supported:
Version | DOI |
---|---|
v_20200508_grambow_c_et_al_v_1_0_0 | https://doi.org/10.5281/zenodo.3581267 |
v_20200508_grambow_c_et_al_v_1_0_1 | https://doi.org/10.5281/zenodo.3715478 |
v_20200508_grambow_c_et_al_add_on_v_1_0_0 | https://doi.org/10.5281/zenodo.3731554 |
v_20220718_spiekermann_k_et_al_v_1_0_0 | https://doi.org/10.5281/zenodo.5652098 |
v_20220718_spiekermann_k_et_al_v_1_0_1 | https://doi.org/10.5281/zenodo.6618262 |
The following miscellaneous chemical reaction data sources are currently supported:
Data Source | DOI |
---|---|
v_20131008_kraut_h_et_al | https://doi.org/10.1021/ci400442f |
v_20161014_wei_j_n_et_al | https://doi.org/10.1021/acscentsci.6b00219 |
The following chemical reaction rule data sources are currently supported:
The RetroRules [9] database is an open-source collection of chemical reaction rules for metabolic pathway discovery and metabolic engineering. The following versions are currently supported:
Version | DOI |
---|---|
v_release_rr01_rp2_hs | https://doi.org/10.5281/zenodo.5827427 |
v_release_rr02_rp2_hs | https://doi.org/10.5281/zenodo.5828017 |
v_release_rr02_rp3_hs | https://doi.org/10.5281/zenodo.5827977 |
v_release_rr02_rp3_nohs | https://doi.org/10.5281/zenodo.5827969 |
The following miscellaneous chemical reaction rule data sources are currently supported:
Data Source | DOI |
---|---|
v_20180421_avramova_s_et_al | https://doi.org/10.5281/zenodo.1209313 |
The following updates are currently planned for version v.2024.07:
- Complete the development of the /ncsw_data/storage sub-package.
- Create the /documentation directory.
- Create the /notebooks directory.
- Create the /scripts directory.
- Create the /tests directory.
- Publish the package on PyPI.
The contents of this repository are published under the MIT license. Please refer to individual references for more details regarding the license information of external resources utilized within this repository.
If you are interested in contributing to this repository by reporting bugs, suggesting improvements, or submitting feedback, feel free to use GitHub Issues.
[1] Zdrazil, B., Felix, E., Hunter, F., Manners, E.J., Blackshaw, J., Corbett, S., de Veij, M., Ioannidis, H., Lopez, D.M., Mosquera, J.F., Magarinos, M.P., Bosc, N., Arcila, R., Kizilören, T., Gaulton, A., Bento, A.P., Adasme, M.F., Monecke, P., Landrum, G.A., and Leach, A.R. The ChEMBL Database in 2023: A Drug Discovery Platform Spanning Multiple Bioactivity Data Types and Time Periods, Nucleic Acids Research, 52, D1, 2024, D1180-D1192.
[2] Irwin, J.J., Tang, K.G., Young, J., Dandarchuluun, C., Wong, B.R., Khurelbaatar, M., Moroz, Y.S., Mayfield, J., and Sayle, R.A. ZINC20 - A Free Ultralarge-Scale Chemical Database for Ligand Discovery. J. Chem. Inf. Model., 2020, 60, 12, 6065-6073.
[3] Lowe, D.M. Extraction of Chemical Structures and Reactions from the Literature. Ph.D. Thesis, University of Cambridge, Department of Chemistry, Pembroke College, 2012.
[4] Kearnes, S.M., Maser, M.R., Wleklinski, M., Kast, A., Doyle, A.G., Dreher, S.D., Hawkins, J.M., Jensen, K.F., and Coley, C.W. The Open Reaction Database. J. Am. Chem. Soc., 2021, 143, 45, 18820–18826.
[5] The Chemical Reaction Database (CRD): https://kmt.vander-lingen.nl. Accessed on: June 1st, 2024.
[6] Bansal, P., Morgat, A., Axelsen, K.B., Muthukrishnan, V., Coudert, E., Aimo, L., Hyka-Nouspikel, N., Gasteiger, E., Kerhornou, A., Neto, T.B., Pozzato, M., Blatter, M., Ignatchenko, A., Redaschi, N., and Bridge, A. Rhea, the Reaction Knowledgebase in 2022, Nucleic Acids Research, 50, D1, 2022, D693–D700.
[7] Grambow, C.A., Pattanaik, L., and Green, W.H. Reactants, Products, and Transition States of Elementary Chemical Reactions based on Quantum Chemistry. Sci. Data, 7, 137, 2020.
[8] Spiekermann, K., Pattanaik, L., and Green, W.H. Fast Predictions of Reaction Barrier Heights: Toward Coupled-cluster Accuracy. Sci. Data, 9, 417, 2022.
[9] Duigou, T., du Lac, M., Carbonell, P., and Faulon, J. RetroRules: A Database of Reaction Rules for Engineering Biology. Nucleic Acids Research, 47, D1, 2018, D1229–D1235.