Add pointers to data processing script & ChEMBL dataset update #135

AlanHassen · 2022-01-13T09:15:11Z

The overall question:
Is it possible to describe the preprocessing and data origin for different datasets?

Explanation:
I am currently looking into using ChEMBL via TDC. However, it would be essential to know which version of the dataset is provided here and how it is preprocessed (cleaned...) for reproducibility purposes. This is especially important because ChEMBL has recurring releases. In the source code, the file is downloaded from "https://dataverse.harvard.edu/api/access/datafile/" without any explanation (TDC/tdc/utils/load.py).

A Solution:
Add a section data origin/preprocessing to the documentation.

kexinhuang12345 · 2022-01-15T07:08:44Z

Hi Alan, thank you for raising this important point. We have a repo that tracks the preprocessing scripts for the majority of the datasets: https://github.com/kexinhuang12345/data_process however it is not cleaned up yet. I think it is important to make sure the data provenance is good and we would work towards that by linking to these processing scripts in the website.

As for the ChEMBL, unfortunately, the processing script seems to be missing. To address that, we plan to release the most up to date ChEMBL version in the coming release and document the chembl version on the website. If you have already used the current data, for now, you could call it TDC version to make things clear. Hope this helps!

kexinhuang12345 · 2022-01-24T16:02:14Z

Hi, ChEMBL-V29 is now released in 0.3.5. You can load it via:

from tdc.generation import MolGen
data = MolGen(name = 'ChEMBL_V29')

kexinhuang12345 added the enhancement New feature or request label Jan 15, 2022

kexinhuang12345 changed the title ~~Is it possible to describe the preprocessing and data origin for different datasets?~~ Add pointers to data processing script & ChEMBL dataset update Jan 15, 2022

kexinhuang12345 self-assigned this Jan 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pointers to data processing script & ChEMBL dataset update #135

Add pointers to data processing script & ChEMBL dataset update #135

AlanHassen commented Jan 13, 2022

kexinhuang12345 commented Jan 15, 2022

kexinhuang12345 commented Jan 24, 2022

Add pointers to data processing script & ChEMBL dataset update #135

Add pointers to data processing script & ChEMBL dataset update #135

Comments

AlanHassen commented Jan 13, 2022

kexinhuang12345 commented Jan 15, 2022

kexinhuang12345 commented Jan 24, 2022