You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The overall question:
Is it possible to describe the preprocessing and data origin for different datasets?
Explanation:
I am currently looking into using ChEMBL via TDC. However, it would be essential to know which version of the dataset is provided here and how it is preprocessed (cleaned...) for reproducibility purposes. This is especially important because ChEMBL has recurring releases. In the source code, the file is downloaded from "https://dataverse.harvard.edu/api/access/datafile/" without any explanation (TDC/tdc/utils/load.py).
A Solution:
Add a section data origin/preprocessing to the documentation.
The text was updated successfully, but these errors were encountered:
Hi Alan, thank you for raising this important point. We have a repo that tracks the preprocessing scripts for the majority of the datasets: https://github.com/kexinhuang12345/data_process however it is not cleaned up yet. I think it is important to make sure the data provenance is good and we would work towards that by linking to these processing scripts in the website.
As for the ChEMBL, unfortunately, the processing script seems to be missing. To address that, we plan to release the most up to date ChEMBL version in the coming release and document the chembl version on the website. If you have already used the current data, for now, you could call it TDC version to make things clear. Hope this helps!
kexinhuang12345
changed the title
Is it possible to describe the preprocessing and data origin for different datasets?
Add pointers to data processing script & ChEMBL dataset update
Jan 15, 2022
The overall question:
Is it possible to describe the preprocessing and data origin for different datasets?
Explanation:
I am currently looking into using ChEMBL via TDC. However, it would be essential to know which version of the dataset is provided here and how it is preprocessed (cleaned...) for reproducibility purposes. This is especially important because ChEMBL has recurring releases. In the source code, the file is downloaded from "https://dataverse.harvard.edu/api/access/datafile/" without any explanation (TDC/tdc/utils/load.py).
A Solution:
Add a section data origin/preprocessing to the documentation.
The text was updated successfully, but these errors were encountered: