-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Thanks a lot for providing FS-mol. Very valuable to the community!
I am bit confused about the nature of the numerical_values in the FS-Mol dataset. The paper says, those are IC50/EC50 values:
ChEMBL contains the results of many experiments, termed “assays”, each having a unique experiment ID. We retained only those measurements referring to small molecule activity (IC50 or EC50).
However, the code in here points to the fact that percentage as a unit might also have been used during the creation of the dataset:
| if df.iloc[0]["standard_units"] == "%": |
When checking some assays in the train task list (anecdotally), there are indeed assays that uses % as unit, eg.:
- https://www.ebi.ac.uk/chembl/g/#browse/activities/filter/assay_chembl_id%3ACHEMBL3591894.
- and the corresponding distribution of the numeric_label (and bool_label):

Not sure to which extent it make sense to apply a log-transformation to percentage values ranging from [0-100]. However, this is done in the FS-MOL dataset, and also the community slowly starts to do that (I guess because only IC50 / EC50 values are assumed??) --> https://github.com/Wenlin-Chen/ADKF-IFT/blob/c96919d553313b267240dc1409ae65160c629aab/fs_mol/data/dkt.py#L111 (the corresponding paper: https://arxiv.org/pdf/2205.02708.pdf)
but we include the regression task (for the actual numeric activity target IC50 or EC50) in our evaluation as well
The community is slowly using FS-MOL also in a regression context. It would be great if we get clarification around this IC50 / EC50 versus percentage issue, or have those assays explicitly labeled maybe?
Thanks a lot for looking into that. Greatly appreciated!