Skip to content

Clarification on how to use FS-MOL in a regression context #57

@juliabuhmann

Description

@juliabuhmann

Thanks a lot for providing FS-mol. Very valuable to the community!

I am bit confused about the nature of the numerical_values in the FS-Mol dataset. The paper says, those are IC50/EC50 values:

ChEMBL contains the results of many experiments, termed “assays”, each having a unique experiment ID. We retained only those measurements referring to small molecule activity (IC50 or EC50).

However, the code in here points to the fact that percentage as a unit might also have been used during the creation of the dataset:

if df.iloc[0]["standard_units"] == "%":
which is totally fine I guess when only using it to extract activity / non-activity :)
When checking some assays in the train task list (anecdotally), there are indeed assays that uses % as unit, eg.:

Not sure to which extent it make sense to apply a log-transformation to percentage values ranging from [0-100]. However, this is done in the FS-MOL dataset, and also the community slowly starts to do that (I guess because only IC50 / EC50 values are assumed??) --> https://github.com/Wenlin-Chen/ADKF-IFT/blob/c96919d553313b267240dc1409ae65160c629aab/fs_mol/data/dkt.py#L111 (the corresponding paper: https://arxiv.org/pdf/2205.02708.pdf)

but we include the regression task (for the actual numeric activity target IC50 or EC50) in our evaluation as well

The community is slowly using FS-MOL also in a regression context. It would be great if we get clarification around this IC50 / EC50 versus percentage issue, or have those assays explicitly labeled maybe?
Thanks a lot for looking into that. Greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions