Clarification on how to use FS-MOL in a regression context

Thanks a lot for providing FS-mol. Very valuable to the community!

I am bit confused about the nature of the numerical_values in the FS-Mol dataset. The paper says, those are IC50/EC50 values:
>  ChEMBL contains the results of many experiments, termed “assays”, each having a unique experiment ID. We retained only those measurements referring to small molecule activity (IC50 or EC50).

However, the code in here points to the fact that percentage as a unit might also have been used during the creation of the dataset: https://github.com/microsoft/FS-Mol/blob/fa336aed734132ae7899edec1992228ba59d5aca/fs_mol/preprocessing/utils/cleaning_utils.py#L144 which is totally fine I guess when only using it to extract activity / non-activity :) 
When checking some assays in the train task list (anecdotally), there are indeed assays that uses % as unit, eg.:
- https://www.ebi.ac.uk/chembl/g/#browse/activities/filter/assay_chembl_id%3ACHEMBL3591894.
- and the corresponding distribution of the numeric_label (and bool_label):
![image](https://user-images.githubusercontent.com/2922439/235627448-b3f45df7-0719-4d6d-80a4-8e7eaf761834.png)


Not sure to which extent it make sense to apply a log-transformation to percentage values ranging from [0-100]. However, this is done in the FS-MOL dataset, and also the community slowly starts to do that (I guess because only IC50 / EC50 values are assumed??) --> https://github.com/Wenlin-Chen/ADKF-IFT/blob/c96919d553313b267240dc1409ae65160c629aab/fs_mol/data/dkt.py#L111 (the corresponding paper: https://arxiv.org/pdf/2205.02708.pdf)
>  but we include the regression task (for the actual numeric activity target IC50 or EC50) in our evaluation as well

The community is slowly using FS-MOL also in a regression context. It would be great if we get clarification around this IC50 / EC50 versus percentage issue, or have those assays explicitly labeled maybe?
Thanks a lot for looking into that. Greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on how to use FS-MOL in a regression context #57

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification on how to use FS-MOL in a regression context #57

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions