Should regression targets be log scaled? #55

cyrusmaher · 2021-02-23T03:52:26Z

I've found for a number of the ADME regression targets (volume of distribution, half life, etc.) that there is much clearer signal when I regress against the log of the target. The target distributions are fairly heavy-tailed in these cases, so without this transform, a handful of points can drive the loss.

Does it make sense to check whether regression targets should be transformed, and if so, to update the dataset generation accordingly?

kexinhuang12345 · 2021-02-23T16:30:03Z

Hey Cyrus, that's a good point! That is also why we provided a log transformation function. You can try:

from tdc.single_pred import ADME
data = ADME(name = 'VDss_Lombardo')
data.convert_to_log()
data.label_distribution()

We want to keep the raw data as well, but we will add a note on the website about this important information.

I think in our current ADMET benchmark, we are not using that. So, it may be a good idea to use a log-scale for VDss and Half life. I will try that out to see if the performance makes sense. What other datasets have you identified this issue? Thanks!

cyrusmaher · 2021-02-23T18:57:38Z

Hi Kexin, that's for your quick reply! I went ahead and ran this for all the TDC endpoints that I'm working with. I only see benefits for VD, clearance, and half life. Note for the log transform, I use a robust version so it works for non-positive numbers...

def robust_log(x):
    return np.sign(x) * np.log(abs(x) + 1)

kexinhuang12345 · 2021-02-24T03:46:53Z

Thanks so much for making this table! It looks like you are using the old version of TDC. We have found that eDrug3D is very noisy, so we replaced them with higher quality ones. You can check them out in our website.

Would it be fast to generate these numbers for these new datasets? If not, let me know, i can also run some codes to test the difference.

Also, regarding the log transformation, i made a PR to incorporate your point: #56 one difference is instead of using +1 for numeric stability, i use 1e-10. Since it looks like some raw values are pretty small and +1 would make a difference.

cyrusmaher · 2021-03-03T15:01:17Z

Hi Kexin, good catch adding a smaller number! I got called off to do covid variant work, but I should be able to get back to this next week. Once it's ready, I'll add the updated table here.

kexinhuang12345 · 2021-03-03T18:21:03Z

Sounds good, thanks a lot!!!

cyrusmaher · 2021-03-18T04:14:48Z

Here you go! I updated the robust log computation and added the latest datasets:

kexinhuang12345 · 2021-03-18T22:17:24Z

Thanks so much! This looks good, it seems the currently supported dataset does not require log transformation. Closing for now! Feel free to reopen if you have any question!

kexinhuang12345 closed this as completed Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should regression targets be log scaled? #55

Should regression targets be log scaled? #55

cyrusmaher commented Feb 23, 2021

kexinhuang12345 commented Feb 23, 2021

cyrusmaher commented Feb 23, 2021 •

edited

Loading

kexinhuang12345 commented Feb 24, 2021

cyrusmaher commented Mar 3, 2021

kexinhuang12345 commented Mar 3, 2021

cyrusmaher commented Mar 18, 2021

kexinhuang12345 commented Mar 18, 2021

Should regression targets be log scaled? #55

Should regression targets be log scaled? #55

Comments

cyrusmaher commented Feb 23, 2021

kexinhuang12345 commented Feb 23, 2021

cyrusmaher commented Feb 23, 2021 • edited Loading

kexinhuang12345 commented Feb 24, 2021

cyrusmaher commented Mar 3, 2021

kexinhuang12345 commented Mar 3, 2021

cyrusmaher commented Mar 18, 2021

kexinhuang12345 commented Mar 18, 2021

cyrusmaher commented Feb 23, 2021 •

edited

Loading