Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should regression targets be log scaled? #55

Closed
cyrusmaher opened this issue Feb 23, 2021 · 7 comments
Closed

Should regression targets be log scaled? #55

cyrusmaher opened this issue Feb 23, 2021 · 7 comments

Comments

@cyrusmaher
Copy link

I've found for a number of the ADME regression targets (volume of distribution, half life, etc.) that there is much clearer signal when I regress against the log of the target. The target distributions are fairly heavy-tailed in these cases, so without this transform, a handful of points can drive the loss.

Does it make sense to check whether regression targets should be transformed, and if so, to update the dataset generation accordingly?

@kexinhuang12345
Copy link
Collaborator

Hey Cyrus, that's a good point! That is also why we provided a log transformation function. You can try:

from tdc.single_pred import ADME
data = ADME(name = 'VDss_Lombardo')
data.convert_to_log()
data.label_distribution()

We want to keep the raw data as well, but we will add a note on the website about this important information.

I think in our current ADMET benchmark, we are not using that. So, it may be a good idea to use a log-scale for VDss and Half life. I will try that out to see if the performance makes sense. What other datasets have you identified this issue? Thanks!

@cyrusmaher
Copy link
Author

cyrusmaher commented Feb 23, 2021

Hi Kexin, that's for your quick reply! I went ahead and ran this for all the TDC endpoints that I'm working with. I only see benefits for VD, clearance, and half life. Note for the log transform, I use a robust version so it works for non-positive numbers...

def robust_log(x):
    return np.sign(x) * np.log(abs(x) + 1)

image

@kexinhuang12345
Copy link
Collaborator

Thanks so much for making this table! It looks like you are using the old version of TDC. We have found that eDrug3D is very noisy, so we replaced them with higher quality ones. You can check them out in our website.

Would it be fast to generate these numbers for these new datasets? If not, let me know, i can also run some codes to test the difference.

Also, regarding the log transformation, i made a PR to incorporate your point: #56 one difference is instead of using +1 for numeric stability, i use 1e-10. Since it looks like some raw values are pretty small and +1 would make a difference.

@cyrusmaher
Copy link
Author

Hi Kexin, good catch adding a smaller number! I got called off to do covid variant work, but I should be able to get back to this next week. Once it's ready, I'll add the updated table here.

@kexinhuang12345
Copy link
Collaborator

Sounds good, thanks a lot!!!

@cyrusmaher
Copy link
Author

Here you go! I updated the robust log computation and added the latest datasets:
image

@kexinhuang12345
Copy link
Collaborator

Thanks so much! This looks good, it seems the currently supported dataset does not require log transformation. Closing for now! Feel free to reopen if you have any question!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants