-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should regression targets be log scaled? #55
Comments
Hey Cyrus, that's a good point! That is also why we provided a log transformation function. You can try:
We want to keep the raw data as well, but we will add a note on the website about this important information. I think in our current ADMET benchmark, we are not using that. So, it may be a good idea to use a log-scale for VDss and Half life. I will try that out to see if the performance makes sense. What other datasets have you identified this issue? Thanks! |
Hi Kexin, that's for your quick reply! I went ahead and ran this for all the TDC endpoints that I'm working with. I only see benefits for VD, clearance, and half life. Note for the log transform, I use a robust version so it works for non-positive numbers... def robust_log(x):
return np.sign(x) * np.log(abs(x) + 1) |
Thanks so much for making this table! It looks like you are using the old version of TDC. We have found that eDrug3D is very noisy, so we replaced them with higher quality ones. You can check them out in our website. Would it be fast to generate these numbers for these new datasets? If not, let me know, i can also run some codes to test the difference. Also, regarding the log transformation, i made a PR to incorporate your point: #56 one difference is instead of using +1 for numeric stability, i use 1e-10. Since it looks like some raw values are pretty small and +1 would make a difference. |
Hi Kexin, good catch adding a smaller number! I got called off to do covid variant work, but I should be able to get back to this next week. Once it's ready, I'll add the updated table here. |
Sounds good, thanks a lot!!! |
Thanks so much! This looks good, it seems the currently supported dataset does not require log transformation. Closing for now! Feel free to reopen if you have any question! |
I've found for a number of the ADME regression targets (volume of distribution, half life, etc.) that there is much clearer signal when I regress against the log of the target. The target distributions are fairly heavy-tailed in these cases, so without this transform, a handful of points can drive the loss.
Does it make sense to check whether regression targets should be transformed, and if so, to update the dataset generation accordingly?
The text was updated successfully, but these errors were encountered: