Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Same drug-target pair has different affinities in Davis #98

Closed
luoyunan opened this issue Sep 7, 2021 · 6 comments
Closed

Same drug-target pair has different affinities in Davis #98

luoyunan opened this issue Sep 7, 2021 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@luoyunan
Copy link

luoyunan commented Sep 7, 2021

Describe the bug
The Davis dataset is assumed to contain a unique affinity value for a drug-target pair. However, in TDC, there are duplicated drug-target pairs with different affinity values.

To Reproduce

from tdc.multi_pred import DTI
data = DTI('DAVIS', path='./data/TDC')
df = data.get_data()
df = df.drop(columns=['Drug', 'Target'])
df = df[(df['Drug_ID'] == 25243800) & (df['Target_ID'] == 'RET(V804M)')]
print(df)

Expected behavior
The expected output is given below. Different Y values were labeled for drug 25243800 and target RET(V804M).

        Drug_ID   Target_ID      Y
18196  25243800  RET(V804M)    4.8
18197  25243800  RET(V804M)    4.0
18198  25243800  RET(V804M)  350.0
18199  25243800  RET(V804M)  340.0

Environment:

  • TDC version: 0.3.0
  • davis.tab version on dataverse: 2021-01-09 (UNF:6:x6TTv0Um70rEZT/eL8eCtA==)

Additional context
When compared to the raw data of the Davis et al. paper, it looks like the four affinities values shown above should be assigned to targets RET, RET(M918T), RET(V804L), and RET(V804M), respectively. It seems all target IDs were overwritten by RET(V804M).

@kexinhuang12345
Copy link
Collaborator

Thanks for pointing out the bug. Great catch. I think the issue is that it seems RET(M918T), RET(V804L), and RET(V804M) are three variants of RET target. So the target sequence would all be the same but the target ID is different. So the target sequence-SMILES pair itself is still correct but the naming of the target ID is wrong. We will update the correct target ID of these targets in the next release.

@kexinhuang12345 kexinhuang12345 added the bug Something isn't working label Sep 9, 2021
@kexinhuang12345 kexinhuang12345 self-assigned this Sep 9, 2021
@luoyunan
Copy link
Author

Thanks! But if we use unique IDs (e.g., RET(M918T), RET(V804L), etc.) and the same sequence for those mutants, I think there would still be ambiguity for the ML model? In other words, the same inputs (X) are mapped to different affinity values (y) in the data.

I think there are two potential ways to address the issue:

  1. Use different IDs and their corresponding sequence. For example, for RET(M918T), we change the 918th AA from M to T.
  2. For a protein with multiple mutants, only keep the strongest binding affinity value. This is what the Kiba dataset did when integrating the Davis dataset.

@kexinhuang12345
Copy link
Collaborator

Thank you for the suggestion! I think both solutions make lots of sense. I will discuss this with the team and arrive at a final solution. Will keep you posted here!

@kexinhuang12345
Copy link
Collaborator

kexinhuang12345 commented Oct 9, 2021

Hi, we decide to follow 2. Mainly because 1 has several gene names with no clear gene sequence modification.

@kexinhuang12345
Copy link
Collaborator

To reopen, Haoran points out there are also issues with BindingDB and KIBA. We will discuss to (1) keep the highest binding affinity pair (2) keep all of them, and provide a function for various removal schemes, e.g. remove highest, retain mean, and etc. it may be also useful information to know the variance of experimental result to reduce outlier effect

@kexinhuang12345
Copy link
Collaborator

An update: for DAVIS/KIBA, we update the datasets to keep the max affinity for duplicated DTI pairs. For BindingDB, we provide a function for users to decide how to deal with them. You can now use

from tdc.multi_pred import DTI
data = DTI(name = 'BindingDB_Kd')
data.harmonize_affinities(mode = 'max_affinity')

the current supported mode is max_affinity and mean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants