Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source code for DIPS split #18

Open
anton-bushuiev opened this issue Jan 11, 2023 · 6 comments
Open

Source code for DIPS split #18

anton-bushuiev opened this issue Jan 11, 2023 · 6 comments

Comments

@anton-bushuiev
Copy link

Hi 👋! In the paper it is mentioned: "For DIPS, the split is based on protein family to separate similar proteins". Is there a source code for this split? I could only find a random split in paritition_dips.py.

@AxelGiottonini
Copy link

Hey !

I don't remember finding any code for the split, but you can certainly use create a simple script to cluster your proteins using foldseek or something similar and dgl, networkx or any other graph library you want. The only thing you need to output is then the list of files in the same format than you could find in the original splits definition.

Sincerly meow !

@anton-bushuiev
Copy link
Author

Hi, @AxelGiottonini!

Thank you very much for you response. Foldseek looks perfect, I did not know about it. What exactly do you mean by using a graph library? To cluster PPIs using graph metrics based on their EquiDock graph representations? Also, I am still curios how exactly PPIs were split based on the folds of individual interacting partners. If PPI1 has partners with folds A and B and PPI2 with C and D, are they decided to be separated if {A, B} != {C, D} or more strictly {A, B} and {C, D} are disjoint 🤔? It may be important from the perspective of data leakage.

@AxelGiottonini
Copy link

What I did in a previous project was to cluster the proteins using foldseek (all vs all) and to create a graph using all the protein as vertices and putting edges between paired proteins (receptor - ligand) and proteins in a cluster. Then I used the biggest clusters to create the training set and the smallest for validation and testing (90-5-5).

What may be an option could also be to characterize the binding pocket and split the data according to this characterization, but I miss knowledge to do that kind of things.

@anton-bushuiev
Copy link
Author

Thank you for sharing!

Yes, I am also considering to create a split based on interface similarity using a tool like this.

@AxelGiottonini
Copy link

You're welcome ! I did not look for such tool but that seems promising !

Also, when I was working with EquiDock, I had results with a bad accuracy considering only the ligand RMSD (as the receptor RMSD is always 0). I'll share my code and results in the next days, but could you consider sharing your results if something similar occurred?

@anton-bushuiev
Copy link
Author

Hi! I do not use EquiDock and I was mainly interested in the data split. I am working on a related problem of predicting binding affinity change upon mutation (based on the SKEMPI2 data). It as about learning from already bound structures, so its a bit different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants