-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ilrmsdmatrix module #685
ilrmsdmatrix module #685
Conversation
…dock3 into ilrmsd_clustering
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #685 +/- ##
==========================================
+ Coverage 70.25% 71.22% +0.97%
==========================================
Files 78 80 +2
Lines 6967 7261 +294
==========================================
+ Hits 4895 5172 +277
- Misses 2072 2089 +17 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment - further I leave it to the experts
|
||
from haddock.modules.analysis.ilrmsdmatrix import DEFAULT_CONFIG as DEFAULT_ILRMSD_CONFIG | ||
from haddock.modules.analysis.ilrmsdmatrix import HaddockModule as IlrmsdmatrixModule | ||
DATA_DIR = Path(Path(__file__).parent.parent / "tests" / "golden_data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo it's best if you don't share input between the unit and the integration since this adds a cross-test dependency and can cause an indirect side-effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this only imports a text file that is not used by the ilrmsdmatrix unit test..I can add a small ensemble of conformations, but that means adding more data to the repository. And more and more data will have to be added for the next integrations tests..
If you think that having complete independency between the two folders is crucial, I'll do it, it does not sound super necessary to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is important, yes
@VGPReys @rvhonorato I should have implemented your suggestions in the code, thanks for the review! |
it doesn't make much sense to let the users modify this parameter..10k is already a lot, if there're more input models you should never use RMSD-based clustering (at least in docking related contexts)
Any idea of the timing for clustering 10K models with RMSD?
|
clustering is never a problem, the matrix calculation is..with the native python implementation in HADDOCK3 the ilrmsdmatrix in the glycan example took ~10 minutes for 1k models on 10 CPU cores..since it scales quadratically probably we should limit the number of models to 4-5k instead of 10k |
I would make that a parameter - up to the user to decide how much time they want to spend on it.
If you select less than the number of available models, will that be done automatically on the top ranked ones?
Or would you need a seletop step before that, e.g. to reduce from 20k to 5k before clustering
|
OK about making it a parameter, but the max value should not exceed 20k imo, so as to avoid memory problems and super long executions |
Ok - the max param can be 10K (expert users can modify this)
And this means a seletop step would be needed first to cluster less than the total number of models
|
Shouldn't we simply improve the code to make this faster? Adding this as a parameter and all this contour conditions just to avoid programimg? |
not sure code improvements can make much of a difference here: for sure the computation takes longer than it should, but the matrix calculation scales quadratically with the number of models..even if we made the code faster by a factor of 10, the execution will take a lot of time for 10k input models, because 50 million distances have to be computed. as for the code improvements, it's a big choice, as it means adding extra dependencies and spending a lot of time on it. And there aren't so many cases in which we really need that efficiency. I am not saying I am against that, just that it must be discussed with the other developers |
You are about to submit a new Pull Request. Before continuing make sure you read the contributing guidelines and that you comply with the following criteria:
tox
tests pass. Runtox
command inside the repository folder-test.cfg
examples execute without errors. Insideexamples/
runpython run_tests.py -b
Closes #684