-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Impelement cluster-based output in caprieval
#353
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between caprieval_rank and cluster_rank?
To me these should be the same, i.e. only the rank based on the HADDOCK score is relevant here.
The cluster_id is I assume simply the cluster number returned by cluster_fcc.
Also, what are the RMSD dockQ and other metrics reported? The average, or the best within the number of selected models within one cluster?
And may-be also good to add the standard deviations to the average values.
Also the single structure file is sorted based on the HADDOCK score. Why not do the same for the cluster-based file? |
Let's say that:
Cluster ID 4 has the second lowest i-rmsds and is ranked 7 in relation to scores, then:
Sure, I can disable this ranking according to the capri metrics in the cluster-based output.
The metrics reported for the clustering are given according to a new parameter added to
When the number of elements is smaller than the
Indeed good point, will add that.
Se the explanation above (point 1), but to clarify - For example you want to I have explained this before when the ranking was implemented in #213 (comment), but again here for completion: Using the following parameters, this is how the single-structure would look like:
The same logic applies to the cluster-based, see explanation above (point 1). |
What is the difference between caprieval_rank and cluster_rank?
Ok - clear - need to add the documentation to the yaml file to make things clear then
To me these should be the same, i.e. only the rank based on the HADDOCK score is relevant here.
Sure, I can disable this ranking according to the capri metrics in the cluster-based output.
No need for that. But the default behaviour for the caprieval_rank should be the score.
Also, what are the RMSD dockQ and other metrics reported? The average, or the best within the number of selected models within one cluster?
The metrics reported for the clustering are given according to the clt_threshold parameter, example
I.e. the average in that case.
And may-be also good to add the standard deviations to the average values.
Indeed good point, will add that.
Also the single structure file is sorted based on the HADDOCK score. Why not do the same for the cluster-based file?
Se the explanation above (point 1), but to clarify - caprieval has the parameters rankby and sortby. By default all models are ranked and sorted by score, however they can be ranked and sorted by any of the metrics.
Default should always be score (because we might use it also when we don’t have a reference)
|
Yes, its always |
Codecov Report
@@ Coverage Diff @@
## main #353 +/- ##
==========================================
+ Coverage 59.49% 61.41% +1.91%
==========================================
Files 65 65
Lines 4022 4266 +244
==========================================
+ Hits 2393 2620 +227
- Misses 1629 1646 +17
Continue to review full report at Codecov.
|
Added the standard deviation and the documentations to the |
examples/docking-protein-protein/docking-protein-protein-cltsel-test.cfg
Outdated
Show resolved
Hide resolved
@rvhonorato - I updated the description. I hope this matches what is done in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check that the descriptions I edited in the yaml file match what happens in the code
examples/docking-protein-protein/docking-protein-protein-cltsel-test.cfg
Show resolved
Hide resolved
this is not allowed: long: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Negabat igitur ullam esse artem,
quae ipsa a se proficisceretur; Si longus, levis. Nihilo beatiorem esse Metellum quam Regulum. give spaces; long: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Negabat igitur ullam esse artem,
quae ipsa a se proficisceretur; Si longus, levis. Nihilo beatiorem esse Metellum quam Regulum. |
yep looks good! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything looks good.
Having a look at the haddock3/src/haddock/modules/analysis/caprieval/capri.py Lines 611 to 616 in ed55a99
what do you think? |
Is n then set to |
Yes n is Because if you don't mark them, and make the correction you suggest above, they'll be artificially better, for example:
Here we'd select cluster 1 instead of cluster 2 and its what will happen if we correct the Is this the desired behavior? |
But if the threshold is 4, why would you even consider cluster1 in this case?
|
For low-level analysis purposes I would keep the under-populated clusters in the Does this sound meaningful? |
Yes - the reported scores should be the correct one, i.e. divided by the correct number of models.
|
I think in the v2 version we automatically exclude those at some point, here I'm considering them nonetheless, maybe not the best
Yes sounds correct! |
Just for clarity, which should we do then?
|
Just for clarity, which should we do then?
not consider a given cluster if N < clt_threshold
If we define a threshold then we should use it!
Thus indeed not consider clusters that have less than the threshold (even if they might be reported in the file)
correct the scores of the cluster if N < clt_threshold
The correct average score should be reported (i.e. based on the real number of models in the cluster)
|
Pull Request template
You are about to submit a new Pull Request. Before continuing make sure
you read the contributing guidelines and you comply
with the following criteria:
as explained in contributing guidelines. You wrote explanatory comments for
those tricky parts.
tox
tests pass. Runtox
command inside the repository folder.-test.cfg
examples execute without errors. Insideexamples/
runpython run_tests.py -b
other programming languages to HADDOCK3
small functions instead. But, you can use classes if there's a purpose.
by the HADDOCK team
Closes #352 by implementing a cluster-based output to
caprieval
;it now has two output files:
capri_ss.tsv
for single-structure andcapri_clt.tsv
for cluster-based, if there is no cluster information in the models, then onlycapri_ss.tsv
is generated.The models inside each cluster are ranked by score; so by setting
clt_threshold=4
(new parameter) the average metrics will be calculated over the top4 of each cluster.There is a scenario in which
clt_threshold > total_number_of_clusters
that results in the metrics being under-evaluated; ex: each cluster has 2 models but ifclt_threshold=4
, the values of these 2 models will be divided by 4. When this happens, the module will setunder_eval=yes
incapri_clt.tsv
.Example 1:
capri_clt.tsv
Example 2:
capri_clt.tsv
Example 3
capri_clt.tsv
rankby/sortby=irmsd
capri_ss.tsv
(unchanged)