Impelement cluster-based output in `caprieval` #353

rvhonorato · 2022-03-03T13:43:59Z

Pull Request template

You are about to submit a new Pull Request. Before continuing make sure
you read the contributing guidelines and you comply
with the following criteria:

Closes #352 by implementing a cluster-based output to caprieval;

it now has two output files: capri_ss.tsv for single-structure and capri_clt.tsv for cluster-based, if there is no cluster information in the models, then only capri_ss.tsv is generated.

The models inside each cluster are ranked by score; so by setting clt_threshold=4 (new parameter) the average metrics will be calculated over the top4 of each cluster.

There is a scenario in which clt_threshold > total_number_of_clusters that results in the metrics being under-evaluated; ex: each cluster has 2 models but if clt_threshold=4, the values of these 2 models will be divided by 4. When this happens, the module will set under_eval=yes in capri_clt.tsv.

Example 1: `capri_clt.tsv`

########################################
# `caprieval` cluster-based analysis
#
# > rankby_key=score
# > rank_ascending=True
# > sortby_key=score
# > sort_ascending=True
# > clt_threshold=4
#
# NOTE: if under_eval=yes, it means that there were less models in a cluster than
#    clt_threshold, thus these values were under evaluated.
#   You might need to tweak the value of clt_thresholdor change some parameters
#    in `clustfcc` depending on your analysis.
#
########################################
caprieval_rank	cluster_rank	cluster_id	n	under_eval	score	irmsd	fnat	lrmsd	dockq
1	1	2	2	yes	-11.379	1.608	0.146	2.915	0.192
2	2	1	2	yes	-8.921	2.616	0.056	4.638	0.108

Example 2: `capri_clt.tsv`

########################################
# `caprieval` cluster-based analysis
#
# > rankby_key=score
# > rank_ascending=True
# > sortby_key=score
# > sort_ascending=True
# > clt_threshold=4
#
# NOTE: if under_eval=yes, it means that there were less models in a cluster than
#    clt_threshold, thus these values were under evaluated.
#   You might need to tweak the value of clt_threshold or change some parameters
#    in `clustfcc` depending on your analysis.
#
########################################
caprieval_rank	cluster_rank	cluster_id	n	under_eval	score	irmsd	fnat	lrmsd	dockq
1	1	1	4	-	-33.461	2.358	0.354	4.406	0.479
2	2	3	4	-	-26.601	10.927	0.111	17.980	0.104
3	3	5	4	-	-20.855	5.686	0.111	10.384	0.193
4	4	6	4	-	-20.489	6.591	0.076	11.419	0.161
5	5	2	4	-	-18.540	9.972	0.069	16.070	0.104
6	6	8	4	-	-16.757	3.774	0.278	7.214	0.333
7	7	4	4	-	-14.531	2.544	0.361	4.915	0.457
8	8	7	4	-	-14.480	10.449	0.104	17.195	0.107

Example 3 `capri_clt.tsv` `rankby/sortby=irmsd`

########################################
# `caprieval` cluster-based analysis
#
# > rankby_key=irmsd
# > rank_ascending=True
# > sortby_key=irmsd
# > sort_ascending=True
# > clt_threshold=4
#
# NOTE: if under_eval=yes, it means that there were less models in a cluster than
#    clt_threshold, thus these values were under evaluated.
#   You might need to tweak the value of clt_threshold or change some parameters
#    in `clustfcc` depending on your analysis.
#
########################################
caprieval_rank	cluster_rank	cluster_id	n	under_eval	score	irmsd	fnat	lrmsd	dockq
1	1	1	4	-	-33.461	2.358	0.354	4.406	0.479
2	7	4	4	-	-14.531	2.544	0.361	4.915	0.457
3	6	8	4	-	-16.757	3.774	0.278	7.214	0.333
4	3	5	4	-	-20.855	5.686	0.111	10.384	0.193
5	4	6	4	-	-20.489	6.591	0.076	11.419	0.161
6	5	2	4	-	-18.540	9.972	0.069	16.070	0.104
7	8	7	4	-	-14.480	10.449	0.104	17.195	0.107
8	2	3	4	-	-26.601	10.927	0.111	17.980	0.104

`capri_ss.tsv` (unchanged)

model	caprieval_rank	score	irmsd	fnat	lrmsd	ilrmsd	dockq	cluster-id	cluster-ranking	model-cluster-ranking
../01_rigidbody/rigidbody_16.pdb	1	-24.729	3.051	0.361	5.911	4.083	0.410	2	1	1
../01_rigidbody/rigidbody_12.pdb	2	-21.579	5.714	0.111	10.224	8.199	0.195	1	2	1
../01_rigidbody/rigidbody_20.pdb	3	-20.785	3.383	0.222	5.748	4.111	0.358	2	1	2
../01_rigidbody/rigidbody_2.pdb	4	-14.105	4.751	0.111	8.330	6.732	0.237	1	2	2

amjjbonvin

What is the difference between caprieval_rank and cluster_rank?
To me these should be the same, i.e. only the rank based on the HADDOCK score is relevant here.

The cluster_id is I assume simply the cluster number returned by cluster_fcc.

Also, what are the RMSD dockQ and other metrics reported? The average, or the best within the number of selected models within one cluster?

And may-be also good to add the standard deviations to the average values.

amjjbonvin · 2022-03-03T18:58:49Z

Also the single structure file is sorted based on the HADDOCK score. Why not do the same for the cluster-based file?

rvhonorato · 2022-03-03T21:34:19Z

What is the difference between caprieval_rank and cluster_rank?

cluster_rank is always based on the score and its topN is given by a parameter in clustfcc, caprieval_rank can vary according to the metric; let's say you want caprieval to display the clusters with lowest i-rmsd first (this is the example 3 I posted above), but in in detail below:

Let's say that:
Cluster ID 1 is the best scoring cluster, and it contains the lowest i-rmsds, then:

cluster_id = 1
cluster_rank = 1 (best score)
caprieval_rank = 1 (lowest i-rmsd)

Cluster ID 4 has the second lowest i-rmsds and is ranked 7 in relation to scores, then:

cluster_id = 4
cluster_rank = 7 (7th best in relation to score)
caprieval_rank = 2 (second lowest in i-rmsd)

To me these should be the same, i.e. only the rank based on the HADDOCK score is relevant here.

Sure, I can disable this ranking according to the capri metrics in the cluster-based output.

Also, what are the RMSD dockQ and other metrics reported? The average, or the best within the number of selected models within one cluster?

The metrics reported for the clustering are given according to a new parameter added to caprieval in this PR, the clt_threshold parameter, example:

clt_threshold = 4
Cluster #1 has 20 elements -> the top4 elements will be selected and divided by 4
Cluster #5 has 5 elements -> the top4 elements will be selected and divided by 4
Cluster #10 has 2 elements -> the top2 elements will be selected and divided by 4

When the number of elements is smaller than the clt_threshold, the cluster will be be under sampled, and it will be flagged with yes in the under_eval column in the output file, these can then be easily filtered out during more advanced analysis.

And may-be also good to add the standard deviations to the average values.

Indeed good point, will add that.

Also the single structure file is sorted based on the HADDOCK score. Why not do the same for the cluster-based file?

Se the explanation above (point 1), but to clarify - caprieval has the parameters rankby and sortby. By default all models are ranked and sorted by score, however they can be ranked and sorted by any of the metrics.

For example you want to rankby=score and sortby=irmsd, or for parametrization purposes, rankby=irmsd and sortby=score

I have explained this before when the ranking was implemented in #213 (comment), but again here for completion:

Using the following parameters, this is how the single-structure would look like:

rankby = 'score'
sortby = 'irmsd'

             model            caprieval_rank     score     irmsd      fnat     lrmsd    ilrmsd
01_rigidbody/rigidbody_15.pdb           7   -22.486     1.367     0.500     3.158     2.285
01_rigidbody/rigidbody_4.pdb            1   -33.049     1.621     0.389     3.589     2.908
01_rigidbody/rigidbody_9.pdb           17   -13.026     2.549     0.417     7.081     4.720
01_rigidbody/rigidbody_16.pdb           3   -24.729     2.950     0.361     7.472     4.502
01_rigidbody/rigidbody_20.pdb          10   -20.785     3.289     0.222     6.922     4.261

The same logic applies to the cluster-based, see explanation above (point 1).

amjjbonvin · 2022-03-03T21:41:18Z

What is the difference between caprieval_rank and cluster_rank?

Ok - clear - need to add the documentation to the yaml file to make things clear then

To me these should be the same, i.e. only the rank based on the HADDOCK score is relevant here. Sure, I can disable this ranking according to the capri metrics in the cluster-based output.

No need for that. But the default behaviour for the caprieval_rank should be the score.

Also, what are the RMSD dockQ and other metrics reported? The average, or the best within the number of selected models within one cluster? The metrics reported for the clustering are given according to the clt_threshold parameter, example

I.e. the average in that case.

And may-be also good to add the standard deviations to the average values. Indeed good point, will add that. Also the single structure file is sorted based on the HADDOCK score. Why not do the same for the cluster-based file? Se the explanation above (point 1), but to clarify - caprieval has the parameters rankby and sortby. By default all models are ranked and sorted by score, however they can be ranked and sorted by any of the metrics.

Default should always be score (because we might use it also when we don’t have a reference)

rvhonorato · 2022-03-03T21:50:17Z

Yes, its always score by default - both ranking and sorting! Will add the standard deviations.

codecov-commenter · 2022-03-04T11:15:18Z

Codecov Report

Merging #353 (5268d1e) into main (528d7b6) will increase coverage by 1.91%.
The diff coverage is 86.52%.

@@            Coverage Diff             @@
##             main     #353      +/-   ##
==========================================
+ Coverage   59.49%   61.41%   +1.91%     
==========================================
  Files          65       65              
  Lines        4022     4266     +244     
==========================================
+ Hits         2393     2620     +227     
- Misses       1629     1646      +17

Impacted Files	Coverage Δ
src/haddock/modules/analysis/caprieval/__init__.py	`23.07% <0.00%> (+0.43%)`	⬆️
src/haddock/modules/analysis/caprieval/capri.py	`82.35% <84.34%> (+0.47%)`	⬆️
tests/test_module_caprieval.py	`100.00% <100.00%> (ø)`
tests/test_module_topoaa.py	`100.00% <0.00%> (ø)`
tests/test_gear_prepare_run.py	`100.00% <0.00%> (ø)`
tests/test_gear_expandable_parameters.py	`100.00% <0.00%> (ø)`
src/haddock/gear/expandable_parameters.py	`100.00% <0.00%> (ø)`
src/haddock/core/defaults.py	`85.71% <0.00%> (+1.09%)`	⬆️
src/haddock/gear/prepare_run.py	`48.71% <0.00%> (+9.34%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 528d7b6...5268d1e. Read the comment docs.

rvhonorato · 2022-03-04T11:16:14Z

Added the standard deviation and the documentations to the defaults.yml

examples/docking-protein-protein/docking-protein-protein-cltsel-test.cfg

src/haddock/modules/analysis/caprieval/defaults.yaml

@rvhonorato

@rvhonorato - I updated the description. I hope this matches what is done in the code.

amjjbonvin

Please check that the descriptions I edited in the yaml file match what happens in the code

examples/docking-protein-protein/docking-protein-protein-cltsel-test.cfg

src/haddock/modules/analysis/caprieval/defaults.yaml

joaomcteixeira · 2022-03-04T14:29:35Z

this is not allowed:

long: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Negabat igitur ullam esse artem,
quae ipsa a se proficisceretur; Si longus, levis. Nihilo beatiorem esse Metellum quam Regulum.

give spaces;

long: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Negabat igitur ullam esse artem,
   quae ipsa a se proficisceretur; Si longus, levis. Nihilo beatiorem esse Metellum quam Regulum.

rvhonorato · 2022-03-04T16:19:00Z

Please check that the descriptions I edited in the yaml file match what happens in the code

yep looks good!

joaomcteixeira

everything looks good.

mgiulini · 2022-04-08T10:42:19Z

Having a look at the capri_clt.tsv files I noticed that the clusters with a population lower than clt_threshold are correctly flagged with under_eval = yes. Nonetheless, the fact that the values of irmsd, fnat, dockq, lrmsd and ilrmsd are still divided by clt_threshold could be misleading. I would keep the under_eval flag, but correcting the values by introducing an additional check here:

haddock3/src/haddock/modules/analysis/caprieval/capri.py

Lines 611 to 616 in ed55a99

    
           def _calc_stats(data, n): 
        
               """Calculate the mean and stdev.""" 
        
               mean = sum(data) / n 
        
               var = sum((x - mean) ** 2 for x in data) / n 
        
               stdev = sqrt(var) 
        
               return mean, stdev

what do you think?

amjjbonvin · 2022-04-08T11:24:23Z

Is n then set to clt_threshold ? If that's the case then indeed this should be corrected.

rvhonorato · 2022-04-08T11:46:49Z

Yes n is clt_threshold and yes, that's exactly why the under_eval is there because its divided by a larger number.

Because if you don't mark them, and make the correction you suggest above, they'll be artificially better, for example:

# cluster 1 
model1 -25
model2 -30
average score = -27,5

# cluster 2
model1 -35
model2 -25
model3 -30
model4 -10 
average score = -25

Here we'd select cluster 1 instead of cluster 2 and its what will happen if we correct the clt_threshold.

Is this the desired behavior?

amjjbonvin · 2022-04-08T11:55:27Z

But if the threshold is 4, why would you even consider cluster1 in this case?

mgiulini · 2022-04-08T12:15:24Z

For low-level analysis purposes I would keep the under-populated clusters in the capri_clt.tsv file together with the flag and the correct averages. I would exclude them from plot routines and other high-level analysis.

Does this sound meaningful?

amjjbonvin · 2022-04-08T12:16:42Z

Yes - the reported scores should be the correct one, i.e. divided by the correct number of models.

rvhonorato · 2022-04-08T12:29:05Z

But if the threshold is 4, why would you even consider cluster1 in this case?

I think in the v2 version we automatically exclude those at some point, here I'm considering them nonetheless, maybe not the best

For low-level analysis purposes I would keep the under-populated clusters in the capri_clt.tsv file together with the flag and the correct averages. I would exclude them from plot routines and other high-level analysis.

Yes sounds correct!

rvhonorato · 2022-04-08T12:33:22Z

Just for clarity, which should we do then?

not consider a given cluster if N < clt_threshold
correct the scores of the cluster if N < clt_threshold

amjjbonvin · 2022-04-08T12:37:13Z

Just for clarity, which should we do then? not consider a given cluster if N < clt_threshold

If we define a threshold then we should use it! Thus indeed not consider clusters that have less than the threshold (even if they might be reported in the file)

correct the scores of the cluster if N < clt_threshold

The correct average score should be reported (i.e. based on the real number of models in the cluster)

rvhonorato added 3 commits March 3, 2022 14:29

implement cluster-based output in caprieval

7d62e26

Update test_module_caprieval.py

f1dab4b

typo

71db75f

rvhonorato added enhancement Enhancing an existing feature of adding a new one m|caprieval Improvements in caprieval module labels Mar 3, 2022

rvhonorato requested review from joaomcteixeira and amjjbonvin March 3, 2022 13:44

rvhonorato self-assigned this Mar 3, 2022

amjjbonvin reviewed Mar 3, 2022

View reviewed changes

rvhonorato added 4 commits March 4, 2022 11:52

add stdev

4ce6063

Update defaults.yaml

2b96d3e

Update test_module_caprieval.py

7b71085

🐐

e030ff2

rvhonorato requested a review from amjjbonvin March 4, 2022 11:16

amjjbonvin requested changes Mar 4, 2022

View reviewed changes

rvhonorato added 5 commits March 4, 2022 13:47

Update defaults.yaml

fa5a9c1

Update defaults.yaml

74a503f

remove re-ranking feature

8b84b52

Update docking-protein-protein-cltsel-test.cfg

f2470c6

Update test_module_caprieval.py

ae93f73

rvhonorato requested a review from amjjbonvin March 4, 2022 13:15

Update defaults.yaml

1bb78ea

@rvhonorato - I updated the description. I hope this matches what is done in the code.

amjjbonvin approved these changes Mar 4, 2022

View reviewed changes

joaomcteixeira reviewed Mar 4, 2022

View reviewed changes

examples/docking-protein-protein/docking-protein-protein-cltsel-test.cfg Show resolved Hide resolved

src/haddock/modules/analysis/caprieval/defaults.yaml Outdated Show resolved Hide resolved

corrects default.yaml

059cb47

corrects minimum chars

5268d1e

joaomcteixeira added the label Mar 4, 2022

rvhonorato requested a review from joaomcteixeira March 4, 2022 16:19

joaomcteixeira approved these changes Mar 4, 2022

View reviewed changes

rvhonorato merged commit 850cf65 into main Mar 4, 2022

rvhonorato deleted the caprieval_anaclust branch March 4, 2022 16:28

rvhonorato mentioned this pull request Apr 8, 2022

Amend clt_threshold behaviour #386

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impelement cluster-based output in `caprieval` #353

Impelement cluster-based output in `caprieval` #353

rvhonorato commented Mar 3, 2022 •

edited by joaomcteixeira

amjjbonvin left a comment

amjjbonvin commented Mar 3, 2022

rvhonorato commented Mar 3, 2022 •

edited

amjjbonvin commented Mar 3, 2022 via email

rvhonorato commented Mar 3, 2022

codecov-commenter commented Mar 4, 2022 •

edited

rvhonorato commented Mar 4, 2022

amjjbonvin left a comment

joaomcteixeira commented Mar 4, 2022

rvhonorato commented Mar 4, 2022

joaomcteixeira left a comment

mgiulini commented Apr 8, 2022

amjjbonvin commented Apr 8, 2022

rvhonorato commented Apr 8, 2022 •

edited

amjjbonvin commented Apr 8, 2022 via email

mgiulini commented Apr 8, 2022

amjjbonvin commented Apr 8, 2022 via email

rvhonorato commented Apr 8, 2022

rvhonorato commented Apr 8, 2022

amjjbonvin commented Apr 8, 2022 via email

Impelement cluster-based output in caprieval #353

Impelement cluster-based output in caprieval #353

Conversation

rvhonorato commented Mar 3, 2022 • edited by joaomcteixeira

Pull Request template

Example 1: capri_clt.tsv

Example 2: capri_clt.tsv

Example 3 capri_clt.tsv rankby/sortby=irmsd

capri_ss.tsv (unchanged)

amjjbonvin left a comment

Choose a reason for hiding this comment

amjjbonvin commented Mar 3, 2022

rvhonorato commented Mar 3, 2022 • edited

amjjbonvin commented Mar 3, 2022 via email

rvhonorato commented Mar 3, 2022

codecov-commenter commented Mar 4, 2022 • edited

Codecov Report

rvhonorato commented Mar 4, 2022

amjjbonvin left a comment

Choose a reason for hiding this comment

joaomcteixeira commented Mar 4, 2022

rvhonorato commented Mar 4, 2022

joaomcteixeira left a comment

Choose a reason for hiding this comment

mgiulini commented Apr 8, 2022

amjjbonvin commented Apr 8, 2022

rvhonorato commented Apr 8, 2022 • edited

amjjbonvin commented Apr 8, 2022 via email

mgiulini commented Apr 8, 2022

amjjbonvin commented Apr 8, 2022 via email

rvhonorato commented Apr 8, 2022

rvhonorato commented Apr 8, 2022

amjjbonvin commented Apr 8, 2022 via email

Impelement cluster-based output in `caprieval` #353

Impelement cluster-based output in `caprieval` #353

rvhonorato commented Mar 3, 2022 •

edited by joaomcteixeira

Example 1: `capri_clt.tsv`

Example 2: `capri_clt.tsv`

Example 3 `capri_clt.tsv` `rankby/sortby=irmsd`

`capri_ss.tsv` (unchanged)

rvhonorato commented Mar 3, 2022 •

edited

codecov-commenter commented Mar 4, 2022 •

edited

rvhonorato commented Apr 8, 2022 •

edited