Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impelement cluster-based output in caprieval #353

Merged
merged 15 commits into from
Mar 4, 2022
Merged

Conversation

rvhonorato
Copy link
Member

@rvhonorato rvhonorato commented Mar 3, 2022

Pull Request template

You are about to submit a new Pull Request. Before continuing make sure
you read the contributing guidelines and you comply
with the following criteria:

  • Your code is well documented. Functions and classes have proper docstrings
    as explained in contributing guidelines. You wrote explanatory comments for
    those tricky parts.
  • You wrote tests for the new code.
  • tox tests pass. Run tox command inside the repository folder.
  • -test.cfg examples execute without errors. Inside examples/ run
    python run_tests.py -b
  • You have stick to Python. Talk with us before if you really need to add
    other programming languages to HADDOCK3
  • code follows our coding style
  • You avoided structuring the code into classes as much as possible, using
    small functions instead. But, you can use classes if there's a purpose.
  • PR does not add any install dependencies unless permission was granted
    by the HADDOCK team
  • PR does not break licensing
  • You get extra bonus if you write tests for already existing code :godmode:

Closes #352 by implementing a cluster-based output to caprieval;

it now has two output files: capri_ss.tsv for single-structure and capri_clt.tsv for cluster-based, if there is no cluster information in the models, then only capri_ss.tsv is generated.

The models inside each cluster are ranked by score; so by setting clt_threshold=4 (new parameter) the average metrics will be calculated over the top4 of each cluster.

There is a scenario in which clt_threshold > total_number_of_clusters that results in the metrics being under-evaluated; ex: each cluster has 2 models but if clt_threshold=4, the values of these 2 models will be divided by 4. When this happens, the module will set under_eval=yes in capri_clt.tsv.

Example 1: capri_clt.tsv

########################################
# `caprieval` cluster-based analysis
#
# > rankby_key=score
# > rank_ascending=True
# > sortby_key=score
# > sort_ascending=True
# > clt_threshold=4
#
# NOTE: if under_eval=yes, it means that there were less models in a cluster than
#    clt_threshold, thus these values were under evaluated.
#   You might need to tweak the value of clt_thresholdor change some parameters
#    in `clustfcc` depending on your analysis.
#
########################################
caprieval_rank	cluster_rank	cluster_id	n	under_eval	score	irmsd	fnat	lrmsd	dockq
1	1	2	2	yes	-11.379	1.608	0.146	2.915	0.192
2	2	1	2	yes	-8.921	2.616	0.056	4.638	0.108

Example 2: capri_clt.tsv

########################################
# `caprieval` cluster-based analysis
#
# > rankby_key=score
# > rank_ascending=True
# > sortby_key=score
# > sort_ascending=True
# > clt_threshold=4
#
# NOTE: if under_eval=yes, it means that there were less models in a cluster than
#    clt_threshold, thus these values were under evaluated.
#   You might need to tweak the value of clt_threshold or change some parameters
#    in `clustfcc` depending on your analysis.
#
########################################
caprieval_rank	cluster_rank	cluster_id	n	under_eval	score	irmsd	fnat	lrmsd	dockq
1	1	1	4	-	-33.461	2.358	0.354	4.406	0.479
2	2	3	4	-	-26.601	10.927	0.111	17.980	0.104
3	3	5	4	-	-20.855	5.686	0.111	10.384	0.193
4	4	6	4	-	-20.489	6.591	0.076	11.419	0.161
5	5	2	4	-	-18.540	9.972	0.069	16.070	0.104
6	6	8	4	-	-16.757	3.774	0.278	7.214	0.333
7	7	4	4	-	-14.531	2.544	0.361	4.915	0.457
8	8	7	4	-	-14.480	10.449	0.104	17.195	0.107

Example 3 capri_clt.tsv rankby/sortby=irmsd

########################################
# `caprieval` cluster-based analysis
#
# > rankby_key=irmsd
# > rank_ascending=True
# > sortby_key=irmsd
# > sort_ascending=True
# > clt_threshold=4
#
# NOTE: if under_eval=yes, it means that there were less models in a cluster than
#    clt_threshold, thus these values were under evaluated.
#   You might need to tweak the value of clt_threshold or change some parameters
#    in `clustfcc` depending on your analysis.
#
########################################
caprieval_rank	cluster_rank	cluster_id	n	under_eval	score	irmsd	fnat	lrmsd	dockq
1	1	1	4	-	-33.461	2.358	0.354	4.406	0.479
2	7	4	4	-	-14.531	2.544	0.361	4.915	0.457
3	6	8	4	-	-16.757	3.774	0.278	7.214	0.333
4	3	5	4	-	-20.855	5.686	0.111	10.384	0.193
5	4	6	4	-	-20.489	6.591	0.076	11.419	0.161
6	5	2	4	-	-18.540	9.972	0.069	16.070	0.104
7	8	7	4	-	-14.480	10.449	0.104	17.195	0.107
8	2	3	4	-	-26.601	10.927	0.111	17.980	0.104

capri_ss.tsv (unchanged)

model	caprieval_rank	score	irmsd	fnat	lrmsd	ilrmsd	dockq	cluster-id	cluster-ranking	model-cluster-ranking
../01_rigidbody/rigidbody_16.pdb	1	-24.729	3.051	0.361	5.911	4.083	0.410	2	1	1
../01_rigidbody/rigidbody_12.pdb	2	-21.579	5.714	0.111	10.224	8.199	0.195	1	2	1
../01_rigidbody/rigidbody_20.pdb	3	-20.785	3.383	0.222	5.748	4.111	0.358	2	1	2
../01_rigidbody/rigidbody_2.pdb	4	-14.105	4.751	0.111	8.330	6.732	0.237	1	2	2

@rvhonorato rvhonorato added enhancement Enhancing an existing feature of adding a new one m|caprieval Improvements in caprieval module labels Mar 3, 2022
@rvhonorato rvhonorato self-assigned this Mar 3, 2022
Copy link
Member

@amjjbonvin amjjbonvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between caprieval_rank and cluster_rank?
To me these should be the same, i.e. only the rank based on the HADDOCK score is relevant here.

The cluster_id is I assume simply the cluster number returned by cluster_fcc.

Also, what are the RMSD dockQ and other metrics reported? The average, or the best within the number of selected models within one cluster?

And may-be also good to add the standard deviations to the average values.

@amjjbonvin
Copy link
Member

Also the single structure file is sorted based on the HADDOCK score. Why not do the same for the cluster-based file?

@rvhonorato
Copy link
Member Author

rvhonorato commented Mar 3, 2022

  1. What is the difference between caprieval_rank and cluster_rank?

cluster_rank is always based on the score and its topN is given by a parameter in clustfcc, caprieval_rank can vary according to the metric; let's say you want caprieval to display the clusters with lowest i-rmsd first (this is the example 3 I posted above), but in in detail below:

Let's say that:
Cluster ID 1 is the best scoring cluster, and it contains the lowest i-rmsds, then:

cluster_id = 1
cluster_rank = 1 (best score)
caprieval_rank = 1 (lowest i-rmsd)

Cluster ID 4 has the second lowest i-rmsds and is ranked 7 in relation to scores, then:

cluster_id = 4
cluster_rank = 7 (7th best in relation to score)
caprieval_rank = 2 (second lowest in i-rmsd)
  1. To me these should be the same, i.e. only the rank based on the HADDOCK score is relevant here.

Sure, I can disable this ranking according to the capri metrics in the cluster-based output.

  1. Also, what are the RMSD dockQ and other metrics reported? The average, or the best within the number of selected models within one cluster?

The metrics reported for the clustering are given according to a new parameter added to caprieval in this PR, the clt_threshold parameter, example:

clt_threshold = 4
Cluster #1 has 20 elements -> the top4 elements will be selected and divided by 4
Cluster #5 has 5 elements -> the top4 elements will be selected and divided by 4
Cluster #10 has 2 elements -> the top2 elements will be selected and divided by 4

When the number of elements is smaller than the clt_threshold, the cluster will be be under sampled, and it will be flagged with yes in the under_eval column in the output file, these can then be easily filtered out during more advanced analysis.

  1. And may-be also good to add the standard deviations to the average values.

Indeed good point, will add that.

  1. Also the single structure file is sorted based on the HADDOCK score. Why not do the same for the cluster-based file?

Se the explanation above (point 1), but to clarify - caprieval has the parameters rankby and sortby. By default all models are ranked and sorted by score, however they can be ranked and sorted by any of the metrics.

For example you want to rankby=score and sortby=irmsd, or for parametrization purposes, rankby=irmsd and sortby=score

I have explained this before when the ranking was implemented in #213 (comment), but again here for completion:

Using the following parameters, this is how the single-structure would look like:

rankby = 'score'
sortby = 'irmsd'

             model            caprieval_rank     score     irmsd      fnat     lrmsd    ilrmsd
01_rigidbody/rigidbody_15.pdb           7   -22.486     1.367     0.500     3.158     2.285
01_rigidbody/rigidbody_4.pdb            1   -33.049     1.621     0.389     3.589     2.908
01_rigidbody/rigidbody_9.pdb           17   -13.026     2.549     0.417     7.081     4.720
01_rigidbody/rigidbody_16.pdb           3   -24.729     2.950     0.361     7.472     4.502
01_rigidbody/rigidbody_20.pdb          10   -20.785     3.289     0.222     6.922     4.261

The same logic applies to the cluster-based, see explanation above (point 1).

@amjjbonvin
Copy link
Member

amjjbonvin commented Mar 3, 2022 via email

@rvhonorato
Copy link
Member Author

Yes, its always score by default - both ranking and sorting! Will add the standard deviations.

@codecov-commenter
Copy link

codecov-commenter commented Mar 4, 2022

Codecov Report

Merging #353 (5268d1e) into main (528d7b6) will increase coverage by 1.91%.
The diff coverage is 86.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #353      +/-   ##
==========================================
+ Coverage   59.49%   61.41%   +1.91%     
==========================================
  Files          65       65              
  Lines        4022     4266     +244     
==========================================
+ Hits         2393     2620     +227     
- Misses       1629     1646      +17     
Impacted Files Coverage Δ
src/haddock/modules/analysis/caprieval/__init__.py 23.07% <0.00%> (+0.43%) ⬆️
src/haddock/modules/analysis/caprieval/capri.py 82.35% <84.34%> (+0.47%) ⬆️
tests/test_module_caprieval.py 100.00% <100.00%> (ø)
tests/test_module_topoaa.py 100.00% <0.00%> (ø)
tests/test_gear_prepare_run.py 100.00% <0.00%> (ø)
tests/test_gear_expandable_parameters.py 100.00% <0.00%> (ø)
src/haddock/gear/expandable_parameters.py 100.00% <0.00%> (ø)
src/haddock/core/defaults.py 85.71% <0.00%> (+1.09%) ⬆️
src/haddock/gear/prepare_run.py 48.71% <0.00%> (+9.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 528d7b6...5268d1e. Read the comment docs.

@rvhonorato
Copy link
Member Author

Added the standard deviation and the documentations to the defaults.yml

src/haddock/modules/analysis/caprieval/defaults.yaml Outdated Show resolved Hide resolved
src/haddock/modules/analysis/caprieval/defaults.yaml Outdated Show resolved Hide resolved
src/haddock/modules/analysis/caprieval/defaults.yaml Outdated Show resolved Hide resolved
src/haddock/modules/analysis/caprieval/defaults.yaml Outdated Show resolved Hide resolved
src/haddock/modules/analysis/caprieval/defaults.yaml Outdated Show resolved Hide resolved
src/haddock/modules/analysis/caprieval/defaults.yaml Outdated Show resolved Hide resolved
src/haddock/modules/analysis/caprieval/defaults.yaml Outdated Show resolved Hide resolved
src/haddock/modules/analysis/caprieval/defaults.yaml Outdated Show resolved Hide resolved
src/haddock/modules/analysis/caprieval/defaults.yaml Outdated Show resolved Hide resolved
@rvhonorato - I updated the description. I hope this matches what is done in the code.
Copy link
Member

@amjjbonvin amjjbonvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check that the descriptions I edited in the yaml file match what happens in the code

@joaomcteixeira
Copy link
Member

this is not allowed:

long: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Negabat igitur ullam esse artem,
quae ipsa a se proficisceretur; Si longus, levis. Nihilo beatiorem esse Metellum quam Regulum.

give spaces;

long: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Negabat igitur ullam esse artem,
   quae ipsa a se proficisceretur; Si longus, levis. Nihilo beatiorem esse Metellum quam Regulum.

@joaomcteixeira joaomcteixeira added the :godmode: label Mar 4, 2022
@rvhonorato
Copy link
Member Author

Please check that the descriptions I edited in the yaml file match what happens in the code

yep looks good!

Copy link
Member

@joaomcteixeira joaomcteixeira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything looks good.

@rvhonorato rvhonorato merged commit 850cf65 into main Mar 4, 2022
@rvhonorato rvhonorato deleted the caprieval_anaclust branch March 4, 2022 16:28
@mgiulini
Copy link
Contributor

mgiulini commented Apr 8, 2022

Having a look at the capri_clt.tsv files I noticed that the clusters with a population lower than clt_threshold are correctly flagged with under_eval = yes. Nonetheless, the fact that the values of irmsd, fnat, dockq, lrmsd and ilrmsd are still divided by clt_threshold could be misleading. I would keep the under_eval flag, but correcting the values by introducing an additional check here:

def _calc_stats(data, n):
"""Calculate the mean and stdev."""
mean = sum(data) / n
var = sum((x - mean) ** 2 for x in data) / n
stdev = sqrt(var)
return mean, stdev

what do you think?

@amjjbonvin
Copy link
Member

Is n then set to clt_threshold ? If that's the case then indeed this should be corrected.

@rvhonorato
Copy link
Member Author

rvhonorato commented Apr 8, 2022

Yes n is clt_threshold and yes, that's exactly why the under_eval is there because its divided by a larger number.

Because if you don't mark them, and make the correction you suggest above, they'll be artificially better, for example:

# cluster 1 
model1 -25
model2 -30
average score = -27,5

# cluster 2
model1 -35
model2 -25
model3 -30
model4 -10 
average score = -25

Here we'd select cluster 1 instead of cluster 2 and its what will happen if we correct the clt_threshold.

Is this the desired behavior?

@amjjbonvin
Copy link
Member

amjjbonvin commented Apr 8, 2022 via email

@mgiulini
Copy link
Contributor

mgiulini commented Apr 8, 2022

For low-level analysis purposes I would keep the under-populated clusters in the capri_clt.tsv file together with the flag and the correct averages. I would exclude them from plot routines and other high-level analysis.

Does this sound meaningful?

@amjjbonvin
Copy link
Member

amjjbonvin commented Apr 8, 2022 via email

@rvhonorato
Copy link
Member Author

But if the threshold is 4, why would you even consider cluster1 in this case?

I think in the v2 version we automatically exclude those at some point, here I'm considering them nonetheless, maybe not the best

For low-level analysis purposes I would keep the under-populated clusters in the capri_clt.tsv file together with the flag and the correct averages. I would exclude them from plot routines and other high-level analysis.

Yes sounds correct!

@rvhonorato
Copy link
Member Author

Just for clarity, which should we do then?

  1. not consider a given cluster if N < clt_threshold
  2. correct the scores of the cluster if N < clt_threshold

@amjjbonvin
Copy link
Member

amjjbonvin commented Apr 8, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancing an existing feature of adding a new one m|caprieval Improvements in caprieval module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add cluster-based analysis to caprieval
5 participants