-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
haddock3-analyse: create report html page based on individual plots #611
Conversation
@mgiulini can you please review this pull request? I couldn't add a reviewer to the |
Thanks for checking this. It fails on finding the best structures from For |
Be careful with the number of best structures. This depends on the clustering settings and other parameters. Also the tests do not have the full sampling, which might explain some of the issue you are having in the analysis |
src/haddock/libs/libplots.py
Outdated
def _add_links(best_struct_df): | ||
table_df = best_struct_df.copy() | ||
for col_name in table_df.columns[1:]: | ||
file_name = best_struct_df[col_name].apply(lambda row: Path(row).name) | ||
# href empty for local files | ||
# TODO fix links to Download and Visibility | ||
# we can use html href syntax but html should be responsive! | ||
dl_text = "Download" | ||
vis_text = "Visibility" | ||
table_df[col_name] = file_name + ", " + dl_text + ", " + vis_text | ||
return table_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that this is the function that should create all the links to the structures and to their visualization..the problem here is that in a standard run we might have hundreds (if not thousands) of clusters, and including all the links here would probably end up in a huge html report, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I am wondering how this issue was handled in haddock 2.4. Here, the table creates vertical scrolling. I committed the report.html
generated for -m 5
, see my commit. If the file size of the report is a concern, we need to think about exporting it to another format perhaps a csv file!
src/haddock/libs/libplots.py
Outdated
dfcl : pandas DataFrame | ||
DataFrame of capri table with new column names | ||
""" | ||
dfcl = dfcl.sort_values(by=["cluster_id"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to the previous comment: it might happen that the run gives rise to hundreds/thousands of clusters..in that case I am not sure it makes sense to list all of them and for sure they should not be ordered by cluster-id (the best cluster (lower model-cluster-ranking
) might have cluster_id = 2000
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment above.
src/haddock/libs/libplots.py
Outdated
def find_best_struct(ss_file, number_of_struct=4): | ||
""" | ||
Find best structures. | ||
|
||
It inspects model-cluster-ranking recorded in capri_ss.tsv file and finds | ||
the best models (models with lower ranks). By default, it selects the 4 best | ||
models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @amjjbonvin here: the number of structures to choose depend on many parameters, and the data we show here can disagree with those reported in previous steps of the workflow (for example the clustfcc
step).
I see three potential solutions here:
- we report mean and standard deviation over the full set of structures belonging to a cluster
- we choose 4 as the default number (in this case we should state it clearly, e.g. "data refer to the first 4 structures")
- (not so immediate) we make the analysis read this number from the
capri_clt.tsv
file, where the value ofclt_threshold
is reported
""" | ||
dfss = read_capri_table(ss_file) | ||
dfss = dfss.sort_values(by=["cluster-id", "model-cluster-ranking"]) | ||
# TODO need a check for "Unclustered" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how to implement this check..I would maybe report the top number_of_struct
unclustered structures in the table as well
src/haddock/libs/libplots.py
Outdated
max_number_of_struct = dfss.groupby("cluster-id").count()["model-cluster-ranking"].min() | ||
number_of_struct = min(number_of_struct, max_number_of_struct) | ||
best_struct_df = dfss.groupby("cluster-id").head(number_of_struct).copy() | ||
number_of_cluster = len(best_struct_df["cluster-id"].unique()) | ||
col_names = [ | ||
f"Nr {number + 1} best structure" for number in range(number_of_struct) | ||
] * number_of_cluster | ||
best_struct_df = best_struct_df.assign(Structure=col_names) | ||
best_struct_df = best_struct_df.pivot_table( | ||
index=["cluster-id"], | ||
columns=["Structure"], | ||
values="model", | ||
aggfunc=lambda x: x, | ||
) | ||
best_struct_df.reset_index(inplace=True) | ||
best_struct_df.rename(columns={"cluster-id": "Cluster ID"}, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add some comments here and there to help clarifying the procedure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I added comments. Let me know if something is still unclear.
Co-authored-by: Marco Giulini <54807167+mgiulini@users.noreply.github.com>
On the HADDOCK server we report by default the top 10 clusters. For the visualisation all should be plotted, but may-be only the top10 color-coded and labelled And the order is based on the scoring function and not the cluster number as Marco explained. I.e. the |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #611 +/- ##
==========================================
+ Coverage 73.81% 74.58% +0.76%
==========================================
Files 110 111 +1
Lines 7347 7565 +218
==========================================
+ Hits 5423 5642 +219
+ Misses 1924 1923 -1
... and 4 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
As html file from tests/golden_data is 831Kb
- zero pad best structure column names - tried html in cell, but failed - dont write clt_table.html as already part of report.html - add title to report.html
The number of best structures has been set to 10. Replaced examples/analysis/report.html with script to generate report.html from tests/golden_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi there, I tried to run the updated version of the PR. The reports look nice for the example runs, so I guess this PR is almost ready to be merged. Just a few comments from my side:
- currently the report contain 10 box plots. In the first row there are HADDOCK-score, interface rmsd, ligand-rmsd, interface-ligand-RMSD, and VDW energy, while the remaining terms lie in the second row. From a docking perspective it makes more sense to put the energetic terms (HADDOCK-SCORE, VDW, ELECTROSTATIC, RESTRAINTS and DESOLVATION) in a row and the structural comparison observables (all the other terms) in another row. could you do this?
- I don't know if it makes sense to introduce the
report.py
script in the analysis, as now it is properly generated whenever thepostprocess
option is set totrue
..I think it's better if we provide a comprehensive description of the analysis (for ex. here create detailed documentation for haddock3-analyse #628 ), rather than leaving a python script there. I wouldn't add the report.html neither, as it will probably change shape and content many times and we don't want to update it every time. do you agree with me here? - could you please add some docstrings to the
test_libplots.py
so that it is possible to understand what we're testing? - please for the next time do not correct the linting with other linting softwares, but run the
tox -e lint
command, as this PR breaks the linting (I will correct it)
Besides this minor stuff, great addition!
PS: @amjjbonvin if you need this to be merged now, we could fix this minor points in other PRs
yes, I reordered the list of plots, see my commit.
agreed, it is removed now.
it is added.
if I understood correctly |
Indeed HADDOCK3 does not have auto-linter because I set up the CI methods in haddock3, and I don't like automatic code edits 😉. But, you are now free to change Keep in mind linting is just an agreement between developers to homogenize the writing style so that developers know "how to look" at the code anywhere in the project. So, there's human freedom to change it. The current haddock3 settings reflect my writing preferences. Cheers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good now, time to merge it into the main branch!
as for the lint, yes, no worries, it will be fixed at some time
You are about to submit a new Pull Request. Before continuing make sure you read the contributing guidelines and you comply with the following criteria:
Your PR is about CNSYou wrote tests for the new codetox
tests pass. Runtox
command inside the repository folder-test.cfg
examples execute without errors. Insideexamples/
runpython run_tests.py -b
Your PR is about writing documentation for already existing code 🔥Your PR is about writing tests for already existing codeThis PR extends the functionality of the command line
haddock3-analyse
. After running theanalyse
, two HTML files are generated:report.html
contains all of the plots (tables, boxes and scatters),clt_table.html
contains tables that show a summary of the capri tables.Example: this
report.html
is generated by runninghaddock3-analyse -r scenario2a-NMR-epitope-pass -m 9
on the results ofscenario2a-NMR-epitope-pass
introduced in the tutorial HADDOCK3-antibody-antigen.This pull request:
clt_table.html
in analysis dirreport.html
in analysis dirreport.html
file to example/analysis dirTodo list for the future:
clt_table.html
clt_table.html
Note: Linter fails because the
matplotlib
has not been added to the dependencies.