haddock3-analyse: create report html page based on individual plots #611

SarahAlidoost · 2023-02-08T09:20:06Z

You are about to submit a new Pull Request. Before continuing make sure you read the contributing guidelines and you comply with the following criteria:

This PR extends the functionality of the command line haddock3-analyse. After running the analyse, two HTML files are generated: report.html contains all of the plots (tables, boxes and scatters), clt_table.html contains tables that show a summary of the capri tables.

Example: this report.html is generated by running haddock3-analyse -r scenario2a-NMR-epitope-pass -m 9 on the results of scenario2a-NMR-epitope-pass introduced in the tutorial HADDOCK3-antibody-antigen.

This pull request:

fixes colors of box plots. Now the colors of scatters and boxes are the same.
fixes linter errors for existing codes
generates a clt_table.html in analysis dir
generates a report.html in analysis dir
adds a report.html file to example/analysis dir

Todo list for the future:

enable downloading for the best structures in clt_table.html
enable visibility for the best structures in clt_table.html

Note: Linter fails because the matplotlib has not been added to the dependencies.

SarahAlidoost · 2023-02-22T12:41:30Z

@mgiulini can you please review this pull request? I couldn't add a reviewer to the Reviewers list.

SarahAlidoost · 2023-02-23T13:09:28Z

Hi @SarahAlidoost , thanks for the contribution, the report looks very nice!

I think though that this PR must be checked and tested a bit more thoroughly: I tried to run haddock3-analyse on one of the examples folders (examples/docking-antibody-antigen/run1-ranairCDR-cltsel-test), and it failed. In this case I used the modules 1, 5 and 13.

The reason why it failed on module 1 is clear (there's a typo in the column model-cluster-ranking in capri_ss.tsv) and will be corrected (see #626), while for the other two modules the situation is a bit more obscure (error message reads: Length of values (80) does not match length of index (20))

can you try to test the addition on some examples to see if there are other problems?

Thanks for checking this. It fails on finding the best structures from capri_ss file when creating tables. The number of the best structures is 4 and hard coded in the module, see here. For -m 5 looking at the examples/docking-antibody-antigen/run1-ranairCDR-cltsel-test/05_caprieval/capri_ss.tsv, it seems that the number of best structures should be 1. The same issue for -m 13. Is the number of best structures defined by the user or the code? In the former case, we can introduce an argument to the cli and create a better error message.

For -m 15 and -m 2, there are two more problems: the columns cluster-ranking and model-cluster-ranking are filled with -. These are "Unclustered", right? Another problem is a typo in the column name, self.model-cluster-ranking should be model-cluster-ranking.

SarahAlidoost · 2023-02-23T13:33:41Z

@mgiulini in my last commit, I fixed the issue for the number of best structures. Now, the analyse command should work for -m 5 and -m 13. I also added a todo for -m 2 and 15 i.e. Unclustered cases.

amjjbonvin · 2023-02-23T13:43:33Z

Be careful with the number of best structures. This depends on the clustering settings and other parameters.

Also the tests do not have the full sampling, which might explain some of the issue you are having in the analysis

mgiulini · 2023-02-23T14:07:29Z

src/haddock/libs/libplots.py

+def _add_links(best_struct_df):
+    table_df = best_struct_df.copy()
+    for col_name in table_df.columns[1:]:
+        file_name = best_struct_df[col_name].apply(lambda row: Path(row).name)
+        # href empty for local files
+        # TODO fix links to Download and Visibility
+        # we can use html href syntax but html should be responsive!
+        dl_text = "Download"
+        vis_text = "Visibility"
+        table_df[col_name] = file_name + ", " + dl_text + ", " + vis_text
+    return table_df


I guess that this is the function that should create all the links to the structures and to their visualization..the problem here is that in a standard run we might have hundreds (if not thousands) of clusters, and including all the links here would probably end up in a huge html report, what do you think?

good point. I am wondering how this issue was handled in haddock 2.4. Here, the table creates vertical scrolling. I committed the report.html generated for -m 5, see my commit. If the file size of the report is a concern, we need to think about exporting it to another format perhaps a csv file!

mgiulini · 2023-02-23T14:10:06Z

src/haddock/libs/libplots.py

+    dfcl : pandas DataFrame
+        DataFrame of capri table with new column names
+    """
+    dfcl = dfcl.sort_values(by=["cluster_id"])


related to the previous comment: it might happen that the run gives rise to hundreds/thousands of clusters..in that case I am not sure it makes sense to list all of them and for sure they should not be ordered by cluster-id (the best cluster (lower model-cluster-ranking) might have cluster_id = 2000)

see my comment above.

src/haddock/libs/libplots.py

mgiulini · 2023-02-23T14:22:02Z

src/haddock/libs/libplots.py

+def find_best_struct(ss_file, number_of_struct=4):
+    """
+    Find best structures.
+
+    It inspects model-cluster-ranking recorded in capri_ss.tsv file and finds
+    the best models (models with lower ranks). By default, it selects the 4 best
+    models.


I agree with @amjjbonvin here: the number of structures to choose depend on many parameters, and the data we show here can disagree with those reported in previous steps of the workflow (for example the clustfcc step).

I see three potential solutions here:

we report mean and standard deviation over the full set of structures belonging to a cluster

we choose 4 as the default number (in this case we should state it clearly, e.g. "data refer to the first 4 structures")

(not so immediate) we make the analysis read this number from the capri_clt.tsv file, where the value of clt_threshold is reported

mgiulini · 2023-02-23T14:24:36Z

src/haddock/libs/libplots.py

+    """
+    dfss = read_capri_table(ss_file)
+    dfss = dfss.sort_values(by=["cluster-id", "model-cluster-ranking"])
+    # TODO need a check for "Unclustered"


I wonder how to implement this check..I would maybe report the top number_of_struct unclustered structures in the table as well

mgiulini · 2023-02-23T14:25:21Z

src/haddock/libs/libplots.py

+    max_number_of_struct = dfss.groupby("cluster-id").count()["model-cluster-ranking"].min()
+    number_of_struct = min(number_of_struct, max_number_of_struct)
+    best_struct_df = dfss.groupby("cluster-id").head(number_of_struct).copy()
+    number_of_cluster = len(best_struct_df["cluster-id"].unique())
+    col_names = [
+        f"Nr {number + 1} best structure" for number in range(number_of_struct)
+        ] * number_of_cluster
+    best_struct_df = best_struct_df.assign(Structure=col_names)
+    best_struct_df = best_struct_df.pivot_table(
+        index=["cluster-id"],
+        columns=["Structure"],
+        values="model",
+        aggfunc=lambda x: x,
+        )
+    best_struct_df.reset_index(inplace=True)
+    best_struct_df.rename(columns={"cluster-id": "Cluster ID"}, inplace=True)


could you add some comments here and there to help clarifying the procedure?

sure, I added comments. Let me know if something is still unclear.

Co-authored-by: Marco Giulini <54807167+mgiulini@users.noreply.github.com>

amjjbonvin · 2023-02-23T19:00:49Z

On the HADDOCK server we report by default the top 10 clusters. For the visualisation all should be plotted, but may-be only the top10 color-coded and labelled

And the order is based on the scoring function and not the cluster number as Marco explained. I.e. the score column of capri_clt.tsv

codecov-commenter · 2023-03-07T09:56:18Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.76 🎉

Comparison is base (02ed904) 73.81% compared to head (96e6952) 74.58%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #611      +/-   ##
==========================================
+ Coverage   73.81%   74.58%   +0.76%     
==========================================
  Files         110      111       +1     
  Lines        7347     7565     +218     
==========================================
+ Hits         5423     5642     +219     
+ Misses       1924     1923       -1

Impacted Files	Coverage Δ
src/haddock/clis/cli_analyse.py	`73.75% <100.00%> (+0.33%)`	⬆️
src/haddock/libs/libplots.py	`96.17% <100.00%> (+4.17%)`	⬆️
tests/test_libplots.py	`100.00% <100.00%> (ø)`

... and 4 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

As html file from tests/golden_data is 831Kb

- zero pad best structure column names - tried html in cell, but failed - dont write clt_table.html as already part of report.html - add title to report.html

sverhoeven · 2023-03-10T12:23:46Z

The number of best structures has been set to 10.

Replaced examples/analysis/report.html with script to generate report.html from tests/golden_data
The report.html is 831Kb. Should report.html be added to the repo or GH pages?

report.zip

mgiulini

hi there, I tried to run the updated version of the PR. The reports look nice for the example runs, so I guess this PR is almost ready to be merged. Just a few comments from my side:

currently the report contain 10 box plots. In the first row there are HADDOCK-score, interface rmsd, ligand-rmsd, interface-ligand-RMSD, and VDW energy, while the remaining terms lie in the second row. From a docking perspective it makes more sense to put the energetic terms (HADDOCK-SCORE, VDW, ELECTROSTATIC, RESTRAINTS and DESOLVATION) in a row and the structural comparison observables (all the other terms) in another row. could you do this?
I don't know if it makes sense to introduce the report.py script in the analysis, as now it is properly generated whenever the postprocess option is set to true..I think it's better if we provide a comprehensive description of the analysis (for ex. here create detailed documentation for haddock3-analyse #628 ), rather than leaving a python script there. I wouldn't add the report.html neither, as it will probably change shape and content many times and we don't want to update it every time. do you agree with me here?
could you please add some docstrings to the test_libplots.py so that it is possible to understand what we're testing?
please for the next time do not correct the linting with other linting softwares, but run the tox -e lint command, as this PR breaks the linting (I will correct it)

Besides this minor stuff, great addition!

PS: @amjjbonvin if you need this to be merged now, we could fix this minor points in other PRs

SarahAlidoost · 2023-03-22T16:46:09Z

hi there, I tried to run the updated version of the PR. The reports look nice for the example runs, so I guess this PR is almost ready to be merged. Just a few comments from my side:

currently the report contain 10 box plots. In the first row there are HADDOCK-score, interface rmsd, ligand-rmsd, interface-ligand-RMSD, and VDW energy, while the remaining terms lie in the second row. From a docking perspective it makes more sense to put the energetic terms (HADDOCK-SCORE, VDW, ELECTROSTATIC, RESTRAINTS and DESOLVATION) in a row and the structural comparison observables (all the other terms) in another row. could you do this?

yes, I reordered the list of plots, see my commit.

I don't know if it makes sense to introduce the report.py script in the analysis, as now it is properly generated whenever the postprocess option is set to true..I think it's better if we provide a comprehensive description of the analysis (for ex. here create detailed documentation for haddock3-analyse #628 ), rather than leaving a python script there. I wouldn't add the report.html neither, as it will probably change shape and content many times and we don't want to update it every time. do you agree with me here?

agreed, it is removed now.

could you please add some docstrings to the test_libplots.py so that it is possible to understand what we're testing?

it is added.

please for the next time do not correct the linting with other linting softwares, but run the tox -e lint command, as this PR breaks the linting (I will correct it)

if I understood correctly tox -e lint only shows the errors without fixing them. Are there any tools to auto-fix linter errors? Also, running tox -e lint on src/haddock/libs/libplots.py returns two errors line too long. can these errors be skipped?

joaomcteixeira · 2023-03-23T09:10:13Z

Hi @SarahAlidoost

Indeed tox -e lint shows the lint errors according to the configuration in the tox.ini file. The current setup uses flake8 for linting. You can have a look at the tox.ini file for more details; it's a toml-like file. You can also visit https://flake8.pycqa.org/en/latest/ for all info on flake8.

HADDOCK3 does not have auto-linter because I set up the CI methods in haddock3, and I don't like automatic code edits 😉. But, you are now free to change flake8 for whatever other linter you may prefer. I recently found Ruff (see discussion), which may be a good alternative. I can use ruff inside tox as well.

Keep in mind linting is just an agreement between developers to homogenize the writing style so that developers know "how to look" at the code anywhere in the project. So, there's human freedom to change it. The current haddock3 settings reflect my writing preferences.

Cheers,

mgiulini

looks good now, time to merge it into the main branch!
as for the lint, yes, no worries, it will be fixed at some time

SarahAlidoost added 6 commits February 1, 2023 18:08

firs draft of creating analysis report

576a037

clean up the function for box plots

5feb235

fix number of classes, add afunction for scatters

ae155e9

remove grouping legend

84d06a7

combine two function into one

e927fd9

add two examples of report generated by analyse

c85ff28

SarahAlidoost mentioned this pull request Feb 8, 2023

haddock3-analyse: create report html page based on individual plots i-VRESSE/haddock3#1

Closed

12 tasks

SarahAlidoost added 9 commits February 8, 2023 15:45

combine scatters and boxes in one html

bbd38e6

use the same colors in scatters and boxes

d1f8e4d

increase grid height and add spacing for grid

50fee78

change grid spacing, add title_standoff and automargin

8e6b5d8

replace example reports with one

7076fcc

add clt table for the results of analysis

a4709a1

sort table based on cluster id

d128a03

use include_plotlyjs only once in the final report

b78bc41

update report file including table

9dc2435

mgiulini mentioned this pull request Feb 10, 2023

add score component's average and standard deviation in capri_clt.tsv file #612

Closed

SarahAlidoost added 11 commits February 10, 2023 15:30

add table of best structures to the report

3ce8e16

add download and visibility

ce49128

update the report file

d92fdc0

Merge branch 'main' into create_report

4593941

add docstrings, refcator function report_grid_size

a6d4b2f

remove html syntax from _add_links

f676809

fix height and width

cadef20

Merge remote-tracking branch 'upstream/main' into create_report

c8e5f57

fix sizes

b93c4d4

re generate report file

560dc44

fix linter errors

e3d6554

SarahAlidoost marked this pull request as ready for review February 22, 2023 11:35

fix number of best struct, add a todo for unclustered

5c75f4e

mgiulini self-requested a review February 23, 2023 13:36

mgiulini reviewed Feb 23, 2023

View reviewed changes

SarahAlidoost and others added 3 commits February 23, 2023 16:58

replace report.htm with another example

996d16d

add comments explaining the procedure of finding best structures

aaeace3

fix docstring

3001b1d

Co-authored-by: Marco Giulini <54807167+mgiulini@users.noreply.github.com>

Peter9192 mentioned this pull request Mar 1, 2023

Export plotly data and layout as json #631

Closed

sverhoeven added 5 commits March 10, 2023 10:06

Merge remote-tracking branch 'origin/main' into create_report

c70b497

Add import

1e0e6af

Add tests for libplots/find_best_struct

97a2746

Replace examples/analysis/report.html with script to generate it

a252b1d

As html file from tests/golden_data is 831Kb

- order tables by increasing score

21b6513

- zero pad best structure column names - tried html in cell, but failed - dont write clt_table.html as already part of report.html - add title to report.html

amjjbonvin requested a review from mgiulini March 13, 2023 16:25

mgiulini reviewed Mar 15, 2023

View reviewed changes

SarahAlidoost added 4 commits March 22, 2023 17:07

reorder box plots, first energetic terms, second the rest

14ba8c8

bring back the code that was removed by merge

14e4068

remove python script report.py

1882a98

add docstrings to test_libplots, fix linter errors

96e6952

SarahAlidoost requested a review from mgiulini March 22, 2023 16:38

mgiulini approved these changes Mar 23, 2023

View reviewed changes

mgiulini merged commit 8077309 into haddocking:main Mar 23, 2023

SarahAlidoost deleted the create_report branch March 29, 2023 08:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

haddock3-analyse: create report html page based on individual plots #611

haddock3-analyse: create report html page based on individual plots #611

SarahAlidoost commented Feb 8, 2023 •

edited by tneijenhuis

Loading

SarahAlidoost commented Feb 22, 2023

SarahAlidoost commented Feb 23, 2023 •

edited

Loading

SarahAlidoost commented Feb 23, 2023 •

edited

Loading

amjjbonvin commented Feb 23, 2023

mgiulini Feb 23, 2023

SarahAlidoost Feb 23, 2023

mgiulini Feb 23, 2023

SarahAlidoost Feb 23, 2023

mgiulini Feb 23, 2023

mgiulini Feb 23, 2023

mgiulini Feb 23, 2023

SarahAlidoost Feb 23, 2023

amjjbonvin commented Feb 23, 2023

codecov-commenter commented Mar 7, 2023 •

edited

Loading

sverhoeven commented Mar 10, 2023 •

edited

Loading

mgiulini left a comment

SarahAlidoost commented Mar 22, 2023

joaomcteixeira commented Mar 23, 2023

mgiulini left a comment

haddock3-analyse: create report html page based on individual plots #611

haddock3-analyse: create report html page based on individual plots #611

Conversation

SarahAlidoost commented Feb 8, 2023 • edited by tneijenhuis Loading

SarahAlidoost commented Feb 22, 2023

SarahAlidoost commented Feb 23, 2023 • edited Loading

SarahAlidoost commented Feb 23, 2023 • edited Loading

amjjbonvin commented Feb 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amjjbonvin commented Feb 23, 2023

codecov-commenter commented Mar 7, 2023 • edited Loading

Codecov Report

sverhoeven commented Mar 10, 2023 • edited Loading

mgiulini left a comment

Choose a reason for hiding this comment

SarahAlidoost commented Mar 22, 2023

joaomcteixeira commented Mar 23, 2023

mgiulini left a comment

Choose a reason for hiding this comment

SarahAlidoost commented Feb 8, 2023 •

edited by tneijenhuis

Loading

SarahAlidoost commented Feb 23, 2023 •

edited

Loading

SarahAlidoost commented Feb 23, 2023 •

edited

Loading

codecov-commenter commented Mar 7, 2023 •

edited

Loading

sverhoeven commented Mar 10, 2023 •

edited

Loading