Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

haddock3-analyse: create report html page based on individual plots #611

Merged
merged 39 commits into from
Mar 23, 2023

Conversation

SarahAlidoost
Copy link
Contributor

@SarahAlidoost SarahAlidoost commented Feb 8, 2023

You are about to submit a new Pull Request. Before continuing make sure you read the contributing guidelines and you comply with the following criteria:

  • You have stick to Python. Talk with us before adding other programming languages to HADDOCK3
  • Your PR is about CNS
  • Your code is well documented: proper docstrings and explanatory comments for those tricky parts
  • You structured the code into small functions as much as possible. You can use classes if there's a (state) purpose
  • code follows our coding style
  • You wrote tests for the new code
  • tox tests pass. Run tox command inside the repository folder
  • -test.cfg examples execute without errors. Inside examples/ run python run_tests.py -b
  • PR does not add any install dependencies unless permission granted by the HADDOCK team
  • PR does not break licensing
  • Your PR is about writing documentation for already existing code 🔥
  • Your PR is about writing tests for already existing code :godmode:

This PR extends the functionality of the command line haddock3-analyse. After running the analyse, two HTML files are generated: report.html contains all of the plots (tables, boxes and scatters), clt_table.html contains tables that show a summary of the capri tables.

Example: this report.html is generated by running haddock3-analyse -r scenario2a-NMR-epitope-pass -m 9 on the results of scenario2a-NMR-epitope-pass introduced in the tutorial HADDOCK3-antibody-antigen.

This pull request:

  • fixes colors of box plots. Now the colors of scatters and boxes are the same.
  • fixes linter errors for existing codes
  • generates a clt_table.html in analysis dir
  • generates a report.html in analysis dir
  • adds a report.html file to example/analysis dir

Todo list for the future:

  • enable downloading for the best structures in clt_table.html
  • enable visibility for the best structures in clt_table.html

Note: Linter fails because the matplotlib has not been added to the dependencies.

@SarahAlidoost SarahAlidoost marked this pull request as ready for review February 22, 2023 11:35
@SarahAlidoost
Copy link
Contributor Author

@mgiulini can you please review this pull request? I couldn't add a reviewer to the Reviewers list.

@SarahAlidoost
Copy link
Contributor Author

SarahAlidoost commented Feb 23, 2023

Hi @SarahAlidoost , thanks for the contribution, the report looks very nice!

I think though that this PR must be checked and tested a bit more thoroughly: I tried to run haddock3-analyse on one of the examples folders (examples/docking-antibody-antigen/run1-ranairCDR-cltsel-test), and it failed. In this case I used the modules 1, 5 and 13.

The reason why it failed on module 1 is clear (there's a typo in the column model-cluster-ranking in capri_ss.tsv) and will be corrected (see #626), while for the other two modules the situation is a bit more obscure (error message reads: Length of values (80) does not match length of index (20))

can you try to test the addition on some examples to see if there are other problems?

Thanks for checking this. It fails on finding the best structures from capri_ss file when creating tables. The number of the best structures is 4 and hard coded in the module, see here. For -m 5 looking at the examples/docking-antibody-antigen/run1-ranairCDR-cltsel-test/05_caprieval/capri_ss.tsv, it seems that the number of best structures should be 1. The same issue for -m 13. Is the number of best structures defined by the user or the code? In the former case, we can introduce an argument to the cli and create a better error message.

For -m 15 and -m 2, there are two more problems: the columns cluster-ranking and model-cluster-ranking are filled with -. These are "Unclustered", right? Another problem is a typo in the column name, self.model-cluster-ranking should be model-cluster-ranking.

@SarahAlidoost
Copy link
Contributor Author

SarahAlidoost commented Feb 23, 2023

@mgiulini in my last commit, I fixed the issue for the number of best structures. Now, the analyse command should work for -m 5 and -m 13. I also added a todo for -m 2 and 15 i.e. Unclustered cases.

@mgiulini mgiulini self-requested a review February 23, 2023 13:36
@amjjbonvin
Copy link
Member

Be careful with the number of best structures. This depends on the clustering settings and other parameters.

Also the tests do not have the full sampling, which might explain some of the issue you are having in the analysis

Comment on lines 747 to 757
def _add_links(best_struct_df):
table_df = best_struct_df.copy()
for col_name in table_df.columns[1:]:
file_name = best_struct_df[col_name].apply(lambda row: Path(row).name)
# href empty for local files
# TODO fix links to Download and Visibility
# we can use html href syntax but html should be responsive!
dl_text = "Download"
vis_text = "Visibility"
table_df[col_name] = file_name + ", " + dl_text + ", " + vis_text
return table_df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that this is the function that should create all the links to the structures and to their visualization..the problem here is that in a standard run we might have hundreds (if not thousands) of clusters, and including all the links here would probably end up in a huge html report, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I am wondering how this issue was handled in haddock 2.4. Here, the table creates vertical scrolling. I committed the report.html generated for -m 5, see my commit. If the file size of the report is a concern, we need to think about exporting it to another format perhaps a csv file!

dfcl : pandas DataFrame
DataFrame of capri table with new column names
"""
dfcl = dfcl.sort_values(by=["cluster_id"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to the previous comment: it might happen that the run gives rise to hundreds/thousands of clusters..in that case I am not sure it makes sense to list all of them and for sure they should not be ordered by cluster-id (the best cluster (lower model-cluster-ranking) might have cluster_id = 2000)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above.

src/haddock/libs/libplots.py Outdated Show resolved Hide resolved
Comment on lines 703 to 709
def find_best_struct(ss_file, number_of_struct=4):
"""
Find best structures.

It inspects model-cluster-ranking recorded in capri_ss.tsv file and finds
the best models (models with lower ranks). By default, it selects the 4 best
models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @amjjbonvin here: the number of structures to choose depend on many parameters, and the data we show here can disagree with those reported in previous steps of the workflow (for example the clustfcc step).

I see three potential solutions here:

  1. we report mean and standard deviation over the full set of structures belonging to a cluster
  2. we choose 4 as the default number (in this case we should state it clearly, e.g. "data refer to the first 4 structures")
  3. (not so immediate) we make the analysis read this number from the capri_clt.tsv file, where the value of clt_threshold is reported

"""
dfss = read_capri_table(ss_file)
dfss = dfss.sort_values(by=["cluster-id", "model-cluster-ranking"])
# TODO need a check for "Unclustered"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how to implement this check..I would maybe report the top number_of_struct unclustered structures in the table as well

Comment on lines 727 to 742
max_number_of_struct = dfss.groupby("cluster-id").count()["model-cluster-ranking"].min()
number_of_struct = min(number_of_struct, max_number_of_struct)
best_struct_df = dfss.groupby("cluster-id").head(number_of_struct).copy()
number_of_cluster = len(best_struct_df["cluster-id"].unique())
col_names = [
f"Nr {number + 1} best structure" for number in range(number_of_struct)
] * number_of_cluster
best_struct_df = best_struct_df.assign(Structure=col_names)
best_struct_df = best_struct_df.pivot_table(
index=["cluster-id"],
columns=["Structure"],
values="model",
aggfunc=lambda x: x,
)
best_struct_df.reset_index(inplace=True)
best_struct_df.rename(columns={"cluster-id": "Cluster ID"}, inplace=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add some comments here and there to help clarifying the procedure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I added comments. Let me know if something is still unclear.

@amjjbonvin
Copy link
Member

On the HADDOCK server we report by default the top 10 clusters. For the visualisation all should be plotted, but may-be only the top10 color-coded and labelled

And the order is based on the scoring function and not the cluster number as Marco explained. I.e. the score column of capri_clt.tsv

@codecov-commenter
Copy link

codecov-commenter commented Mar 7, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.76 🎉

Comparison is base (02ed904) 73.81% compared to head (96e6952) 74.58%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #611      +/-   ##
==========================================
+ Coverage   73.81%   74.58%   +0.76%     
==========================================
  Files         110      111       +1     
  Lines        7347     7565     +218     
==========================================
+ Hits         5423     5642     +219     
+ Misses       1924     1923       -1     
Impacted Files Coverage Δ
src/haddock/clis/cli_analyse.py 73.75% <100.00%> (+0.33%) ⬆️
src/haddock/libs/libplots.py 96.17% <100.00%> (+4.17%) ⬆️
tests/test_libplots.py 100.00% <100.00%> (ø)

... and 4 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

As html file from tests/golden_data is 831Kb
- zero pad best structure column names
- tried html in cell, but failed
- dont write clt_table.html as already part of report.html
- add title to report.html
@sverhoeven
Copy link
Contributor

sverhoeven commented Mar 10, 2023

The number of best structures has been set to 10.

Replaced examples/analysis/report.html with script to generate report.html from tests/golden_data
The report.html is 831Kb. Should report.html be added to the repo or GH pages?

report.zip

Copy link
Contributor

@mgiulini mgiulini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi there, I tried to run the updated version of the PR. The reports look nice for the example runs, so I guess this PR is almost ready to be merged. Just a few comments from my side:

  • currently the report contain 10 box plots. In the first row there are HADDOCK-score, interface rmsd, ligand-rmsd, interface-ligand-RMSD, and VDW energy, while the remaining terms lie in the second row. From a docking perspective it makes more sense to put the energetic terms (HADDOCK-SCORE, VDW, ELECTROSTATIC, RESTRAINTS and DESOLVATION) in a row and the structural comparison observables (all the other terms) in another row. could you do this?
  • I don't know if it makes sense to introduce the report.py script in the analysis, as now it is properly generated whenever the postprocess option is set to true..I think it's better if we provide a comprehensive description of the analysis (for ex. here create detailed documentation for haddock3-analyse #628 ), rather than leaving a python script there. I wouldn't add the report.html neither, as it will probably change shape and content many times and we don't want to update it every time. do you agree with me here?
  • could you please add some docstrings to the test_libplots.py so that it is possible to understand what we're testing?
  • please for the next time do not correct the linting with other linting softwares, but run the tox -e lint command, as this PR breaks the linting (I will correct it)

Besides this minor stuff, great addition!

PS: @amjjbonvin if you need this to be merged now, we could fix this minor points in other PRs

@SarahAlidoost
Copy link
Contributor Author

hi there, I tried to run the updated version of the PR. The reports look nice for the example runs, so I guess this PR is almost ready to be merged. Just a few comments from my side:

  • currently the report contain 10 box plots. In the first row there are HADDOCK-score, interface rmsd, ligand-rmsd, interface-ligand-RMSD, and VDW energy, while the remaining terms lie in the second row. From a docking perspective it makes more sense to put the energetic terms (HADDOCK-SCORE, VDW, ELECTROSTATIC, RESTRAINTS and DESOLVATION) in a row and the structural comparison observables (all the other terms) in another row. could you do this?

yes, I reordered the list of plots, see my commit.

  • I don't know if it makes sense to introduce the report.py script in the analysis, as now it is properly generated whenever the postprocess option is set to true..I think it's better if we provide a comprehensive description of the analysis (for ex. here create detailed documentation for haddock3-analyse #628 ), rather than leaving a python script there. I wouldn't add the report.html neither, as it will probably change shape and content many times and we don't want to update it every time. do you agree with me here?

agreed, it is removed now.

  • could you please add some docstrings to the test_libplots.py so that it is possible to understand what we're testing?

it is added.

  • please for the next time do not correct the linting with other linting softwares, but run the tox -e lint command, as this PR breaks the linting (I will correct it)

if I understood correctly tox -e lint only shows the errors without fixing them. Are there any tools to auto-fix linter errors? Also, running tox -e lint on src/haddock/libs/libplots.py returns two errors line too long. can these errors be skipped?

@joaomcteixeira
Copy link
Member

Hi @SarahAlidoost

Indeed tox -e lint shows the lint errors according to the configuration in the tox.ini file. The current setup uses flake8 for linting. You can have a look at the tox.ini file for more details; it's a toml-like file. You can also visit https://flake8.pycqa.org/en/latest/ for all info on flake8.

HADDOCK3 does not have auto-linter because I set up the CI methods in haddock3, and I don't like automatic code edits 😉. But, you are now free to change flake8 for whatever other linter you may prefer. I recently found Ruff (see discussion), which may be a good alternative. I can use ruff inside tox as well.

Keep in mind linting is just an agreement between developers to homogenize the writing style so that developers know "how to look" at the code anywhere in the project. So, there's human freedom to change it. The current haddock3 settings reflect my writing preferences.

Cheers,

Copy link
Contributor

@mgiulini mgiulini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good now, time to merge it into the main branch!
as for the lint, yes, no worries, it will be fixed at some time

@mgiulini mgiulini merged commit 8077309 into haddocking:main Mar 23, 2023
@SarahAlidoost SarahAlidoost deleted the create_report branch March 29, 2023 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants