# Predator Revision Document

In [1]:
from datetime import datetime

print("\033[32m{}\033[0m".format(datetime.now().strftime("%B %d, %Y %H:%M:%S")))

[32mMay 23, 2022 11:34:00[0m


## Reviewers' comments

1. **Is the manuscript technically sound, and do the data support the conclusions?**

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

&nbsp;

Reviewer #1: Yes

Reviewer #2: Partly

Reviewer #3: Yes

&nbsp;

2. **Has the statistical analysis been performed appropriately and rigorously?**

&nbsp;

Reviewer #1: I Don't Know

Reviewer #2: Yes

Reviewer #3: Yes

&nbsp;

3. **Have the authors made all data underlying the findings in their manuscript fully available?**

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

&nbsp;

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

&nbsp;

4. **Is the manuscript presented in an intelligible fashion and written in standard English?**

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

&nbsp;

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

&nbsp;

5. **Review Comments to the Author**

### Reviewer #1

Reviewer #1: The proposed research presents a new machine learning approach that improves the prediction of the impact of cancer somatic mutations on specific protein-protein interactions over the already existing models. It overcomes the limitations of current state-of-the-art methods by extracting the most informative features from those tools to train a new and more accurate ensemble model based on random forests to classify mutations as disruptive or nondisruptive for specific interactions. After the exhaustive model performance evaluation, an extensive analysis on TCGA data is provided at multiple levels, leading to some interesting biological findings on mutual exclusivity of mutational patterns. The main limitation of the work is clearly stated: the low availability of structure data needed to train the model is a limit for prediction accuracy. It's clear and readable, but some major aspects need to be considered:

1)  It's stated tha IntAct "disruptive" label is a subcategory of "decreasing" label. Why redundant triplets with both labels are assumed to be "disruptive" and not "decreasing"? The sentence needs a better explanation of the reason.
2) Neither pre-trained model, nor source code with random seeds to generate it from scratch are provided, limiting the possibility to repeat the analysis and to use the proposed approach.

Minor issues regard:

1) Methods sometimes anticipate results without referring to figures (e.g. "We choose N as 10 as it gives the ...") and results sometimes describe procedures not mentioned in methods (e.g. "To estimate the significance of mutual exclusivity...").
2) Methods section doesn't mention the scripting languages/environments used and their versions.
3) Title for "Training random forests" section is missing.
4) COAD is missing in the text when listing cancer types (page 7, Results, Prediction in TCGA data).
5) Check punctuation.

### Reviewer #2

Reviewer #2: In their manuscript “Predator: Predicting the impact of cancer somatic mutations on protein-protein interactions”, Berber et al describe a new machine learning model to predict the effect of missense mutations on protein-protein interactions and apply this model to predict the mutations occurring in cancer that could affect protein-protein interactions.
The problematic is interesting and very relevant in the field. However, several prediction models have already been described, for instance ELASPIC that is used in the current work, and the way the paper is written makes it difficult to fully evaluate what Predator is adding to the latest version of ELASPIC. I think the authors should make a stronger effort to explain the added value of Predator and present their results in a more transparent way.

Major comments:

Prediction Performance on the IntAct Dataset

1. The number of mutations predicted to influence an interaction in the TCGA datasets that were already present in the training set should be mentioned? The author should give the list of the 164 proteins included in the training dataset.



<span style="color:blue">Comparing IntAct mutations interactions (740 interactions in total but there are 439 unique interactions after converting primary iso-form) and TCGA predicted interactions (not all of TCGA interactions, though, I looked at the ones we have made predictions, there are 21250 unique interactions), only 4 of them overlapped:
</span>

|    | UniProt_ID   | Mutation   | Interactor_UniProt_ID   |
|---:|:-------------|:-----------|:------------------------|
|  0 | P42773       | T69A       | Q00534                  |
|  1 | Q8IYM1       | D197N      | Q8WYJ6                  |
|  2 | Q8IYM1       | D197N      | Q14141                  |
|  3 | P21860       | G284R      | P04626                  |

code: Mutations_IntAct_TCGA.ipynb

</span>



2. The authors define the “TP as the number of correctly classified nondisruptive interactions and TN as the number of correctly identified disruptive interactions”
This is a bit counter-intuitive as one would expect the positive mutations to be the disruptive ones.
3. The authors should discuss better the features they used (i.e. EL2_score, Matrix score). Are these scores the result of ELASPIC2 core (EL2core) and ELASPIC2 interface (EL2interface) machine learnings. If this is the case, then it would be interesting to test Predator using only this score and remove the input features that were used by ELASPIC2 to get this output score.
Also, it is not clear why the 10 “best” features include some properties of the wt residue and not of the mutant one (and vice versa), for instance, the “solvation_polar_wt” and not the “solvation_polar_mut” or the “van_der_waals_mut” and not the “van_der_waals_wt”. One would expect a change in the property of a residue to be more informative than the absolute property of each residue (wt or mutant) independently.
4. It is not clear why the authors are changing the way they’re training the model for the final validation. In the material and Method, they first declare training the model with 80% of the dataset and validating it with the 20% left, but they finally used what they call the leave-one-protein out evaluation which imply changing the training dataset. The authors should elaborate on this.
5. It could be interesting to check whether the FN are mutations more likely to increase an interaction or to have no effect.
6. The author should make their model available for the reviewers to test it.

Prediction of Disrupted Interactions in the TCGA Data 260

7. It is not clear how the mutations observed in the TCGA dataset are pre-processed to give the triplets and how many different proteins are represented in those triplets.
8. Also, it would be interesting to know how many of these triplets were already present in the training dataset
9. The way the results are presented could be confusing because the real output of the predictor is not presented. Indeed, we expect the output to be about mutations and their effect on interactions but Supplementary fig 5 present the results in terms of interactions and table 2 lists the affected genes and gives the score by patient. It would be interesting to have a supplementary data file with the output of Predator giving the prediction (Disruptive or not disruptive) for each triplet.
10. The fact that the proteins obtained as a result of the prediction have been described to be involved in cancer in the literature is not really a validation. To validate their model, the authors should present experimental data confirming that the mutations predicted as “disruptive” really disrupt the mentioned interaction. In that sense, the only real validation is provided for the mutation H1047R in the kinase domain of PIK3CA that has been shown to disrupt the interaction between PIK3CA and PIK3R1. Other interactions whose predicted disruption could play a role in cancer have been described in the literature but the mutation predicted to disrupt that interaction is not mentioned and therefore, it is not possible to validate its effect. Again, a supplementary data file with the output of predator and a comment on whether the triplet has been experimentally validated would be useful..

Minor comments

1. Correct legend supp. Figure 1 (Predator_SHAP_Top_10)
2. Typo in the legend of fig 1 : “The first step is to pre-process the “traning” data”. It should be “training”.
3. Discrepancy between supplementary table 1 showing 103 features (89 if the items classified as “id or labels” are removed) and the text talking about 86 features.
4. In table 1, the recall value that should be in bold is the one obtained with SAAMBE_3D, not the one from Predator.
5. Does Supplementary Table 2 report the total number of mutations or the total number of missense mutations.
6. p10 Lane 368-369 : Revise sentence “The most frequently disrupted interaction of ERBB2 is with its interaction with SRPK1 (6 patients).”

### Reviewer #3

Reviewer #3: Berber et al present a new method for inferring the mutations that impact protein interactions based on a reparameterization of the ELASPIC pipeline. The original ELASPIC2 pipeline pulls data variety from a variety of sources and is trained against the Spearman correlation coefficient. The new method is trained specifically on the IntAct cancer dataset of disruptive interface mutations and is trained as a binary classifier. PREDATOR, in my understanding, therefore uses a filtered version of the ELASPIC2 feature set, a different target (classification vs. Spearman correlation), and a different training set (IntAct vs. multiple sources). This results in some subtle differences between the two methods. The results are similar to ELASPIC2 overall but improved in some aspects, in particular PREDATOR has higher sensitivity for disruptive mutations while retaining the same accuracy of the original method (Table 1). Separation of training and testing data is fairly rigorous (both leave one protein out and per mutation cross-validation).

Comments: PREDATOR is formed by a reparameterization of Elaspic energy terms to the IntAct datasets. Since PREDATOR heavily relies on Elaspic, it would be useful to know when and ideally how the results differ. Are there any consistent patterns when the predictions are at variance? For instance, disulfide bonding across the interface is a relatively rare interaction that may have a big impact when it occurs. This is included in ELASPIC via FOLDX but excluded a priori in PREDATOR. From Figs. S2 and S4, it appears alignment based features are weighed more heavily in PREDATOR. Part of this is likely the classification vs. ranking difference – a ranking method may be more reliant on physics based energy terms designed to pick out small differences between largely conserved mutations at the expense of making occasional large mistakes that lead to errors in classification.

Along this line, it would be interesting to see, for example, if this a limitation of structural modeling. Are proteins where a PDB of the complex does not exist predicted better by PREDATOR? Is there a trend in the error relative to PREDATOR with % sequence identity of the homology model when a PDB of the complex does not exist? Does repeating the ELASPIC result but uploading an Alphafold model of the complex (presumed to be more accurate) improve the result when the cases are at variance? Are the interfaces where the two methods are at variance unusually flexible? Are clashes unusually high in these models? It isn’t necessary to answer all these questions but it would be helpful to have some understanding of why PREDATOR is outperforming ELASPIC