-
Notifications
You must be signed in to change notification settings - Fork 0
5. Indices
During execution, LAGOON-MCL uses several indices to evaluate the results:
- Alignment quality indices: Two indices are used to assess the quality of alignments against the MMseqs2 alphafoldDB database.
- Homogeneity score: Measures the consistency of annotations within each cluster.
The homogeneity score is calculated as follows:
Where:
-
$Hom_{score}$ : the homogeneity score -
$N_{label}$ : the number of different labels in the cluster -
$N_{seq}$ : the number of sequences in the cluster
The homogeneity score ranges between 0 and 1.
If the score is less than 0, an algorithm is applied to select only the annotations that best explain the sequences in the cluster.
Initialize list_label as an empty list
Initialize set_nodes as an ampty set
Create a dictionary dict_labels with:
{label: {set containing the identifiers linked to the label}}
For each label in dict_labels:
Add the size of dict_labels[label] to dict_size
Add all elements of dict_labels[label] to set_nodes
While set_nodes is not empty:
Initialize max_label as the label with the largest size in dict_size
Add max_label to list_label
Remove elements of dict_labels[max_label] from set_nodes
Update dict_labels by removing the elements associated with max_label
Recalculate the sizes of the subsets in dict_size
To assess the quality of coverage between two sequences, LAGOON-MCL uses two measurements:
- Query coverage: The proportion of the query sequence covered by the subject sequence.
- Subject coverage: The proportion of the subject sequence covered by the query sequence.
From these two values, two indices are calculated:
- Mutual coverage index: Reflects the overall coverage of both sequences.
- Disparity index: Measures the balance of the alignment between the sequences.
Where:
-
queryCoverage: proportion of the query sequence covered by the alignment -
queryStartPosition: alignment start position on the query sequence -
queryStopPosition: alignment stop position on the query sequence -
querySequenceLength: length of the query sequence
Where:
-
subjectCoverage: proportion of the subject sequence covered by the alignment -
subjectStartPosition: alignment start position on the subject sequence -
subjectStopPosition: alignment stop position on the subject sequence -
subjectSequenceLength: length (in amino acids) of the subject sequence
The mutual coverage index measures the overall coverage between two sequences.
This index ranges from 0 to 1:
- 0: poor coverage — the query and subject sequences are barely covered by the alignment
- 1: perfect coverage — both sequences are completely covered by the alignment
It is calculated as: $$ coverageIndex = \frac{queryCoverage + subjectCoverage}{2} $$
Where:
-
coverageIndex: mutual coverage index -
queryCoverage: coverage proportion of the query sequence -
subjectCoverage: coverage proportion of the subject sequence
The disparity index measures the alignment balance between the query and subject sequences.
It indicates whether the alignment covers both sequences evenly.
The index ranges from 0 to 1:
- 0: the alignment covers the query and subject sequences equally
- 1: there is a large difference between the coverage of the query and subject sequences
It is calculated as: $$ disparityIndex = \left| queryCoverage - subjectCoverage \right| $$
Where:
-
disparityIndex: disparity index -
queryCoverage: coverage proportion of the query sequence -
subjectCoverage: coverage proportion of the subject sequence
| Example 1 | Example 2 | Example 3 | |
|---|---|---|---|
| querySequenceLength | 804 | 743 | 71 |
| queryStartPosition | 1 | 617 | 1 |
| queryStopPosition | 570 | 711 | 71 |
| subjectSequenceLength | 570 | 722 | 532 |
| subjectStartPosition | 1 | 431 | 460 |
| subjectStopPosition | 570 | 527 | 532 |
Given an alignment with the following positions: $$ queryCoverage = \frac{570 - 1 + 1}{804} = \frac{570}{804} \approx 0.709 $$ $$ subjectCoverage = \frac{570 - 1 + 1}{570} = \frac{570}{570} = 1 $$ The mutual coverage index is: $$ coverageIndex = \frac{0.709 + 1}{2} \approx 0.854 $$ The disparity index is: $$ disparityIndex = |0.709 - 1| \approx 0.291 $$
- The coverage index of ~0.85 indicates that a large portion of both sequences is included in the alignment, suggesting good overall coverage.
- The disparity index of ~0.29 shows that the query sequence is less covered than the subject sequence, highlighting some imbalance in the alignment.
Given an alignment with the following positions: $$ queryCoverage = \frac{711 - 617 + 1}{743} = \frac{95}{743} \approx 0.128 $$ $$ subjectCoverage = \frac{527 - 431 + 1}{722} = \frac{97}{722} \approx 0.134 $$ The mutual coverage index is: $$ coverageIndex = \frac{0.128 + 0.134}{2} \approx 0.131 $$ The disparity index is: $$ disparityIndex = |0.128 - 0.134| \approx 0.0065 $$
- The disparity index is close to 0, indicating that the two sequences are aligned over roughly the same proportion of their lengths.
- The coverage index of ~0.13 shows that only a small portion of both sequences is included in the alignment, indicating poor overall alignment quality.
Given an alignment with the following positions: $$ queryCoverage = \frac{71 - 1 + 1}{71} = \frac{71}{71} = 1 $$ $$ subjectCoverage = \frac{532 - 460 + 1}{532} = \frac{73}{532} \approx 0.137 $$ The mutual coverage index is: $$ coverageIndex = \frac{1 + 0.137}{2} \approx 0.569 $$ The disparity index is: $$ disparityIndex = |1 - 0.137| \approx 0.863 $$
- The coverage index of ~0.57 indicates that, on average, just over half of the two sequences are covered by the alignment.
- The disparity index is very high (~0.86), showing that the alignment is extremely imbalanced: the query sequence is fully covered, but the subject sequence is barely aligned.
- This alignment would generally be considered low quality due to the imbalance.