Table of content
* [Introduction](#title-1)
* [Alignment quality indexes](#title-2)
* [Homogeneity score](#title-3)
* [Query sequence coverage by subject sequence](#subtitle-3-1)
* [Subject sequence coverage by query sequence](#subtitle-3-2)
* [Mutual coverage index](#subtitle-3-3)
* [Disparity index](#subtitle-3-4)
* [A few examples](#subtitle-3-5)
* [Example 1](#example-1)
* [Example 2](#example-2)
* [Example 3](#example-3)
Introduction
During execution, LAGOON-MCL uses several indices to evaluate the results:
- **Alignment quality indices:** Two indices are used to assess the quality of alignments against the MMseqs2 **alphafoldDB** database.
- **Homogeneity score:** Measures the consistency of annotations within each cluster.
Homogeneity score
The **homogeneity score** is calculated as follows:
$$
N_{label} > 1 \quad \Rightarrow \quad Hom_{score} = 1 - \frac{N_{label}}{N_{seq}}
$$
$$
N_{label} = 1 \quad \Rightarrow \quad Hom_{score} = 1
$$
$$
N_{label} = 0 \quad \Rightarrow \quad Hom_{score} = NA
$$
Where:
- $Hom_{score}$ : the homogeneity score
- $N_{label}$ : the number of different labels in the cluster
- $N_{seq}$ : the number of sequences in the cluster
The homogeneity score ranges between **0 and 1**.
If the score is less than 0, an algorithm is applied to select only the annotations that **best explain the sequences** in the cluster.
```text
Initialize list_label as an empty list
Initialize set_nodes as an ampty set
Create a dictionary dict_labels with:
{label: {set containing the identifiers linked to the label}}
For each label in dict_labels:
Add the size of dict_labels[label] to dict_size
Add all elements of dict_labels[label] to set_nodes
While set_nodes is not empty:
Initialize max_label as the label with the largest size in dict_size
Add max_label to list_label
Remove elements of dict_labels[max_label] from set_nodes
Update dict_labels by removing the elements associated with max_label
Recalculate the sizes of the subsets in dict_size
```
Alignment quality indexes
To assess the quality of coverage between two sequences, LAGOON-MCL uses two measurements:
- **Query coverage:** The proportion of the query sequence covered by the subject sequence.
- **Subject coverage:** The proportion of the subject sequence covered by the query sequence.
From these two values, two indices are calculated:
- **Mutual coverage index:** Reflects the overall coverage of both sequences.
- **Disparity index:** Measures the balance of the alignment between the sequences.
Query sequence coverage by subject sequence
$$
queryCoverage = \frac{queryStopPosition - queryStartPosition + 1}{querySequenceLength}
$$
Where:
- **`queryCoverage`**: proportion of the query sequence covered by the alignment
- **`queryStartPosition`**: alignment start position on the query sequence
- **`queryStopPosition`**: alignment stop position on the query sequence
- **`querySequenceLength`**: length of the query sequence
Subject sequence coverage by query sequence
$$
subjectCoverage = \frac{subjectStopPosition - subjectStartPosition + 1}{subjectSequenceLength}
$$
Where:
- **`subjectCoverage`**: proportion of the subject sequence covered by the alignment
- **`subjectStartPosition`**: alignment start position on the subject sequence
- **`subjectStopPosition`**: alignment stop position on the subject sequence
- **`subjectSequenceLength`**: length (in amino acids) of the subject sequence
Mutual coverage index
## Mutual Coverage Index
The **mutual coverage index** measures the overall coverage between two sequences.
This index ranges from **0** to **1**:
- **0**: poor coverage — the query and subject sequences are barely covered by the alignment
- **1**: perfect coverage — both sequences are completely covered by the alignment
It is calculated as:
$$
coverageIndex = \frac{queryCoverage + subjectCoverage}{2}
$$
Where:
- **`coverageIndex`**: mutual coverage index
- **`queryCoverage`**: coverage proportion of the query sequence
- **`subjectCoverage`**: coverage proportion of the subject sequence
Disparity index
## Disparity Index
The **disparity index** measures the alignment balance between the query and subject sequences. \
It indicates whether the alignment covers both sequences evenly.
The index ranges from **0** to **1**:
- **0**: the alignment covers the query and subject sequences equally
- **1**: there is a large difference between the coverage of the query and subject sequences
It is calculated as:
$$
disparityIndex = \left| queryCoverage - subjectCoverage \right|
$$
Where:
- **`disparityIndex`**: disparity index
- **`queryCoverage`**: coverage proportion of the query sequence
- **`subjectCoverage`**: coverage proportion of the subject sequence
A few examples
| | Example 1 | Example 2 | Example 3 |
|-----------------------|-----------|-----------|-----------|
| querySequenceLength | 804 | 743 | 71 |
| queryStartPosition | 1 | 617 | 1 |
| queryStopPosition | 570 | 711 | 71 |
| subjectSequenceLength | 570 | 722 | 532 |
| subjectStartPosition | 1 | 431 | 460 |
| subjectStopPosition | 570 | 527 | 532 |
#### Example 1
Given an alignment with the following positions:
$$
queryCoverage = \frac{570 - 1 + 1}{804} = \frac{570}{804} \approx 0.709
$$
$$
subjectCoverage = \frac{570 - 1 + 1}{570} = \frac{570}{570} = 1
$$
The **mutual coverage index** is:
$$
coverageIndex = \frac{0.709 + 1}{2} \approx 0.854
$$
The **disparity index** is:
$$
disparityIndex = |0.709 - 1| \approx 0.291
$$
- The **coverage index** of ~0.85 indicates that a large portion of both sequences is included in the alignment, suggesting good overall coverage.
- The **disparity index** of ~0.29 shows that the query sequence is less covered than the subject sequence, highlighting some imbalance in the alignment.
#### Example 2
Given an alignment with the following positions:
$$
queryCoverage = \frac{711 - 617 + 1}{743} = \frac{95}{743} \approx 0.128
$$
$$
subjectCoverage = \frac{527 - 431 + 1}{722} = \frac{97}{722} \approx 0.134
$$
The **mutual coverage index** is:
$$
coverageIndex = \frac{0.128 + 0.134}{2} \approx 0.131
$$
The **disparity index** is:
$$
disparityIndex = |0.128 - 0.134| \approx 0.0065
$$
- The **disparity index** is close to 0, indicating that the two sequences are aligned over roughly the same proportion of their lengths.
- The **coverage index** of ~0.13 shows that only a small portion of both sequences is included in the alignment, indicating poor overall alignment quality.
#### Example 3
Given an alignment with the following positions:
$$
queryCoverage = \frac{71 - 1 + 1}{71} = \frac{71}{71} = 1
$$
$$
subjectCoverage = \frac{532 - 460 + 1}{532} = \frac{73}{532} \approx 0.137
$$
The **mutual coverage index** is:
$$
coverageIndex = \frac{1 + 0.137}{2} \approx 0.569
$$
The **disparity index** is:
$$
disparityIndex = |1 - 0.137| \approx 0.863
$$
- The **coverage index** of ~0.57 indicates that, on average, just over half of the two sequences are covered by the alignment.
- The **disparity index** is very high (~0.86), showing that the alignment is extremely imbalanced: the query sequence is fully covered, but the subject sequence is barely aligned.
- This alignment would generally be considered low quality due to the imbalance.