Table of content

* [Introduction](#title-1) * [Alignment quality indexes](#title-2) * [Homogeneity score](#title-3) * [Query sequence coverage by subject sequence](#subtitle-3-1) * [Subject sequence coverage by query sequence](#subtitle-3-2) * [Mutual coverage index](#subtitle-3-3) * [Disparity index](#subtitle-3-4) * [A few examples](#subtitle-3-5) * [Example 1](#example-1) * [Example 2](#example-2) * [Example 3](#example-3)

Introduction

During execution, LAGOON-MCL uses several indices to evaluate the results: - **Alignment quality indices:** Two indices are used to assess the quality of alignments against the MMseqs2 **alphafoldDB** database. - **Homogeneity score:** Measures the consistency of annotations within each cluster.

Homogeneity score

The **homogeneity score** is calculated as follows: $$ N_{label} > 1 \quad \Rightarrow \quad Hom_{score} = 1 - \frac{N_{label}}{N_{seq}} $$ $$ N_{label} = 1 \quad \Rightarrow \quad Hom_{score} = 1 $$ $$ N_{label} = 0 \quad \Rightarrow \quad Hom_{score} = NA $$ Where: - $Hom_{score}$ : the homogeneity score - $N_{label}$ : the number of different labels in the cluster - $N_{seq}$ : the number of sequences in the cluster The homogeneity score ranges between **0 and 1**. If the score is less than 0, an algorithm is applied to select only the annotations that **best explain the sequences** in the cluster. ```text Initialize list_label as an empty list Initialize set_nodes as an ampty set Create a dictionary dict_labels with: {label: {set containing the identifiers linked to the label}} For each label in dict_labels: Add the size of dict_labels[label] to dict_size Add all elements of dict_labels[label] to set_nodes While set_nodes is not empty: Initialize max_label as the label with the largest size in dict_size Add max_label to list_label Remove elements of dict_labels[max_label] from set_nodes Update dict_labels by removing the elements associated with max_label Recalculate the sizes of the subsets in dict_size ```

Alignment quality indexes

To assess the quality of coverage between two sequences, LAGOON-MCL uses two measurements: - **Query coverage:** The proportion of the query sequence covered by the subject sequence. - **Subject coverage:** The proportion of the subject sequence covered by the query sequence. From these two values, two indices are calculated: - **Mutual coverage index:** Reflects the overall coverage of both sequences. - **Disparity index:** Measures the balance of the alignment between the sequences.

Query sequence coverage by subject sequence

$$ queryCoverage = \frac{queryStopPosition - queryStartPosition + 1}{querySequenceLength} $$ Where: - **`queryCoverage`**: proportion of the query sequence covered by the alignment - **`queryStartPosition`**: alignment start position on the query sequence - **`queryStopPosition`**: alignment stop position on the query sequence - **`querySequenceLength`**: length of the query sequence

Subject sequence coverage by query sequence

$$ subjectCoverage = \frac{subjectStopPosition - subjectStartPosition + 1}{subjectSequenceLength} $$ Where: - **`subjectCoverage`**: proportion of the subject sequence covered by the alignment - **`subjectStartPosition`**: alignment start position on the subject sequence - **`subjectStopPosition`**: alignment stop position on the subject sequence - **`subjectSequenceLength`**: length (in amino acids) of the subject sequence

Mutual coverage index

## Mutual Coverage Index The **mutual coverage index** measures the overall coverage between two sequences. This index ranges from **0** to **1**: - **0**: poor coverage — the query and subject sequences are barely covered by the alignment - **1**: perfect coverage — both sequences are completely covered by the alignment It is calculated as: $$ coverageIndex = \frac{queryCoverage + subjectCoverage}{2} $$ Where: - **`coverageIndex`**: mutual coverage index - **`queryCoverage`**: coverage proportion of the query sequence - **`subjectCoverage`**: coverage proportion of the subject sequence

Disparity index

## Disparity Index The **disparity index** measures the alignment balance between the query and subject sequences. \ It indicates whether the alignment covers both sequences evenly. The index ranges from **0** to **1**: - **0**: the alignment covers the query and subject sequences equally - **1**: there is a large difference between the coverage of the query and subject sequences It is calculated as: $$ disparityIndex = \left| queryCoverage - subjectCoverage \right| $$ Where: - **`disparityIndex`**: disparity index - **`queryCoverage`**: coverage proportion of the query sequence - **`subjectCoverage`**: coverage proportion of the subject sequence

A few examples

| | Example 1 | Example 2 | Example 3 | |-----------------------|-----------|-----------|-----------| | querySequenceLength | 804 | 743 | 71 | | queryStartPosition | 1 | 617 | 1 | | queryStopPosition | 570 | 711 | 71 | | subjectSequenceLength | 570 | 722 | 532 | | subjectStartPosition | 1 | 431 | 460 | | subjectStopPosition | 570 | 527 | 532 | #### Example 1 Given an alignment with the following positions: $$ queryCoverage = \frac{570 - 1 + 1}{804} = \frac{570}{804} \approx 0.709 $$ $$ subjectCoverage = \frac{570 - 1 + 1}{570} = \frac{570}{570} = 1 $$ The **mutual coverage index** is: $$ coverageIndex = \frac{0.709 + 1}{2} \approx 0.854 $$ The **disparity index** is: $$ disparityIndex = |0.709 - 1| \approx 0.291 $$ - The **coverage index** of ~0.85 indicates that a large portion of both sequences is included in the alignment, suggesting good overall coverage. - The **disparity index** of ~0.29 shows that the query sequence is less covered than the subject sequence, highlighting some imbalance in the alignment. #### Example 2 Given an alignment with the following positions: $$ queryCoverage = \frac{711 - 617 + 1}{743} = \frac{95}{743} \approx 0.128 $$ $$ subjectCoverage = \frac{527 - 431 + 1}{722} = \frac{97}{722} \approx 0.134 $$ The **mutual coverage index** is: $$ coverageIndex = \frac{0.128 + 0.134}{2} \approx 0.131 $$ The **disparity index** is: $$ disparityIndex = |0.128 - 0.134| \approx 0.0065 $$ - The **disparity index** is close to 0, indicating that the two sequences are aligned over roughly the same proportion of their lengths. - The **coverage index** of ~0.13 shows that only a small portion of both sequences is included in the alignment, indicating poor overall alignment quality. #### Example 3 Given an alignment with the following positions: $$ queryCoverage = \frac{71 - 1 + 1}{71} = \frac{71}{71} = 1 $$ $$ subjectCoverage = \frac{532 - 460 + 1}{532} = \frac{73}{532} \approx 0.137 $$ The **mutual coverage index** is: $$ coverageIndex = \frac{1 + 0.137}{2} \approx 0.569 $$ The **disparity index** is: $$ disparityIndex = |1 - 0.137| \approx 0.863 $$ - The **coverage index** of ~0.57 indicates that, on average, just over half of the two sequences are covered by the alignment. - The **disparity index** is very high (~0.86), showing that the alignment is extremely imbalanced: the query sequence is fully covered, but the subject sequence is barely aligned. - This alignment would generally be considered low quality due to the imbalance.