5. Indices

Table of content

Introduction
Alignment quality indexes
Homogeneity score

Introduction

During execution, LAGOON-MCL uses several indices to evaluate the results:

Alignment quality indices: Two indices are used to assess the quality of alignments against the MMseqs2 alphafoldDB database.
Homogeneity score: Measures the consistency of annotations within each cluster.

Homogeneity score

The homogeneity score is calculated as follows:

$$ N_{label} > 1 \quad \Rightarrow \quad Hom_{score} = 1 - \frac{N_{label}}{N_{seq}} $$

$$ N_{label} = 1 \quad \Rightarrow \quad Hom_{score} = 1 $$

$$ N_{label} = 0 \quad \Rightarrow \quad Hom_{score} = NA $$

Where:

$Hom_{score}$ : the homogeneity score
$N_{label}$ : the number of different labels in the cluster
$N_{seq}$ : the number of sequences in the cluster

The homogeneity score ranges between 0 and 1.
If the score is less than 0, an algorithm is applied to select only the annotations that best explain the sequences in the cluster.

Initialize list_label as an empty list
Initialize set_nodes as an ampty set

Create a dictionary dict_labels with:
    {label: {set containing the identifiers linked to the label}}

For each label in dict_labels:
    Add the size of dict_labels[label] to dict_size
    Add all elements of dict_labels[label] to set_nodes

While set_nodes is not empty:

    Initialize max_label as the label with the largest size in dict_size

    Add max_label to list_label
    Remove elements of dict_labels[max_label] from set_nodes

    Update dict_labels by removing the elements associated with max_label

    Recalculate the sizes of the subsets in dict_size

Alignment quality indexes

To assess the quality of coverage between two sequences, LAGOON-MCL uses two measurements:

Query coverage: The proportion of the query sequence covered by the subject sequence.
Subject coverage: The proportion of the subject sequence covered by the query sequence.

From these two values, two indices are calculated:

Mutual coverage index: Reflects the overall coverage of both sequences.
Disparity index: Measures the balance of the alignment between the sequences.

Query sequence coverage by subject sequence

$$ queryCoverage = \frac{queryStopPosition - queryStartPosition + 1}{querySequenceLength} $$

Where:

queryCoverage: proportion of the query sequence covered by the alignment
queryStartPosition: alignment start position on the query sequence
queryStopPosition: alignment stop position on the query sequence
querySequenceLength: length of the query sequence

Subject sequence coverage by query sequence

$$ subjectCoverage = \frac{subjectStopPosition - subjectStartPosition + 1}{subjectSequenceLength} $$

Where:

subjectCoverage: proportion of the subject sequence covered by the alignment
subjectStartPosition: alignment start position on the subject sequence
subjectStopPosition: alignment stop position on the subject sequence
subjectSequenceLength: length (in amino acids) of the subject sequence

Mutual coverage index

Mutual Coverage Index

The mutual coverage index measures the overall coverage between two sequences.
This index ranges from 0 to 1:

0: poor coverage — the query and subject sequences are barely covered by the alignment
1: perfect coverage — both sequences are completely covered by the alignment

It is calculated as: $$ coverageIndex = \frac{queryCoverage + subjectCoverage}{2} $$

Where:

coverageIndex: mutual coverage index
queryCoverage: coverage proportion of the query sequence
subjectCoverage: coverage proportion of the subject sequence

Disparity index

Disparity Index

The disparity index measures the alignment balance between the query and subject sequences.
It indicates whether the alignment covers both sequences evenly.

The index ranges from 0 to 1:

0: the alignment covers the query and subject sequences equally
1: there is a large difference between the coverage of the query and subject sequences

It is calculated as: $$ disparityIndex = \left| queryCoverage - subjectCoverage \right| $$

Where:

disparityIndex: disparity index
queryCoverage: coverage proportion of the query sequence
subjectCoverage: coverage proportion of the subject sequence

A few examples

	Example 1	Example 2	Example 3
querySequenceLength	804	743	71
queryStartPosition	1	617	1
queryStopPosition	570	711	71
subjectSequenceLength	570	722	532
subjectStartPosition	1	431	460
subjectStopPosition	570	527	532

Example 1

Given an alignment with the following positions: $$ queryCoverage = \frac{570 - 1 + 1}{804} = \frac{570}{804} \approx 0.709 $$ $$ subjectCoverage = \frac{570 - 1 + 1}{570} = \frac{570}{570} = 1 $$ The mutual coverage index is: $$ coverageIndex = \frac{0.709 + 1}{2} \approx 0.854 $$ The disparity index is: $$ disparityIndex = |0.709 - 1| \approx 0.291 $$

The coverage index of ~0.85 indicates that a large portion of both sequences is included in the alignment, suggesting good overall coverage.
The disparity index of ~0.29 shows that the query sequence is less covered than the subject sequence, highlighting some imbalance in the alignment.

Example 2

Given an alignment with the following positions: $$ queryCoverage = \frac{711 - 617 + 1}{743} = \frac{95}{743} \approx 0.128 $$ $$ subjectCoverage = \frac{527 - 431 + 1}{722} = \frac{97}{722} \approx 0.134 $$ The mutual coverage index is: $$ coverageIndex = \frac{0.128 + 0.134}{2} \approx 0.131 $$ The disparity index is: $$ disparityIndex = |0.128 - 0.134| \approx 0.0065 $$

The disparity index is close to 0, indicating that the two sequences are aligned over roughly the same proportion of their lengths.
The coverage index of ~0.13 shows that only a small portion of both sequences is included in the alignment, indicating poor overall alignment quality.

Example 3

Given an alignment with the following positions: $$ queryCoverage = \frac{71 - 1 + 1}{71} = \frac{71}{71} = 1 $$ $$ subjectCoverage = \frac{532 - 460 + 1}{532} = \frac{73}{532} \approx 0.137 $$ The mutual coverage index is: $$ coverageIndex = \frac{1 + 0.137}{2} \approx 0.569 $$ The disparity index is: $$ disparityIndex = |1 - 0.137| \approx 0.863 $$

The coverage index of ~0.57 indicates that, on average, just over half of the two sequences are covered by the alignment.
The disparity index is very high (~0.86), showing that the alignment is extremely imbalanced: the query sequence is fully covered, but the subject sequence is barely aligned.
This alignment would generally be considered low quality due to the imbalance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5. Indices

Table of content

Introduction

Homogeneity score

Alignment quality indexes

Query sequence coverage by subject sequence

Subject sequence coverage by query sequence

Mutual coverage index

Mutual Coverage Index

Disparity index

Disparity Index

A few examples

Example 1

Example 2

Example 3

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally