# Proof-of-concept: ECG similarity search engine

This Jupyter notebook presents a proof-of-concept system for retrieving ECGs based on similarity across multiple ML-derived metrics. It includes:

1. Synthetic data generation with clinically meaningful structure

2. Feature preprocessing 

3. Feature indexing using FAISS

4. A flexible querying mechanism supporting feature selection and weighting, with illustrative query examples


The notebook complements the accompanying report by demonstrating the core implementation and validating the system’s performance under realistic conditions.

Let us start by importing the needed packages.

In [6]:
import os
import sys
import time
import numpy as np

sys.path.append(os.path.abspath(".."))

from src.data_generator import DataGenerator
from src.data_preprocessor import DataPreprocessor
from src.similarity_searcher import SimilaritySearcher

### 1. Synthetic data generation

We will use `DataGenerator` to simulate a synthetic dataset of ECGs of dimension N. For this example, we will pick N = 100,000

In [2]:
N = 100000
generator = DataGenerator(num_ecgs=N)
data = generator.generate_data()

### 2. Feature preprocessing

Next, we will use `DataPreprocessor` to apply the corresponding pre-processing techniques to each feature groups, namely:

- Heart rate: Standardized (mean 0, std 1).

- Risk scores: Each of the 5 condition scores is standardized independently.

- Embeddings: Standardized and reduced with PCA for compactness.

- Beat-type proportions: Used as-is (already normalized between 0 and 1).

This produces a unified matrix of preprocessed vectors, ready for indexing.

In [3]:
preprocessor = DataPreprocessor()
processed_data = preprocessor.fit_transform(data)

### 3. Feature indexing 

We use the `SimilaritySearcher` class to build and manage two types of FAISS indexes:

- Single Index (orchestrated by `SingleIndexer` class): Builds the index over the entire feature vector for full-vector similarity search.

- Hybrid Index (orchestrated by `HybridIndexer` class): Builds separate FAISS indexes for each feature group (e.g., heart rate, risk scores, embeddings), supporting modular and interpretable queries.

Index construction is done in batches for scalability and memory efficiency.

Given the size of the dataset, we will implement `IndexHNSWFlat` (approximate search) for all the features

In [4]:
hnsw_index_config = {
    "heart_rate": "hnsw",
    "risk_scores": "hnsw",
    "embedding": "hnsw",
    "beat_props": "hnsw",
}
searcher = SimilaritySearcher(
    full_matrix=processed_data,
    group_shapes=preprocessor.group_shapes,
    group_index_types=hnsw_index_config,
)

SingleIndex: building index with type <hnsw>: 100%|██████████| 1/1 [00:01<00:00,  1.38s/it]
HybridIndexer: building for group <heart_rate> with index type <hnsw>: 100%|██████████| 1/1 [00:00<00:00,  1.29it/s]
HybridIndexer: building for group <risk_scores> with index type <hnsw>: 100%|██████████| 1/1 [00:01<00:00,  1.19s/it]
HybridIndexer: building for group <embedding> with index type <hnsw>: 100%|██████████| 1/1 [00:01<00:00,  1.17s/it]
HybridIndexer: building for group <beat_props> with index type <hnsw>: 100%|██████████| 1/1 [00:01<00:00,  1.15s/it]


### 4. Querying

We use the SimilaritySearcher to perform similarity queries on the indexed ECG feature vectors.

- Users can specify which feature groups to consider in the similarity computation (e.g., ["embedding", "heart_rate"]).

- Each group can be assigned a custom weight, allowing fine-grained control over the influence of different features.

- If all feature groups are selected, the system uses the full-vector index for fast retrieval.

- Otherwise, it uses the hybrid index, combining results from selected groups using normalized distances and weighted scoring.

Before querying, we must preprocess the input ECG vector using the same transformations applied during indexing

In [7]:
# Let's assume we want to search for the nearest neighbors of a specific ECG record by its ID (and we know it exists)
query_id = "ecg_99999"

# Extract the query vector for the specified ID and preprocess it
query_idx = data.index.get_loc(query_id)
query_vector = data.loc[[query_id]]
query_processed = preprocessor.transform(query_vector)

# Extract the cluster for the query vector
query_cluster = query_vector["cluster"].iloc[0]

Now, let's define a simple function to run the similarity search on different feature groups.

In [8]:
def run_similarity_query(
    searcher,
    query_processed,
    query_idx,
    query_cluster,
    top_k=100,
    selected_groups=None,
    database_size=None,
    df=data,
):
    """
    Run a similarity query and report key stats.

    Parameters
    ---------
        searcher: SimilaritySearcher instance.
        query_processed: Preprocessed query vector (1xD).
        query_idx: Index of the query in the database (used to exclude from results).
        query_cluster: Cluster label of the query ECG.
        top_k: Number of neighbors to retrieve (default = 100).
        selected_groups: Feature groups to query over.
        database_size: Optional, for logging database size.
        df: DataFrame containing metadata (e.g., clusters).
    """

    # Perform the search
    start_time = time.time()
    indices, distances = searcher.search(
        query_processed,
        top_k=top_k + 1,  # fetch extra to allow removing the query itself
        selected_groups=selected_groups,
    )
    elapsed_time = time.time() - start_time

    # Post-process results to remove the query itself
    indices = np.asarray(indices).flatten()
    distances = np.asarray(distances).flatten()

    if query_idx in indices:
        mask = indices != query_idx
        indices = indices[mask]
        distances = distances[mask]

    # Trim to top-k
    top_k_actual = min(top_k, len(indices))
    indices = indices[:top_k_actual]
    distances = distances[:top_k_actual]

    # Print results
    print(f"Query by group(s): {selected_groups}")
    print(f"Number of ECGs in database: {database_size or len(df)}")
    print(f"Top-k retrieved: {top_k_actual}")
    print(f"Query time: {elapsed_time:.4f} seconds")

    # Print cluster distribution of the retrieved ECGs
    cluster_counts = df.iloc[indices]["cluster"].value_counts(normalize=True)
    print(f"\nQuery ECG cluster: {query_cluster}")
    print("Cluster distribution among retrieved ECGs:")
    print(cluster_counts.round(3))

##### Experiment 1: Heart rate only

In [9]:
run_similarity_query(
    searcher,
    query_processed,
    query_idx,
    query_cluster,
    top_k=100,
    selected_groups=["heart_rate"],
)

Query by group(s): ['heart_rate']
Number of ECGs in database: 100000
Top-k retrieved: 100
Query time: 0.0005 seconds

Query ECG cluster: pvc_heavy
Cluster distribution among retrieved ECGs:
cluster
normal        0.72
pvc_heavy     0.13
ischemia      0.12
afib_prone    0.03
Name: proportion, dtype: float64


**Analysis**: We're using only `heart_rate` for similarity search, which triggers the use of `HybridIndex` with index `IndexHNSWFlat`. Since only one feature group is selected, the system executes a single top-k search without applying margin expansion, normalization, or score aggregation.. The system retrieves mainly examples for clusters that have a similar heart rate by design (e.g. "normal" and "ischemia" clusters), with the proportions specified in the cluster distribution (e.g. "normal" being the majority cluster). The query time is well below the limit of 1s.

##### Example 2: Heart rate and  beat type proportions

In [10]:
run_similarity_query(
    searcher,
    query_processed,
    query_idx,
    query_cluster,
    top_k=100,
    selected_groups=["heart_rate", "beat_props"],
)

Query by group(s): ['heart_rate', 'beat_props']
Number of ECGs in database: 100000
Top-k retrieved: 100
Query time: 0.0100 seconds

Query ECG cluster: pvc_heavy
Cluster distribution among retrieved ECGs:
cluster
normal        0.52
pvc_heavy     0.37
ischemia      0.09
afib_prone    0.02
Name: proportion, dtype: float64


**Analysis**: We're now including the feature `beat_props`. The system starts retrieving a higher proportion of examples within the cluster of the queried ECG. Because more than one feature group is selected, the system retrieves now `top_k × margin_factor` candidates from each group, followed by normalization and score aggregation. This introduces a slight increase in query time due to the additional index lookup, margin expansion and post-processing, but it remains well below the 1-second threshold.

##### Example 3: Heart rate, beat type proportions and risk of AFib

In [11]:
run_similarity_query(
    searcher,
    query_processed,
    query_idx,
    query_cluster,
    top_k=100,
    selected_groups=["heart_rate", "beat_props", "risk_afib"],
)

Query by group(s): ['heart_rate', 'beat_props', 'risk_afib']
Number of ECGs in database: 100000
Top-k retrieved: 100
Query time: 0.0857 seconds

Query ECG cluster: pvc_heavy
Cluster distribution among retrieved ECGs:
cluster
pvc_heavy     0.50
normal        0.42
ischemia      0.06
afib_prone    0.02
Name: proportion, dtype: float64


**Analysis**: When including another feature group, `risk_afib`, the proportion of samples within the cluster of the query vector further increases. Note that we queried on `risk_afib` but that the system used the combined index `risk_scores`, meaning we would have the same results no matter what risk score we queried on. The query time increased because of the addition of a feature but is still well below the limit of 1s.

In [12]:
run_similarity_query(searcher, query_processed, query_idx, query_cluster, top_k=100)

Query by group(s): None
Number of ECGs in database: 100000
Top-k retrieved: 100
Query time: 0.0003 seconds

Query ECG cluster: pvc_heavy
Cluster distribution among retrieved ECGs:
cluster
pvc_heavy    1.0
Name: proportion, dtype: float64


**Analysis**: In this example we're not specifying any selected groups, so the similarity search is done with `SingleIndexer`on the index built with all the features. This reduces the search time dramatically, as there is a single search with `HNSW`. In terms of cluster distribution amongst the retrieves examples, we can see all of them belong to the cluster of the queried ECG.