The embedding-matching scheme is crucial for achieving accurate word discrimination in Query-by-Example Spoken Term Detection (QbE-STD) tasks. Below is a Python code snippet illustrating how to implement the embedding-matching scheme.

In [None]:
import tensorflow as tf
import numpy as np

In [None]:
embedding_model = tf.keras.models.load_model("model_path")

## Generate Embeddings from A2E-Net

Assuming you have trained your A2E-Net model (model), and have a sequence of acoustic segments (acoustic_segments) as well as a sequence of acoustic signals of a spoken query term (spoken_query_terms):

In [None]:
# Each element in the sequence would be a NumPy array representing the acoustic features
acoustic_segments = #list of NumPy array
spoken_query_terms = #list of NumPy array

# Generate embeddings for acoustic_segments and spoken_query_terms
segment_embeddings = [embedding_model.predict(np.expand_dims(segment, axis=0)) for segment in acoustic_segments]
query_embeddings = [embedding_model.predict(np.expand_dims(term, axis=0)) for term in spoken_query_terms]

## Compute Basis Embedding 

In [None]:
# Compute the basis embedding c_b by averaging the embeddings of spoken_query_terms
c_b = np.mean(query_embeddings, axis=0)

## Calculate Cosine Similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity
cosine_scores = [cosine_similarity(embedding, c_b) for embedding in segment_embeddings]

## Apply Simple Moving Average (SMA)

In [None]:
def simple_moving_average(sequence, window_size):
    return np.convolve(sequence, np.ones(window_size)/window_size, mode='valid')

In [None]:
window_size = 3
smoothed_scores = simple_moving_average(np.array(cosine_scores).flatten(), window_size)

## Spoken term detection (Word-Searching)

In [None]:
threshold = 0.8  
word_occurrences = np.where(smoothed_scores >= threshold)[0]