# Embedding Space Optimization for DAS Data

This notebook demonstrates how to **fine-tune** a pre-trained embedding 
to maximize the separation between two classes (noise vs. event). 

**Key Steps**:

1. Load a pre-trained model and produce embeddings for your training data.  
2. Compute the centroids (means) for event and noise embeddings.  
3. Define a custom distance-based loss that pulls each sample toward its class centroid.  
4. Fine-tune the embedding layer with that loss.  
5. Classify by comparing distances to the noise/event centroids (with or without a threshold).  
6. Evaluate performance using confusion matrix, classification report, PR/ROC curves, etc.  


In [1]:
# notebook cell 1: imports
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

from my_scripts.embedding_optimization import (
    compute_embeddings,
    compute_class_centers,
    fine_tune_embedding_model,
    classify_by_distance,
    evaluate_performance,
    plot_pr_curve,
    plot_roc_curve
)


2024-12-27 23:22:45.263786: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-27 23:22:45.263866: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-27 23:22:45.265391: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-27 23:22:45.275926: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Step 1: Load Data and Pre-trained Model

We assume you already have:
- `X_train.npy`, `y_train.npy`
- `X_test.npy`, `y_test.npy`
- A pre-trained model saved at `saved_models/model_imb_thres40`, which has 
  an 'embedding_layer' inside.


In [2]:
# notebook cell 2: load data
X_train = np.load('train_data/X_train.npy')
y_train = np.load('train_data/y_train.npy')
X_test  = np.load('test_data/X_test.npy')
y_test  = np.load('test_data/y_test.npy')

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test  shape:", X_test.shape)
print("y_test  shape:", y_test.shape)

base_model_path = 'saved_models/model_imb'  # Example path


X_train shape: (4884, 288, 695, 1)
y_train shape: (4884,)
X_test  shape: (1221, 288, 695, 1)
y_test  shape: (1221,)


## Step 2: Obtain Initial Embeddings and Compute Centroids

We'll create a "temporary" embedding model from the pre-trained model 
and compute the centroid for noise and event classes.


In [3]:
import tensorflow as tf
from tensorflow.keras.models import load_model, Model

# Load entire model
pretrained_model = load_model(base_model_path)
# Create an embedding model that outputs from the 'embedding_layer'
embedding_model = Model(
    inputs=pretrained_model.input,
    outputs=pretrained_model.get_layer('embedding_layer').output
)

# Produce embeddings for training data
embeddings_train = compute_embeddings(embedding_model, X_train, batch_size=64)
# Compute centroids
event_center, noise_center = compute_class_centers(embeddings_train, y_train)

print("Event center shape:", event_center.shape)
print("Noise center shape:", noise_center.shape)


2024-12-27 23:39:14.167655: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:274] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2024-12-27 23:39:17.451765: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 3910325760 exceeds 10% of free system memory.


Event center shape: (128,)
Noise center shape: (128,)


## Step 3: Fine-Tune the Embedding Model with Distance Loss

We'll optimize the embedding space to pull each sample closer 
to its respective class centroid. We define a margin (e.g., 10.0).


In [1]:
fine_tuned_model, history = fine_tune_embedding_model(
    base_model_path=base_model_path,
    X_train=X_train,
    y_train=y_train,
    event_center=event_center,
    noise_center=noise_center,
    distance_margin=10.0,
    batch_size=64,
    epochs=50,  # For demonstration, you can go higher (e.g. 500)
    checkpoint_path='saved_models/optimize/model_epoch_{epoch:03d}.h5'
)

NameError: name 'fine_tune_embedding_model' is not defined

## Step 4: Use the Fine-Tuned Model to Generate New Embeddings

Now we get improved embeddings for both train and test sets 
using the fine-tuned embedding model. We'll do a distance-based classification.


In [5]:
embeddings_train_up = compute_embeddings(fine_tuned_model, X_train, batch_size=64)
embeddings_test_up  = compute_embeddings(fine_tuned_model, X_test,  batch_size=64)

# We'll still use the same (event_center, noise_center) for classification, 
# or optionally recompute them from updated train embeddings:
updated_event_center, updated_noise_center = compute_class_centers(embeddings_train_up, y_train)


NameError: name 'compute_embeddings' is not defined

## Step 5: Decide on a Distance Threshold and Classify

Here we illustrate **two** ways to classify:

1. **Min-Distance Rule**: Classify as event if distance_to_event < distance_to_noise.  
2. **Noise-Center Threshold**: If distance_to_noise > some threshold => event.  

We'll demonstrate the second approach, as shown in the sample code.


In [6]:
# notebook cell 6: distance-based classification

# 1) Classify with no explicit threshold => whichever center is closer
y_pred_min_dist, dist_event, dist_noise = classify_by_distance(
    embeddings_test_up, 
    updated_event_center, 
    updated_noise_center, 
    threshold=None
)

evaluate_performance(y_test, y_pred_min_dist, title="Min-Distance Classification")

# 2) Classify by threshold on noise distance:
#    if distance_to_noise > threshold => event
# We can pick threshold from e.g. the 99th percentile of noise distances in training set
noise_dist_train = dist_noise[:len(X_train)]  # If train+test are in one array, adjust as needed
threshold = np.percentile(noise_dist_train, 99.0)

y_pred_thresh, dist_event_test, dist_noise_test = classify_by_distance(
    embeddings_test_up, 
    updated_event_center, 
    updated_noise_center, 
    threshold=threshold
)
evaluate_performance(y_test, y_pred_thresh, title="Threshold Classification")


NameError: name 'classify_by_distance' is not defined

## Step 6: Precision-Recall and ROC Curves

We'll treat `distance_to_event` or `distance_to_noise` as a "score."  
For PR/ROC, recall that smaller distance => more event-like. We'll invert (multiply by -1) if needed.


In [7]:
# We can choose dist_noise_test for example
pr_auc = plot_pr_curve(y_test, dist_noise_test, label_name='Distance to Noise Center')
roc_auc = plot_roc_curve(y_test, dist_noise_test, label_name='Distance to Noise Center')
print(f"PR-AUC: {pr_auc:.3f}, ROC-AUC: {roc_auc:.3f}")


NameError: name 'plot_pr_curve' is not defined