###  Speaker Identification — Testing & Open-Set Prediction

This notebook is used to **test and identify speakers** from new audio samples using a **hybrid open-set recognition approach**.

We combine two powerful techniques:

1. **Classifier-Based Prediction**  
   A custom PyTorch model trained using ECAPA-TDNN embeddings from known speakers.  
   This model performs traditional **closed-set classification** — it chooses one of the known speaker classes.

2. **Cosine Similarity to Speaker Centroids**  
   We calculate the **cosine similarity** between the test embedding and precomputed **speaker centroids** (average embeddings).  
   This allows us to detect when a voice is **unfamiliar or closer to a different speaker** than the classifier predicts.

---

Final Prediction Logic: Hybrid Decision

The `hybrid_speaker_predict()` function handles 3 main scenarios:

- **Case 1 — Classifier and Cosine Agree**  
  → High confidence match:  
  `"I am damn sure — it’s [SpeakerName]!"`

- **Case 2 — Classifier Predicts Wrong, Cosine Thinks It's Someone Else**  
  → Show a friendly mismatch explanation:  
  `"Classifier might be confused with [X], but I guess this is [Y]!"`

- **Case 3 — Unknown Speaker**  
  → Cosine similarity is too low for any known speaker:  
  `"❓ This voice doesn’t match anyone I know — Unknown Speaker."`

---

 What Happens in This Notebook?

- Loads the **trained classifier** and **centroids**
- Loads the **SpeechBrain ECAPA-TDNN model** to extract speaker embeddings
- Preprocesses uploaded/recorded audio
- Uses the **hybrid prediction function** to make a robust speaker guess
- Optionally runs inside a **Gradio interface** for real-time testing

---

 Why This Hybrid Method?

- **More robust in noisy real-world audio**
- **Can detect unknown voices** outside the training set
- **Explains classifier errors more transparently** using similarity logic


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from models import EmbeddingClassifierBN, preprocess_audio
import torch
from speechbrain.pretrained import SpeakerRecognition
import joblib
from IPython.display import Audio
import gradio as gr
from collections import defaultdict
import torch.nn.functional as F

  from .autonotebook import tqdm as notebook_tqdm
  if ismodule(module) and hasattr(module, '__file__'):
  from speechbrain.pretrained import SpeakerRecognition


In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [4]:
# Load the label encoder 
label_encoder = joblib.load("label_encoder.pkl")
class_names = list(label_encoder.classes_)

# === Constants (same as training) ===
TARGET_SAMPLE_RATE = 16000
MAX_AUDIO_DURATION_SEC = 12
MAX_SAMPLES = TARGET_SAMPLE_RATE * MAX_AUDIO_DURATION_SEC

# --- Load the ECAPA-TDNN model for embedding extraction:
verification = SpeakerRecognition.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb")
verification.eval()

model = EmbeddingClassifierBN(embedding_dim=192, num_classes=34).to(device)
model.load_state_dict(torch.load("saved_models/best_model_overall_fold_2.pt", map_location=device))
model.eval()



  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)
  state_dict = torch.load(path, map_location=device)
  stats = torch.load(path, map_location=device)
  model.load_state_dict(torch.load("saved_models/best_model_overall_fold_2.pt", map_location=device))


EmbeddingClassifierBN(
  (fc1): Linear(in_features=192, out_features=256, bias=True)
  (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc2): Linear(in_features=256, out_features=256, bias=True)
  (bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc3): Linear(in_features=256, out_features=128, bias=True)
  (bn3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (final): Linear(in_features=128, out_features=34, bias=True)
  (relu): ReLU()
  (drop): Dropout(p=0.3, inplace=False)
)

In [5]:
# Load precomputed embeddings and labels from final_dataset.pt
data = torch.load("final_dataset.pt")
X_embeddings = data["embeddings"]  # Tensor of shape [N, 192]
y_labels = data["labels"]          # Tensor of shape [N]

# Group embeddings by speaker label and compute centroids
speaker_embeddings = defaultdict(list)
for emb, label in zip(X_embeddings, y_labels):
    speaker_embeddings[int(label)].append(emb)

centroids = {}
for label, emb_list in speaker_embeddings.items():
    centroid = torch.stack(emb_list, dim=0).mean(dim=0)
    centroid = F.normalize(centroid, p=2, dim=0)
    # Move centroid to the same device as your model
    centroids[label] = centroid.to(device)

print("Computed centroids for {} speakers.".format(len(centroids)))


Computed centroids for 34 speakers.


  data = torch.load("final_dataset.pt")


In [6]:

def hybrid_speaker_predict(filepath, threshold=0.5):

    # Step 1: Preprocess & embed
    waveform = preprocess_audio(filepath)
    if waveform is None:
        return "Failed to process audio."

    waveform = waveform.to(device)
    wav_lens = torch.tensor([1.0], device=device)

    with torch.no_grad():
        embedding = verification.encode_batch(waveform, wav_lens=wav_lens).view(1, -1).to(device)
        logits = model(embedding)
        probs = torch.softmax(logits, dim=1)
        top_prob, pred_class = torch.max(probs, dim=1)
        pred_class = pred_class.item()

    # Step 2: Cosine similarity with all centroids
    emb = F.normalize(embedding.view(-1), p=2, dim=0)
    cosine_scores = {
        label: F.cosine_similarity(emb.unsqueeze(0), centroid.unsqueeze(0).to(device)).item()
        for label, centroid in centroids.items()
    }
    best_cosine_label, best_score = max(cosine_scores.items(), key=lambda x: x[1])

    # Step 3: Decision Logic
    if best_score < threshold:
        return "❓ This voice doesn’t match anyone I know — Unknown Speaker."

    if pred_class == best_cosine_label:
        return f"  I am damn sure — it’s {class_names[pred_class]}!"

    return (
        f" Classifier might be confused with {class_names[pred_class]}, "
        f"but I guess this is  {class_names[best_cosine_label]}!"
    )


In [7]:
filepath1 = "testing/sheeba.mp3"

Audio(filepath1)


In [8]:

filepath2 = "testing/abhishek.mp3"

Audio(filepath2)

In [9]:
filepath3 = "testing/sicheng.mp3"

Audio(filepath3)

In [10]:

filepath4 = "testing/Aakash_not_in_training_set.mp3"

Audio(filepath4)

In [11]:

result = hybrid_speaker_predict(filepath1, threshold=0.5)
print("🔊 Prediction:", result)


🔊 Prediction:   I am damn sure — it’s marysheeba!


In [12]:
result = hybrid_speaker_predict(filepath2, threshold=0.5)
print("🔊 Prediction:", result)


🔊 Prediction:   I am damn sure — it’s abhishek!


In [13]:

result = hybrid_speaker_predict(filepath3, threshold=0.5)
print("🔊 Prediction:", result)

🔊 Prediction:   I am damn sure — it’s sicheng!


In [14]:

result = hybrid_speaker_predict(filepath4, threshold=0.5)
print("🔊 Prediction:", result)

🔊 Prediction: ❓ This voice doesn’t match anyone I know — Unknown Speaker.


In [15]:
def predict_speaker_gradio(audio_path):
    if audio_path is None:
        return "🚫 No audio file provided."

    try:
        prediction = hybrid_speaker_predict(audio_path, threshold=0.5)
        return f"🎤 {prediction}"
    except Exception as e:
        return f"❌ Error: {str(e)}"


In [16]:
iface = gr.Interface(
    fn=predict_speaker_gradio,
    inputs=gr.Audio(type="filepath", label="Upload Audio"),
    outputs=gr.Textbox(label="Who is the Black Sheep?"),
    title="🔊 Guess the Speaker",
    description="🎙️ Please record or upload at least 12 seconds of clear audio for best results.",
    theme="default",
)


🎤 Microphone recording is blocked inside Jupyter notebooks because Gradio renders its interface within an embedded iframe (a sandboxed HTML container).
To use the mic, open the app in a browser tab (e.g., Chrome) via the localhost link.

In [17]:
iface.launch()

* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


