# 🎧 Call Quality Analyzer — Colab Notebook

**What:** Analyze a sales call (YouTube) and return talk-time ratio,
of questions, longest monologue, sentiment, and one actionable insight.  
**Test file:** https://www.youtube.com/watch?v=4ostqJD3Psc  
**Notes:** Uses lightweight models (Whisper tiny, Resemblyzer, VADER) so it runs on free Colab within the time limit.


In [19]:
# Install dependencies (run once).
# - yt-dlp: download YouTube audio
# - faster-whisper: fast Whisper implementation
# - resemblyzer: speaker embeddings
# - librosa, soundfile: audio I/O
# - nltk: sentiment (VADER)
!pip -q install yt-dlp faster-whisper resemblyzer pydub librosa nltk scikit-learn webrtcvad soundfile

In [20]:
# small helper imports & pipeline timer
import time, json, os
start_all = time.time()

In [21]:
## Step 1 — Download & prepare audio
# We download the YouTube file and convert to 16kHz mono WAV (better for ASR & robust to noise).

YT_URL="https://www.youtube.com/watch?v=4ostqJD3Psc"
!yt-dlp -f bestaudio -x --audio-format wav -o "input.%(ext)s" "$YT_URL"
!ffmpeg -y -i input.wav -ac 1 -ar 16000 call_audio.wav -loglevel error
# Optional: to trim long audio
# !ffmpeg -y -t 600 -i input.wav -ac 1 -ar 16000 call_audio.wav -loglevel error

[youtube] Extracting URL: https://www.youtube.com/watch?v=4ostqJD3Psc
[youtube] 4ostqJD3Psc: Downloading webpage
[youtube] 4ostqJD3Psc: Downloading tv simply player API JSON
[youtube] 4ostqJD3Psc: Downloading tv client config
[youtube] 4ostqJD3Psc: Downloading tv player API JSON
[info] 4ostqJD3Psc: Downloading 1 format(s): 251
[download] input.wav has already been downloaded
[ExtractAudio] Destination: input.wav
Deleting original file input.orig.wav (pass -k to keep)


In [22]:
from faster_whisper import WhisperModel
print("Transcription step: loading model (tiny.en) — this is optimized for CPU / free Colab")
t0 = time.time()
model = WhisperModel("tiny.en", compute_type="int8")  # tiny.en + int8 = fast on CPU
segments, info = model.transcribe("call_audio.wav", word_timestamps=True, vad_filter=True)
segments = list(segments)
asr_time = time.time() - t0

full_text = " ".join([s.text.strip() for s in segments])
print(f"ASR done in {asr_time:.2f}s | language={info.language} | segments={len(segments)}")
print("Transcript preview:", full_text[:500].replace("\n"," "))

Transcription step: loading model (tiny.en) — this is optimized for CPU / free Colab
ASR done in 18.12s | language=en | segments=32
Transcript preview: Thank you for calling Nissan. My name is Lauren. Can I have your name? Yeah, my name is John Smith. Thank you, John. How can I help you? I was just calling about to see how much it would cost to update the map in my car. I'd be happy to help you with that today. Did you receive a mailer from us? I did. Do you need the customer number? Yes, please. Okay. It's 1-5-2-4-3. Thank you. And the year making bottle of your vehicle? Yeah, I have a 2009 Nissan Altamond. Oh, nice car. Yeah, thank you. We re


In [23]:
## Step 2 — Diarization (Who spoke when)
# We compute short overlapping audio windows → speaker embeddings (Resemblyzer) → KMeans(2) clustering → merge adjacent windows to get speaker turns.

import librosa, numpy as np
from resemblyzer import VoiceEncoder
from sklearn.cluster import KMeans

# Load audio (16k mono)
wav, sr = librosa.load("call_audio.wav", sr=16000, mono=True)
total_dur = len(wav) / sr

# Window settings (empirically fast & accurate)
win = 1.5   # seconds
hop = 0.5   # seconds

# Build frames
frames = []
t = 0.0
while t < total_dur:
    s = int(t*sr)
    e = int(min((t+win)*sr, len(wav)))
    if e - s >= int(0.3*sr):  # keep frames >= 300ms
        frames.append((t, t+win, wav[s:e]))
    t += hop

# Quick energy-based VAD to skip silence frames
def energy(x): return float(np.mean(x**2))
energies = np.array([energy(f[2]) for f in frames])
thr = np.percentile(energies, 60)  # keep top ~40% energy (heuristic)
voiced_idx = np.where(energies >= thr)[0]
voiced_frames = [frames[i] for i in voiced_idx]

# Compute embeddings
encoder = VoiceEncoder(device="cpu")
embs = [encoder.embed_utterance(x.astype("float32")) for (_,_,x) in voiced_frames]

# Cluster (fallback if too few embeddings)
if len(embs) >= 2:
    labels = KMeans(n_clusters=2, n_init=10, random_state=0).fit_predict(embs)
else:
    labels = [0] * len(embs)

# Merge adjacent windows with same label to produce speaker turn segments
speaker_segments = []
for (s,e,_), lab in zip(voiced_frames, labels):
    if not speaker_segments or lab != speaker_segments[-1]["spk"]:
        speaker_segments.append({"start": s, "end": e, "spk": int(lab)})
    else:
        speaker_segments[-1]["end"] = e
print(f"Estimated speaker segments: {len(speaker_segments)} (approx)")


Loaded the voice encoder model on cpu in 0.02 seconds.
Estimated speaker segments: 20 (approx)


In [24]:
# --- talk-time & longest monologue ---
talk_time = {0: 0.0, 1: 0.0}
for seg in speaker_segments:
    dur = seg["end"] - seg["start"]
    talk_time[seg["spk"]] += dur
total = talk_time[0] + talk_time[1] + 1e-9
ratio0 = 100.0 * talk_time[0] / total
ratio1 = 100.0 * talk_time[1] / total
longest = max((s["end"]-s["start"]) for s in speaker_segments) if speaker_segments else 0.0

# --- question counting (hybrid) ---
import re
q_marks = full_text.count('?')
interrogatives = set(["what","why","how","when","where","which","who","can","could","would","will","is","are","do","does","did"])
sentences = re.split(r'(?<=[\.\!\?])\s+', full_text.lower())
extra_q = sum(1 for s in sentences if s.strip() and s.split()[0] in interrogatives and '?' not in s)
num_questions = q_marks + extra_q

# Print clean report
report = {
  "talk_time_ratio": {"A": round(ratio0,1), "B": round(ratio1,1)},
  "num_questions": int(num_questions),
  "longest_monologue_sec": round(longest,2),
  "transcript_preview": full_text[:400].replace("\n"," ")
}

import json
print("=== Call Metrics ===")
print(json.dumps(report, indent=2))

=== Call Metrics ===
{
  "talk_time_ratio": {
    "A": 53.8,
    "B": 46.2
  },
  "num_questions": 8,
  "longest_monologue_sec": 13.0,
  "transcript_preview": "Thank you for calling Nissan. My name is Lauren. Can I have your name? Yeah, my name is John Smith. Thank you, John. How can I help you? I was just calling about to see how much it would cost to update the map in my car. I'd be happy to help you with that today. Did you receive a mailer from us? I did. Do you need the customer number? Yes, please. Okay. It's 1-5-2-4-3. Thank you. And the year maki"
}


In [25]:
import nltk
nltk.download('vader_lexicon', quiet=True)
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
scores = sia.polarity_scores(full_text)
compound = scores['compound']
if compound >= 0.05: sentiment='positive'
elif compound <= -0.05: sentiment='negative'
else: sentiment='neutral'
print("Sentiment:", sentiment, scores)


Sentiment: positive {'neg': 0.006, 'neu': 0.75, 'pos': 0.244, 'compound': 0.997}


In [26]:
# Build insight using simple rules
insights = []
if ratio0 > 70 or ratio1 > 70:
    insights.append("Talk-time is unbalanced (>70% by one speaker). Prompt the quieter speaker with open-ended questions.")
if (num_questions / max(0.01, (total_dur/60.0))) < 0.5:  # <0.5 q per minute ≈ <5 per 10 mins
    insights.append("Increase question density; ask more open-ended questions.")
if sentiment == "negative":
    insights.append("Sentiment is negative: address objections early and confirm value.")

if not insights:
    insights.append("Good balance detected. Close with a clear next step and timeline.")

rep = "Speaker A" if ratio0 >= ratio1 else "Speaker B"
cust = "Speaker B" if rep == "Speaker A" else "Speaker A"

final_report = {
  "talk_time_ratio": {"A": round(ratio0,1), "B": round(ratio1,1)},
  "num_questions": int(num_questions),
  "longest_monologue_sec": round(longest,2),
  "sentiment": sentiment,
  "insight": insights[0],
  "bonus_guess": {"sales_rep": rep, "customer": cust}
}

# Pretty print & save
print("=== FINAL REPORT ===")
print(json.dumps(final_report, indent=2))

# Save to JSON file
outname = "call_quality_results.json"
with open(outname, "w") as f:
    json.dump(final_report, f, indent=2)
print("Saved:", outname)


=== FINAL REPORT ===
{
  "talk_time_ratio": {
    "A": 53.8,
    "B": 46.2
  },
  "num_questions": 8,
  "longest_monologue_sec": 13.0,
  "sentiment": "positive",
  "insight": "Good balance detected. Close with a clear next step and timeline.",
  "bonus_guess": {
    "sales_rep": "Speaker A",
    "customer": "Speaker B"
  }
}
Saved: call_quality_results.json


In [27]:
print("Total pipeline time (s):", round(time.time() - start_all, 2))


Total pipeline time (s): 29.02


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>