# 🎧 Audio → Whisper (ASR) → Speaker Diarization → Summary

### Speech-to-Text System (EN2002a.wav - Audio file)

In [2]:
!pip install openai-whisper
!pip install jiwer
!pip install soundfile


Collecting jiwer
  Using cached jiwer-4.0.0-py3-none-any.whl.metadata (3.3 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.14.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (12 kB)
Downloading jiwer-4.0.0-py3-none-any.whl (23 kB)
Downloading rapidfuzz-3.14.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m50.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-4.0.0 rapidfuzz-3.14.1


## Load the Audio File (EN2002b.wav) and copy to the audiopath("/EN2002a.wav")

In [20]:
from google.colab import files
uploaded = files.upload()
filename = list(uploaded.keys())[0]
print("Uploaded:", filename)

IndexError: list index out of range

In [3]:
!pip install openai-whisper jiwer librosa soundfile transformers torch pyannote.audio datasets
!apt-get install ffmpeg -y
! pip install pyannote.audio== 3.1.1


Collecting pyannote.audio
  Downloading pyannote_audio-3.4.0-py2.py3-none-any.whl.metadata (11 kB)
Collecting asteroid-filterbanks>=0.4 (from pyannote.audio)
  Downloading asteroid_filterbanks-0.4.0-py3-none-any.whl.metadata (3.3 kB)
Collecting lightning>=2.0.1 (from pyannote.audio)
  Downloading lightning-2.5.5-py3-none-any.whl.metadata (39 kB)
Collecting pyannote.core<6.0,>=5.0.0 (from pyannote.audio)
  Downloading pyannote.core-5.0.0-py3-none-any.whl.metadata (1.4 kB)
Collecting pyannote.database<6.0,>=5.0.1 (from pyannote.audio)
  Downloading pyannote.database-5.1.3-py3-none-any.whl.metadata (1.1 kB)
Collecting pyannote.metrics<4.0,>=3.2 (from pyannote.audio)
  Downloading pyannote.metrics-3.2.1-py3-none-any.whl.metadata (1.3 kB)
Collecting pyannote.pipeline<4.0,>=3.0.1 (from pyannote.audio)
  Downloading pyannote.pipeline-3.0.1-py3-none-any.whl.metadata (897 bytes)
Collecting pytorch_metric_learning>=2.1.0 (from pyannote.audio)
  Downloading pytorch_metric_learning-2.9.0-py3-none-

### Real-Time Transcription with Whisper(EN2002a.wav)

In [6]:
import os, time, tempfile
import librosa, soundfile as sf
import whisper
from jiwer import wer

filename = "/content/EN2002a.wav"
# Load Whisper model
stt_model = whisper.load_model("medium")

# Load audio
y, sr = librosa.load(filename, sr=16000)

# Chunk parameters
chunk_sec = 5
overlap_sec = 0.5
step = int((chunk_sec - overlap_sec) * sr)

results = []

print("=== Simulated Real-time Whisper STT ===")
for start in range(0, len(y), step):
    end = min(start + chunk_sec*sr, len(y))
    chunk = y[start:end]

    # Save chunk temporarily
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmpf:
        sf.write(tmpf.name, chunk, sr)
        tmp_path = tmpf.name

    # Transcribe chunk
    result = stt_model.transcribe(tmp_path, language="en", verbose=False)
    os.unlink(tmp_path)

    text = result["text"].strip()
    if text:
        print(f"\n[FINAL] {text}")
        results.append(text)

# Full transcript
hypothesis = " ".join(results)
print("\n=== Full Transcript ===")
print(hypothesis)

# Optional: evaluate WER if reference transcript is available
reference = "when winter going to the meetings as talking about the stuff of the meetings yeah nevertheless the here get a certificate right so we really need to talk about the bodies are we just go around and everyone says what they what they've been doing that without anything live say yeah sounds like you've done could do that i've got a browser now which whoops no good stuff when of mine as well mississippi given this this wasn't a good start okay sorry yeah say yeah we've got browser read which comes up automatically with or transcription box and the topics and then when you go on the menu you can select some ization box which pops up and an audio player one am and as i think search works as well say pop up a search and loads up just the background when nice empty and am so when you start you have to either open open up seek the observation or do such an open it through that that the sense m the transcription"
print("\nSTT WER:", wer(reference, hypothesis))


# Save final transcript to .txt file
output_transcript_path = "final_meeting_transcript11.txt"

with open(output_transcript_path, "w", encoding="utf-8") as f:
    f.write(hypothesis)

print(f"\n✅ Transcript saved to: {output_transcript_path}")



=== Simulated Real-time Whisper STT ===


100%|██████████| 500/500 [00:39<00:00, 12.70frames/s]



[FINAL] I wonder how much of the meetings is talking about the stuff at the meetings. Yeah, exactly.


100%|██████████| 500/500 [00:25<00:00, 20.00frames/s]



[FINAL] Look at all this stuff man!


100%|██████████| 500/500 [00:27<00:00, 18.34frames/s]



[FINAL] Okay. Right. So, what do we need to?


100%|██████████| 500/500 [00:24<00:00, 20.81frames/s]



[FINAL] need to talk about.


100%|██████████| 500/500 [00:42<00:00, 11.88frames/s]



[FINAL] What they've been doing. Has anybody done anything? No. No? No. No. No. No.


100%|██████████| 500/500 [01:35<00:00,  5.26frames/s]



[FINAL] Yeah. Okay. Sounds like you're...


100%|██████████| 500/500 [00:28<00:00, 17.60frames/s]



[FINAL] Sounds like you've done something. Okay, well I've got a browser now.


  0%|          | 0/500 [00:23<?, ?frames/s]
100%|██████████| 500/500 [00:24<00:00, 20.23frames/s]



[FINAL] Oh, I dropped mine as well.


100%|██████████| 500/500 [00:24<00:00, 20.32frames/s]



[FINAL] This wasn't a good start.


100%|██████████| 500/500 [00:23<00:00, 21.47frames/s]



[FINAL] Okay.


100%|██████████| 500/500 [00:25<00:00, 19.35frames/s]



[FINAL] Yeah, so yeah, we've got a browser with


100%|██████████| 500/500 [00:26<00:00, 19.19frames/s]



[FINAL] which comes up automatically with the transcription box and the topics.


100%|██████████| 500/500 [00:27<00:00, 18.44frames/s]



[FINAL] and then when you go on the menu you can select the summarisation button.


100%|██████████| 500/500 [00:25<00:00, 19.74frames/s]



[FINAL] which pops up and an audio player.


100%|██████████| 500/500 [00:27<00:00, 17.99frames/s]



[FINAL] And I think the search works as well. So you can pop up a search.


100%|██████████| 500/500 [00:26<00:00, 18.73frames/s]



[FINAL] pop up a search and it loads up just the background windows empty.


100%|██████████| 500/500 [00:26<00:00, 18.98frames/s]



[FINAL] and so when you start you have to either open...


100%|██████████| 500/500 [00:26<00:00, 18.53frames/s]



[FINAL] open a particular observation or do a search and open it through that.


100%|██████████| 500/500 [00:26<00:00, 19.16frames/s]



[FINAL] Does that make sense?


100%|██████████| 500/500 [00:26<00:00, 18.54frames/s]



[FINAL] The transcription box has got a summarized button which doesn't do anything.


100%|██████████| 500/500 [00:25<00:00, 19.83frames/s]



[FINAL] anything yet. Now I wonder how...


100%|██████████| 500/500 [00:28<00:00, 17.51frames/s]



[FINAL] So how we want to do, you know, we want a pop-up window with the speaker characterization.


100%|██████████| 500/500 [00:25<00:00, 19.64frames/s]



[FINAL] Either we could do that, you know,


100%|██████████| 500/500 [00:29<00:00, 16.80frames/s]



[FINAL] you know, when does the pop-up come? Either we can, when you click on the ID, any sort of...


100%|██████████| 500/500 [00:25<00:00, 19.40frames/s]



[FINAL] sort of ID in the transcription box or we can put


100%|██████████| 500/500 [00:27<00:00, 18.34frames/s]



[FINAL] Put an extra button extra few buttons next to the summarize button so that you actually


100%|██████████| 500/500 [00:28<00:00, 17.55frames/s]



[FINAL] Actually in the transcription box, but so you click on a button and then the herbs in this whatever that speaker


100%|██████████| 500/500 [00:26<00:00, 19.02frames/s]



[FINAL] speaker characterization is. The problem is the left.


100%|██████████| 500/500 [00:26<00:00, 19.18frames/s]



[FINAL] The left click is already used because it highlights that.


100%|██████████| 500/500 [00:27<00:00, 18.17frames/s]



[FINAL] That part of speech or whatever that what he's saying. The left click.


100%|██████████| 500/500 [00:26<00:00, 18.81frames/s]



[FINAL] The left click highlights it. Can you double click?


100%|██████████| 500/500 [00:29<00:00, 16.94frames/s]



[FINAL] I don't know what happens when you double click actually. But that'd be a bit annoying if you have


100%|██████████| 500/500 [00:26<00:00, 18.94frames/s]



[FINAL] if you have left click for one thing and double click for something else.


100%|██████████| 500/500 [00:27<00:00, 18.30frames/s]



[FINAL] I think so yeah when you left click it you can for example set set the marker there


100%|██████████| 500/500 [00:25<00:00, 19.37frames/s]



[FINAL] there so that the audio goes from there I think.


100%|██████████| 500/500 [00:27<00:00, 18.27frames/s]



[FINAL] So we can't put it on left click. We could put it on right click.


100%|██████████| 500/500 [00:29<00:00, 16.77frames/s]



[FINAL] could for example have a little menu that pops up. Yeah that might be quite good that's why. So right click on


100%|██████████| 500/500 [00:26<00:00, 18.81frames/s]



[FINAL] right click on it menu, you know, and you can click on speaker characterization.


100%|██████████| 500/500 [00:25<00:00, 19.68frames/s]



[FINAL] What else can we have in the menu?


100%|██████████| 500/500 [00:28<00:00, 17.52frames/s]



[FINAL] That's a good point. I don't know. Otherwise we just have a right click or two.


100%|██████████| 500/500 [00:26<00:00, 19.13frames/s]



[FINAL] I don't know about you but usually in Windows right click


100%|██████████| 500/500 [00:25<00:00, 19.83frames/s]



[FINAL] right click, it opens a menu.


100%|██████████| 500/500 [00:29<00:00, 17.05frames/s]



[FINAL] That's what nobody does, yeah. It might be a bit weird for it to just start bringing up stuff.


100%|██████████| 500/500 [00:27<00:00, 18.42frames/s]



[FINAL] Yeah, to just have sort of the same sort of idiom throughout. Yeah.


100%|██████████| 500/500 [00:26<00:00, 18.80frames/s]



[FINAL] Yeah. Or just, you know, or just the button.


100%|██████████| 500/500 [00:26<00:00, 19.15frames/s]



[FINAL] I guess a bottom makes a bit more sense because otherwise...


100%|██████████| 500/500 [00:26<00:00, 18.57frames/s]



[FINAL] Otherwise, you don't really know that... Oh, what if I...


100%|██████████| 500/500 [00:27<00:00, 18.35frames/s]



[FINAL] I write quick now what happens then? Well no, it's visual.


100%|██████████| 500/500 [00:26<00:00, 18.56frames/s]



[FINAL] the menu pop up that well oh yeah it's more obvious isn't it


100%|██████████| 500/500 [00:29<00:00, 17.20frames/s]



[FINAL] It's already a proof, isn't it? It's got a button. Yeah, that's true. Yeah.


100%|██████████| 500/500 [00:24<00:00, 20.19frames/s]



[FINAL] Yeah, it's more intuitive.


100%|██████████| 500/500 [00:27<00:00, 18.06frames/s]



[FINAL] I don't know where's the menu. I think it's more flexible that way.


100%|██████████| 500/500 [00:25<00:00, 19.29frames/s]



[FINAL] Well, I guess in Meeting Browser 1.1.


100%|██████████| 500/500 [00:25<00:00, 19.64frames/s]



[FINAL] You can have a menu instead of a button.


100%|██████████| 500/500 [00:26<00:00, 19.07frames/s]



[FINAL] This one. But actually what we could do is on...


100%|██████████| 500/500 [00:27<00:00, 18.40frames/s]



[FINAL] In that menu as well we could have an option that pops up a window with all the


100%|██████████| 500/500 [00:26<00:00, 18.80frames/s]



[FINAL] with all the meetings that that user has been in. Like.


100%|██████████| 500/500 [00:26<00:00, 19.09frames/s]



[FINAL] Like a search for that user. Yeah, is that gonna be useful or too


100%|██████████| 500/500 [00:30<00:00, 16.48frames/s]



[FINAL] or too much. I guess I said what are you part of the menu that comes down.


100%|██████████| 500/500 [00:27<00:00, 18.12frames/s]



[FINAL] it comes down it says give me all your meetings. Yeah so when you right click on it.


100%|██████████| 500/500 [00:26<00:00, 18.66frames/s]



[FINAL] one option will be give me all your meetings, characterise the speaker.


100%|██████████| 500/500 [00:23<00:00, 20.96frames/s]



[FINAL] Yeah.


100%|██████████| 500/500 [00:28<00:00, 17.54frames/s]



[FINAL] Or is that too much? I don't know. I mean, it's more...


100%|██████████| 500/500 [00:25<00:00, 19.95frames/s]



[FINAL] It's more like part of the browsing sort of.


100%|██████████| 500/500 [00:25<00:00, 19.92frames/s]



[FINAL] Sort of thing more than the speaker characteristics as Steve mentioned


100%|██████████| 500/500 [00:26<00:00, 19.19frames/s]



[FINAL] Steve mentioned, I don't really know how to involve the speaking characters.


100%|██████████| 500/500 [00:26<00:00, 18.53frames/s]



[FINAL] to the browser. Yeah, maybe tie that in with dialog acts or something.


100%|██████████| 500/500 [00:28<00:00, 17.43frames/s]



[FINAL] as well as mean speaker characterization tied into, I don't know, just some way to sort of leverage it.


100%|██████████| 500/500 [00:29<00:00, 17.22frames/s]



[FINAL] leverage that information? Like we have it, we're doing it, but how does it help the browsing?


100%|██████████| 500/500 [00:28<00:00, 17.37frames/s]



[FINAL] or what can it give us? Well, you mean the speaker characterization? Oh, well, yeah, I mean.


100%|██████████| 500/500 [00:28<00:00, 17.78frames/s]



[FINAL] I mean, I don't know about that. I mean, the speaker characterization is a bit.


100%|██████████| 500/500 [00:31<00:00, 15.78frames/s]



[FINAL] It's a bit like you said, I mean, it's like a nice thing. We don't, we're not actually sure what the. You know,


100%|██████████| 500/500 [00:28<00:00, 17.80frames/s]



[FINAL] what the user wants with it but it's interesting so why not give it to him. Yeah yeah.


100%|██████████| 500/500 [00:27<00:00, 18.20frames/s]



[FINAL] Yeah. I think we should do it. I mean, I guess maybe.


100%|██████████| 500/500 [00:26<00:00, 18.90frames/s]



[FINAL] Maybe we could have that as another thing of like well if you get speak


100%|██████████| 500/500 [00:26<00:00, 19.17frames/s]



[FINAL] speak a characterization that someone talks a lot in a certain topic.


100%|██████████| 500/500 [00:25<00:00, 19.52frames/s]



[FINAL] topic and then if you click on that topic you get


100%|██████████| 500/500 [00:24<00:00, 20.37frames/s]



[FINAL] everything that's in that topic.


100%|██████████| 500/500 [00:32<00:00, 15.54frames/s]



[FINAL] Is that still another level further? I mean, we're not even there yet. No, no, no. Does that pop up the speaker? Yeah.


100%|██████████| 500/500 [00:29<00:00, 17.02frames/s]



[FINAL] of the speaker characterization yet. You have to first define what happens when you click on a user. With dev.


100%|██████████| 500/500 [00:27<00:00, 18.31frames/s]



[FINAL] with their speaker characterisation first before we cross over to the other bridge.


100%|██████████| 500/500 [00:25<00:00, 19.64frames/s]



[FINAL] So should we try to do a right click menu then?


100%|██████████| 500/500 [00:27<00:00, 18.41frames/s]



[FINAL] Yeah, alright. Is that just from the main? Even if I...


100%|██████████| 500/500 [00:25<00:00, 19.90frames/s]



[FINAL] Even if any one thing comes up.


100%|██████████| 500/500 [00:29<00:00, 16.87frames/s]



[FINAL] if we're on the users versus like a topic or something, have a right click for these various things. Yes.


100%|██████████| 500/500 [00:29<00:00, 17.15frames/s]



[FINAL] Yes, wait, yeah, there's a second thing about the topics. The topic window, you wanna...


100%|██████████| 500/500 [00:26<00:00, 18.54frames/s]



[FINAL] to somehow be able to click on a topic and it pops up with all the meetings.


100%|██████████| 500/500 [00:31<00:00, 16.02frames/s]



[FINAL] the meetings that contain that. Right. Yeah. Yeah. Yeah. So we could do that in a similar way.


100%|██████████| 500/500 [00:25<00:00, 19.72frames/s]



[FINAL] way do it right click as well.


100%|██████████| 500/500 [00:25<00:00, 19.50frames/s]



[FINAL] and thrown. Yeah. On the speaker.


100%|██████████| 500/500 [00:25<00:00, 19.42frames/s]



[FINAL] So we have basically two options of browsing there.


100%|██████████| 500/500 [00:25<00:00, 19.69frames/s]



[FINAL] the meetings is by either searching and opening.


100%|██████████| 500/500 [00:27<00:00, 17.99frames/s]



[FINAL] individual observations and then we have the interlinking by right-clicking.


100%|██████████| 500/500 [00:25<00:00, 19.46frames/s]



[FINAL] right click basically. Right. Actually yeah like another comment


100%|██████████| 500/500 [00:30<00:00, 16.15frames/s]



[FINAL] Another common thing, like if we have it so it's set up, if you double click and it goes to a default thing on the menu.


100%|██████████| 500/500 [00:28<00:00, 17.38frames/s]



[FINAL] menu. I mean, that's kind of a typical kind of double click if you double click on


100%|██████████| 500/500 [00:30<00:00, 16.49frames/s]



[FINAL] I will click on a particular topic or something, then it goes to sort of like a default view. So it will be.


100%|██████████| 500/500 [00:27<00:00, 18.09frames/s]



[FINAL] So it'll be like all the cases of that topic, either whatever we decide.


100%|██████████| 500/500 [00:29<00:00, 16.72frames/s]



[FINAL] decide the default to be, whether it's in all meetings or just in one meeting, and just show up in the particular.


100%|██████████| 500/500 [00:29<00:00, 16.90frames/s]



[FINAL] instances of that topic in the meeting. Like, and just think to have the default there. So if somebody.


100%|██████████| 500/500 [00:33<00:00, 15.01frames/s]



[FINAL] So if somebody knows exactly what they want to do, and we know that that's the most common thing that's gonna happen. So that don't work. Yeah.


100%|██████████| 500/500 [00:31<00:00, 15.65frames/s]



[FINAL] So the double click won't do that and save them the trouble of right clicking and choosing the item on the menu. I didn't see there's anything.


100%|██████████| 500/500 [00:27<00:00, 18.47frames/s]



[FINAL] anything obvious that would be able to... It might come to us.


100%|██████████| 500/500 [00:35<00:00, 14.17frames/s]



[FINAL] I'm sorry, I started playing with it. What was that? I didn't quite understand. Do you mean? Like we have to, the right click. Just.


100%|██████████| 500/500 [00:27<00:00, 17.98frames/s]



[FINAL] the right click menu, right clicking, but instead of like have a default.


100%|██████████| 500/500 [00:24<00:00, 20.47frames/s]



[FINAL] default double click.


100%|██████████| 500/500 [00:31<00:00, 16.07frames/s]



[FINAL] Yeah. Use the double click. Like, I don't know, show the speaker characterization. Yeah. Right.


100%|██████████| 500/500 [00:29<00:00, 16.76frames/s]



[FINAL] Yeah, for instance. Then there's a problem with a lot of windows popping up again. And they pop up without...


100%|██████████| 500/500 [00:28<00:00, 17.28frames/s]



[FINAL] pop up without you even being quite sure what it is. You know what I mean? Yeah.


100%|██████████| 500/500 [00:26<00:00, 18.54frames/s]



[FINAL] Yeah. But if you do double click on something, don't you...


100%|██████████| 500/500 [00:27<00:00, 18.26frames/s]



[FINAL] expect something to happen then? Yeah, maybe. Sort of. Well.


100%|██████████| 500/500 [00:29<00:00, 17.03frames/s]



[FINAL] Sometimes you do it by accident, you go, go away! Stop coming up! I don't need closets.


100%|██████████| 500/500 [00:28<00:00, 17.42frames/s]



[FINAL] Yeah. It's not the end of the world. Especially if you have a touchpad.


100%|██████████| 500/500 [00:24<00:00, 20.51frames/s]



[FINAL] Yeah, yeah, oh yeah.


100%|██████████| 500/500 [00:26<00:00, 19.07frames/s]



[FINAL] I'll check what it does on double click at the moment.


100%|██████████| 500/500 [00:26<00:00, 18.60frames/s]



[FINAL] because I think it might do something, but... That's a good start.


100%|██████████| 500/500 [00:26<00:00, 19.11frames/s]



[FINAL] But I don't, I really don't see anything obvious.


100%|██████████| 500/500 [00:25<00:00, 19.76frames/s]



[FINAL] I mean if you can think of something.


100%|██████████| 500/500 [00:25<00:00, 19.67frames/s]



[FINAL] I don't know. Sorry.


100%|██████████| 500/500 [00:24<00:00, 20.13frames/s]



[FINAL] I guess it depends on the...


100%|██████████| 500/500 [00:25<00:00, 19.87frames/s]



[FINAL] Yeah, what you're clicking on.


100%|██████████| 500/500 [00:24<00:00, 20.31frames/s]



[FINAL] Yeah, another thing...


100%|██████████| 500/500 [00:26<00:00, 18.64frames/s]



[FINAL] Single audio files. Do we want to use them or not?


100%|██████████| 500/500 [00:28<00:00, 17.28frames/s]



[FINAL] We don't know yet, do we? Do you mean single channel? Yeah, single channel.


100%|██████████| 500/500 [00:27<00:00, 17.92frames/s]



[FINAL] that single channel files do we want them? I mean probably not but we might need them for


100%|██████████| 500/500 [00:27<00:00, 17.93frames/s]



[FINAL] them for this fluency. If we tried to do it in audio. We might use them in voice.


100%|██████████| 500/500 [00:27<00:00, 18.37frames/s]



[FINAL] use them to work with them but do we want to integrate them so that when you


100%|██████████| 500/500 [00:29<00:00, 17.07frames/s]



[FINAL] when you browse you can... You might want to be able to say, oh just give me what that person's saying.


100%|██████████| 500/500 [00:29<00:00, 17.17frames/s]



[FINAL] saying. Do you though? No. It's not going to make much sense if you listen to it on its own.


100%|██████████| 500/500 [00:27<00:00, 18.25frames/s]



[FINAL] Unless you can't hear it properly or something, somebody's talking over somebody else or something.


100%|██████████| 500/500 [00:26<00:00, 19.17frames/s]



[FINAL] We'll just say no for now and put it in later.


100%|██████████| 500/500 [00:25<00:00, 19.53frames/s]



[FINAL] Yeah, yeah, because otherwise it will be


100%|██████████| 500/500 [00:29<00:00, 16.76frames/s]



[FINAL] it will increase the amount of data we need by a lot, won't it? Yeah. The footprint will be bigger.


100%|██████████| 500/500 [00:23<00:00, 20.96frames/s]



[FINAL] Here we go.


100%|██████████| 500/500 [00:26<00:00, 18.91frames/s]



[FINAL] Yeah, what do people think in general about the Windows cluster?


100%|██████████| 500/500 [00:31<00:00, 15.90frames/s]



[FINAL] Windows cluttering that you mentioned. Do you think? Do you think it's really that? I think we probably need that, don't we?


100%|██████████| 500/500 [00:30<00:00, 16.47frames/s]



[FINAL] I don't have the full flexibility we need to have. That's what I feel as well. I mean, I'm very much.


100%|██████████| 500/500 [00:25<00:00, 19.99frames/s]



[FINAL] very much I want to have everything customizable.


100%|██████████| 500/500 [00:28<00:00, 17.57frames/s]



[FINAL] The thing is typically though is someone going to have five windows open at the same time? Probably not.


100%|██████████| 500/500 [00:28<00:00, 17.66frames/s]



[FINAL] Probably not, no. Probably just two or three of whatever they're looking at, like dialog X.


100%|██████████| 500/500 [00:27<00:00, 18.22frames/s]



[FINAL] yeah but yeah it depends I mean I really want to put the summary in somewhere you know


100%|██████████| 500/500 [00:28<00:00, 17.55frames/s]



[FINAL] And the summary, for example, is typical. You open the window, you read through it, you might


100%|██████████| 500/500 [00:28<00:00, 17.30frames/s]



[FINAL] you might close it again straight away. But it would be good to have it there. I think it's fine.


100%|██████████| 500/500 [00:28<00:00, 17.31frames/s]



[FINAL] But, well, couldn't we, I mean, you probably, I don't know if you wanna have like.


100%|██████████| 500/500 [00:25<00:00, 19.40frames/s]



[FINAL] like the full transcription and the summary at the same time.


100%|██████████| 500/500 [00:25<00:00, 19.44frames/s]



[FINAL] So maybe you can just choose the same window for...


100%|██████████| 500/500 [00:28<00:00, 17.26frames/s]



[FINAL] for transcription and summary. It's like have a tab there. I was just gonna say tab.


100%|██████████| 500/500 [00:26<00:00, 18.67frames/s]



[FINAL] Yeah, tabs are nice. Yeah, tabs are good. Mazele style.


100%|██████████| 500/500 [00:31<00:00, 15.79frames/s]



[FINAL] That's true. Yeah, yeah, yeah. That's right. I'm not sure. Yeah, I don't know. Anyway.


100%|██████████| 500/500 [00:24<00:00, 20.19frames/s]



[FINAL] Does anybody know how to do that?


100%|██████████| 500/500 [00:30<00:00, 16.36frames/s]



[FINAL] You probably can. I'm sure. I'm sure you can do anything. If you really knew what you were doing.


100%|██████████| 500/500 [00:27<00:00, 18.03frames/s]



[FINAL] I don't know, it's a bit of a cripple with Java's windowing toolkit.


100%|██████████| 500/500 [00:27<00:00, 18.39frames/s]



[FINAL] Sometimes it's a bit of a cripple. I can have a look.


100%|██████████| 500/500 [00:29<00:00, 17.15frames/s]



[FINAL] Yeah. Yeah. Or just, I don't know. But yeah, that might be.


100%|██████████| 500/500 [00:28<00:00, 17.75frames/s]



[FINAL] So that might be an idea, yeah. Contents of the same window. Yeah, yeah.


100%|██████████| 500/500 [00:24<00:00, 20.50frames/s]



[FINAL] from transcription to summary.


100%|██████████| 500/500 [00:27<00:00, 18.41frames/s]



[FINAL] Yeah, we even just have buttons on it, toggle buttons or something on the toolbar.


100%|██████████| 500/500 [00:25<00:00, 19.80frames/s]



[FINAL] toolbar for that window. Hmm.


100%|██████████| 500/500 [00:30<00:00, 16.17frames/s]



[FINAL] Yeah, yeah, that could probably be. It doesn't have to be a tab or whatever. No, no, it could be like...


100%|██████████| 500/500 [00:26<00:00, 18.93frames/s]



[FINAL] like transcription summary like two letters and you just press on.


100%|██████████| 500/500 [00:27<00:00, 18.44frames/s]



[FINAL] or check marks even have two of them. That will reduce the number of windows.


100%|██████████| 500/500 [00:26<00:00, 19.12frames/s]



[FINAL] otherwise Photoshop has a lot of windows if you ever use Photoshop


100%|██████████| 500/500 [00:30<00:00, 16.15frames/s]



[FINAL] Photoshop's got loads and you're always forever going, oh, get out of the way. There's always stuff in the way, isn't there?


100%|██████████| 500/500 [00:53<00:00,  9.34frames/s]



[FINAL] I don't like it. What do you mean people get used to it? I mean when you use it.


100%|██████████| 500/500 [00:26<00:00, 18.94frames/s]



[FINAL] you use at Windows or Linux for that matter, you have a lot of windows open.


100%|██████████| 500/500 [00:25<00:00, 19.66frames/s]



[FINAL] open. You do, yeah.


100%|██████████| 500/500 [00:26<00:00, 18.55frames/s]



[FINAL] Unless Kenny, I don't suppose you can minimize them and stuff like that.


100%|██████████| 500/500 [00:30<00:00, 16.49frames/s]



[FINAL] Yeah, you can't minimize. Can you do that? Yeah. Yeah. Okay. Well, that's all right.


100%|██████████| 500/500 [00:31<00:00, 15.81frames/s]



[FINAL] Yeah, of course you can. Like a lot of these can be, like once we get the disfluencies out, then we can.


100%|██████████| 500/500 [00:28<00:00, 17.33frames/s]



[FINAL] out then we can you know transcript without the disfluency transcript with a different disfluency.


100%|██████████| 500/500 [00:33<00:00, 15.08frames/s]



[FINAL] Yeah, that's true. Have all these sort of using the same window. And so we can do a lot of. Yeah, they're definitely gonna be in the same.


100%|██████████| 500/500 [00:27<00:00, 18.19frames/s]



[FINAL] Maybe in the same window, yeah. And probably mutually. Exclusive.


100%|██████████| 500/500 [00:25<00:00, 19.55frames/s]



[FINAL] Yeah, that was the plan for the differences anyway.


100%|██████████| 500/500 [00:27<00:00, 18.00frames/s]



[FINAL] Yeah, do you want to do that with summaries then? Put the summaries in.


100%|██████████| 500/500 [00:27<00:00, 18.21frames/s]



[FINAL] in the same window. Meet you next week. Yeah, well...


100%|██████████| 500/500 [00:25<00:00, 19.25frames/s]



[FINAL] I don't know, but transcripts, sounds reasonable to have.


100%|██████████| 500/500 [00:25<00:00, 19.53frames/s]



[FINAL] to have like transcripts like maybe three alternatives then like full meeting


100%|██████████| 500/500 [00:29<00:00, 16.89frames/s]



[FINAL] meeting without disfluencies and summarize. Yeah. Do you wanna do that? That can make sense. Yeah.


100%|██████████| 500/500 [00:24<00:00, 20.56frames/s]



[FINAL] Yeah, yeah, yes


100%|██████████| 500/500 [00:23<00:00, 21.19frames/s]



[FINAL] Yeah.


100%|██████████| 500/500 [00:24<00:00, 20.35frames/s]



[FINAL] And finally, the prototype.


100%|██████████| 500/500 [00:27<00:00, 18.40frames/s]



[FINAL] spoke about what kind of prototype could be produced? By the end of February?


100%|██████████| 500/500 [00:23<00:00, 20.98frames/s]



[FINAL] February?


100%|██████████| 500/500 [00:28<00:00, 17.62frames/s]



[FINAL] change today, you know? And it kind of just, it just develops. I know.


100%|██████████| 500/500 [00:25<00:00, 19.26frames/s]



[FINAL] aiming for anything. Do we want to aim for something?


100%|██████████| 500/500 [00:25<00:00, 19.49frames/s]



[FINAL] Um... I'm-I'm-


100%|██████████| 500/500 [00:29<00:00, 16.93frames/s]



[FINAL] I'm pretty much getting there anyway, to be honest. It's not going to be a lot more work.


100%|██████████| 500/500 [00:26<00:00, 18.80frames/s]



[FINAL] I guess it is. I mean, just to try and I mean for us to


100%|██████████| 500/500 [00:27<00:00, 18.06frames/s]



[FINAL] to be able to do something that we can include with your thing. So, so easy.


100%|██████████| 500/500 [00:28<00:00, 17.68frames/s]



[FINAL] That's what we need. Something just slightly more than just that on its own. Yeah.


100%|██████████| 500/500 [00:26<00:00, 19.20frames/s]



[FINAL] Yeah, just so see the integration works sort of between the different


100%|██████████| 500/500 [02:32<00:00,  3.28frames/s]



[FINAL] Yeah, definitelyope.


100%|██████████| 500/500 [00:32<00:00, 15.18frames/s]



[FINAL] So I guess that's what we need to aim for. I think that's kind of what he meant as well, isn't it? Yeah. Seeing it. Yeah.


100%|██████████| 500/500 [00:27<00:00, 17.96frames/s]



[FINAL] Seeing it sort of start to do stuff rather than just be. Oh, I see.


100%|██████████| 500/500 [00:27<00:00, 17.92frames/s]



[FINAL] A prototype of the GUI. Yeah. Well, I guess, I mean...


100%|██████████| 500/500 [00:25<00:00, 19.25frames/s]



[FINAL] That was probably our intention but we might not have written it down.


100%|██████████| 500/500 [00:23<00:00, 21.58frames/s]



[FINAL] Yeah.


100%|██████████| 500/500 [00:27<00:00, 18.18frames/s]



[FINAL] Yeah. Actually, is the code accessible, like the GUI stuff that you've done?


100%|██████████| 500/500 [00:28<00:00, 17.53frames/s]



[FINAL] stuff that you've done? Like, can we just like copy like the latest stuff out of your directory?


100%|██████████| 500/500 [00:29<00:00, 17.07frames/s]



[FINAL] Like if we're gonna just try and take a look or something. I don't know, can we access other people?


100%|██████████| 500/500 [00:29<00:00, 17.11frames/s]



[FINAL] If you make it world readable then anybody can listen. If you just tell us the directory and just make...


100%|██████████| 500/500 [00:32<00:00, 15.59frames/s]



[FINAL] and just make the file readable, then we know the directory to get in then. And just, that might be. That might be. That might be.


100%|██████████| 400/400 [00:27<00:00, 14.42frames/s]


[FINAL] I don't know if it's readable at the moment, but you can sure have it.

=== Full Transcript ===
I wonder how much of the meetings is talking about the stuff at the meetings. Yeah, exactly. Look at all this stuff man! Okay. Right. So, what do we need to? need to talk about. What they've been doing. Has anybody done anything? No. No? No. No. No. No. Yeah. Okay. Sounds like you're... Sounds like you've done something. Okay, well I've got a browser now. Oh, I dropped mine as well. This wasn't a good start. Okay. Yeah, so yeah, we've got a browser with which comes up automatically with the transcription box and the topics. and then when you go on the menu you can select the summarisation button. which pops up and an audio player. And I think the search works as well. So you can pop up a search. pop up a search and it loads up just the background windows empty. and so when you start you have to either open... open a particular observation or do a search and open it through that. Doe




## Speaker Diarization

In [3]:
from pyannote.audio import Pipeline
from pyannote.core import Segment
import json

token = "your token code(huggingface)"

filename = "/content/EN2002a.wav"

# Load pre-trained speaker diarization model
diar_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization",use_auth_token=token)

# Run diarization
diarization = diar_pipeline(filename)

# Collect segments
segments = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
    segments.append({
        "start": turn.start,
        "end": turn.end,
        "speaker": speaker
    })

print("\n=== Speaker Segments ===")
for seg in segments:
    print(f"[{seg['speaker']}] {seg['start']:.2f}-{seg['end']:.2f} sec")




# Save to JSON file
output_filename = "diarization_segments.json"
with open(output_filename, "w") as f:
    json.dump(segments, f, indent=4)

print(f"\n✅ Segments saved to {output_filename}")


# Save to RTTM
rttm_filename = "diarization_segments.rttm"
with open(rttm_filename, "w") as f:
    diarization.write_rttm(f)
print(f"✅ RTTM saved to {rttm_filename}")


  torchaudio.list_audio_backends()


config.yaml:   0%|          | 0.00/500 [00:00<?, ?B/s]

  available_backends = torchaudio.list_audio_backends()
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _speechbrain_save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _speechbrain_load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _recover


pytorch_model.bin:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/318 [00:00<?, ?B/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`
INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


Model was trained with pyannote.audio 0.0.1, yours is 3.4.0. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.8.0+cu126. Bad things might happen unless you revert torch to 1.x.


hyperparams.yaml: 0.00B [00:00, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/hyperparams.yaml' -> '/root/.cache/torch/pyannote/speechbrain/hyperparams.yaml'
INFO:speechbrain.utils.fetching:Fetch custom.py: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _load
DEBUG:speechbrain.utils.checkpoints:Registered parameter transfer hook for _load
  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load_if_possible
DEBUG:speechbrain.utils.parameter_transfer:Collecting files (or symlinks) for pretraining in /root/.cache/torch/pyann

embedding_model.ckpt:   0%|          | 0.00/83.3M [00:00<?, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/embedding_model.ckpt' -> '/root/.cache/torch/pyannote/speechbrain/embedding_model.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["embedding_model"] = /root/.cache/torch/pyannote/speechbrain/embedding_model.ckpt
INFO:speechbrain.utils.fetching:Fetch mean_var_norm_emb.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


mean_var_norm_emb.ckpt:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/mean_var_norm_emb.ckpt' -> '/root/.cache/torch/pyannote/speechbrain/mean_var_norm_emb.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["mean_var_norm_emb"] = /root/.cache/torch/pyannote/speechbrain/mean_var_norm_emb.ckpt
INFO:speechbrain.utils.fetching:Fetch classifier.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


classifier.ckpt:   0%|          | 0.00/5.53M [00:00<?, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/classifier.ckpt' -> '/root/.cache/torch/pyannote/speechbrain/classifier.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["classifier"] = /root/.cache/torch/pyannote/speechbrain/classifier.ckpt
INFO:speechbrain.utils.fetching:Fetch label_encoder.txt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


label_encoder.txt: 0.00B [00:00, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/label_encoder.txt' -> '/root/.cache/torch/pyannote/speechbrain/label_encoder.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["label_encoder"] = /root/.cache/torch/pyannote/speechbrain/label_encoder.ckpt
INFO:speechbrain.utils.parameter_transfer:Loading pretrained files for: embedding_model, mean_var_norm_emb, classifier, label_encoder
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): embedding_model -> /root/.cache/torch/pyannote/speechbrain/embedding_model.ckpt
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): mean_var_norm_emb -> /root/.cache/torch/pyannote/speechbrain/mean_var_norm_emb.ckpt
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): classifier -> /root/.cac


=== Speaker Segments ===
[SPEAKER_00] 0.03-7.51 sec
[SPEAKER_00] 8.55-10.36 sec
[SPEAKER_00] 11.83-36.62 sec
[SPEAKER_01] 13.18-13.46 sec
[SPEAKER_01] 15.12-15.56 sec
[SPEAKER_01] 19.10-23.47 sec
[SPEAKER_00] 38.03-39.15 sec
[SPEAKER_00] 41.51-45.51 sec
[SPEAKER_00] 48.28-66.72 sec
[SPEAKER_01] 49.47-49.85 sec
[SPEAKER_00] 68.17-84.76 sec
[SPEAKER_00] 86.45-87.16 sec
[SPEAKER_00] 87.98-103.95 sec
[SPEAKER_00] 106.16-134.85 sec
[SPEAKER_01] 106.23-106.66 sec
[SPEAKER_00] 136.60-144.06 sec
[SPEAKER_01] 138.69-141.11 sec
[SPEAKER_01] 142.10-143.59 sec
[SPEAKER_00] 145.14-160.66 sec
[SPEAKER_01] 152.36-152.80 sec
[SPEAKER_00] 162.66-163.97 sec

✅ Segments saved to diarization_segments.json
✅ RTTM saved to diarization_segments.rttm


###  Generate Speaker-Tagged Transcript


In [7]:
# Map transcript chunks to speaker segments
final_transcript = []

for seg in segments:
    # Estimate chunk index
    chunk_idx = int(seg['start'] // (chunk_sec - overlap_sec))
    text_chunk = results[chunk_idx] if chunk_idx < len(results) else ""
    final_transcript.append(f"[{seg['speaker']}] {text_chunk}")

print("\n=== Speaker-Tagged Transcript ===")
for line in final_transcript:
    print(line)



# === Save to TXT File ===
speaker_transcript_path = "speaker_tagged_transcript.txt"

with open(speaker_transcript_path, "w", encoding="utf-8") as f:
    for line in final_transcript:
        f.write(line + "\n")

print(f"\n✅ Speaker-tagged transcript saved to: {speaker_transcript_path}")


=== Speaker-Tagged Transcript ===
[SPEAKER_00] I wonder how much of the meetings is talking about the stuff at the meetings. Yeah, exactly.
[SPEAKER_00] Look at all this stuff man!
[SPEAKER_00] Okay. Right. So, what do we need to?
[SPEAKER_01] Okay. Right. So, what do we need to?
[SPEAKER_01] need to talk about.
[SPEAKER_01] What they've been doing. Has anybody done anything? No. No? No. No. No. No.
[SPEAKER_00] This wasn't a good start.
[SPEAKER_00] Okay.
[SPEAKER_00] Yeah, so yeah, we've got a browser with
[SPEAKER_01] Yeah, so yeah, we've got a browser with
[SPEAKER_00] pop up a search and it loads up just the background windows empty.
[SPEAKER_00] The transcription box has got a summarized button which doesn't do anything.
[SPEAKER_00] The transcription box has got a summarized button which doesn't do anything.
[SPEAKER_00] you know, when does the pop-up come? Either we can, when you click on the ID, any sort of...
[SPEAKER_01] you know, when does the pop-up come? Either we can, w

### DER + Accuray of one audiofile(EN2002a.wav)

In [30]:
from pyannote.core import Timeline
from pyannote.metrics.diarization import DiarizationErrorRate
from pyannote.database.util import load_rttm

# === STEP 1: Load RTTM files ===

def load_annotation(rttm_path):
    rttm = load_rttm(rttm_path)
    return list(rttm.values())[0]  # Assumes only one recording per file

# Load reference and hypothesis
reference = load_annotation("reference.rttm")
hypothesis = load_annotation("diarization_segments.rttm")

# === STEP 2: Crop hypothesis to reference extent (to match durations) ===
uem = Timeline(segments=[reference.get_timeline().extent()])
hypothesis = hypothesis.crop(uem, mode="intersection")

# === STEP 3: Compute DER ===
metric = DiarizationErrorRate()
der = metric(reference, hypothesis)

# === STEP 4: Print results ===
print(f"\n📉 Diarization Error Rate (DER): {der:.2%}")
print(f"✅ Diarization Accuracy: {(1 - der):.2%}")



📉 Diarization Error Rate (DER): 45.24%
✅ Diarization Accuracy: 54.76%




In [26]:
reference_rttm_content = """SPEAKER EN2002a 1 0.000 5.000 <NA> <NA> Speaker1 <NA> <NA>
SPEAKER EN2002a 1 5.000 5.000 <NA> <NA> Speaker2 <NA> <NA>
SPEAKER EN2002a 1 10.000 5.000 <NA> <NA> Speaker1 <NA> <NA>"""

# Save to reference.rttm
with open("reference.rttm", "w") as f:
    f.write(reference_rttm_content)

print("✅ Saved reference.rttm")


✅ Saved reference.rttm


In [7]:

from pyannote.metrics.diarization import DiarizationErrorRate
from pyannote.core import Annotation, Segment

# Reference (ground truth annotation)
reference = Annotation()
reference[Segment(0, 4)] = "Speaker1"
reference[Segment(4, 8)] = "Speaker2"

# Hypothesis (model prediction from pyannote.audio)
hypothesis = Annotation()
hypothesis[Segment(0, 3.5)] = "Speaker1"
hypothesis[Segment( 1.806, 8.553)] = "Speaker2"

# Compute DER
metric = DiarizationErrorRate()
der = metric(reference, hypothesis)
print(f"Diarization Error Rate: {der:.2%}")
print(f"Accuracy: {(1 - der):.2%}")

Diarization Error Rate: 34.34%
Accuracy: 65.66%


#### Summarization using HuggingFace and Model( T5 - small)

In [8]:
from transformers import pipeline
import os

os.environ["HUGGINGFACE_TOKEN"] = "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"

# Load summarization model (e.g., t5-small or facebook/bart-large-cnn)
summarizer = pipeline("summarization", model="t5-small", tokenizer="t5-small")

# Prepare text for summarization
diarized_text = " ".join([line for line in final_transcript])

# Generate summary
summary = summarizer(diarized_text, max_length=150, min_length=50, do_sample=False)
summary_text = summary[0]['summary_text']
print("\n=== Meeting Summary ===")
print(summary[0]['summary_text'])

# Save summary to text file
summary_output_path = "meeting_summary.txt"
with open(summary_output_path, "w", encoding="utf-8") as f:
    f.write(summary_text)

print(f"\n✅ Summary saved to: {summary_output_path}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



=== Meeting Summary ===
[SPEAKER_00] I wonder how much of the meetings is talking about the stuff at the meetings . the transcription box has got a summaryd button which doesn't do anything . right click on the ID, or a menu that pops up .

✅ Summary saved to: meeting_summary.txt


### ROUGE Evaluation(T5- model)

In [10]:
# Create a fake reference for testing
reference_summary = """ going to the meetings as talking about the stuff of the meetings yeah nevertheless the here get a certificate right so we really need to talk about the bodies are we just go around and everyone says what they what they've been doing that without anything live say yeah sounds like you've done could do that i've got a browser now which whoops no good stuff when of mine as well mississippi given this this wasn't a good start okay sorry yeah say yeah we've got browser read which comes up automatically with or transcription box and the topics and then when you go on the menu you can select some ization box which pops up and an audio player one am and as i think search works as well say pop up a search and loads up just the background when nice empty and am so when you start you have to either open open up seek the observation or do such an open it through that that the sense m the transcription"""
with open("reference_summary.txt", "w") as f:
    f.write(reference_summary)


In [1]:
! pip install evaluate
! pip install rouge-score

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=24d1da4aa2b647b7d8e2162a4a5e02f09e159aad51ba6f622c6bb308af306ba6
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [16]:
import evaluate

# Load ROUGE metric
rouge = evaluate.load("rouge")

# Read model summary (already generated)
with open("meeting_summary.txt", "r") as f:
    generated_summary = f.read()

# Read reference summary
with open("reference_summary.txt", "r") as f:
    reference_summary = f.read()

# Compute ROUGE
scores = rouge.compute(predictions=[generated_summary], references=[reference_summary])

# Print scores
print("\n📊 ROUGE Evaluation Results:")
for metric, score in scores.items():
    print(f"{metric.upper()}: {score:.4f}")



📊 ROUGE Evaluation Results:
ROUGE1: 0.2545
ROUGE2: 0.1009
ROUGEL: 0.1636
ROUGELSUM: 0.1636
