
# Conversational Speaker Diarization

This project is centered around harnessing the capabilities of advanced AI language models (specifically, gpt-4) to conduct speaker diarization on conversation transcripts, which are sourced from the OpenAI Whisper API. Speaker diarization involves segmenting spoken content into distinct portions belonging to different speakers. This approach simplifies the analysis of conversations by providing clear, structured outputs that highlight the interactions among various speakers.

In [None]:
#Uucomment If you run on google colab

#!sudo apt update && sudo apt install ffmpeg

In [None]:
!pip install -q openai==0.27.8
!pip install -q auditok==0.2.0
!pip install -q SoundFile==0.10.3.post1
!pip install -q numpy==1.24.4
!pip install -q tiktoken==0.4.0
!pip install -q --force-reinstall https://github.com/yt-dlp/yt-dlp/archive/master.tar.gz

### Set OpenAI Key

In [29]:
import os
import subprocess
import pprint

import openai 

openai.api_key = os.environ.get(
    "OPENAI_API_KEY", "sk-***"
)

### Define Speech to Text pipeline

##### Specify input file to process

Use code below if you want to use audio file:

In [None]:
audio_file_path = "/path/to/audio/file"
fname = os.path.basename(audio_file_path).split('.')[0]
proccessed_audio_file = f"./{fname}.mp3"

subprocess.run(["ffmpeg", "-i", audio_file_path, "-ar",
                    "16000", "-ac", "1", "-y", proccessed_audio_file])

Uncomment code below if you want to use Youtube video:

In [None]:
import yt_dlp as youtube_dl

# SET YOUTUBE video id
video_id = "96daW-XQpmE"

ydl_opts = {
            'format': 'bestvideo+bestaudio/best',
            'outtmpl': os.path.join('./', '%(title)s.%(ext)s'),
            'extractor_lazy': True,
}

with youtube_dl.YoutubeDL(ydl_opts) as ydl:
    info_dict = ydl.extract_info(
        f"https://www.youtube.com/watch?v={video_id}", download=True)
    video_path = os.path.join(
        './', ydl.prepare_filename(info_dict))

fname = os.path.basename(video_path).split('.')[0]
proccessed_audio_file = f"./{fname}.mp3"

subprocess.run(["ffmpeg", "-i", video_path, "-ar",
                    "16000", "-ac", "1", "-y", proccessed_audio_file])

Check the file size. By default, the Whisper API only supports files that are less than 25 MB. If you have an audio file that is longer than that, then you could use an algorithm like VAD (Voice Activity Detection) in order to avoid that.

In [21]:
# check file SIZE
size_in_mb = os.path.getsize(proccessed_audio_file) / (1024 * 1024)

#### Define Voice activity detection

In [22]:
import auditok

whisper_sample_rate = 16000

def vad_audiotok(audio_content):
    """
    Perform voice activity detection using the audiotok package.

    :param audio_content: Bytes of audio data.
    :return: Chunks containing speech detected in the audio.
    """
    audio_regions = auditok.split(
        audio_content,
        sr=whisper_sample_rate,
        ch=1,
        sw=2,
        min_dur=0.5,
        max_dur=30,
        max_silence=0.3,
        energy_threshold=30
    )
    return audio_regions


def audio_process(wav_path):
    """
    Process audio data, performing voice activity detection and segmenting the audio.

    :param wav_path: Path to the audio file or audio bytes.
    :return: Segmented audio chunks containing detected speech.
    """
    if not is_byte:
        with open(wav_path, 'rb') as f:
            wav_bytes = f.read()
    else:
        wav_bytes = wav_path
    wav, sr = sf.read(wav_path)
    audio_regions = self.vad_audiotok(wav_bytes)
    wav_segments = []
    for r in audio_regions:
        start = r.meta.start
        end = r.meta.end
        segment = wav[int(start * sr):int(end * sr)]
        wav_segments.append(segment)
    return wav_segments

### Init Wrpper Whisper API

In [23]:
import uuid

import soundfile as sf

# specify whisper's model
model_name = "whisper-1"

def transcribe(audio_file):
    """
    Transcribe the provided audio using the OpenAI API.

    :param audio_file: Path to the audio file or audio bytes.
    :return: Transcription text from the audio.
    """
    # Save audio bytes as a temporary WAV file
    root_path = get_project_root()
    temp_wav_path = f"./{str(uuid.uuid4())}.mp3"
    with sf.SoundFile(temp_wav_path, 'wb', samplerate=self.whisper_sample_rate, channels=1) as f:
        f.write(audio_file)

    auf = open(temp_wav_path, 'rb')
    # Transcribe using OpenAI API
    response = openai.Audio.transcribe(
        model_name, auf)
    # Clean up temporary file
    os.remove(temp_wav_path)
    return response['text']

def transcribe_raw(audio_file):
    """
    Transcribe the provided audio using the OpenAI API without saving a temporary file.

    :param audio_file: Path to the audio file or audio bytes.
    :return: Transcription text from the audio.
    """
    auf = open(audio_file, 'rb')
    # Transcribe using OpenAI API
    response = openai.Audio.transcribe(
        model_name, auf)
    return response['text']


In [24]:
if size_in_mb > 25:
    wav_segments = audio_process(
        proccessed_audio_file, is_byte=True)
    transcript = []
    for segments in wav_segments:
        transcript.append(transcribe(segments))
        time.sleep(0.006)
    transcript = ''.join(transcript)
else:
    transcript = transcribe_raw(proccessed_audio_file)

In [30]:
pprint.pprint(transcript)

('Hi guys, welcome. I would like to thank you guys for coming. Before we start '
 'I would like to have your parents consent. Thank you. Alright before we '
 'start I would like to remind you guys about the confidentiality guidelines. '
 'Your confidentiality as a student is really important. Everything that you '
 'say in the group counseling session, this group counseling session is '
 "confidential and it's private. So that means that it cannot be shared with "
 'other people from this group. Okay, what is said here stays here. Alright, '
 "however there's some exceptions as required by law and ethical standards. "
 'Harm to self-others, abuse or neglect, and if it is required by law to '
 'attend a hearing or legal proceedings, then I cannot guarantee that your '
 'information will be kept confidential. Anyone have any questions? So is '
 'everybody alright with these ground rules? Well just to make sure that '
 'everybody understands, everything is going to be confidential. So tha

### Define Prompt For Diarization

Extract dialogue involving multiple speaker from transcript.

In [26]:
import tiktoken

tt_encoding = tiktoken.get_encoding("cl100k_base")
openai_model = "gpt-4"

def token_counter(passage):
    """
    Count the number of tokens in a given passage.

    Parameters:
        passage (str): The input text passage.

    Returns:
        int: The total number of tokens in the passage.
    """
    tokens = tt_encoding.encode(passage)
    total_tokens = len(tokens)
    return total_tokens

def extract_dialogue(transcript):
    """
    Extract dialogue involving multiple speaker from text.

    Parameters:
        transcript (str): The text containing the conversation.
        history (list): List of message history (optional).

    Returns:
        str: Extracted dialogue in the specified format.
    """
    prompt = """Perform speaker diarization on the given text to identify and extract conversations involving multiple speakers. Present the dialogue in the following structured format:
    Speaker 1:
    Speaker 2:
    Speaker 3:
    ..."""

    while True:
        try:
            messages = [
                    {"role": "system", "content": prompt},
                ]
            user_message = {"role": "user",
                            "content": transcript.replace('\n', '')}
            messages.append(user_message)
            tokens_per_message = 4
            max_token = 8191 - (token_counter(prompt) + token_counter(
                transcript) + (len(messages)*tokens_per_message) + 3)
            response = openai.ChatCompletion.create(
                model=openai_model,
                messages=messages,
                max_tokens=max_token,
                temperature=1,
                top_p=1,
                presence_penalty=0,
                frequency_penalty=0,
            )
            bot_response = response["choices"][0]["message"]["content"].strip(
            )
            return bot_response

        except openai.error.RateLimitError:
            messages.pop(1)
            continue


##### Get Diarization Results

In [27]:
dialogue = extract_dialogue(transcript)

In [31]:
pprint.pprint(dialogue)

('Speaker 1:\n'
 'Hi guys, welcome. I would like to thank you guys for coming... [Counseling '
 'explanation]... Okay, what is said here stays here... Do you guys have any '
 'other things that you probably want to add as ground rules?... Okay, sounds '
 "good... question for you guys... Okay, all right, I see... We're going to "
 "play a game... Okay, all right. who wants to go next?... Okay... What's your "
 'name first?... Oh, sorry, next question. Can I say my name? ... You guys can '
 'call me Sam... What would be your perfect job?... Okay. What would be your '
 'perfect job?... Okay, that sounds cool. Alright, your turn... Okay, all '
 'right... Okay, so we ran out of questions... [Self-image explanation]...so, '
 "let's just, let's do an activity... [Instructions for activity]... Any "
 'questions?... Okay, who would like to go first?... Okay... Who wants to go '
 'next?... What would be something, one thing that you would like to change '
 'about yourself?... Okay. Okay. Alec..