No speaker labels in txt format with diarization enabled #801

veenified · 2024-05-11T21:35:11Z

I have been using WhisperX for transcribing multi-speaker audio files and I enabled diarization to distinguish between different speakers. However, I noticed that the TXT format output does not include speaker labels, which are crucial for my use case to identify who is speaking at any given time.

Could you provide some insights on why the speaker labels are missing in the TXT output when diarization is enabled? Is this an intended behavior or a potential oversight? Additionally, if this feature is not currently supported, are there any plans to include speaker labels in future updates of the TXT format output?

Thank you for your assistance and for the great work on this tool!

PS: Here's a sample command if it helps...
whisperx --hf_token <your_hf_token> --print_progress True --language en --diarize --compute_type int8 voice_chat.mp4 -o ~/transcriptions -f txt --min_speakers 4 --max_speakers 12

The text was updated successfully, but these errors were encountered:

veenified · 2024-05-14T18:48:19Z

In case anyone is interested...
I put together some python that formats the "srt" format into something closer to what I am looking for from the "txt" format.

import re
import sys
import os

def reformat_srt(file_path):
    with open(file_path, 'r') as file:
        content = file.read()

    pattern = re.compile(r'\d+\n(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})\n(\[SPEAKER_\d+\]: .+?(?=\n\n|\Z))', re.DOTALL)
    matches = pattern.findall(content)

    # Remove underline markup
    matches = [(start, end, re.sub(r'<u>|</u>', '', text.strip())) for start, end, text in matches]

    # Keep messages in the original order
    unique_lines = []
    seen = set()
    for _, _, text in matches:
        if text not in seen:
            seen.add(text)
            unique_lines.append(text)

    reformatted = '\n'.join(unique_lines)

    return reformatted

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python clean_srt.py <srt_input_filename> [output_directory]")
        sys.exit(1)

    input_file_path = sys.argv[1]
    output_dir = sys.argv[2] if len(sys.argv) > 2 else os.getcwd()

    # Generate the output filename
    base_name = os.path.basename(input_file_path)
    name, _ = os.path.splitext(base_name)
    output_file_path = os.path.join(output_dir, f"{name}-clean.txt")

    reformatted_content = reformat_srt(input_file_path)

    with open(output_file_path, 'w') as file:
        file.write(reformatted_content)

SeeknnDestroy · 2024-05-15T15:28:16Z

I have been using WhisperX for transcribing multi-speaker audio files and I enabled diarization to distinguish between different speakers. However, I noticed that the TXT format output does not include speaker labels, which are crucial for my use case to identify who is speaking at any given time.

Could you provide some insights on why the speaker labels are missing in the TXT output when diarization is enabled? Is this an intended behavior or a potential oversight? Additionally, if this feature is not currently supported, are there any plans to include speaker labels in future updates of the TXT format output?

Thank you for your assistance and for the great work on this tool!

PS: Here's a sample command if it helps...

whisperx --hf_token <your_hf_token> --print_progress True --language en --diarize --compute_type int8 voice_chat.mp4 -o ~/transcriptions -f txt --min_speakers 4 --max_speakers 12

Can you please elaborate on text format? Are you using whisperx on command line or as python library? Could you share example snippet and what are your conclusions about this?

veenified · 2024-05-15T16:29:17Z

@SeeknnDestroy,
I am using WhisperX from the command line, as I included in the issue description and example command.
The text format that I would like to see when the diarization is enabled via the --diarize flag is something like this:

[SPEAKER_08]: Can you show me Jill's last email on that subject.
[SPEAKER_05]: Sure, let me share my screen. I don't know if there's a response to my questions, but we can review it together.

The python that I included above is the workaround I implemented for the time being, which take the "srt" format and simplifies it to a "txt" format that includes SPEAKER labels, like the example I included above.

nkilm · 2024-05-26T18:44:49Z

@veenified Currently WriteTXT class writes only the transcripts to the file.

We can modify it as follows,

class WriteTXT(ResultWriter):
    extension: str = "txt"

    def write_result(self, result: dict, file: TextIO, options: dict):
        for segment in result["segments"]:
            start = format_timestamp(segment["start"])
            end = format_timestamp(segment["end"])
            speaker = segment.get("speaker", "Unknown")
            text = segment["text"].strip()
            print(f"{start}\t{end}\t{speaker}\t{text}", file=file, flush=True)

The output .txt will be in this format, with start-end timestamps along with speaker labels.

veenified · 2024-06-14T14:16:34Z

@nkilm This works great!
I've switched over to using this for my TXT format, because I am always using Diarization.

I would suggest keying off the --diarization to decide whether the conventional TXT format is used or the revised format with SPEAKERs identified.

I tried to do this and submit a pull request, but I am failing to pass the diarization flag/parameter through to utils.py as an option.

nkilm · 2024-06-14T16:29:03Z

@veenified

I would suggest keying off the --diarization to decide whether the conventional TXT format is used or the revised format with
SPEAKERs identified.

Please can you explain more about what you trying to achieve?

I tried to do this and submit a pull request, but I am failing to pass the diarization flag/parameter through to utils.py as an option.

Is the PR still open? I'll see if I can help.

ypsilonkah · 2024-08-01T10:25:04Z

Hi,
first thank you @veenified for your workaround.
I was looking for/wondering if it is possible to get plain txt files where the active speaker is only mentioned in the first line of their active speech block. At the moment my txt files have many line breaks and each line lists the active speaker; I would prefer paragraphs for each block and then a new paragraph for a new speaker, who gets identified in the first line.

I also use whisperx on command line ( whisperx file.mp3 --model large-v3 --language de --align_model WAV2VEC2_LARGE_LV60K_960H --diarize --print_progress True --min_speakers 2 --max_speakers 2 --hf_token HFTOKEN)

Example:

Speaker 0: Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore 
magna aliquyam erat, sed diam voluptua. At vero eos et accusam et 
justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea 
takimata sanctus est Lorem ipsum dolor sit amet.

Speaker 1: Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore 
magna aliquyam erat, sed diam voluptua. At vero eos et accusam et 
justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea 
takimata sanctus est Lorem ipsum dolor sit amet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No speaker labels in txt format with diarization enabled #801

No speaker labels in txt format with diarization enabled #801

veenified commented May 11, 2024

veenified commented May 14, 2024

SeeknnDestroy commented May 15, 2024

veenified commented May 15, 2024

nkilm commented May 26, 2024

veenified commented Jun 14, 2024

nkilm commented Jun 14, 2024

ypsilonkah commented Aug 1, 2024 •

edited

Loading

No speaker labels in txt format with diarization enabled #801

No speaker labels in txt format with diarization enabled #801

Comments

veenified commented May 11, 2024

veenified commented May 14, 2024

SeeknnDestroy commented May 15, 2024

veenified commented May 15, 2024

nkilm commented May 26, 2024

veenified commented Jun 14, 2024

nkilm commented Jun 14, 2024

ypsilonkah commented Aug 1, 2024 • edited Loading

ypsilonkah commented Aug 1, 2024 •

edited

Loading