# Data

## Crawling

Fist we need define the video ids in the `data/video_ids.txt` file. The file should contain one video id per line. For example:

```
video_id_1
video_id_2
...
video_id_n
```

This can be done by manually creating the file or by define the variable `VIDEO_SEARCH_START_DATE` in `src/config.py` file. The `VIDEO_SEARCH_START_DATE` should be in the datetime data type `datetime(YYYY, MM, DD)` and the script will search for all videos uploaded 15 days after this date.

Then we can use the following command to download the video data.

In [1]:
# !python src/asr/utils/collect_data.py

It will download all the audio files from the videos in the `data/video_ids.txt` file and save them in the `data/raw/audios` folder, with the name `video_id.mp3`. The subtitles will be saved in the `data/raw/subtitles` folder with the name `video_id.vtt`. The metadata will be saved in the `data/metadata` folder with the name `video_id.json`.

The metadata will contain the following information:
- `video_id`: the video id
- `title`: the video title
- `description`: the video description
- `tags`: the video tags
- `category`: the video category
- `duration`: the video duration

The subtitles will be saved in the VTT format. The VTT format is a simple text format that contains the subtitles in the following format:

```vtt
WEBVTT
Kind: captions
Language: en

00:00:04.376 --> 00:00:08.463
My first job as an investor
was when I was 24 years old,

00:00:08.505 --> 00:00:12.217
and I'm almost 50 today,
so that is half a lifetime ago.

...
```
Alternatively, the script will save subtitle files in the original format provided by the author if available. In cases where the author's subtitles are not accessible, we will utilize automatic captions from YouTube as a substitute.

## Preprocessing

After downloading the data we will preprocess the audio and subtitles. The subtitles will be converted to a json file, where each subtitle will be a dictionary with the following keys:

```json
{
    "data": [
        {
            "id": 0,
            "speaker": "A",
            "text": " Directions, in this part, you will be asked to refer to information on the screen in order to answer three questions.",
            "start": 0,
            "end": 1020 // in milliseconds
        },
        {
            "id": 1,
            "speaker": "A",
            "text": " Directions, in this part, you will be asked to refer to information on the screen in order to answer three questions.",
            "start": 2003,
            "end": 10200 // in milliseconds
        }
        // ....
    ]
}
```

In [2]:
import json
from pprint import pprint
import pandas as pd

from asr.utils.parser import parse_subtitle
import config as cfg

subtitles = parse_subtitle(f"{cfg.SUBTITLE_RAW_PATH}/{cfg.TWO_PEOPLE_VIDEO_ID}-en-auto.vtt")
with open(f"{cfg.SUBTITLE_PROCESSED_PATH}/{cfg.TWO_PEOPLE_VIDEO_ID}.json", "w+") as f:
    json.dump(subtitles, f, indent=2, default=str)

1161 overall subtitles
1161 without overlap subtitles
1161 after filtering
83 merged
not cool 


In [3]:
df = pd.DataFrame(subtitles)
df.head()

Unnamed: 0,ts_start,ts_end,original_phrase,sub_file,duration,idx,phrase,hash
0,00:00:00.080000,00:00:14.719000,let's say that you are designing a let's say...,/space/hotel/phit/personal/asr/data/raw/subtit...,14.639,0,let's say that you are designing a,065479dc321578faf12de4c03b5082b80893134e9363c2...
1,00:00:14.719000,00:00:28.800000,marketplace is selling a gun how would you go ...,/space/hotel/phit/personal/asr/data/raw/subtit...,14.081,16,marketplace is selling a gun how would you go ...,d54cb81716f422937117c190414326a5bc946f56cb8912...
2,00:00:28.800000,00:00:43.680000,listings what happens with those identificatio...,/space/hotel/phit/personal/asr/data/raw/subtit...,14.88,32,listings what happens with those identificatio...,02581d6c4b20027dcab817fe4fe5260753f2fd8cea1e01...
3,00:00:43.680000,00:00:57.360000,and then a user can flag the listing if they s...,/space/hotel/phit/personal/asr/data/raw/subtit...,13.68,44,and then a user can flag the listing if they s...,0748072ad938406083885a698ffe51336765456031604c...
4,00:00:57.360000,00:01:12.159000,determine it as a gun and that's the only thin...,/space/hotel/phit/personal/asr/data/raw/subtit...,14.799,58,determine it as a gun and that's the only thin...,0f925ccf7ff103e2d8dd13b28e468be09571e3992d84e9...


# Model

In [4]:
# !pip install git+https://github.com/m-bain/whisperX.git@78dcfaab51005aa703ee21375f81ed31bc248560
# !pip install dora-search lameenc openunmix wget Cython
# !pip install --no-build-isolation "nemo_toolkit[asr]==1.23.0"
# !pip install --no-deps git+https://github.com/facebookresearch/demucs#egg=demucs
# !pip install git+https://github.com/oliverguhr/deepmultilingualpunctuation.git

In [5]:
# !pip install -r requirements.txt

In [6]:
import os
import wget
from omegaconf import OmegaConf
import json
import shutil
import whisperx
import torch
from pydub import AudioSegment
from nemo.collections.asr.models.msdd_models import NeuralDiarizer
from deepmultilingualpunctuation import PunctuationModel
import re
import logging
import nltk
from whisperx.alignment import DEFAULT_ALIGN_MODELS_HF, DEFAULT_ALIGN_MODELS_TORCH
from whisperx.utils import LANGUAGES, TO_LANGUAGE_CODE

This version of torchaudio is old. SpeechBrain no longer tries using the torchaudio global backend mechanism in recipes, so if you encounter issues, update torchaudio.
This version of torchaudio is old. SpeechBrain no longer tries using the torchaudio global backend mechanism in recipes, so if you encounter issues, update torchaudio.


In [7]:
punct_model_langs = [
    "en",
    "fr",
    "de",
    "es",
    "it",
    "nl",
    "pt",
    "bg",
    "pl",
    "cs",
    "sk",
    "sl",
]
wav2vec2_langs = list(DEFAULT_ALIGN_MODELS_TORCH.keys()) + list(
    DEFAULT_ALIGN_MODELS_HF.keys()
)

whisper_langs = sorted(LANGUAGES.keys()) + sorted(
    [k.title() for k in TO_LANGUAGE_CODE.keys()]
)


def create_config(output_dir):
    DOMAIN_TYPE = "telephonic"  # Can be meeting, telephonic, or general based on domain type of the audio file
    CONFIG_FILE_NAME = f"diar_infer_{DOMAIN_TYPE}.yaml"
    CONFIG_URL = f"https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/speaker_tasks/diarization/conf/inference/{CONFIG_FILE_NAME}"
    MODEL_CONFIG = os.path.join(output_dir, CONFIG_FILE_NAME)
    if not os.path.exists(MODEL_CONFIG):
        MODEL_CONFIG = wget.download(CONFIG_URL, output_dir)

    config = OmegaConf.load(MODEL_CONFIG)

    data_dir = os.path.join(output_dir, "data")
    os.makedirs(data_dir, exist_ok=True)

    meta = {
        "audio_filepath": os.path.join(output_dir, "mono_file.wav"),
        "offset": 0,
        "duration": None,
        "label": "infer",
        "text": "-",
        "rttm_filepath": None,
        "uem_filepath": None,
    }
    with open(os.path.join(data_dir, "input_manifest.json"), "w") as fp:
        json.dump(meta, fp)
        fp.write("\n")

    pretrained_vad = "vad_multilingual_marblenet"
    pretrained_speaker_model = "titanet_large"
    config.num_workers = 0  # Workaround for multiprocessing hanging with ipython issue
    config.diarizer.manifest_filepath = os.path.join(data_dir, "input_manifest.json")
    config.diarizer.out_dir = (
        output_dir  # Directory to store intermediate files and prediction outputs
    )

    config.diarizer.speaker_embeddings.model_path = pretrained_speaker_model
    config.diarizer.oracle_vad = (
        False  # compute VAD provided with model_path to vad config
    )
    config.diarizer.clustering.parameters.oracle_num_speakers = False

    # Here, we use our in-house pretrained NeMo VAD model
    config.diarizer.vad.model_path = pretrained_vad
    config.diarizer.vad.parameters.onset = 0.8
    config.diarizer.vad.parameters.offset = 0.6
    config.diarizer.vad.parameters.pad_offset = -0.05
    config.diarizer.msdd_model.model_path = (
        "diar_msdd_telephonic"  # Telephonic speaker diarization model
    )

    return config


def get_word_ts_anchor(s, e, option="start"):
    if option == "end":
        return e
    elif option == "mid":
        return (s + e) / 2
    return s


def get_words_speaker_mapping(wrd_ts, spk_ts, word_anchor_option="start"):
    s, e, sp = spk_ts[0]
    wrd_pos, turn_idx = 0, 0
    wrd_spk_mapping = []
    for wrd_dict in wrd_ts:
        ws, we, wrd = (
            int(wrd_dict["start"] * 1000),
            int(wrd_dict["end"] * 1000),
            wrd_dict["word"],
        )
        wrd_pos = get_word_ts_anchor(ws, we, word_anchor_option)
        while wrd_pos > float(e):
            turn_idx += 1
            turn_idx = min(turn_idx, len(spk_ts) - 1)
            s, e, sp = spk_ts[turn_idx]
            if turn_idx == len(spk_ts) - 1:
                e = get_word_ts_anchor(ws, we, option="end")
        wrd_spk_mapping.append(
            {"word": wrd, "start_time": ws, "end_time": we, "speaker": sp}
        )
    return wrd_spk_mapping


sentence_ending_punctuations = ".?!"


def get_first_word_idx_of_sentence(word_idx, word_list, speaker_list, max_words):
    is_word_sentence_end = (
        lambda x: x >= 0 and word_list[x][-1] in sentence_ending_punctuations
    )
    left_idx = word_idx
    while (
        left_idx > 0
        and word_idx - left_idx < max_words
        and speaker_list[left_idx - 1] == speaker_list[left_idx]
        and not is_word_sentence_end(left_idx - 1)
    ):
        left_idx -= 1

    return left_idx if left_idx == 0 or is_word_sentence_end(left_idx - 1) else -1


def get_last_word_idx_of_sentence(word_idx, word_list, max_words):
    is_word_sentence_end = (
        lambda x: x >= 0 and word_list[x][-1] in sentence_ending_punctuations
    )
    right_idx = word_idx
    while (
        right_idx < len(word_list)
        and right_idx - word_idx < max_words
        and not is_word_sentence_end(right_idx)
    ):
        right_idx += 1

    return (
        right_idx
        if right_idx == len(word_list) - 1 or is_word_sentence_end(right_idx)
        else -1
    )


def get_realigned_ws_mapping_with_punctuation(
    word_speaker_mapping, max_words_in_sentence=50
):
    is_word_sentence_end = (
        lambda x: x >= 0
        and word_speaker_mapping[x]["word"][-1] in sentence_ending_punctuations
    )
    wsp_len = len(word_speaker_mapping)

    words_list, speaker_list = [], []
    for k, line_dict in enumerate(word_speaker_mapping):
        word, speaker = line_dict["word"], line_dict["speaker"]
        words_list.append(word)
        speaker_list.append(speaker)

    k = 0
    while k < len(word_speaker_mapping):
        line_dict = word_speaker_mapping[k]
        if (
            k < wsp_len - 1
            and speaker_list[k] != speaker_list[k + 1]
            and not is_word_sentence_end(k)
        ):
            left_idx = get_first_word_idx_of_sentence(
                k, words_list, speaker_list, max_words_in_sentence
            )
            right_idx = (
                get_last_word_idx_of_sentence(
                    k, words_list, max_words_in_sentence - k + left_idx - 1
                )
                if left_idx > -1
                else -1
            )
            if min(left_idx, right_idx) == -1:
                k += 1
                continue

            spk_labels = speaker_list[left_idx : right_idx + 1]
            mod_speaker = max(set(spk_labels), key=spk_labels.count)
            if spk_labels.count(mod_speaker) < len(spk_labels) // 2:
                k += 1
                continue

            speaker_list[left_idx : right_idx + 1] = [mod_speaker] * (
                right_idx - left_idx + 1
            )
            k = right_idx

        k += 1

    k, realigned_list = 0, []
    while k < len(word_speaker_mapping):
        line_dict = word_speaker_mapping[k].copy()
        line_dict["speaker"] = speaker_list[k]
        realigned_list.append(line_dict)
        k += 1

    return realigned_list


def get_sentences_speaker_mapping(word_speaker_mapping, spk_ts):
    sentence_checker = nltk.tokenize.PunktSentenceTokenizer().text_contains_sentbreak
    s, e, spk = spk_ts[0]
    prev_spk = spk

    snts = []
    snt = {"speaker": f"Speaker {spk}", "start_time": s, "end_time": e, "text": ""}

    for wrd_dict in word_speaker_mapping:
        wrd, spk = wrd_dict["word"], wrd_dict["speaker"]
        s, e = wrd_dict["start_time"], wrd_dict["end_time"]
        if spk != prev_spk or sentence_checker(snt["text"] + " " + wrd):
            snts.append(snt)
            snt = {
                "speaker": f"Speaker {spk}",
                "start_time": s,
                "end_time": e,
                "text": "",
            }
        else:
            snt["end_time"] = e
        snt["text"] += wrd + " "
        prev_spk = spk

    snts.append(snt)
    return snts


def get_speaker_aware_transcript(sentences_speaker_mapping, f):
    previous_speaker = sentences_speaker_mapping[0]["speaker"]
    f.write(f"{previous_speaker}: ")

    for sentence_dict in sentences_speaker_mapping:
        speaker = sentence_dict["speaker"]
        sentence = sentence_dict["text"]

        # If this speaker doesn't match the previous one, start a new paragraph
        if speaker != previous_speaker:
            f.write(f"\n\n{speaker}: ")
            previous_speaker = speaker

        # No matter what, write the current sentence
        f.write(sentence + " ")


def format_timestamp(
    milliseconds: float, always_include_hours: bool = False, decimal_marker: str = "."
):
    assert milliseconds >= 0, "non-negative timestamp expected"

    hours = milliseconds // 3_600_000
    milliseconds -= hours * 3_600_000

    minutes = milliseconds // 60_000
    milliseconds -= minutes * 60_000

    seconds = milliseconds // 1_000
    milliseconds -= seconds * 1_000

    hours_marker = f"{hours:02d}:" if always_include_hours or hours > 0 else ""
    return (
        f"{hours_marker}{minutes:02d}:{seconds:02d}{decimal_marker}{milliseconds:03d}"
    )


def write_srt(transcript, file):
    """
    Write a transcript to a file in SRT format.

    """
    for i, segment in enumerate(transcript, start=1):
        # write srt lines
        print(
            f"{i}\n"
            f"{format_timestamp(segment['start_time'], always_include_hours=True, decimal_marker=',')} --> "
            f"{format_timestamp(segment['end_time'], always_include_hours=True, decimal_marker=',')}\n"
            f"{segment['speaker']}: {segment['text'].strip().replace('-->', '->')}\n",
            file=file,
            flush=True,
        )


def find_numeral_symbol_tokens(tokenizer):
    numeral_symbol_tokens = [
        -1,
    ]
    for token, token_id in tokenizer.get_vocab().items():
        has_numeral_symbol = any(c in "0123456789%$£" for c in token)
        if has_numeral_symbol:
            numeral_symbol_tokens.append(token_id)
    return numeral_symbol_tokens


def _get_next_start_timestamp(word_timestamps, current_word_index, final_timestamp):
    # if current word is the last word
    if current_word_index == len(word_timestamps) - 1:
        return word_timestamps[current_word_index]["start"]

    next_word_index = current_word_index + 1
    while current_word_index < len(word_timestamps) - 1:
        if word_timestamps[next_word_index].get("start") is None:
            # if next word doesn't have a start timestamp
            # merge it with the current word and delete it
            word_timestamps[current_word_index]["word"] += (
                " " + word_timestamps[next_word_index]["word"]
            )

            word_timestamps[next_word_index]["word"] = None
            next_word_index += 1
            if next_word_index == len(word_timestamps):
                return final_timestamp

        else:
            return word_timestamps[next_word_index]["start"]


def filter_missing_timestamps(
    word_timestamps, initial_timestamp=0, final_timestamp=None
):
    # handle the first and last word
    if word_timestamps[0].get("start") is None:
        word_timestamps[0]["start"] = (
            initial_timestamp if initial_timestamp is not None else 0
        )
        word_timestamps[0]["end"] = _get_next_start_timestamp(
            word_timestamps, 0, final_timestamp
        )

    result = [
        word_timestamps[0],
    ]

    for i, ws in enumerate(word_timestamps[1:], start=1):
        # if ws doesn't have a start and end
        # use the previous end as start and next start as end
        if ws.get("start") is None and ws.get("word") is not None:
            ws["start"] = word_timestamps[i - 1]["end"]
            ws["end"] = _get_next_start_timestamp(word_timestamps, i, final_timestamp)

        if ws["word"] is not None:
            result.append(ws)
    return result


def cleanup(path: str):
    """path could either be relative or absolute."""
    # check if file or directory exists
    if os.path.isfile(path) or os.path.islink(path):
        # remove file
        os.remove(path)
    elif os.path.isdir(path):
        # remove directory and all its content
        shutil.rmtree(path)
    else:
        raise ValueError("Path {} is not a file or dir.".format(path))


def process_language_arg(language: str, model_name: str):
    """
    Process the language argument to make sure it's valid and convert language names to language codes.
    """
    if language is not None:
        language = language.lower()
    if language not in LANGUAGES:
        if language in TO_LANGUAGE_CODE:
            language = TO_LANGUAGE_CODE[language]
        else:
            raise ValueError(f"Unsupported language: {language}")

    if model_name.endswith(".en") and language != "en":
        if language is not None:
            logging.warning(
                f"{model_name} is an English-only model but received '{language}'; using English instead."
            )
        language = "en"
    return language


def transcribe(
    audio_file: str,
    language: str,
    model_name: str,
    compute_dtype: str,
    suppress_numerals: bool,
    device: str,
):
    from faster_whisper import WhisperModel
    from helpers import find_numeral_symbol_tokens, wav2vec2_langs

    # Faster Whisper non-batched
    # Run on GPU with FP16
    whisper_model = WhisperModel(model_name, device=device, compute_type=compute_dtype)

    # or run on GPU with INT8
    # model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
    # or run on CPU with INT8
    # model = WhisperModel(model_size, device="cpu", compute_type="int8")

    if suppress_numerals:
        numeral_symbol_tokens = find_numeral_symbol_tokens(whisper_model.hf_tokenizer)
    else:
        numeral_symbol_tokens = None

    if language is not None and language in wav2vec2_langs:
        word_timestamps = False
    else:
        word_timestamps = True

    segments, info = whisper_model.transcribe(
        audio_file,
        language=language,
        beam_size=5,
        word_timestamps=word_timestamps,  # TODO: disable this if the language is supported by wav2vec2
        suppress_tokens=numeral_symbol_tokens,
        vad_filter=True,
    )
    whisper_results = []
    for segment in segments:
        whisper_results.append(segment._asdict())
    # clear gpu vram
    del whisper_model
    torch.cuda.empty_cache()
    return whisper_results, language


def transcribe_batched(
    audio_file: str,
    language: str,
    batch_size: int,
    model_name: str,
    compute_dtype: str,
    suppress_numerals: bool,
    device: str,
):
    import whisperx

    # Faster Whisper batched
    whisper_model = whisperx.load_model(
        model_name,
        device,
        compute_type=compute_dtype,
        asr_options={"suppress_numerals": suppress_numerals},
    )
    audio = whisperx.load_audio(audio_file)
    result = whisper_model.transcribe(audio, language=language, batch_size=batch_size)
    del whisper_model
    torch.cuda.empty_cache()
    return result["segments"], result["language"]


# Options

In [8]:
video_id = cfg.ASSESSMENT_VIDEO_IDS[1]
# Name of the audio file
audio_path = f"{cfg.AUDIO_RAW_PATH}/{video_id}.mp3"

# Whether to enable music removal from speech, helps increase diarization quality but uses alot of ram
enable_stemming = True

# (choose from 'tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 'large')
whisper_model_name = "large-v3"

# replaces numerical digits with their pronounciation, increases diarization accuracy
suppress_numerals = True

batch_size = 8

language = None  # autodetect language

device = "cuda" if torch.cuda.is_available() else "cpu"

# Processing

## Separating music from speech using Demucs
---
By isolating the vocals from the rest of the audio, it becomes easier to identify and track individual speakers based on the spectral and temporal characteristics of their speech signals. Source separation is just one of many techniques that can be used as a preprocessing step to help improve the accuracy and reliability of the overall diarization process.

In [9]:
if enable_stemming:
    # Isolate vocals from the rest of the audio

    return_code = os.system(
        f'python3 -m demucs.separate -n htdemucs --two-stems=vocals "{audio_path}" -o "temp_outputs"'
    )

    if return_code != 0:
        logging.warning("Source splitting failed, using original audio file.")
        vocal_target = audio_path
    else:
        vocal_target = os.path.join(
            "temp_outputs",
            "htdemucs",
            os.path.splitext(os.path.basename(audio_path))[0],
            "vocals.wav",
        )
else:
    vocal_target = audio_path

Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/htdemucs
Separating track /space/hotel/phit/personal/asr/data/raw/audios/Y8tlFLIjyMU.mp3


  0%|                                                                                | 0.0/1585.35 [00:00<?, ?seconds/s]

100%|████████████████████████████████████████████████████████████████████| 1585.35/1585.35 [00:39<00:00, 40.04seconds/s]


In [10]:
vocal_target

'temp_outputs/htdemucs/Y8tlFLIjyMU/vocals.wav'

## Transcriping audio using Whisper and realligning timestamps using Wav2Vec2
---
This code uses two different open-source models to transcribe speech and perform forced alignment on the resulting transcription.

The first model is called OpenAI Whisper, which is a speech recognition model that can transcribe speech with high accuracy. The code loads the whisper model and uses it to transcribe the vocal_target file.

The output of the transcription process is a set of text segments with corresponding timestamps indicating when each segment was spoken.

In [11]:
# !pip install ctranslate2==3.24.0

In [12]:
compute_type = "float16"
# or run on GPU with INT8
# compute_type = "int8_float16"
# or run on CPU with INT8
# compute_type = "int8"

if batch_size != 0:
    whisper_results, language = transcribe_batched(
        vocal_target,
        language,
        batch_size,
        whisper_model_name,
        compute_type,
        suppress_numerals,
        device,
    )
else:
    whisper_results, language = transcribe(
        vocal_target,
        language,
        whisper_model_name,
        compute_type,
        suppress_numerals,
        device,
    )

No language specified, language will be first be detected for each audio file (increases inference time).


Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../../../../../../../space/hotel/phit/.cache/torch/whisperx-vad-segmentation.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Detected language: en (0.98) in first 30s of audio...
Suppressing numeral and symbol tokens


In [13]:
whisper_results

[{'text': " Hello, everyone. Welcome back to another round of Turing Mock Interviews. I am José and I'm a tech leader at Turing. I'm from Montreal, Canada, and at Turing, I work on the hiring team, helping them to hire the best engineers by helping with the vetting process. I have more than seventeen years of experience and my expertise lies in JavaScript.",
  'start': 10.128,
  'end': 33.797},
 {'text': " Today, I will be interviewing Victor for the role of an experienced Node.js developer. Let's hear from Victor. Hello, Victor. How are you doing and how is your day doing so far? Hi, Guilci. I'm doing good. How are you, too?",
  'start': 33.797,
  'end': 49.991},
 {'text': " I'm doing great. So thanks for asking. So could you please introduce yourself and tell me a little bit about your experience, your past project and what kind of language framework have you been working with? And then I can take from there. Yes, I am Victor. I'm from Nigeria.",
  'start': 51.186,
  'end': 70.367},


### Save result for assessment 01 (transcription)

In [14]:
from pprint import pprint
import json

res_ = whisper_results.copy()
for i, r in enumerate(res_):
    r["id"] = i
    r["speaker"] = "A"

final_result = {
    "data": res_
}
pprint(final_result)

with open(f"results/assessment_01/{video_id}-asr.json", "w") as f:
    json.dump(final_result, f, indent=2)

{'data': [{'end': 33.797,
           'id': 0,
           'speaker': 'A',
           'start': 10.128,
           'text': ' Hello, everyone. Welcome back to another round of Turing '
                   "Mock Interviews. I am José and I'm a tech leader at "
                   "Turing. I'm from Montreal, Canada, and at Turing, I work "
                   'on the hiring team, helping them to hire the best '
                   'engineers by helping with the vetting process. I have more '
                   'than seventeen years of experience and my expertise lies '
                   'in JavaScript.'},
          {'end': 49.991,
           'id': 1,
           'speaker': 'A',
           'start': 33.797,
           'text': ' Today, I will be interviewing Victor for the role of an '
                   "experienced Node.js developer. Let's hear from Victor. "
                   'Hello, Victor. How are you doing and how is your day doing '
                   "so far? Hi, Guilci. I'm doing good. Ho

,
          {'end': 601.613,
           'id': 25,
           'speaker': 'A',
           'start': 571.715,
           'text': ' the engine with Node.js, you know, yeah. Okay, this one '
                   'is a classic question, okay, classic Node.js question that '
                   "you'll be asking a lot during the interview like this. My "
                   'question is, what do you understand about callbacks? All '
                   'right, so callbacks are a way of, in Node.js, a way of '
                   'performing an operation asynchronous with the callbacks '
                   'essential functions that are passed into'},
          {'end': 623.66,
           'id': 26,
           'speaker': 'A',
           'start': 601.971,
           'text': ' um another function or any other asynchronous process to '
                   'be called at the later time and that time is usually '
                   'indeterminate um but um so the callback those functions '
                   '

## Aligning the transcription with the original audio using Wav2Vec2
---
The second model used is called wav2vec2, which is a large-scale neural network that is designed to learn representations of speech that are useful for a variety of speech processing tasks, including speech recognition and alignment.

The code loads the wav2vec2 alignment model and uses it to align the transcription segments with the original audio signal contained in the vocal_target file. This process involves finding the exact timestamps in the audio signal where each segment was spoken and aligning the text accordingly.

By combining the outputs of the two models, the code produces a fully aligned transcription of the speech contained in the vocal_target file. This aligned transcription can be useful for a variety of speech processing tasks, such as speaker diarization, sentiment analysis, and language identification.

If there's no Wav2Vec2 model available for your language, word timestamps generated by whisper will be used instead.

In [15]:
if language in wav2vec2_langs:
    device = "cuda"
    alignment_model, metadata = whisperx.load_align_model(
        language_code=language, device=device
    )
    result_aligned = whisperx.align(
        whisper_results, alignment_model, metadata, vocal_target, device
    )
    word_timestamps = filter_missing_timestamps(
        result_aligned["word_segments"],
        initial_timestamp=whisper_results[0].get("start"),
        final_timestamp=whisper_results[-1].get("end"),
    )

    # clear gpu vram
    del alignment_model
    torch.cuda.empty_cache()
else:
    assert batch_size == 0, (  # TODO: add a better check for word timestamps existence
        f"Unsupported language: {language}, use --batch_size to 0"
        " to generate word timestamps using whisper directly and fix this error."
    )
    word_timestamps = []
    for segment in whisper_results:
        for word in segment["words"]:
            word_timestamps.append({"word": word[2], "start": word[0], "end": word[1]})

In [16]:
result_aligned["word_segments"]

[{'word': 'Hello,', 'start': 10.148, 'end': 10.348, 'score': 0.57},
 {'word': 'everyone.', 'start': 10.388, 'end': 10.748, 'score': 0.836},
 {'word': 'Welcome', 'start': 10.868, 'end': 11.228, 'score': 0.93},
 {'word': 'back', 'start': 11.288, 'end': 11.469, 'score': 0.893},
 {'word': 'to', 'start': 11.509, 'end': 11.609, 'score': 0.96},
 {'word': 'another', 'start': 11.669, 'end': 12.249, 'score': 0.84},
 {'word': 'round', 'start': 12.469, 'end': 12.829, 'score': 0.905},
 {'word': 'of', 'start': 12.889, 'end': 12.989, 'score': 0.898},
 {'word': 'Turing', 'start': 13.029, 'end': 13.269, 'score': 0.389},
 {'word': 'Mock', 'start': 13.349, 'end': 13.609, 'score': 0.604},
 {'word': 'Interviews.', 'start': 13.649, 'end': 14.19, 'score': 0.788},
 {'word': 'I', 'start': 14.55, 'end': 14.61, 'score': 0.988},
 {'word': 'am', 'start': 14.67, 'end': 14.75, 'score': 0.867},
 {'word': 'José', 'start': 14.79, 'end': 15.27, 'score': 0.899},
 {'word': 'and', 'start': 15.57, 'end': 15.71, 'score': 0.7

In [17]:
word_timestamps

[{'word': 'Hello,', 'start': 10.148, 'end': 10.348, 'score': 0.57},
 {'word': 'everyone.', 'start': 10.388, 'end': 10.748, 'score': 0.836},
 {'word': 'Welcome', 'start': 10.868, 'end': 11.228, 'score': 0.93},
 {'word': 'back', 'start': 11.288, 'end': 11.469, 'score': 0.893},
 {'word': 'to', 'start': 11.509, 'end': 11.609, 'score': 0.96},
 {'word': 'another', 'start': 11.669, 'end': 12.249, 'score': 0.84},
 {'word': 'round', 'start': 12.469, 'end': 12.829, 'score': 0.905},
 {'word': 'of', 'start': 12.889, 'end': 12.989, 'score': 0.898},
 {'word': 'Turing', 'start': 13.029, 'end': 13.269, 'score': 0.389},
 {'word': 'Mock', 'start': 13.349, 'end': 13.609, 'score': 0.604},
 {'word': 'Interviews.', 'start': 13.649, 'end': 14.19, 'score': 0.788},
 {'word': 'I', 'start': 14.55, 'end': 14.61, 'score': 0.988},
 {'word': 'am', 'start': 14.67, 'end': 14.75, 'score': 0.867},
 {'word': 'José', 'start': 14.79, 'end': 15.27, 'score': 0.899},
 {'word': 'and', 'start': 15.57, 'end': 15.71, 'score': 0.7

## Convert audio to mono for NeMo combatibility

In [18]:
sound = AudioSegment.from_file(vocal_target).set_channels(1)
ROOT = os.getcwd()
temp_path = os.path.join(ROOT, "temp_outputs")
os.makedirs(temp_path, exist_ok=True)
sound.export(os.path.join(temp_path, "mono_file.wav"), format="wav")

<_io.BufferedRandom name='/mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/mono_file.wav'>

## Speaker Diarization using NeMo MSDD Model
---
This code uses a model called Nvidia NeMo MSDD (Multi-scale Diarization Decoder) to perform speaker diarization on an audio signal. Speaker diarization is the process of separating an audio signal into different segments based on who is speaking at any given time.

In [20]:
# Initialize NeMo MSDD diarization model
msdd_model = NeuralDiarizer(cfg=create_config(temp_path)).to("cpu") # cuda 
msdd_model.diarize()

del msdd_model
torch.cuda.empty_cache()

[NeMo I 2024-05-12 21:20:36 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-05-12 21:20:36 cloud:58] Found existing object /home/phit/.cache/torch/NeMo/NeMo_1.23.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-05-12 21:20:36 cloud:64] Re-using file from: /home/phit/.cache/torch/NeMo/NeMo_1.23.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-05-12 21:20:36 common:924] Instantiating model from pre-trained checkpoint


[NeMo W 2024-05-12 21:20:38 modelPT:165] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-05-12 21:20:38 modelPT:172] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-05-12 21:20:38 modelPT:178] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-05-12 21:20:38 features:289] PADDING: 16
[NeMo I 2024-05-12 21:20:38 features:289] PADDING: 16
[NeMo I 2024-05-12 21:20:38 audio_preprocessing:517] Numba CUDA SpecAugment kernel is being used
[NeMo I 2024-05-12 21:20:39 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /home/phit/.cache/torch/NeMo/NeMo_1.23.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-05-12 21:20:39 features:289] PADDING: 16
[NeMo I 2024-05-12 21:20:39 audio_preprocessing:517] Numba CUDA SpecAugment kernel is being used
[NeMo I 2024-05-12 21:20:39 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-05-12 21:20:39 cloud:58] Found existing object /home/phit/.cache/torch/NeMo/NeMo_1.23.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-05-12 21:20:39 cloud:64] Re-using file from: /home/phit/.cache/torch/NeMo/NeMo_1.23.0/v

[NeMo W 2024-05-12 21:20:39 modelPT:165] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-05-12 21:20:39 features:289] PADDING: 16
[NeMo I 2024-05-12 21:20:40 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/phit/.cache/torch/NeMo/NeMo_1.23.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-05-12 21:20:40 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-05-12 21:20:40 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-05-12 21:20:40 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-05-12 21:20:40 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-05-12 21:20:40 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:01<00:00,  1.42s/it]


[NeMo I 2024-05-12 21:20:41 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-05-12 21:20:41 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-05-12 21:20:41 collections:445] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-05-12 21:20:41 collections:446] Dataset loaded with 32 items, total duration of  0.44 hours.
[NeMo I 2024-05-12 21:20:41 collections:448] # 32 files loaded accounting to # 1 labels


vad: 100%|██████████| 32/32 [00:43<00:00,  1.37s/it]


[NeMo I 2024-05-12 21:21:25 clustering_diarizer:250] Generating predictions with overlapping input segments


                                                               

[NeMo I 2024-05-12 21:21:41 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:01<00:00,  1.17s/it]


[NeMo I 2024-05-12 21:21:42 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-05-12 21:21:42 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-05-12 21:21:42 collections:445] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-05-12 21:21:42 collections:446] Dataset loaded with 1596 items, total duration of  0.59 hours.
[NeMo I 2024-05-12 21:21:42 collections:448] # 1596 files loaded accounting to # 1 labels


[1/5] extract embeddings: 100%|██████████| 25/25 [01:37<00:00,  3.92s/it]


[NeMo I 2024-05-12 21:23:20 clustering_diarizer:389] Saved embedding files to /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-05-12 21:23:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-05-12 21:23:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-05-12 21:23:21 collections:445] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-05-12 21:23:21 collections:446] Dataset loaded with 1931 items, total duration of  0.61 hours.
[NeMo I 2024-05-12 21:23:21 collections:448] # 1931 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 31/31 [01:35<00:00,  3.07s/it]


[NeMo I 2024-05-12 21:24:56 clustering_diarizer:389] Saved embedding files to /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-05-12 21:24:56 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-05-12 21:24:56 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-05-12 21:24:56 collections:445] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-05-12 21:24:56 collections:446] Dataset loaded with 2410 items, total duration of  0.62 hours.
[NeMo I 2024-05-12 21:24:56 collections:448] # 2410 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 38/38 [01:37<00:00,  2.58s/it]


[NeMo I 2024-05-12 21:26:35 clustering_diarizer:389] Saved embedding files to /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-05-12 21:26:35 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-05-12 21:26:35 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-05-12 21:26:35 collections:445] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-05-12 21:26:35 collections:446] Dataset loaded with 3257 items, total duration of  0.65 hours.
[NeMo I 2024-05-12 21:26:35 collections:448] # 3257 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 51/51 [01:35<00:00,  1.86s/it]


[NeMo I 2024-05-12 21:28:11 clustering_diarizer:389] Saved embedding files to /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-05-12 21:28:11 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-05-12 21:28:11 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-05-12 21:28:11 collections:445] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-05-12 21:28:11 collections:446] Dataset loaded with 4967 items, total duration of  0.67 hours.
[NeMo I 2024-05-12 21:28:11 collections:448] # 4967 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 78/78 [02:00<00:00,  1.55s/it]


[NeMo I 2024-05-12 21:30:14 clustering_diarizer:389] Saved embedding files to /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/speaker_outputs/embeddings


[NeMo W 2024-05-12 21:30:14 speaker_utils:464] cuda=False, using CPU for eigen decomposition. This might slow down the clustering process.
clustering: 100%|██████████| 1/1 [00:10<00:00, 10.18s/it]


[NeMo I 2024-05-12 21:30:24 clustering_diarizer:464] Outputs are saved in /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs directory


[NeMo W 2024-05-12 21:30:24 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-05-12 21:30:24 msdd_models:960] Loading embedding pickle file of scale:0 at /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-05-12 21:30:24 msdd_models:960] Loading embedding pickle file of scale:1 at /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-05-12 21:30:24 msdd_models:960] Loading embedding pickle file of scale:2 at /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-05-12 21:30:24 msdd_models:960] Loading embedding pickle file of scale:3 at /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-05-12 21:30:24 msdd_models:960] Loading embedding pickle file of scale:4 at /mnt/net/i2x256-ai03/hotel/phit/personal/asr/temp_outputs/speaker_outputs/embed

100%|██████████| 1/1 [00:00<00:00,  1.12it/s]


[NeMo I 2024-05-12 21:30:26 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-05-12 21:30:26 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-05-12 21:30:26 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-05-12 21:30:26 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-05-12 21:30:26 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-05-12 21:30:26 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-05-12 21:30:26 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-05-12 21:30:27 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-05-12 21:30:27 msdd_models:1431]   
    


## Mapping Spekers to Sentences According to Timestamps


In [21]:
# Reading timestamps <> Speaker Labels mapping

speaker_ts = []
with open(os.path.join(temp_path, "pred_rttms", "mono_file.rttm"), "r") as f:
    lines = f.readlines()
    for line in lines:
        line_list = line.split(" ")
        s = int(float(line_list[5]) * 1000)
        e = s + int(float(line_list[8]) * 1000)
        speaker_ts.append([s, e, int(line_list[11].split("_")[-1])])

wsm = get_words_speaker_mapping(word_timestamps, speaker_ts, "start")

In [22]:
speaker_ts

[[10060, 21080, 1],
 [21340, 22840, 1],
 [23100, 26760, 1],
 [27020, 28200, 1],
 [28540, 33480, 1],
 [33980, 44200, 1],
 [44540, 45960, 1],
 [47020, 48680, 0],
 [49100, 49800, 0],
 [51260, 53960, 1],
 [54300, 54520, 1],
 [55100, 59960, 1],
 [60220, 64040, 1],
 [64700, 65880, 1],
 [67100, 67480, 0],
 [67900, 68680, 0],
 [68940, 70360, 0],
 [71180, 73160, 0],
 [73820, 76680, 0],
 [77420, 83240, 0],
 [83500, 83640, 0],
 [84060, 86280, 0],
 [86940, 87640, 0],
 [87980, 88440, 0],
 [88700, 90680, 0],
 [90940, 91160, 0],
 [91420, 91880, 0],
 [92140, 92280, 0],
 [92780, 94520, 0],
 [94940, 95400, 0],
 [95820, 96600, 0],
 [96860, 100200, 0],
 [100540, 104520, 0],
 [104860, 109160, 0],
 [109580, 110040, 0],
 [110380, 113720, 0],
 [114380, 120040, 0],
 [120380, 123720, 0],
 [124460, 125085, 0],
 [125085, 125560, 1],
 [125085, 125335, 0],
 [125900, 126120, 1],
 [126380, 129080, 1],
 [129340, 130680, 1],
 [131100, 131480, 1],
 [131740, 132280, 1],
 [132700, 133400, 1],
 [133900, 135480, 1],
 [13574

In [23]:
wsm

[{'word': 'Hello,', 'start_time': 10148, 'end_time': 10348, 'speaker': 1},
 {'word': 'everyone.', 'start_time': 10388, 'end_time': 10748, 'speaker': 1},
 {'word': 'Welcome', 'start_time': 10868, 'end_time': 11228, 'speaker': 1},
 {'word': 'back', 'start_time': 11288, 'end_time': 11469, 'speaker': 1},
 {'word': 'to', 'start_time': 11509, 'end_time': 11609, 'speaker': 1},
 {'word': 'another', 'start_time': 11669, 'end_time': 12249, 'speaker': 1},
 {'word': 'round', 'start_time': 12469, 'end_time': 12829, 'speaker': 1},
 {'word': 'of', 'start_time': 12889, 'end_time': 12989, 'speaker': 1},
 {'word': 'Turing', 'start_time': 13029, 'end_time': 13269, 'speaker': 1},
 {'word': 'Mock', 'start_time': 13349, 'end_time': 13609, 'speaker': 1},
 {'word': 'Interviews.', 'start_time': 13649, 'end_time': 14190, 'speaker': 1},
 {'word': 'I', 'start_time': 14550, 'end_time': 14610, 'speaker': 1},
 {'word': 'am', 'start_time': 14670, 'end_time': 14750, 'speaker': 1},
 {'word': 'José', 'start_time': 14790

## Realligning Speech segments using Punctuation
---
This code provides a method for disambiguating speaker labels in cases where a sentence is split between two different speakers. It uses punctuation markings to determine the dominant speaker for each sentence in the transcription.
```
Speaker A: It's got to come from somewhere else. Yeah, that one's also fun because you know the lows are
Speaker B: going to suck, right? So it's actually it hits you on both sides.
```
For example, if a sentence is split between two speakers, the code takes the mode of speaker labels for each word in the sentence, and uses that speaker label for the whole sentence. This can help to improve the accuracy of speaker diarization, especially in cases where the Whisper model may not take fine utterances like "hmm" and "yeah" into account, but the Diarization Model (Nemo) may include them, leading to inconsistent results.

The code also handles cases where one speaker is giving a monologue while other speakers are making occasional comments in the background. It ignores the comments and assigns the entire monologue to the speaker who is speaking the majority of the time. This provides a robust and reliable method for realigning speech segments to their respective speakers based on punctuation in the transcription.

In [24]:
if language in punct_model_langs:
    # restoring punctuation in the transcript to help realign the sentences
    punct_model = PunctuationModel(model="kredor/punctuate-all")

    words_list = list(map(lambda x: x["word"], wsm))

    labled_words = punct_model.predict(words_list)

    ending_puncts = ".?!"
    model_puncts = ".,;:!?"

    # We don't want to punctuate U.S.A. with a period. Right?
    is_acronym = lambda x: re.fullmatch(r"\b(?:[a-zA-Z]\.){2,}", x)

    for word_dict, labeled_tuple in zip(wsm, labled_words):
        word = word_dict["word"]
        if (
            word
            and labeled_tuple[1] in ending_puncts
            and (word[-1] not in model_puncts or is_acronym(word))
        ):
            word += labeled_tuple[1]
            if word.endswith(".."):
                word = word.rstrip(".")
            word_dict["word"] = word

else:
    logging.warning(
        f"Punctuation restoration is not available for {language} language. Using the original punctuation."
    )

wsm = get_realigned_ws_mapping_with_punctuation(wsm)
ssm = get_sentences_speaker_mapping(wsm, speaker_ts)

--- Logging error ---
Traceback (most recent call last):
  File "/space/hotel/phit/miniconda3/envs/speech/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/space/hotel/phit/miniconda3/envs/speech/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/space/hotel/phit/miniconda3/envs/speech/lib/python3.10/logging/__init__.py", line 678, in format
    record.message = record.getMessage()
  File "/space/hotel/phit/miniconda3/envs/speech/lib/python3.10/logging/__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/space/hotel/phit/miniconda3/envs/speech/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/space/hotel/phit/miniconda3/envs/speech/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/space/hotel/phit/minicon

In [24]:
"." in words_list

True

In [25]:
labled_words

[['Hello,', '0', 0.9999343],
 ['everyone.', '.', 0.74264365],
 ['Welcome', '0', 0.9999838],
 ['back', '0', 0.99997807],
 ['to', '0', 0.9999871],
 ['another', '0', 0.9999887],
 ['round', '0', 0.9999865],
 ['of', '0', 0.9999815],
 ['Turing', '0', 0.99839514],
 ['Mock', '0', 0.9994703],
 ['Interviews.', '.', 0.89526683],
 ['I', '0', 0.9999889],
 ['am', '0', 0.998486],
 ['José', '0', 0.82939523],
 ['and', '0', 0.9999503],
 ["I'm", '0', 0.9999881],
 ['a', '0', 0.99998987],
 ['tech', '0', 0.99998856],
 ['leader', '0', 0.9997371],
 ['at', '0', 0.9999881],
 ['Turing.', '0', 0.59339064],
 ["I'm", '0', 0.999984],
 ['from', '0', 0.9999862],
 ['Montreal,', '0', 0.9998981],
 ['Canada,', '0', 0.9116947],
 ['and', '0', 0.9420801],
 ['at', '0', 0.9999864],
 ['Turing,', '0', 0.88648695],
 ['I', '0', 0.99997663],
 ['work', '0', 0.9999832],
 ['on', '0', 0.9999697],
 ['the', '0', 0.99999],
 ['hiring', '0', 0.9999895],
 ['team,', '0', 0.99545336],
 ['helping', '0', 0.9999894],
 ['them', '0', 0.99997616],
 

In [26]:
os.path.splitext(audio_path)

('/space/hotel/phit/personal/asr/data/raw/audios/Y8tlFLIjyMU', '.mp3')

In [27]:
import string
chars = string.ascii_uppercase

speaker_map = {str(ord(ch) - 65): ch for ch in chars}

In [28]:
speaker_map

{'0': 'A',
 '1': 'B',
 '2': 'C',
 '3': 'D',
 '4': 'E',
 '5': 'F',
 '6': 'G',
 '7': 'H',
 '8': 'I',
 '9': 'J',
 '10': 'K',
 '11': 'L',
 '12': 'M',
 '13': 'N',
 '14': 'O',
 '15': 'P',
 '16': 'Q',
 '17': 'R',
 '18': 'S',
 '19': 'T',
 '20': 'U',
 '21': 'V',
 '22': 'W',
 '23': 'X',
 '24': 'Y',
 '25': 'Z'}

### Save speaker diarization result

In [29]:
num_of_speaker = len(set(s["speaker"].split(" ")[-1] for s in ssm))
sd_results = {
    "data": []
}

for i, segment in enumerate(ssm, start=1):
    speaker_id = segment["speaker"].split(" ")[-1]
    sd_results["data"].append({
        "id": speaker_id,
        "speaker": speaker_map[speaker_id],
        "text": segment["text"],
        "start": segment["start_time"],
        "end": segment["end_time"]
    })

    
with open(f"{cfg.RESULT_PATH}/assessment_01/{video_id}-speaker-diarization.json", "w") as f:
    json.dump(sd_results, f, indent=2)

In [31]:
import config as cfg
with open(f"{cfg.RESULT_PATH}/assessment_01/{video_id}.txt", "w", encoding="utf-8-sig") as f:
    get_speaker_aware_transcript(ssm, f)

with open(f"{cfg.RESULT_PATH}/assessment_01/{video_id}.srt", "w", encoding="utf-8-sig") as srt:
    write_srt(ssm, srt)

try:
    cleanup(temp_path)
except:
    pass