<a href="https://colab.research.google.com/github/kenigandrey/hexlet-git/blob/main/Whisper_Transcription_%2B_NeMo_Diarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing Dependencies

In [1]:
!pip install git+https://github.com/SYSTRAN/faster-whisper.git ctranslate2==4.4.0
!pip install "nemo-toolkit[asr]>=2.dev"
!pip install git+https://github.com/MahmoudAshraf97/demucs.git
!pip install git+https://github.com/oliverguhr/deepmultilingualpunctuation.git
!pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git

Collecting git+https://github.com/SYSTRAN/faster-whisper.git
  Cloning https://github.com/SYSTRAN/faster-whisper.git to /tmp/pip-req-build-o01ra7dt
  Running command git clone --filter=blob:none --quiet https://github.com/SYSTRAN/faster-whisper.git /tmp/pip-req-build-o01ra7dt
  Resolved https://github.com/SYSTRAN/faster-whisper.git to commit fb65cd387f4941e1bf2381b88b0f6b9957e56e03
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting git+https://github.com/MahmoudAshraf97/demucs.git
  Cloning https://github.com/MahmoudAshraf97/demucs.git to /tmp/pip-req-build-69_g6dwl
  Running command git clone --filter=blob:none --quiet https://github.com/MahmoudAshraf97/demucs.git /tmp/pip-req-build-69_g6dwl
  Resolved https://github.com/MahmoudAshraf97/demucs.git to commit 4273070a70ded308ddfd0879d267bbd06f89a1b7
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting git+https://github.com/oliverguhr/deepmultilingualpunctuation.git
  Cloning https://github.com/oliverguhr/deepmul

In [2]:
import os
import wget
from omegaconf import OmegaConf
import json
import shutil
import torch
import torchaudio
from nemo.collections.asr.models.msdd_models import NeuralDiarizer
from deepmultilingualpunctuation import PunctuationModel
import re
import logging
import nltk
import faster_whisper
from ctc_forced_aligner import (
    load_alignment_model,
    generate_emissions,
    preprocess_text,
    get_alignments,
    get_spans,
    postprocess_results,
)

# Helper Functions

In [4]:
punct_model_langs = [
    "en",
    "fr",
    "de",
    "es",
    "it",
    "nl",
    "pt",
    "bg",
    "pl",
    "cs",
    "sk",
    "sl",
]

LANGUAGES = {
    "en": "english",
    "zh": "chinese",
    "de": "german",
    "es": "spanish",
    "ru": "russian",
    "ko": "korean",
    "fr": "french",
    "ja": "japanese",
    "pt": "portuguese",
    "tr": "turkish",
    "pl": "polish",
    "ca": "catalan",
    "nl": "dutch",
    "ar": "arabic",
    "sv": "swedish",
    "it": "italian",
    "id": "indonesian",
    "hi": "hindi",
    "fi": "finnish",
    "vi": "vietnamese",
    "he": "hebrew",
    "uk": "ukrainian",
    "el": "greek",
    "ms": "malay",
    "cs": "czech",
    "ro": "romanian",
    "da": "danish",
    "hu": "hungarian",
    "ta": "tamil",
    "no": "norwegian",
    "th": "thai",
    "ur": "urdu",
    "hr": "croatian",
    "bg": "bulgarian",
    "lt": "lithuanian",
    "la": "latin",
    "mi": "maori",
    "ml": "malayalam",
    "cy": "welsh",
    "sk": "slovak",
    "te": "telugu",
    "fa": "persian",
    "lv": "latvian",
    "bn": "bengali",
    "sr": "serbian",
    "az": "azerbaijani",
    "sl": "slovenian",
    "kn": "kannada",
    "et": "estonian",
    "mk": "macedonian",
    "br": "breton",
    "eu": "basque",
    "is": "icelandic",
    "hy": "armenian",
    "ne": "nepali",
    "mn": "mongolian",
    "bs": "bosnian",
    "kk": "kazakh",
    "sq": "albanian",
    "sw": "swahili",
    "gl": "galician",
    "mr": "marathi",
    "pa": "punjabi",
    "si": "sinhala",
    "km": "khmer",
    "sn": "shona",
    "yo": "yoruba",
    "so": "somali",
    "af": "afrikaans",
    "oc": "occitan",
    "ka": "georgian",
    "be": "belarusian",
    "tg": "tajik",
    "sd": "sindhi",
    "gu": "gujarati",
    "am": "amharic",
    "yi": "yiddish",
    "lo": "lao",
    "uz": "uzbek",
    "fo": "faroese",
    "ht": "haitian creole",
    "ps": "pashto",
    "tk": "turkmen",
    "nn": "nynorsk",
    "mt": "maltese",
    "sa": "sanskrit",
    "lb": "luxembourgish",
    "my": "myanmar",
    "bo": "tibetan",
    "tl": "tagalog",
    "mg": "malagasy",
    "as": "assamese",
    "tt": "tatar",
    "haw": "hawaiian",
    "ln": "lingala",
    "ha": "hausa",
    "ba": "bashkir",
    "jw": "javanese",
    "su": "sundanese",
    "yue": "cantonese",
}

# language code lookup by name, with a few language aliases
TO_LANGUAGE_CODE = {
    **{language: code for code, language in LANGUAGES.items()},
    "burmese": "my",
    "valencian": "ca",
    "flemish": "nl",
    "haitian": "ht",
    "letzeburgesch": "lb",
    "pushto": "ps",
    "panjabi": "pa",
    "moldavian": "ro",
    "moldovan": "ro",
    "sinhalese": "si",
    "castilian": "es",
}


langs_to_iso = {
    "af": "afr",
    "am": "amh",
    "ar": "ara",
    "as": "asm",
    "az": "aze",
    "ba": "bak",
    "be": "bel",
    "bg": "bul",
    "bn": "ben",
    "bo": "tib",
    "br": "bre",
    "bs": "bos",
    "ca": "cat",
    "cs": "cze",
    "cy": "wel",
    "da": "dan",
    "de": "ger",
    "el": "gre",
    "en": "eng",
    "es": "spa",
    "et": "est",
    "eu": "baq",
    "fa": "per",
    "fi": "fin",
    "fo": "fao",
    "fr": "fre",
    "gl": "glg",
    "gu": "guj",
    "ha": "hau",
    "haw": "haw",
    "he": "heb",
    "hi": "hin",
    "hr": "hrv",
    "ht": "hat",
    "hu": "hun",
    "hy": "arm",
    "id": "ind",
    "is": "ice",
    "it": "ita",
    "ja": "jpn",
    "jw": "jav",
    "ka": "geo",
    "kk": "kaz",
    "km": "khm",
    "kn": "kan",
    "ko": "kor",
    "la": "lat",
    "lb": "ltz",
    "ln": "lin",
    "lo": "lao",
    "lt": "lit",
    "lv": "lav",
    "mg": "mlg",
    "mi": "mao",
    "mk": "mac",
    "ml": "mal",
    "mn": "mon",
    "mr": "mar",
    "ms": "may",
    "mt": "mlt",
    "my": "bur",
    "ne": "nep",
    "nl": "dut",
    "nn": "nno",
    "no": "nor",
    "oc": "oci",
    "pa": "pan",
    "pl": "pol",
    "ps": "pus",
    "pt": "por",
    "ro": "rum",
    "ru": "rus",
    "sa": "san",
    "sd": "snd",
    "si": "sin",
    "sk": "slo",
    "sl": "slv",
    "sn": "sna",
    "so": "som",
    "sq": "alb",
    "sr": "srp",
    "su": "sun",
    "sv": "swe",
    "sw": "swa",
    "ta": "tam",
    "te": "tel",
    "tg": "tgk",
    "th": "tha",
    "tk": "tuk",
    "tl": "tgl",
    "tr": "tur",
    "tt": "tat",
    "uk": "ukr",
    "ur": "urd",
    "uz": "uzb",
    "vi": "vie",
    "yi": "yid",
    "yo": "yor",
    "yue": "yue",
    "zh": "chi",
}


whisper_langs = sorted(LANGUAGES.keys()) + sorted(
    [k.title() for k in TO_LANGUAGE_CODE.keys()]
)


def create_config(output_dir):
    DOMAIN_TYPE = "telephonic"  # Can be meeting, telephonic, or general based on domain type of the audio file
    CONFIG_FILE_NAME = f"diar_infer_{DOMAIN_TYPE}.yaml"
    CONFIG_URL = f"https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/speaker_tasks/diarization/conf/inference/{CONFIG_FILE_NAME}"
    MODEL_CONFIG = os.path.join(output_dir, CONFIG_FILE_NAME)
    if not os.path.exists(MODEL_CONFIG):
        MODEL_CONFIG = wget.download(CONFIG_URL, output_dir)

    config = OmegaConf.load(MODEL_CONFIG)

    data_dir = os.path.join(output_dir, "data")
    os.makedirs(data_dir, exist_ok=True)

    meta = {
        "audio_filepath": os.path.join(output_dir, "mono_file.wav"),
        "offset": 0,
        "duration": None,
        "label": "infer",
        "text": "-",
        "rttm_filepath": None,
        "uem_filepath": None,
    }
    with open(os.path.join(data_dir, "input_manifest.json"), "w") as fp:
        json.dump(meta, fp)
        fp.write("\n")

    pretrained_vad = "vad_multilingual_marblenet"
    pretrained_speaker_model = "titanet_large"
    config.num_workers = 0  # Workaround for multiprocessing hanging with ipython issue
    config.diarizer.manifest_filepath = os.path.join(data_dir, "input_manifest.json")
    config.diarizer.out_dir = (
        output_dir  # Directory to store intermediate files and prediction outputs
    )

    config.diarizer.speaker_embeddings.model_path = pretrained_speaker_model
    config.diarizer.oracle_vad = (
        False  # compute VAD provided with model_path to vad config
    )
    config.diarizer.clustering.parameters.oracle_num_speakers = False

    # Here, we use our in-house pretrained NeMo VAD model
    config.diarizer.vad.model_path = pretrained_vad
    config.diarizer.vad.parameters.onset = 0.8
    config.diarizer.vad.parameters.offset = 0.6
    config.diarizer.vad.parameters.pad_offset = -0.05
    config.diarizer.msdd_model.model_path = (
        "diar_msdd_telephonic"  # Telephonic speaker diarization model
    )

    return config


def get_word_ts_anchor(s, e, option="start"):
    if option == "end":
        return e
    elif option == "mid":
        return (s + e) / 2
    return s


def get_words_speaker_mapping(wrd_ts, spk_ts, word_anchor_option="start"):
    s, e, sp = spk_ts[0]
    wrd_pos, turn_idx = 0, 0
    wrd_spk_mapping = []
    for wrd_dict in wrd_ts:
        ws, we, wrd = (
            int(wrd_dict["start"] * 1000),
            int(wrd_dict["end"] * 1000),
            wrd_dict["text"],
        )
        wrd_pos = get_word_ts_anchor(ws, we, word_anchor_option)
        while wrd_pos > float(e):
            turn_idx += 1
            turn_idx = min(turn_idx, len(spk_ts) - 1)
            s, e, sp = spk_ts[turn_idx]
            if turn_idx == len(spk_ts) - 1:
                e = get_word_ts_anchor(ws, we, option="end")
        wrd_spk_mapping.append(
            {"word": wrd, "start_time": ws, "end_time": we, "speaker": sp}
        )
    return wrd_spk_mapping


sentence_ending_punctuations = ".?!"


def get_first_word_idx_of_sentence(word_idx, word_list, speaker_list, max_words):
    is_word_sentence_end = (
        lambda x: x >= 0 and word_list[x][-1] in sentence_ending_punctuations
    )
    left_idx = word_idx
    while (
        left_idx > 0
        and word_idx - left_idx < max_words
        and speaker_list[left_idx - 1] == speaker_list[left_idx]
        and not is_word_sentence_end(left_idx - 1)
    ):
        left_idx -= 1

    return left_idx if left_idx == 0 or is_word_sentence_end(left_idx - 1) else -1


def get_last_word_idx_of_sentence(word_idx, word_list, max_words):
    is_word_sentence_end = (
        lambda x: x >= 0 and word_list[x][-1] in sentence_ending_punctuations
    )
    right_idx = word_idx
    while (
        right_idx < len(word_list) - 1
        and right_idx - word_idx < max_words
        and not is_word_sentence_end(right_idx)
    ):
        right_idx += 1

    return (
        right_idx
        if right_idx == len(word_list) - 1 or is_word_sentence_end(right_idx)
        else -1
    )


def get_realigned_ws_mapping_with_punctuation(
    word_speaker_mapping, max_words_in_sentence=50
):
    is_word_sentence_end = (
        lambda x: x >= 0
        and word_speaker_mapping[x]["word"][-1] in sentence_ending_punctuations
    )
    wsp_len = len(word_speaker_mapping)

    words_list, speaker_list = [], []
    for k, line_dict in enumerate(word_speaker_mapping):
        word, speaker = line_dict["word"], line_dict["speaker"]
        words_list.append(word)
        speaker_list.append(speaker)

    k = 0
    while k < len(word_speaker_mapping):
        line_dict = word_speaker_mapping[k]
        if (
            k < wsp_len - 1
            and speaker_list[k] != speaker_list[k + 1]
            and not is_word_sentence_end(k)
        ):
            left_idx = get_first_word_idx_of_sentence(
                k, words_list, speaker_list, max_words_in_sentence
            )
            right_idx = (
                get_last_word_idx_of_sentence(
                    k, words_list, max_words_in_sentence - k + left_idx - 1
                )
                if left_idx > -1
                else -1
            )
            if min(left_idx, right_idx) == -1:
                k += 1
                continue

            spk_labels = speaker_list[left_idx : right_idx + 1]
            mod_speaker = max(set(spk_labels), key=spk_labels.count)
            if spk_labels.count(mod_speaker) < len(spk_labels) // 2:
                k += 1
                continue

            speaker_list[left_idx : right_idx + 1] = [mod_speaker] * (
                right_idx - left_idx + 1
            )
            k = right_idx

        k += 1

    k, realigned_list = 0, []
    while k < len(word_speaker_mapping):
        line_dict = word_speaker_mapping[k].copy()
        line_dict["speaker"] = speaker_list[k]
        realigned_list.append(line_dict)
        k += 1

    return realigned_list


def get_sentences_speaker_mapping(word_speaker_mapping, spk_ts):
    sentence_checker = nltk.tokenize.PunktSentenceTokenizer().text_contains_sentbreak
    s, e, spk = spk_ts[0]
    prev_spk = spk

    snts = []
    snt = {"speaker": f"Speaker {spk}", "start_time": s, "end_time": e, "text": ""}

    for wrd_dict in word_speaker_mapping:
        wrd, spk = wrd_dict["word"], wrd_dict["speaker"]
        s, e = wrd_dict["start_time"], wrd_dict["end_time"]
        if spk != prev_spk or sentence_checker(snt["text"] + " " + wrd):
            snts.append(snt)
            snt = {
                "speaker": f"Speaker {spk}",
                "start_time": s,
                "end_time": e,
                "text": "",
            }
        else:
            snt["end_time"] = e
        snt["text"] += wrd + " "
        prev_spk = spk

    snts.append(snt)
    return snts


def get_speaker_aware_transcript(sentences_speaker_mapping, f):
    previous_speaker = sentences_speaker_mapping[0]["speaker"]
    f.write(f"{previous_speaker}: ")

    for sentence_dict in sentences_speaker_mapping:
        speaker = sentence_dict["speaker"]
        sentence = sentence_dict["text"]

        # If this speaker doesn't match the previous one, start a new paragraph
        if speaker != previous_speaker:
            f.write(f"\n\n{speaker}: ")
            previous_speaker = speaker

        # No matter what, write the current sentence
        f.write(sentence + " ")


def format_timestamp(
    milliseconds: float, always_include_hours: bool = False, decimal_marker: str = "."
):
    assert milliseconds >= 0, "non-negative timestamp expected"

    hours = milliseconds // 3_600_000
    milliseconds -= hours * 3_600_000

    minutes = milliseconds // 60_000
    milliseconds -= minutes * 60_000

    seconds = milliseconds // 1_000
    milliseconds -= seconds * 1_000

    hours_marker = f"{hours:02d}:" if always_include_hours or hours > 0 else ""
    return (
        f"{hours_marker}{minutes:02d}:{seconds:02d}{decimal_marker}{milliseconds:03d}"
    )


def write_srt(transcript, file):
    """
    Write a transcript to a file in SRT format.

    """
    for i, segment in enumerate(transcript, start=1):
        # write srt lines
        print(
            f"{i}\n"
            f"{format_timestamp(segment['start_time'], always_include_hours=True, decimal_marker=',')} --> "
            f"{format_timestamp(segment['end_time'], always_include_hours=True, decimal_marker=',')}\n"
            f"{segment['speaker']}: {segment['text'].strip().replace('-->', '->')}\n",
            file=file,
            flush=True,
        )


def find_numeral_symbol_tokens(tokenizer):
    numeral_symbol_tokens = [
        -1,
    ]
    for token, token_id in tokenizer.get_vocab().items():
        has_numeral_symbol = any(c in "0123456789%$£" for c in token)
        if has_numeral_symbol:
            numeral_symbol_tokens.append(token_id)
    return numeral_symbol_tokens


def _get_next_start_timestamp(word_timestamps, current_word_index, final_timestamp):
    # if current word is the last word
    if current_word_index == len(word_timestamps) - 1:
        return word_timestamps[current_word_index]["start"]

    next_word_index = current_word_index + 1
    while current_word_index < len(word_timestamps) - 1:
        if word_timestamps[next_word_index].get("start") is None:
            # if next word doesn't have a start timestamp
            # merge it with the current word and delete it
            word_timestamps[current_word_index]["word"] += (
                " " + word_timestamps[next_word_index]["word"]
            )

            word_timestamps[next_word_index]["word"] = None
            next_word_index += 1
            if next_word_index == len(word_timestamps):
                return final_timestamp

        else:
            return word_timestamps[next_word_index]["start"]


def filter_missing_timestamps(
    word_timestamps, initial_timestamp=0, final_timestamp=None
):
    # handle the first and last word
    if word_timestamps[0].get("start") is None:
        word_timestamps[0]["start"] = (
            initial_timestamp if initial_timestamp is not None else 0
        )
        word_timestamps[0]["end"] = _get_next_start_timestamp(
            word_timestamps, 0, final_timestamp
        )

    result = [
        word_timestamps[0],
    ]

    for i, ws in enumerate(word_timestamps[1:], start=1):
        # if ws doesn't have a start and end
        # use the previous end as start and next start as end
        if ws.get("start") is None and ws.get("word") is not None:
            ws["start"] = word_timestamps[i - 1]["end"]
            ws["end"] = _get_next_start_timestamp(word_timestamps, i, final_timestamp)

        if ws["word"] is not None:
            result.append(ws)
    return result


def cleanup(path: str):
    """path could either be relative or absolute."""
    # check if file or directory exists
    if os.path.isfile(path) or os.path.islink(path):
        # remove file
        os.remove(path)
    elif os.path.isdir(path):
        # remove directory and all its content
        shutil.rmtree(path)
    else:
        raise ValueError("Path {} is not a file or dir.".format(path))


def process_language_arg(language: str, model_name: str):
    """
    Process the language argument to make sure it's valid and convert language names to language codes.
    """
    if language is not None:
        language = language.lower()
    if language not in LANGUAGES:
        if language in TO_LANGUAGE_CODE:
            language = TO_LANGUAGE_CODE[language]
        else:
            raise ValueError(f"Unsupported language: {language}")

    if model_name.endswith(".en") and language != "en":
        if language is not None:
            logging.warning(
                f"{model_name} is an English-only model but received '{language}'; using English instead."
            )
        language = "en"
    return language

# Options

In [5]:
# Whether to enable music removal from speech, helps increase diarization quality but uses alot of ram
enable_stemming = True

# (choose from 'tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 'large')
whisper_model_name = "large-v2"

# replaces numerical digits with their pronounciation, increases diarization accuracy
suppress_numerals = True

batch_size = 8

language = None  # autodetect language

device = "cuda" if torch.cuda.is_available() else "cpu"

# Processing

## Separating music from speech using Demucs

---

By isolating the vocals from the rest of the audio, it becomes easier to identify and track individual speakers based on the spectral and temporal characteristics of their speech signals. Source separation is just one of many techniques that can be used as a preprocessing step to help improve the accuracy and reliability of the overall diarization process.

In [10]:
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir('.') if isfile(f)]
len(onlyfiles)

43

## Transcriping audio using Whisper and realligning timestamps using Forced Alignment
---
This code uses two different open-source models to transcribe speech and perform forced alignment on the resulting transcription.

The first model is called OpenAI Whisper, which is a speech recognition model that can transcribe speech with high accuracy. The code loads the whisper model and uses it to transcribe the vocal_target file.

The output of the transcription process is a set of text segments with corresponding timestamps indicating when each segment was spoken.


In [None]:
# Name of the audio file
#audio_path = "123.mp3"
for audio_path in onlyfiles:
  if enable_stemming:
      # Isolate vocals from the rest of the audio

      return_code = os.system(
          f'python3 -m demucs.separate -n htdemucs --two-stems=vocals "{audio_path}" -o "temp_outputs"'
      )

      if return_code != 0:
          logging.warning("Source splitting failed, using original audio file.")
          vocal_target = audio_path
      else:
          vocal_target = os.path.join(
              "temp_outputs",
              "htdemucs",
              os.path.splitext(os.path.basename(audio_path))[0],
              "vocals.wav",
          )
  else:
      vocal_target = audio_path

  compute_type = "float16"
  # or run on GPU with INT8
  # compute_type = "int8_float16"
  # or run on CPU with INT8
  # compute_type = "int8"

  whisper_model = faster_whisper.WhisperModel(
      whisper_model_name, device=device, compute_type=compute_type
  )
  whisper_pipeline = faster_whisper.BatchedInferencePipeline(whisper_model)
  audio_waveform = faster_whisper.decode_audio(vocal_target)
  suppress_tokens = (
      find_numeral_symbol_tokens(whisper_model.hf_tokenizer)
      if suppress_numerals
      else [-1]
  )

  if batch_size > 0:
      transcript_segments, info = whisper_pipeline.transcribe(
          audio_waveform,
          language,
          suppress_tokens=suppress_tokens,
          batch_size=batch_size,
          without_timestamps=True,
      )
  else:
      transcript_segments, info = whisper_model.transcribe(
          audio_waveform,
          language,
          suppress_tokens=suppress_tokens,
          without_timestamps=True,
          vad_filter=True,
      )

  full_transcript = "".join(segment.text for segment in transcript_segments)

  # clear gpu vram
  del whisper_model, whisper_pipeline
  torch.cuda.empty_cache()

  alignment_model, alignment_tokenizer = load_alignment_model(
      device,
      dtype=torch.float16 if device == "cuda" else torch.float32,
  )

  audio_waveform = audio_waveform.to(alignment_model.dtype).to(alignment_model.device)

  emissions, stride = generate_emissions(
      alignment_model, audio_waveform, batch_size=batch_size
  )

  del alignment_model
  torch.cuda.empty_cache()

  tokens_starred, text_starred = preprocess_text(
      full_transcript,
      romanize=True,
      language=langs_to_iso[info.language],
  )

  segments, scores, blank_token = get_alignments(
      emissions,
      tokens_starred,
      alignment_tokenizer,
  )

  spans = get_spans(tokens_starred, segments, blank_token)

  word_timestamps = postprocess_results(text_starred, spans, stride, scores)

  ROOT = os.getcwd()
  temp_path = os.path.join(ROOT, "temp_outputs")
  os.makedirs(temp_path, exist_ok=True)
  torchaudio.save(
      os.path.join(temp_path, "mono_file.wav"),
      audio_waveform.cpu().unsqueeze(0).float(),
      16000,
      channels_first=True,
  )

  # Initialize NeMo MSDD diarization model
  msdd_model = NeuralDiarizer(cfg=create_config(temp_path)).to("cuda")
  msdd_model.diarize()

  del msdd_model
  torch.cuda.empty_cache()

  # Reading timestamps <> Speaker Labels mapping

  speaker_ts = []
  with open(os.path.join(temp_path, "pred_rttms", "mono_file.rttm"), "r") as f:
      lines = f.readlines()
      for line in lines:
          line_list = line.split(" ")
          s = int(float(line_list[5]) * 1000)
          e = s + int(float(line_list[8]) * 1000)
          speaker_ts.append([s, e, int(line_list[11].split("_")[-1])])

  wsm = get_words_speaker_mapping(word_timestamps, speaker_ts, "start")

  if info.language in punct_model_langs:
      # restoring punctuation in the transcript to help realign the sentences
      punct_model = PunctuationModel(model="kredor/punctuate-all")

      words_list = list(map(lambda x: x["word"], wsm))

      labled_words = punct_model.predict(words_list, chunk_size=230)

      ending_puncts = ".?!"
      model_puncts = ".,;:!?"

      # We don't want to punctuate U.S.A. with a period. Right?
      is_acronym = lambda x: re.fullmatch(r"\b(?:[a-zA-Z]\.){2,}", x)

      for word_dict, labeled_tuple in zip(wsm, labled_words):
          word = word_dict["word"]
          if (
              word
              and labeled_tuple[1] in ending_puncts
              and (word[-1] not in model_puncts or is_acronym(word))
          ):
              word += labeled_tuple[1]
              if word.endswith(".."):
                  word = word.rstrip(".")
              word_dict["word"] = word

  else:
      logging.warning(
          f"Punctuation restoration is not available for {info.language} language. Using the original punctuation."
      )

  wsm = get_realigned_ws_mapping_with_punctuation(wsm)
  ssm = get_sentences_speaker_mapping(wsm, speaker_ts)

  with open(f"{os.path.splitext(audio_path)[0]}.txt", "w", encoding="utf-8-sig") as f:
      get_speaker_aware_transcript(ssm, f)

  with open(f"{os.path.splitext(audio_path)[0]}.srt", "w", encoding="utf-8-sig") as srt:
      write_srt(ssm, srt)

  cleanup(temp_path)

tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

vocabulary.txt:   0%|          | 0.00/460k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.80k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

[NeMo I 2024-11-13 13:00:15 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:00:15 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/diar_msdd_telephonic/versions/1.0.1/files/diar_msdd_telephonic.nemo to /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:00:16 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:00:18 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:00:18 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:00:18 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:00:18 features:305] PADDING: 16
[NeMo I 2024-11-13 13:00:18 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:00:19 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:00:19 features:305] PADDING: 16
[NeMo I 2024-11-13 13:00:19 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:00:20 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/vad_multilingual_marblenet/versions/1.10.0/files/vad_multilingual_marblenet.nemo to /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:00:20 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:00:20 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:00:20 features:305] PADDING: 16
[NeMo I 2024-11-13 13:00:20 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:00:20 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:00:20 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:00:20 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:00:20 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:11<00:00, 11.41s/it]

[NeMo I 2024-11-13 13:00:32 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:00:32 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:00:32 collections:741] Dataset successfully loaded with 25 items and total duration provided from manifest is  0.34 hours.
[NeMo I 2024-11-13 13:00:32 collections:746] # 25 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 25/25 [00:07<00:00,  3.54it/s]

[NeMo I 2024-11-13 13:00:39 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:00:53 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:02<00:00,  2.04s/it]

[NeMo I 2024-11-13 13:00:56 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:00:56 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:00:56 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:00:56 collections:741] Dataset successfully loaded with 863 items and total duration provided from manifest is  0.13 hours.
[NeMo I 2024-11-13 13:00:56 collections:746] # 863 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 14/14 [00:01<00:00,  9.48it/s]

[NeMo I 2024-11-13 13:00:57 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:00:57 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json





[NeMo I 2024-11-13 13:00:57 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:00:57 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:00:57 collections:741] Dataset successfully loaded with 921 items and total duration provided from manifest is  0.14 hours.
[NeMo I 2024-11-13 13:00:57 collections:746] # 921 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 15/15 [00:01<00:00, 13.06it/s]

[NeMo I 2024-11-13 13:00:58 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:00:58 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 13:00:58 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:00:58 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:00:58 collections:741] Dataset successfully loaded with 994 items and total duration provided from manifest is  0.14 hours.
[NeMo I 2024-11-13 13:00:58 collections:746] # 994 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 16/16 [00:01<00:00, 13.86it/s]


[NeMo I 2024-11-13 13:01:00 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:01:00 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 13:01:00 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:01:00 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:01:00 collections:741] Dataset successfully loaded with 1168 items and total duration provided from manifest is  0.16 hours.
[NeMo I 2024-11-13 13:01:00 collections:746] # 1168 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 19/19 [00:01<00:00, 15.63it/s]


[NeMo I 2024-11-13 13:01:01 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:01:01 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 13:01:01 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:01:01 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:01:01 collections:741] Dataset successfully loaded with 1586 items and total duration provided from manifest is  0.17 hours.
[NeMo I 2024-11-13 13:01:01 collections:746] # 1586 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 25/25 [00:01<00:00, 15.83it/s]


[NeMo I 2024-11-13 13:01:03 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:01<00:00,  1.43s/it]

[NeMo I 2024-11-13 13:01:04 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:01:04 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:01:04 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:01:04 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:01:04 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:01:04 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:01:04 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:01:04 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00,  7.75it/s]

[NeMo I 2024-11-13 13:01:05 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:01:05 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:01:05 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 13:01:05 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:01:05 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:01:05 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:01:05 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:01:05 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:01:05 msdd_models:1435]   
    




[NeMo I 2024-11-13 13:04:10 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:04:10 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:04:10 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:04:10 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:04:12 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:04:12 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:04:12 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:04:12 features:305] PADDING: 16
[NeMo I 2024-11-13 13:04:13 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:04:13 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:04:13 features:305] PADDING: 16
[NeMo I 2024-11-13 13:04:14 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:04:14 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:04:14 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:04:14 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:04:14 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:04:14 features:305] PADDING: 16
[NeMo I 2024-11-13 13:04:14 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:04:14 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:04:14 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:04:14 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:04:14 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 10.59it/s]

[NeMo I 2024-11-13 13:04:14 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:04:14 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:04:14 collections:741] Dataset successfully loaded with 23 items and total duration provided from manifest is  0.32 hours.
[NeMo I 2024-11-13 13:04:14 collections:746] # 23 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 23/23 [00:04<00:00,  5.23it/s]

[NeMo I 2024-11-13 13:04:18 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:04:29 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  1.25it/s]

[NeMo I 2024-11-13 13:04:30 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:04:30 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:04:30 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:04:30 collections:741] Dataset successfully loaded with 807 items and total duration provided from manifest is  0.09 hours.
[NeMo I 2024-11-13 13:04:30 collections:746] # 807 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 13/13 [00:01<00:00, 10.29it/s]

[NeMo I 2024-11-13 13:04:31 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:04:31 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 13:04:31 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:04:31 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:04:31 collections:741] Dataset successfully loaded with 830 items and total duration provided from manifest is  0.10 hours.
[NeMo I 2024-11-13 13:04:31 collections:746] # 830 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 13/13 [00:01<00:00, 12.57it/s]

[NeMo I 2024-11-13 13:04:33 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:04:33 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 13:04:33 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:04:33 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:04:33 collections:741] Dataset successfully loaded with 866 items and total duration provided from manifest is  0.10 hours.
[NeMo I 2024-11-13 13:04:33 collections:746] # 866 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 14/14 [00:01<00:00, 13.91it/s]

[NeMo I 2024-11-13 13:04:34 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:04:34 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json





[NeMo I 2024-11-13 13:04:34 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:04:34 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:04:34 collections:741] Dataset successfully loaded with 959 items and total duration provided from manifest is  0.11 hours.
[NeMo I 2024-11-13 13:04:34 collections:746] # 959 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 15/15 [00:00<00:00, 16.34it/s]

[NeMo I 2024-11-13 13:04:35 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:04:35 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 13:04:35 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:04:35 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:04:35 collections:741] Dataset successfully loaded with 1216 items and total duration provided from manifest is  0.12 hours.
[NeMo I 2024-11-13 13:04:35 collections:746] # 1216 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 19/19 [00:01<00:00, 15.52it/s]


[NeMo I 2024-11-13 13:04:36 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.20it/s]

[NeMo I 2024-11-13 13:04:37 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:04:37 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:04:38 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:04:38 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:04:38 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:04:38 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:04:38 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:04:38 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00,  1.77it/s]

[NeMo I 2024-11-13 13:04:39 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:04:39 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:04:39 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 13:04:39 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:04:39 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:04:39 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:04:39 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:04:39 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:04:39 msdd_models:1435]   
    




[NeMo I 2024-11-13 13:06:26 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:06:26 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:06:26 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:06:26 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:06:29 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:06:29 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:06:29 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:06:29 features:305] PADDING: 16
[NeMo I 2024-11-13 13:06:29 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:06:29 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:06:29 features:305] PADDING: 16
[NeMo I 2024-11-13 13:06:30 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:06:30 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:06:30 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:06:30 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:06:30 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:06:30 features:305] PADDING: 16
[NeMo I 2024-11-13 13:06:30 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:06:30 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:06:30 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:06:30 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:06:30 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 25.05it/s]

[NeMo I 2024-11-13 13:06:30 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:06:30 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:06:30 collections:741] Dataset successfully loaded with 11 items and total duration provided from manifest is  0.15 hours.
[NeMo I 2024-11-13 13:06:30 collections:746] # 11 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 11/11 [00:02<00:00,  4.81it/s]

[NeMo I 2024-11-13 13:06:33 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:06:37 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  2.71it/s]

[NeMo I 2024-11-13 13:06:37 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:06:37 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:06:37 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:06:37 collections:741] Dataset successfully loaded with 297 items and total duration provided from manifest is  0.03 hours.
[NeMo I 2024-11-13 13:06:37 collections:746] # 297 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  9.72it/s]

[NeMo I 2024-11-13 13:06:38 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:06:38 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 13:06:38 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:06:38 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:06:38 collections:741] Dataset successfully loaded with 302 items and total duration provided from manifest is  0.03 hours.
[NeMo I 2024-11-13 13:06:38 collections:746] # 302 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00, 13.28it/s]

[NeMo I 2024-11-13 13:06:38 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:06:38 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 13:06:38 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:06:38 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:06:38 collections:741] Dataset successfully loaded with 313 items and total duration provided from manifest is  0.03 hours.
[NeMo I 2024-11-13 13:06:38 collections:746] # 313 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00, 12.36it/s]

[NeMo I 2024-11-13 13:06:39 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:06:39 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 13:06:39 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:06:39 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:06:39 collections:741] Dataset successfully loaded with 337 items and total duration provided from manifest is  0.03 hours.
[NeMo I 2024-11-13 13:06:39 collections:746] # 337 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00, 14.27it/s]


[NeMo I 2024-11-13 13:06:39 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:06:39 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 13:06:39 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:06:39 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:06:39 collections:741] Dataset successfully loaded with 404 items and total duration provided from manifest is  0.04 hours.
[NeMo I 2024-11-13 13:06:39 collections:746] # 404 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 7/7 [00:00<00:00, 14.57it/s]

[NeMo I 2024-11-13 13:06:40 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  1.67it/s]

[NeMo I 2024-11-13 13:06:41 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:06:41 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:06:41 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:06:41 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:06:41 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:06:41 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:06:41 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:06:41 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00,  1.76it/s]

[NeMo I 2024-11-13 13:06:41 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:06:41 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:06:41 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 13:06:41 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:06:41 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:06:41 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:06:41 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:06:41 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:06:41 msdd_models:1435]   
    




[NeMo I 2024-11-13 13:08:14 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:08:14 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:08:14 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:08:14 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:08:16 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:08:16 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:08:16 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:08:16 features:305] PADDING: 16
[NeMo I 2024-11-13 13:08:17 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:08:17 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:08:17 features:305] PADDING: 16
[NeMo I 2024-11-13 13:08:18 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:08:18 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:08:18 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:08:18 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:08:18 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:08:18 features:305] PADDING: 16
[NeMo I 2024-11-13 13:08:18 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:08:18 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:08:18 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:08:18 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:08:18 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 30.10it/s]

[NeMo I 2024-11-13 13:08:18 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:08:18 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:08:18 collections:741] Dataset successfully loaded with 9 items and total duration provided from manifest is  0.12 hours.
[NeMo I 2024-11-13 13:08:18 collections:746] # 9 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 9/9 [00:02<00:00,  3.96it/s]

[NeMo I 2024-11-13 13:08:20 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:08:24 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  3.49it/s]

[NeMo I 2024-11-13 13:08:24 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:08:24 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:08:24 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:08:24 collections:741] Dataset successfully loaded with 339 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 13:08:24 collections:746] # 339 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00,  9.94it/s]

[NeMo I 2024-11-13 13:08:25 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:08:25 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 13:08:25 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:08:25 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:08:25 collections:741] Dataset successfully loaded with 354 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 13:08:25 collections:746] # 354 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00, 13.35it/s]

[NeMo I 2024-11-13 13:08:25 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:08:25 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 13:08:25 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:08:25 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:08:25 collections:741] Dataset successfully loaded with 386 items and total duration provided from manifest is  0.06 hours.
[NeMo I 2024-11-13 13:08:25 collections:746] # 386 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 7/7 [00:00<00:00, 11.36it/s]

[NeMo I 2024-11-13 13:08:26 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:08:26 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 13:08:26 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:08:26 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:08:26 collections:741] Dataset successfully loaded with 453 items and total duration provided from manifest is  0.06 hours.
[NeMo I 2024-11-13 13:08:26 collections:746] # 453 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 8/8 [00:00<00:00, 12.98it/s]


[NeMo I 2024-11-13 13:08:27 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:08:27 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 13:08:27 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:08:27 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:08:27 collections:741] Dataset successfully loaded with 617 items and total duration provided from manifest is  0.07 hours.
[NeMo I 2024-11-13 13:08:27 collections:746] # 617 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 10/10 [00:00<00:00, 14.60it/s]


[NeMo I 2024-11-13 13:08:28 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.67it/s]

[NeMo I 2024-11-13 13:08:28 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:08:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:08:28 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:08:28 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:08:28 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:08:28 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:08:28 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:08:28 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 35.52it/s]

[NeMo I 2024-11-13 13:08:28 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:08:28 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:08:28 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 13:08:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:08:28 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:08:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:08:28 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:08:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:08:28 msdd_models:1435]   
    




[NeMo I 2024-11-13 13:12:27 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:12:27 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:12:27 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:12:27 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:12:29 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:12:29 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:12:29 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:12:29 features:305] PADDING: 16
[NeMo I 2024-11-13 13:12:29 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:12:29 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:12:29 features:305] PADDING: 16
[NeMo I 2024-11-13 13:12:30 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:12:30 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:12:30 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:12:30 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:12:30 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:12:30 features:305] PADDING: 16
[NeMo I 2024-11-13 13:12:30 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:12:30 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:12:30 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:12:30 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:12:30 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  6.95it/s]

[NeMo I 2024-11-13 13:12:30 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:12:30 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:12:30 collections:741] Dataset successfully loaded with 30 items and total duration provided from manifest is  0.40 hours.
[NeMo I 2024-11-13 13:12:30 collections:746] # 30 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 30/30 [00:07<00:00,  4.12it/s]

[NeMo I 2024-11-13 13:12:38 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:12:52 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:01<00:00,  1.01s/it]

[NeMo I 2024-11-13 13:12:53 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:12:53 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:12:53 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:12:53 collections:741] Dataset successfully loaded with 1154 items and total duration provided from manifest is  0.20 hours.
[NeMo I 2024-11-13 13:12:53 collections:746] # 1154 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 19/19 [00:01<00:00,  9.92it/s]


[NeMo I 2024-11-13 13:12:55 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:12:55 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 13:12:55 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:12:55 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:12:55 collections:741] Dataset successfully loaded with 1240 items and total duration provided from manifest is  0.21 hours.
[NeMo I 2024-11-13 13:12:55 collections:746] # 1240 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 20/20 [00:01<00:00, 12.80it/s]


[NeMo I 2024-11-13 13:12:57 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:12:57 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 13:12:57 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:12:57 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:12:57 collections:741] Dataset successfully loaded with 1378 items and total duration provided from manifest is  0.23 hours.
[NeMo I 2024-11-13 13:12:57 collections:746] # 1378 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 22/22 [00:01<00:00, 13.46it/s]


[NeMo I 2024-11-13 13:12:58 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:12:58 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 13:12:58 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:12:59 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:12:59 collections:741] Dataset successfully loaded with 1644 items and total duration provided from manifest is  0.24 hours.
[NeMo I 2024-11-13 13:12:59 collections:746] # 1644 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 26/26 [00:01<00:00, 14.79it/s]


[NeMo I 2024-11-13 13:13:01 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:13:01 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 13:13:01 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:13:01 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:13:01 collections:741] Dataset successfully loaded with 2322 items and total duration provided from manifest is  0.27 hours.
[NeMo I 2024-11-13 13:13:01 collections:746] # 2322 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 37/37 [00:02<00:00, 12.47it/s]


[NeMo I 2024-11-13 13:13:04 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.39it/s]

[NeMo I 2024-11-13 13:13:05 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:13:05 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:13:05 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:13:05 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:13:05 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:13:05 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:13:05 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:13:05 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00,  2.21it/s]

[NeMo I 2024-11-13 13:13:06 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:13:06 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:13:06 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 13:13:06 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:13:06 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:13:06 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:13:06 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:13:06 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:13:06 msdd_models:1435]   
    




[NeMo I 2024-11-13 13:19:06 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:19:06 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:19:06 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:19:06 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:19:10 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:19:10 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:19:10 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:19:10 features:305] PADDING: 16
[NeMo I 2024-11-13 13:19:11 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:19:11 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:19:11 features:305] PADDING: 16
[NeMo I 2024-11-13 13:19:12 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:19:12 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:19:12 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:19:12 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:19:12 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:19:12 features:305] PADDING: 16
[NeMo I 2024-11-13 13:19:12 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:19:12 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:19:12 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:19:12 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:19:12 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  5.52it/s]


[NeMo I 2024-11-13 13:19:12 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:19:12 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:19:12 collections:741] Dataset successfully loaded with 43 items and total duration provided from manifest is  0.60 hours.
[NeMo I 2024-11-13 13:19:12 collections:746] # 43 files loaded accounting to # 1 labels


      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 43/43 [00:09<00:00,  4.30it/s]

[NeMo I 2024-11-13 13:19:22 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:19:41 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:03<00:00,  3.51s/it]

[NeMo I 2024-11-13 13:19:45 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:19:45 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:19:45 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:19:45 collections:741] Dataset successfully loaded with 1630 items and total duration provided from manifest is  0.24 hours.
[NeMo I 2024-11-13 13:19:45 collections:746] # 1630 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 26/26 [00:02<00:00, 10.13it/s]


[NeMo I 2024-11-13 13:19:48 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:19:48 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 13:19:48 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:19:48 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:19:48 collections:741] Dataset successfully loaded with 1711 items and total duration provided from manifest is  0.25 hours.
[NeMo I 2024-11-13 13:19:48 collections:746] # 1711 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 27/27 [00:02<00:00, 12.44it/s]


[NeMo I 2024-11-13 13:19:50 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:19:50 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 13:19:50 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:19:50 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:19:50 collections:741] Dataset successfully loaded with 1849 items and total duration provided from manifest is  0.26 hours.
[NeMo I 2024-11-13 13:19:50 collections:746] # 1849 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 29/29 [00:02<00:00, 13.16it/s]


[NeMo I 2024-11-13 13:19:53 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:19:53 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 13:19:53 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:19:53 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:19:53 collections:741] Dataset successfully loaded with 2125 items and total duration provided from manifest is  0.28 hours.
[NeMo I 2024-11-13 13:19:53 collections:746] # 2125 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 34/34 [00:02<00:00, 16.22it/s]


[NeMo I 2024-11-13 13:19:56 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:19:56 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 13:19:56 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:19:56 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:19:56 collections:741] Dataset successfully loaded with 2888 items and total duration provided from manifest is  0.31 hours.
[NeMo I 2024-11-13 13:19:56 collections:746] # 2888 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 46/46 [00:03<00:00, 13.85it/s]


[NeMo I 2024-11-13 13:20:00 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:01<00:00,  1.30s/it]

[NeMo I 2024-11-13 13:20:01 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:20:01 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:20:01 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:20:01 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:20:01 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:20:01 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:20:01 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:20:01 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00,  7.15it/s]

[NeMo I 2024-11-13 13:20:02 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:20:02 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:20:02 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 13:20:02 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:20:02 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:20:02 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:20:02 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:20:02 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:20:02 msdd_models:1435]   
    




[NeMo I 2024-11-13 13:23:32 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:23:32 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:23:32 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:23:32 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:23:33 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:23:33 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:23:33 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:23:36 features:305] PADDING: 16
[NeMo I 2024-11-13 13:23:36 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:23:37 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:23:37 features:305] PADDING: 16
[NeMo I 2024-11-13 13:23:38 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:23:38 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:23:38 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:23:38 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:23:38 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:23:38 features:305] PADDING: 16
[NeMo I 2024-11-13 13:23:38 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:23:38 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:23:38 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:23:38 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:23:38 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  9.42it/s]

[NeMo I 2024-11-13 13:23:38 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:23:38 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:23:38 collections:741] Dataset successfully loaded with 25 items and total duration provided from manifest is  0.34 hours.
[NeMo I 2024-11-13 13:23:38 collections:746] # 25 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 25/25 [00:05<00:00,  4.86it/s]

[NeMo I 2024-11-13 13:23:43 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:23:55 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  1.19it/s]

[NeMo I 2024-11-13 13:23:56 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:23:56 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:23:56 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:23:56 collections:741] Dataset successfully loaded with 1010 items and total duration provided from manifest is  0.20 hours.
[NeMo I 2024-11-13 13:23:56 collections:746] # 1010 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 16/16 [00:01<00:00,  9.35it/s]


[NeMo I 2024-11-13 13:23:58 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:23:58 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 13:23:58 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:23:58 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:23:58 collections:741] Dataset successfully loaded with 1106 items and total duration provided from manifest is  0.21 hours.
[NeMo I 2024-11-13 13:23:58 collections:746] # 1106 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 18/18 [00:01<00:00, 12.70it/s]


[NeMo I 2024-11-13 13:23:59 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:23:59 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 13:23:59 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:23:59 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:23:59 collections:741] Dataset successfully loaded with 1254 items and total duration provided from manifest is  0.22 hours.
[NeMo I 2024-11-13 13:23:59 collections:746] # 1254 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 20/20 [00:01<00:00, 11.09it/s]


[NeMo I 2024-11-13 13:24:03 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:24:03 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 13:24:03 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:24:03 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:24:03 collections:741] Dataset successfully loaded with 1536 items and total duration provided from manifest is  0.24 hours.
[NeMo I 2024-11-13 13:24:03 collections:746] # 1536 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 24/24 [00:01<00:00, 15.52it/s]


[NeMo I 2024-11-13 13:24:04 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:24:04 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 13:24:04 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:24:04 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:24:04 collections:741] Dataset successfully loaded with 2203 items and total duration provided from manifest is  0.26 hours.
[NeMo I 2024-11-13 13:24:04 collections:746] # 2203 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 35/35 [00:01<00:00, 17.68it/s]


[NeMo I 2024-11-13 13:24:07 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.39it/s]

[NeMo I 2024-11-13 13:24:07 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:24:07 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:24:08 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:24:08 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:24:08 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:24:08 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:24:08 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:24:08 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00,  1.48it/s]

[NeMo I 2024-11-13 13:24:09 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:24:09 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:24:09 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 13:24:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:24:09 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:24:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:24:09 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:24:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:24:09 msdd_models:1435]   
    




[NeMo I 2024-11-13 13:27:15 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:27:15 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:27:15 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:27:15 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:27:17 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:27:17 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:27:17 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:27:17 features:305] PADDING: 16
[NeMo I 2024-11-13 13:27:17 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:27:19 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:27:19 features:305] PADDING: 16
[NeMo I 2024-11-13 13:27:20 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:27:20 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:27:20 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:27:20 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:27:20 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:27:20 features:305] PADDING: 16
[NeMo I 2024-11-13 13:27:20 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:27:20 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:27:20 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:27:20 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:27:20 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 11.41it/s]

[NeMo I 2024-11-13 13:27:20 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:27:20 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:27:20 collections:741] Dataset successfully loaded with 22 items and total duration provided from manifest is  0.29 hours.
[NeMo I 2024-11-13 13:27:20 collections:746] # 22 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 22/22 [00:05<00:00,  4.03it/s]

[NeMo I 2024-11-13 13:27:26 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:27:34 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  1.37it/s]

[NeMo I 2024-11-13 13:27:35 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:27:35 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:27:35 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:27:35 collections:741] Dataset successfully loaded with 852 items and total duration provided from manifest is  0.19 hours.
[NeMo I 2024-11-13 13:27:35 collections:746] # 852 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 14/14 [00:01<00:00,  8.90it/s]


[NeMo I 2024-11-13 13:27:37 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:27:37 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 13:27:37 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:27:37 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:27:37 collections:741] Dataset successfully loaded with 949 items and total duration provided from manifest is  0.20 hours.
[NeMo I 2024-11-13 13:27:37 collections:746] # 949 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 15/15 [00:01<00:00,  9.52it/s]


[NeMo I 2024-11-13 13:27:39 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:27:39 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 13:27:39 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:27:39 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:27:39 collections:741] Dataset successfully loaded with 1096 items and total duration provided from manifest is  0.21 hours.
[NeMo I 2024-11-13 13:27:39 collections:746] # 1096 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 18/18 [00:01<00:00, 13.05it/s]

[NeMo I 2024-11-13 13:27:40 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings





[NeMo I 2024-11-13 13:27:40 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 13:27:40 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:27:40 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:27:40 collections:741] Dataset successfully loaded with 1390 items and total duration provided from manifest is  0.23 hours.
[NeMo I 2024-11-13 13:27:40 collections:746] # 1390 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 22/22 [00:01<00:00, 16.29it/s]


[NeMo I 2024-11-13 13:27:42 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:27:42 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 13:27:42 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:27:42 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:27:42 collections:741] Dataset successfully loaded with 2055 items and total duration provided from manifest is  0.25 hours.
[NeMo I 2024-11-13 13:27:42 collections:746] # 2055 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 33/33 [00:01<00:00, 17.97it/s]


[NeMo I 2024-11-13 13:27:44 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.73it/s]

[NeMo I 2024-11-13 13:27:45 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:27:45 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:27:45 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:27:45 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:27:45 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:27:45 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:27:45 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:27:45 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 14.01it/s]

[NeMo I 2024-11-13 13:27:45 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:27:45 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:27:45 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 13:27:45 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:27:45 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:27:45 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:27:45 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:27:45 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:27:45 msdd_models:1435]   
    




[NeMo I 2024-11-13 13:29:54 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:29:54 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:29:54 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:29:54 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:29:56 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:29:56 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:29:56 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:29:56 features:305] PADDING: 16
[NeMo I 2024-11-13 13:29:56 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:29:57 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:29:57 features:305] PADDING: 16
[NeMo I 2024-11-13 13:29:57 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:29:57 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:29:57 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:29:57 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:29:57 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:29:58 features:305] PADDING: 16
[NeMo I 2024-11-13 13:29:58 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:29:58 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:29:58 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:29:58 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:29:58 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 16.14it/s]

[NeMo I 2024-11-13 13:29:58 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:29:58 collections:740] Filtered duration for loading collection is  0.00 hours.





[NeMo I 2024-11-13 13:29:58 collections:741] Dataset successfully loaded with 14 items and total duration provided from manifest is  0.19 hours.
[NeMo I 2024-11-13 13:29:58 collections:746] # 14 files loaded accounting to # 1 labels


      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 14/14 [00:02<00:00,  4.75it/s]

[NeMo I 2024-11-13 13:30:01 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:30:08 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  2.12it/s]

[NeMo I 2024-11-13 13:30:08 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:30:08 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:30:08 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:30:08 collections:741] Dataset successfully loaded with 503 items and total duration provided from manifest is  0.07 hours.
[NeMo I 2024-11-13 13:30:08 collections:746] # 503 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 8/8 [00:00<00:00,  9.68it/s]

[NeMo I 2024-11-13 13:30:09 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:30:09 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 13:30:09 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:30:09 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:30:09 collections:741] Dataset successfully loaded with 526 items and total duration provided from manifest is  0.08 hours.
[NeMo I 2024-11-13 13:30:09 collections:746] # 526 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 9/9 [00:00<00:00, 13.27it/s]

[NeMo I 2024-11-13 13:30:10 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:30:10 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 13:30:10 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:30:10 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:30:10 collections:741] Dataset successfully loaded with 571 items and total duration provided from manifest is  0.08 hours.
[NeMo I 2024-11-13 13:30:10 collections:746] # 571 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 9/9 [00:00<00:00, 12.94it/s]

[NeMo I 2024-11-13 13:30:11 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:30:11 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 13:30:11 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:30:11 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:30:11 collections:741] Dataset successfully loaded with 657 items and total duration provided from manifest is  0.09 hours.
[NeMo I 2024-11-13 13:30:11 collections:746] # 657 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 11/11 [00:00<00:00, 17.24it/s]

[NeMo I 2024-11-13 13:30:11 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:30:11 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json





[NeMo I 2024-11-13 13:30:11 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:30:11 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:30:11 collections:741] Dataset successfully loaded with 880 items and total duration provided from manifest is  0.09 hours.
[NeMo I 2024-11-13 13:30:11 collections:746] # 880 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 14/14 [00:00<00:00, 17.52it/s]

[NeMo I 2024-11-13 13:30:12 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  1.60it/s]

[NeMo I 2024-11-13 13:30:13 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:30:13 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:30:13 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:30:13 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:30:13 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:30:13 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:30:13 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:30:13 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 23.25it/s]

[NeMo I 2024-11-13 13:30:13 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:30:13 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:30:13 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 13:30:13 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:30:13 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:30:13 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:30:13 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:30:13 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:30:13 msdd_models:1435]   
    




[NeMo I 2024-11-13 13:31:35 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:31:35 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:31:35 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:31:35 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:31:37 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:31:37 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:31:37 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:31:37 features:305] PADDING: 16
[NeMo I 2024-11-13 13:31:37 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:31:38 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:31:39 features:305] PADDING: 16
[NeMo I 2024-11-13 13:31:39 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:31:39 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:31:39 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:31:39 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:31:40 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:31:40 features:305] PADDING: 16
[NeMo I 2024-11-13 13:31:40 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:31:40 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:31:40 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:31:40 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:31:40 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 25.29it/s]


[NeMo I 2024-11-13 13:31:40 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:31:40 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:31:40 collections:741] Dataset successfully loaded with 8 items and total duration provided from manifest is  0.10 hours.
[NeMo I 2024-11-13 13:31:40 collections:746] # 8 files loaded accounting to # 1 labels


      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 8/8 [00:01<00:00,  4.99it/s]

[NeMo I 2024-11-13 13:31:41 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:31:44 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  3.93it/s]

[NeMo I 2024-11-13 13:31:45 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:31:45 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:31:45 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:31:45 collections:741] Dataset successfully loaded with 297 items and total duration provided from manifest is  0.06 hours.
[NeMo I 2024-11-13 13:31:45 collections:746] # 297 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  8.81it/s]

[NeMo I 2024-11-13 13:31:45 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:31:45 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json





[NeMo I 2024-11-13 13:31:45 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:31:45 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:31:45 collections:741] Dataset successfully loaded with 326 items and total duration provided from manifest is  0.07 hours.
[NeMo I 2024-11-13 13:31:45 collections:746] # 326 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00, 13.04it/s]

[NeMo I 2024-11-13 13:31:46 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:31:46 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 13:31:46 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:31:46 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:31:46 collections:741] Dataset successfully loaded with 382 items and total duration provided from manifest is  0.07 hours.
[NeMo I 2024-11-13 13:31:46 collections:746] # 382 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00, 12.63it/s]

[NeMo I 2024-11-13 13:31:46 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:31:46 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 13:31:46 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:31:46 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:31:46 collections:741] Dataset successfully loaded with 470 items and total duration provided from manifest is  0.08 hours.
[NeMo I 2024-11-13 13:31:46 collections:746] # 470 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 8/8 [00:00<00:00, 16.42it/s]

[NeMo I 2024-11-13 13:31:47 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:31:47 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 13:31:47 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:31:47 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:31:47 collections:741] Dataset successfully loaded with 692 items and total duration provided from manifest is  0.08 hours.
[NeMo I 2024-11-13 13:31:47 collections:746] # 692 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 11/11 [00:00<00:00, 17.82it/s]


[NeMo I 2024-11-13 13:31:47 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  2.15it/s]

[NeMo I 2024-11-13 13:31:48 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:31:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:31:48 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:31:48 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:31:48 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:31:48 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:31:48 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:31:48 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 23.42it/s]

[NeMo I 2024-11-13 13:31:48 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:31:48 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:31:48 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 13:31:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:31:48 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:31:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:31:48 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:31:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:31:48 msdd_models:1435]   
    




[NeMo I 2024-11-13 13:33:49 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:33:49 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:33:49 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:33:49 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:33:51 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:33:51 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:33:51 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:33:51 features:305] PADDING: 16
[NeMo I 2024-11-13 13:33:51 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:33:52 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:33:52 features:305] PADDING: 16
[NeMo I 2024-11-13 13:33:53 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:33:53 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:33:53 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:33:53 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:33:53 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:33:53 features:305] PADDING: 16
[NeMo I 2024-11-13 13:33:54 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:33:54 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:33:54 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:33:54 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:33:54 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 15.96it/s]

[NeMo I 2024-11-13 13:33:54 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:33:54 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:33:54 collections:741] Dataset successfully loaded with 13 items and total duration provided from manifest is  0.17 hours.
[NeMo I 2024-11-13 13:33:54 collections:746] # 13 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 13/13 [00:02<00:00,  4.56it/s]

[NeMo I 2024-11-13 13:33:57 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:34:01 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  2.30it/s]

[NeMo I 2024-11-13 13:34:02 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:34:02 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:34:02 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:34:02 collections:741] Dataset successfully loaded with 487 items and total duration provided from manifest is  0.09 hours.
[NeMo I 2024-11-13 13:34:02 collections:746] # 487 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 8/8 [00:00<00:00,  9.10it/s]


[NeMo I 2024-11-13 13:34:03 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:34:03 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 13:34:03 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:34:03 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:34:03 collections:741] Dataset successfully loaded with 528 items and total duration provided from manifest is  0.09 hours.
[NeMo I 2024-11-13 13:34:03 collections:746] # 528 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 9/9 [00:00<00:00, 12.65it/s]

[NeMo I 2024-11-13 13:34:04 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:34:04 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 13:34:04 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:34:04 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:34:04 collections:741] Dataset successfully loaded with 589 items and total duration provided from manifest is  0.10 hours.
[NeMo I 2024-11-13 13:34:04 collections:746] # 589 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 10/10 [00:00<00:00, 12.54it/s]


[NeMo I 2024-11-13 13:34:05 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:34:05 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 13:34:05 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:34:05 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:34:05 collections:741] Dataset successfully loaded with 712 items and total duration provided from manifest is  0.11 hours.
[NeMo I 2024-11-13 13:34:05 collections:746] # 712 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 12/12 [00:00<00:00, 14.76it/s]


[NeMo I 2024-11-13 13:34:06 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:34:06 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 13:34:06 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:34:06 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:34:06 collections:741] Dataset successfully loaded with 1015 items and total duration provided from manifest is  0.12 hours.
[NeMo I 2024-11-13 13:34:06 collections:746] # 1015 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 16/16 [00:01<00:00, 12.46it/s]


[NeMo I 2024-11-13 13:34:08 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.26it/s]

[NeMo I 2024-11-13 13:34:09 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:34:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:34:09 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:34:09 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:34:09 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:34:09 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:34:09 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:34:09 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 16.57it/s]

[NeMo I 2024-11-13 13:34:09 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:34:09 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:34:09 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 13:34:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:34:09 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:34:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:34:09 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:34:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:34:09 msdd_models:1435]   
    




[NeMo I 2024-11-13 13:40:08 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:40:08 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:40:08 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:40:08 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:40:13 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:40:13 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:40:13 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:40:13 features:305] PADDING: 16
[NeMo I 2024-11-13 13:40:13 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:40:14 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:40:14 features:305] PADDING: 16
[NeMo I 2024-11-13 13:40:14 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:40:14 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:40:14 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:40:14 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:40:15 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:40:15 features:305] PADDING: 16
[NeMo I 2024-11-13 13:40:15 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:40:15 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:40:15 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:40:15 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:40:15 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  4.98it/s]

[NeMo I 2024-11-13 13:40:15 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:40:15 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:40:15 collections:741] Dataset successfully loaded with 38 items and total duration provided from manifest is  0.52 hours.
[NeMo I 2024-11-13 13:40:15 collections:746] # 38 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 38/38 [00:08<00:00,  4.24it/s]

[NeMo I 2024-11-13 13:40:24 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:40:40 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:01<00:00,  1.31s/it]

[NeMo I 2024-11-13 13:40:42 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:40:42 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:40:42 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:40:42 collections:741] Dataset successfully loaded with 1407 items and total duration provided from manifest is  0.39 hours.
[NeMo I 2024-11-13 13:40:42 collections:746] # 1407 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 22/22 [00:02<00:00,  9.16it/s]


[NeMo I 2024-11-13 13:40:44 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:40:44 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 13:40:45 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:40:45 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:40:45 collections:741] Dataset successfully loaded with 1619 items and total duration provided from manifest is  0.41 hours.
[NeMo I 2024-11-13 13:40:45 collections:746] # 1619 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 26/26 [00:02<00:00, 11.81it/s]


[NeMo I 2024-11-13 13:40:48 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:40:48 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 13:40:48 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:40:48 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:40:48 collections:741] Dataset successfully loaded with 1949 items and total duration provided from manifest is  0.43 hours.
[NeMo I 2024-11-13 13:40:48 collections:746] # 1949 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 31/31 [00:02<00:00, 10.40it/s]


[NeMo I 2024-11-13 13:40:51 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:40:51 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 13:40:51 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:40:51 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:40:51 collections:741] Dataset successfully loaded with 2558 items and total duration provided from manifest is  0.46 hours.
[NeMo I 2024-11-13 13:40:51 collections:746] # 2558 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 40/40 [00:02<00:00, 15.75it/s]


[NeMo I 2024-11-13 13:40:54 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:40:54 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 13:40:54 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:40:54 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:40:54 collections:741] Dataset successfully loaded with 3856 items and total duration provided from manifest is  0.49 hours.
[NeMo I 2024-11-13 13:40:54 collections:746] # 3856 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 61/61 [00:03<00:00, 17.64it/s]


[NeMo I 2024-11-13 13:40:58 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:01<00:00,  1.79s/it]

[NeMo I 2024-11-13 13:41:00 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:41:00 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:41:01 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:41:01 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:41:01 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:41:01 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:41:01 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:41:01 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00,  2.47it/s]

[NeMo I 2024-11-13 13:41:02 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:41:02 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:41:02 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 13:41:02 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:41:02 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:41:03 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:41:03 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:41:03 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:41:03 msdd_models:1435]   
    




[NeMo I 2024-11-13 13:47:40 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:47:40 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:47:40 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:47:40 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:47:46 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:47:46 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:47:46 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:47:46 features:305] PADDING: 16
[NeMo I 2024-11-13 13:47:46 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:47:47 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:47:47 features:305] PADDING: 16
[NeMo I 2024-11-13 13:47:47 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:47:47 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:47:47 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:47:47 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:47:48 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:47:48 features:305] PADDING: 16
[NeMo I 2024-11-13 13:47:48 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:47:48 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:47:48 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:47:48 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:47:48 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  4.06it/s]

[NeMo I 2024-11-13 13:47:48 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:47:48 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:47:48 collections:741] Dataset successfully loaded with 53 items and total duration provided from manifest is  0.73 hours.
[NeMo I 2024-11-13 13:47:48 collections:746] # 53 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 53/53 [00:12<00:00,  4.16it/s]

[NeMo I 2024-11-13 13:48:01 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:48:25 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:02<00:00,  2.75s/it]

[NeMo I 2024-11-13 13:48:28 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:48:28 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:48:28 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:48:28 collections:741] Dataset successfully loaded with 1457 items and total duration provided from manifest is  0.13 hours.
[NeMo I 2024-11-13 13:48:28 collections:746] # 1457 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 23/23 [00:02<00:00, 10.76it/s]


[NeMo I 2024-11-13 13:48:30 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:48:30 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 13:48:31 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:48:31 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:48:31 collections:741] Dataset successfully loaded with 1478 items and total duration provided from manifest is  0.14 hours.
[NeMo I 2024-11-13 13:48:31 collections:746] # 1478 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 24/24 [00:01<00:00, 13.10it/s]


[NeMo I 2024-11-13 13:48:33 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:48:33 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 13:48:33 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:48:33 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:48:33 collections:741] Dataset successfully loaded with 1512 items and total duration provided from manifest is  0.14 hours.
[NeMo I 2024-11-13 13:48:33 collections:746] # 1512 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 24/24 [00:01<00:00, 13.63it/s]


[NeMo I 2024-11-13 13:48:35 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:48:35 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 13:48:35 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:48:35 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:48:35 collections:741] Dataset successfully loaded with 1595 items and total duration provided from manifest is  0.15 hours.
[NeMo I 2024-11-13 13:48:35 collections:746] # 1595 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 25/25 [00:01<00:00, 16.06it/s]


[NeMo I 2024-11-13 13:48:36 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:48:36 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 13:48:36 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:48:36 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:48:36 collections:741] Dataset successfully loaded with 1904 items and total duration provided from manifest is  0.16 hours.
[NeMo I 2024-11-13 13:48:36 collections:746] # 1904 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 30/30 [00:01<00:00, 16.37it/s]


[NeMo I 2024-11-13 13:48:39 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:01<00:00,  1.24s/it]

[NeMo I 2024-11-13 13:48:40 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:48:40 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:48:40 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:48:40 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:48:40 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:48:40 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:48:40 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:48:40 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00,  7.51it/s]

[NeMo I 2024-11-13 13:48:41 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:48:41 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:48:41 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 13:48:41 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:48:41 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:48:41 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:48:41 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:48:41 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:48:41 msdd_models:1435]   
    




[NeMo I 2024-11-13 13:52:17 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:52:17 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:52:17 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:52:17 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:52:18 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:52:18 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:52:18 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:52:22 features:305] PADDING: 16
[NeMo I 2024-11-13 13:52:22 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:52:23 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:52:23 features:305] PADDING: 16
[NeMo I 2024-11-13 13:52:24 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:52:24 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:52:24 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:52:24 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:52:24 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:52:24 features:305] PADDING: 16
[NeMo I 2024-11-13 13:52:24 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:52:25 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:52:25 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:52:25 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:52:25 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  6.83it/s]

[NeMo I 2024-11-13 13:52:25 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:52:25 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:52:25 collections:741] Dataset successfully loaded with 20 items and total duration provided from manifest is  0.27 hours.
[NeMo I 2024-11-13 13:52:25 collections:746] # 20 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 20/20 [00:04<00:00,  4.59it/s]

[NeMo I 2024-11-13 13:52:29 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:52:38 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  1.16it/s]

[NeMo I 2024-11-13 13:52:39 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:52:39 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:52:39 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:52:39 collections:741] Dataset successfully loaded with 809 items and total duration provided from manifest is  0.18 hours.
[NeMo I 2024-11-13 13:52:39 collections:746] # 809 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 13/13 [00:01<00:00,  9.18it/s]


[NeMo I 2024-11-13 13:52:41 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:52:41 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 13:52:41 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:52:41 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:52:41 collections:741] Dataset successfully loaded with 901 items and total duration provided from manifest is  0.19 hours.
[NeMo I 2024-11-13 13:52:41 collections:746] # 901 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 15/15 [00:01<00:00, 12.68it/s]


[NeMo I 2024-11-13 13:52:42 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:52:42 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 13:52:42 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:52:42 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:52:42 collections:741] Dataset successfully loaded with 1040 items and total duration provided from manifest is  0.20 hours.
[NeMo I 2024-11-13 13:52:42 collections:746] # 1040 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 17/17 [00:01<00:00, 13.24it/s]

[NeMo I 2024-11-13 13:52:43 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:52:43 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 13:52:43 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:52:43 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:52:43 collections:741] Dataset successfully loaded with 1314 items and total duration provided from manifest is  0.22 hours.
[NeMo I 2024-11-13 13:52:43 collections:746] # 1314 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 21/21 [00:01<00:00, 15.77it/s]


[NeMo I 2024-11-13 13:52:45 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:52:45 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 13:52:45 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:52:45 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:52:45 collections:741] Dataset successfully loaded with 1941 items and total duration provided from manifest is  0.24 hours.
[NeMo I 2024-11-13 13:52:45 collections:746] # 1941 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 31/31 [00:01<00:00, 17.48it/s]


[NeMo I 2024-11-13 13:52:47 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.52it/s]

[NeMo I 2024-11-13 13:52:48 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:52:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:52:48 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:52:48 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:52:48 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:52:48 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:52:48 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:52:48 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 13.06it/s]

[NeMo I 2024-11-13 13:52:48 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:52:48 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:52:48 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 13:52:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:52:48 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:52:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:52:48 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:52:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:52:48 msdd_models:1435]   
    




[NeMo I 2024-11-13 13:57:44 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:57:44 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:57:44 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:57:44 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:57:46 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:57:46 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:57:46 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:57:46 features:305] PADDING: 16
[NeMo I 2024-11-13 13:57:46 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:57:47 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:57:47 features:305] PADDING: 16
[NeMo I 2024-11-13 13:57:48 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:57:48 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:57:48 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:57:48 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:57:48 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:57:48 features:305] PADDING: 16
[NeMo I 2024-11-13 13:57:48 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:57:48 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:57:48 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:57:48 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:57:48 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  6.76it/s]

[NeMo I 2024-11-13 13:57:48 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:57:48 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:57:48 collections:741] Dataset successfully loaded with 28 items and total duration provided from manifest is  0.39 hours.
[NeMo I 2024-11-13 13:57:48 collections:746] # 28 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 28/28 [00:07<00:00,  3.88it/s]

[NeMo I 2024-11-13 13:57:55 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:58:09 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  1.02it/s]

[NeMo I 2024-11-13 13:58:10 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:58:10 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:58:10 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:58:10 collections:741] Dataset successfully loaded with 1099 items and total duration provided from manifest is  0.17 hours.
[NeMo I 2024-11-13 13:58:10 collections:746] # 1099 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 18/18 [00:01<00:00,  9.79it/s]


[NeMo I 2024-11-13 13:58:12 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:58:12 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 13:58:12 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:58:12 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:58:12 collections:741] Dataset successfully loaded with 1160 items and total duration provided from manifest is  0.18 hours.
[NeMo I 2024-11-13 13:58:12 collections:746] # 1160 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 19/19 [00:01<00:00, 12.69it/s]


[NeMo I 2024-11-13 13:58:13 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:58:13 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 13:58:13 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:58:13 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:58:13 collections:741] Dataset successfully loaded with 1262 items and total duration provided from manifest is  0.19 hours.
[NeMo I 2024-11-13 13:58:13 collections:746] # 1262 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 20/20 [00:01<00:00, 12.83it/s]


[NeMo I 2024-11-13 13:58:15 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:58:15 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 13:58:15 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:58:15 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:58:15 collections:741] Dataset successfully loaded with 1477 items and total duration provided from manifest is  0.20 hours.
[NeMo I 2024-11-13 13:58:15 collections:746] # 1477 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 24/24 [00:01<00:00, 16.22it/s]


[NeMo I 2024-11-13 13:58:17 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:58:17 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 13:58:17 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:58:17 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:58:17 collections:741] Dataset successfully loaded with 2049 items and total duration provided from manifest is  0.23 hours.
[NeMo I 2024-11-13 13:58:17 collections:746] # 2049 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 33/33 [00:02<00:00, 14.04it/s]


[NeMo I 2024-11-13 13:58:21 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.34it/s]

[NeMo I 2024-11-13 13:58:22 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:58:22 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:58:22 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:58:22 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:58:22 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:58:22 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:58:22 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:58:22 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00,  9.01it/s]

[NeMo I 2024-11-13 13:58:23 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:58:23 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:58:23 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 13:58:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:58:23 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:58:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:58:23 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:58:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:58:23 msdd_models:1435]   
    




[NeMo I 2024-11-13 13:59:43 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 13:59:43 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:59:43 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 13:59:43 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:59:45 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 13:59:45 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 13:59:45 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 13:59:45 features:305] PADDING: 16
[NeMo I 2024-11-13 13:59:45 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 13:59:45 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 13:59:45 features:305] PADDING: 16
[NeMo I 2024-11-13 13:59:46 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 13:59:46 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:59:46 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 13:59:46 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 13:59:46 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 13:59:46 features:305] PADDING: 16
[NeMo I 2024-11-13 13:59:46 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 13:59:46 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 13:59:46 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 13:59:46 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:59:46 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 35.18it/s]

[NeMo I 2024-11-13 13:59:46 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 13:59:46 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:59:46 collections:741] Dataset successfully loaded with 6 items and total duration provided from manifest is  0.08 hours.
[NeMo I 2024-11-13 13:59:46 collections:746] # 6 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 6/6 [00:01<00:00,  4.19it/s]

[NeMo I 2024-11-13 13:59:48 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 13:59:51 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  3.04it/s]

[NeMo I 2024-11-13 13:59:51 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 13:59:51 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:59:51 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:59:51 collections:741] Dataset successfully loaded with 231 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 13:59:51 collections:746] # 231 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  8.51it/s]


[NeMo I 2024-11-13 13:59:52 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:59:52 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 13:59:52 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:59:52 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:59:52 collections:741] Dataset successfully loaded with 256 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 13:59:52 collections:746] # 256 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00, 10.32it/s]

[NeMo I 2024-11-13 13:59:53 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:59:53 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json





[NeMo I 2024-11-13 13:59:53 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:59:53 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:59:53 collections:741] Dataset successfully loaded with 300 items and total duration provided from manifest is  0.06 hours.
[NeMo I 2024-11-13 13:59:53 collections:746] # 300 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00, 13.52it/s]

[NeMo I 2024-11-13 13:59:53 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:59:53 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 13:59:53 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:59:53 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:59:53 collections:741] Dataset successfully loaded with 370 items and total duration provided from manifest is  0.06 hours.
[NeMo I 2024-11-13 13:59:53 collections:746] # 370 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00, 16.03it/s]

[NeMo I 2024-11-13 13:59:54 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 13:59:54 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 13:59:54 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 13:59:54 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 13:59:54 collections:741] Dataset successfully loaded with 541 items and total duration provided from manifest is  0.06 hours.
[NeMo I 2024-11-13 13:59:54 collections:746] # 541 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 9/9 [00:00<00:00, 18.24it/s]

[NeMo I 2024-11-13 13:59:54 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  2.99it/s]

[NeMo I 2024-11-13 13:59:54 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 13:59:54 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:59:55 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 13:59:55 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 13:59:55 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 13:59:55 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 13:59:55 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 13:59:55 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 23.98it/s]


[NeMo I 2024-11-13 13:59:55 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 13:59:55 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 13:59:55 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:59:55 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:59:55 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:59:55 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:59:55 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 13:59:55 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 13:59:55 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:04:23 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:04:23 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:04:23 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:04:23 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:04:25 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:04:25 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:04:25 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:04:25 features:305] PADDING: 16
[NeMo I 2024-11-13 14:04:25 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:04:27 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:04:28 features:305] PADDING: 16
[NeMo I 2024-11-13 14:04:28 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:04:28 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:04:28 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:04:28 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:04:28 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:04:28 features:305] PADDING: 16
[NeMo I 2024-11-13 14:04:28 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:04:28 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:04:28 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:04:28 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:04:28 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  4.43it/s]

[NeMo I 2024-11-13 14:04:29 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:04:29 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:04:29 collections:741] Dataset successfully loaded with 27 items and total duration provided from manifest is  0.37 hours.
[NeMo I 2024-11-13 14:04:29 collections:746] # 27 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 27/27 [00:06<00:00,  4.25it/s]

[NeMo I 2024-11-13 14:04:35 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:04:48 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  1.06it/s]

[NeMo I 2024-11-13 14:04:49 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:04:49 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:04:49 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:04:49 collections:741] Dataset successfully loaded with 871 items and total duration provided from manifest is  0.09 hours.
[NeMo I 2024-11-13 14:04:49 collections:746] # 871 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 14/14 [00:01<00:00, 10.77it/s]

[NeMo I 2024-11-13 14:04:50 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:04:50 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 14:04:51 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:04:51 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:04:51 collections:741] Dataset successfully loaded with 890 items and total duration provided from manifest is  0.10 hours.
[NeMo I 2024-11-13 14:04:51 collections:746] # 890 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 14/14 [00:01<00:00, 12.32it/s]


[NeMo I 2024-11-13 14:04:52 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:04:52 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 14:04:52 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:04:52 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:04:52 collections:741] Dataset successfully loaded with 922 items and total duration provided from manifest is  0.10 hours.
[NeMo I 2024-11-13 14:04:52 collections:746] # 922 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 15/15 [00:01<00:00, 13.39it/s]

[NeMo I 2024-11-13 14:04:53 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:04:53 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 14:04:53 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:04:53 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:04:53 collections:741] Dataset successfully loaded with 1003 items and total duration provided from manifest is  0.11 hours.
[NeMo I 2024-11-13 14:04:53 collections:746] # 1003 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 16/16 [00:00<00:00, 16.07it/s]

[NeMo I 2024-11-13 14:04:54 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:04:54 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json





[NeMo I 2024-11-13 14:04:54 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:04:54 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:04:54 collections:741] Dataset successfully loaded with 1241 items and total duration provided from manifest is  0.12 hours.
[NeMo I 2024-11-13 14:04:54 collections:746] # 1241 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 20/20 [00:01<00:00, 17.86it/s]

[NeMo I 2024-11-13 14:04:55 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  2.09it/s]

[NeMo I 2024-11-13 14:04:56 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:04:56 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:04:56 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:04:56 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:04:56 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:04:56 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:04:56 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:04:56 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 11.37it/s]

[NeMo I 2024-11-13 14:04:56 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:04:56 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:04:56 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:04:57 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:04:57 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:04:57 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:04:57 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:04:57 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:04:57 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:10:04 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:10:04 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:10:04 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:10:04 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:10:10 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:10:10 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:10:10 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:10:10 features:305] PADDING: 16
[NeMo I 2024-11-13 14:10:10 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:10:11 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:10:11 features:305] PADDING: 16
[NeMo I 2024-11-13 14:10:12 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:10:12 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:10:12 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:10:12 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:10:12 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:10:12 features:305] PADDING: 16
[NeMo I 2024-11-13 14:10:12 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:10:12 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:10:12 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:10:12 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:10:12 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  6.74it/s]

[NeMo I 2024-11-13 14:10:12 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:10:12 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:10:12 collections:741] Dataset successfully loaded with 33 items and total duration provided from manifest is  0.45 hours.
[NeMo I 2024-11-13 14:10:12 collections:746] # 33 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 33/33 [00:07<00:00,  4.37it/s]

[NeMo I 2024-11-13 14:10:20 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:10:35 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:01<00:00,  1.14s/it]

[NeMo I 2024-11-13 14:10:36 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:10:36 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:10:36 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:10:36 collections:741] Dataset successfully loaded with 1241 items and total duration provided from manifest is  0.22 hours.
[NeMo I 2024-11-13 14:10:36 collections:746] # 1241 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 20/20 [00:02<00:00,  9.45it/s]


[NeMo I 2024-11-13 14:10:38 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:10:38 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 14:10:38 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:10:38 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:10:38 collections:741] Dataset successfully loaded with 1329 items and total duration provided from manifest is  0.23 hours.
[NeMo I 2024-11-13 14:10:38 collections:746] # 1329 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 21/21 [00:01<00:00, 12.04it/s]

[NeMo I 2024-11-13 14:10:40 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:10:40 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json





[NeMo I 2024-11-13 14:10:40 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:10:40 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:10:40 collections:741] Dataset successfully loaded with 1485 items and total duration provided from manifest is  0.25 hours.
[NeMo I 2024-11-13 14:10:40 collections:746] # 1485 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 24/24 [00:01<00:00, 13.18it/s]


[NeMo I 2024-11-13 14:10:42 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:10:42 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 14:10:42 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:10:42 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:10:42 collections:741] Dataset successfully loaded with 1778 items and total duration provided from manifest is  0.26 hours.
[NeMo I 2024-11-13 14:10:42 collections:746] # 1778 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 28/28 [00:01<00:00, 15.81it/s]


[NeMo I 2024-11-13 14:10:45 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:10:45 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 14:10:45 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:10:45 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:10:45 collections:741] Dataset successfully loaded with 2525 items and total duration provided from manifest is  0.29 hours.
[NeMo I 2024-11-13 14:10:45 collections:746] # 2525 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 40/40 [00:02<00:00, 14.56it/s]


[NeMo I 2024-11-13 14:10:48 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.13it/s]

[NeMo I 2024-11-13 14:10:49 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:10:49 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:10:49 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:10:49 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:10:49 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:10:49 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:10:49 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:10:49 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00,  9.46it/s]

[NeMo I 2024-11-13 14:10:49 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:10:49 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:10:49 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:10:50 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:10:50 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:10:50 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:10:50 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:10:50 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:10:50 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:12:07 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:12:07 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:12:07 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:12:07 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:12:09 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:12:09 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:12:09 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:12:09 features:305] PADDING: 16
[NeMo I 2024-11-13 14:12:09 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:12:10 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:12:10 features:305] PADDING: 16
[NeMo I 2024-11-13 14:12:10 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:12:10 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:12:10 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:12:10 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:12:11 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:12:11 features:305] PADDING: 16
[NeMo I 2024-11-13 14:12:11 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:12:11 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:12:11 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:12:11 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:12:11 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 36.01it/s]

[NeMo I 2024-11-13 14:12:11 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:12:11 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:12:11 collections:741] Dataset successfully loaded with 6 items and total duration provided from manifest is  0.08 hours.
[NeMo I 2024-11-13 14:12:11 collections:746] # 6 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 6/6 [00:01<00:00,  4.75it/s]

[NeMo I 2024-11-13 14:12:12 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:12:15 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  2.95it/s]

[NeMo I 2024-11-13 14:12:15 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:12:15 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:12:15 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:12:15 collections:741] Dataset successfully loaded with 216 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 14:12:15 collections:746] # 216 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  8.56it/s]

[NeMo I 2024-11-13 14:12:16 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:12:16 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 14:12:16 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:12:16 collections:740] Filtered duration for loading collection is  0.00 hours.





[NeMo I 2024-11-13 14:12:16 collections:741] Dataset successfully loaded with 241 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 14:12:16 collections:746] # 241 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00, 11.20it/s]

[NeMo I 2024-11-13 14:12:16 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:12:16 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 14:12:16 clustering_diarizer:347] Extracting embeddings for Diarization





[NeMo I 2024-11-13 14:12:16 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:12:16 collections:741] Dataset successfully loaded with 276 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 14:12:16 collections:746] # 276 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00, 11.58it/s]

[NeMo I 2024-11-13 14:12:17 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:12:17 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 14:12:17 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:12:17 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:12:17 collections:741] Dataset successfully loaded with 338 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 14:12:17 collections:746] # 338 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00, 14.87it/s]

[NeMo I 2024-11-13 14:12:17 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:12:17 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 14:12:17 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:12:17 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:12:17 collections:741] Dataset successfully loaded with 502 items and total duration provided from manifest is  0.06 hours.
[NeMo I 2024-11-13 14:12:17 collections:746] # 502 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 8/8 [00:00<00:00, 17.41it/s]

[NeMo I 2024-11-13 14:12:18 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  1.90it/s]

[NeMo I 2024-11-13 14:12:18 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:12:18 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:12:18 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:12:18 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:12:18 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:12:18 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:12:18 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:12:18 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 26.77it/s]

[NeMo I 2024-11-13 14:12:18 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:12:18 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:12:18 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:12:18 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:12:18 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:12:19 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:12:19 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:12:19 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:12:19 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:14:54 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:14:54 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:14:54 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:14:54 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:14:55 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:14:55 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:14:55 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:14:55 features:305] PADDING: 16
[NeMo I 2024-11-13 14:14:56 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:14:58 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:14:58 features:305] PADDING: 16
[NeMo I 2024-11-13 14:14:59 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:14:59 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:14:59 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:14:59 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:15:00 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:15:00 features:305] PADDING: 16
[NeMo I 2024-11-13 14:15:00 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:15:00 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:15:00 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:15:00 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:15:00 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  9.60it/s]

[NeMo I 2024-11-13 14:15:00 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:15:00 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:15:00 collections:741] Dataset successfully loaded with 19 items and total duration provided from manifest is  0.26 hours.
[NeMo I 2024-11-13 14:15:00 collections:746] # 19 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 19/19 [00:04<00:00,  4.29it/s]

[NeMo I 2024-11-13 14:15:04 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:15:13 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:01<00:00,  1.03s/it]

[NeMo I 2024-11-13 14:15:14 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:15:14 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:15:14 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:15:14 collections:741] Dataset successfully loaded with 614 items and total duration provided from manifest is  0.08 hours.
[NeMo I 2024-11-13 14:15:14 collections:746] # 614 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 10/10 [00:01<00:00,  9.10it/s]


[NeMo I 2024-11-13 14:15:15 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:15:15 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 14:15:15 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:15:15 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:15:15 collections:741] Dataset successfully loaded with 641 items and total duration provided from manifest is  0.08 hours.
[NeMo I 2024-11-13 14:15:15 collections:746] # 641 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 11/11 [00:00<00:00, 12.89it/s]

[NeMo I 2024-11-13 14:15:16 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:15:16 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 14:15:16 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:15:16 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:15:16 collections:741] Dataset successfully loaded with 677 items and total duration provided from manifest is  0.08 hours.
[NeMo I 2024-11-13 14:15:16 collections:746] # 677 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 11/11 [00:00<00:00, 13.23it/s]

[NeMo I 2024-11-13 14:15:17 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:15:17 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 14:15:17 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:15:17 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:15:17 collections:741] Dataset successfully loaded with 756 items and total duration provided from manifest is  0.09 hours.
[NeMo I 2024-11-13 14:15:17 collections:746] # 756 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 12/12 [00:00<00:00, 16.20it/s]


[NeMo I 2024-11-13 14:15:18 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:15:18 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 14:15:18 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:15:18 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:15:18 collections:741] Dataset successfully loaded with 986 items and total duration provided from manifest is  0.10 hours.
[NeMo I 2024-11-13 14:15:18 collections:746] # 986 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 16/16 [00:00<00:00, 17.84it/s]

[NeMo I 2024-11-13 14:15:19 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  1.37it/s]

[NeMo I 2024-11-13 14:15:19 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:15:19 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:15:19 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:15:19 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:15:19 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:15:19 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:15:19 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:15:19 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 19.43it/s]

[NeMo I 2024-11-13 14:15:20 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:15:20 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:15:20 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:15:20 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:15:20 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:15:20 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:15:20 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:15:20 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:15:20 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:19:48 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:19:48 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:19:48 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:19:48 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:19:50 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:19:50 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:19:50 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:19:50 features:305] PADDING: 16
[NeMo I 2024-11-13 14:19:50 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:19:53 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:19:53 features:305] PADDING: 16
[NeMo I 2024-11-13 14:19:53 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:19:53 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:19:53 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:19:53 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:19:53 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:19:53 features:305] PADDING: 16
[NeMo I 2024-11-13 14:19:53 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:19:54 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:19:54 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:19:54 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:19:54 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  7.64it/s]

[NeMo I 2024-11-13 14:19:54 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:19:54 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:19:54 collections:741] Dataset successfully loaded with 25 items and total duration provided from manifest is  0.34 hours.
[NeMo I 2024-11-13 14:19:54 collections:746] # 25 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 25/25 [00:05<00:00,  4.55it/s]

[NeMo I 2024-11-13 14:19:59 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:20:11 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  1.16it/s]

[NeMo I 2024-11-13 14:20:12 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:20:12 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:20:12 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:20:12 collections:741] Dataset successfully loaded with 983 items and total duration provided from manifest is  0.20 hours.
[NeMo I 2024-11-13 14:20:12 collections:746] # 983 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 16/16 [00:01<00:00,  8.92it/s]


[NeMo I 2024-11-13 14:20:14 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:20:14 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 14:20:14 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:20:14 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:20:14 collections:741] Dataset successfully loaded with 1076 items and total duration provided from manifest is  0.21 hours.
[NeMo I 2024-11-13 14:20:14 collections:746] # 1076 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 17/17 [00:01<00:00,  9.32it/s]


[NeMo I 2024-11-13 14:20:17 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:20:17 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 14:20:17 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:20:17 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:20:17 collections:741] Dataset successfully loaded with 1233 items and total duration provided from manifest is  0.23 hours.
[NeMo I 2024-11-13 14:20:17 collections:746] # 1233 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 20/20 [00:01<00:00, 13.26it/s]


[NeMo I 2024-11-13 14:20:19 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:20:19 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 14:20:19 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:20:19 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:20:19 collections:741] Dataset successfully loaded with 1529 items and total duration provided from manifest is  0.24 hours.
[NeMo I 2024-11-13 14:20:19 collections:746] # 1529 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 24/24 [00:01<00:00, 15.92it/s]


[NeMo I 2024-11-13 14:20:20 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:20:20 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 14:20:20 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:20:20 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:20:20 collections:741] Dataset successfully loaded with 2235 items and total duration provided from manifest is  0.27 hours.
[NeMo I 2024-11-13 14:20:20 collections:746] # 2235 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 35/35 [00:02<00:00, 17.36it/s]


[NeMo I 2024-11-13 14:20:23 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:01<00:00,  1.17s/it]

[NeMo I 2024-11-13 14:20:24 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:20:24 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:20:24 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:20:24 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:20:24 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:20:24 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:20:24 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:20:24 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00,  7.90it/s]

[NeMo I 2024-11-13 14:20:25 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:20:25 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:20:25 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:20:25 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:20:25 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:20:25 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:20:25 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:20:25 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:20:25 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:22:33 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:22:33 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:22:33 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:22:33 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:22:35 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:22:35 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:22:35 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:22:35 features:305] PADDING: 16
[NeMo I 2024-11-13 14:22:35 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:22:36 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:22:36 features:305] PADDING: 16
[NeMo I 2024-11-13 14:22:36 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:22:36 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:22:36 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:22:36 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:22:36 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:22:36 features:305] PADDING: 16
[NeMo I 2024-11-13 14:22:36 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:22:36 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:22:36 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:22:36 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:22:36 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 12.94it/s]

[NeMo I 2024-11-13 14:22:37 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:22:37 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:22:37 collections:741] Dataset successfully loaded with 13 items and total duration provided from manifest is  0.18 hours.
[NeMo I 2024-11-13 14:22:37 collections:746] # 13 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 13/13 [00:03<00:00,  4.28it/s]

[NeMo I 2024-11-13 14:22:40 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:22:46 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  2.21it/s]

[NeMo I 2024-11-13 14:22:46 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:22:46 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:22:46 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:22:46 collections:741] Dataset successfully loaded with 457 items and total duration provided from manifest is  0.06 hours.
[NeMo I 2024-11-13 14:22:46 collections:746] # 457 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 8/8 [00:00<00:00,  9.69it/s]

[NeMo I 2024-11-13 14:22:47 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:22:47 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json





[NeMo I 2024-11-13 14:22:47 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:22:47 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:22:47 collections:741] Dataset successfully loaded with 470 items and total duration provided from manifest is  0.06 hours.
[NeMo I 2024-11-13 14:22:47 collections:746] # 470 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 8/8 [00:00<00:00, 12.96it/s]

[NeMo I 2024-11-13 14:22:48 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:22:48 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 14:22:48 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:22:48 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:22:48 collections:741] Dataset successfully loaded with 500 items and total duration provided from manifest is  0.06 hours.
[NeMo I 2024-11-13 14:22:48 collections:746] # 500 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 8/8 [00:00<00:00, 13.07it/s]

[NeMo I 2024-11-13 14:22:49 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:22:49 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json





[NeMo I 2024-11-13 14:22:49 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:22:49 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:22:49 collections:741] Dataset successfully loaded with 577 items and total duration provided from manifest is  0.07 hours.
[NeMo I 2024-11-13 14:22:49 collections:746] # 577 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 10/10 [00:00<00:00, 16.74it/s]

[NeMo I 2024-11-13 14:22:49 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:22:49 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 14:22:49 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:22:49 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:22:49 collections:741] Dataset successfully loaded with 751 items and total duration provided from manifest is  0.08 hours.
[NeMo I 2024-11-13 14:22:49 collections:746] # 751 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 12/12 [00:00<00:00, 16.84it/s]

[NeMo I 2024-11-13 14:22:50 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  1.91it/s]

[NeMo I 2024-11-13 14:22:51 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:22:51 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:22:51 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:22:51 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:22:51 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:22:51 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:22:51 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:22:51 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 19.86it/s]

[NeMo I 2024-11-13 14:22:51 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:22:51 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:22:51 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:22:51 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:22:51 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:22:51 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:22:51 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:22:51 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:22:51 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:26:05 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:26:05 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:26:05 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:26:05 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:26:07 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:26:07 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:26:07 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:26:07 features:305] PADDING: 16
[NeMo I 2024-11-13 14:26:07 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:26:07 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:26:08 features:305] PADDING: 16
[NeMo I 2024-11-13 14:26:08 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:26:08 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:26:08 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:26:08 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:26:08 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:26:08 features:305] PADDING: 16
[NeMo I 2024-11-13 14:26:08 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:26:08 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:26:08 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:26:08 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:26:08 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  8.44it/s]

[NeMo I 2024-11-13 14:26:08 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:26:08 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:26:08 collections:741] Dataset successfully loaded with 19 items and total duration provided from manifest is  0.26 hours.
[NeMo I 2024-11-13 14:26:08 collections:746] # 19 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 19/19 [00:05<00:00,  3.68it/s]


[NeMo I 2024-11-13 14:26:14 clustering_diarizer:254] Generating predictions with overlapping input segments


                                                               

[NeMo I 2024-11-13 14:26:21 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  1.52it/s]

[NeMo I 2024-11-13 14:26:22 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:26:22 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:26:22 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:26:22 collections:741] Dataset successfully loaded with 700 items and total duration provided from manifest is  0.11 hours.
[NeMo I 2024-11-13 14:26:22 collections:746] # 700 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 11/11 [00:01<00:00,  8.07it/s]


[NeMo I 2024-11-13 14:26:24 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:26:24 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 14:26:24 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:26:24 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:26:24 collections:741] Dataset successfully loaded with 738 items and total duration provided from manifest is  0.12 hours.
[NeMo I 2024-11-13 14:26:24 collections:746] # 738 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 12/12 [00:01<00:00,  9.90it/s]


[NeMo I 2024-11-13 14:26:26 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:26:26 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 14:26:26 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:26:26 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:26:26 collections:741] Dataset successfully loaded with 812 items and total duration provided from manifest is  0.12 hours.
[NeMo I 2024-11-13 14:26:26 collections:746] # 812 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 13/13 [00:01<00:00, 12.95it/s]

[NeMo I 2024-11-13 14:26:27 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:26:27 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 14:26:27 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:26:27 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:26:27 collections:741] Dataset successfully loaded with 968 items and total duration provided from manifest is  0.13 hours.
[NeMo I 2024-11-13 14:26:27 collections:746] # 968 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 16/16 [00:00<00:00, 16.72it/s]

[NeMo I 2024-11-13 14:26:28 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:26:28 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 14:26:28 clustering_diarizer:347] Extracting embeddings for Diarization





[NeMo I 2024-11-13 14:26:28 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:26:28 collections:741] Dataset successfully loaded with 1346 items and total duration provided from manifest is  0.15 hours.
[NeMo I 2024-11-13 14:26:28 collections:746] # 1346 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 22/22 [00:01<00:00, 17.61it/s]

[NeMo I 2024-11-13 14:26:30 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  1.90it/s]

[NeMo I 2024-11-13 14:26:30 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:26:30 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:26:30 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:26:30 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:26:30 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:26:30 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:26:30 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:26:30 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 18.61it/s]

[NeMo I 2024-11-13 14:26:31 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:26:31 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:26:31 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:26:31 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:26:31 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:26:31 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:26:31 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:26:31 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:26:31 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:29:04 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:29:04 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:29:04 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:29:04 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:29:05 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:29:05 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:29:05 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:29:05 features:305] PADDING: 16
[NeMo I 2024-11-13 14:29:06 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:29:06 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:29:06 features:305] PADDING: 16
[NeMo I 2024-11-13 14:29:07 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:29:07 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:29:07 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:29:07 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:29:07 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:29:07 features:305] PADDING: 16
[NeMo I 2024-11-13 14:29:07 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:29:07 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:29:07 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:29:07 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:29:07 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  9.15it/s]

[NeMo I 2024-11-13 14:29:07 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:29:07 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:29:07 collections:741] Dataset successfully loaded with 18 items and total duration provided from manifest is  0.24 hours.
[NeMo I 2024-11-13 14:29:07 collections:746] # 18 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 18/18 [00:04<00:00,  3.82it/s]

[NeMo I 2024-11-13 14:29:12 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:29:19 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  1.65it/s]

[NeMo I 2024-11-13 14:29:20 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:29:20 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:29:20 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:29:20 collections:741] Dataset successfully loaded with 531 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 14:29:20 collections:746] # 531 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 9/9 [00:00<00:00, 11.21it/s]

[NeMo I 2024-11-13 14:29:21 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:29:21 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json





[NeMo I 2024-11-13 14:29:21 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:29:21 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:29:21 collections:741] Dataset successfully loaded with 538 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 14:29:21 collections:746] # 538 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 9/9 [00:00<00:00, 12.26it/s]


[NeMo I 2024-11-13 14:29:22 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:29:22 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 14:29:22 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:29:22 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:29:22 collections:741] Dataset successfully loaded with 548 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 14:29:22 collections:746] # 548 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 9/9 [00:00<00:00, 11.49it/s]

[NeMo I 2024-11-13 14:29:23 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:29:23 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 14:29:23 clustering_diarizer:347] Extracting embeddings for Diarization





[NeMo I 2024-11-13 14:29:23 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:29:23 collections:741] Dataset successfully loaded with 581 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 14:29:23 collections:746] # 581 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 10/10 [00:00<00:00, 13.60it/s]


[NeMo I 2024-11-13 14:29:25 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:29:25 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 14:29:25 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:29:25 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:29:25 collections:741] Dataset successfully loaded with 687 items and total duration provided from manifest is  0.06 hours.
[NeMo I 2024-11-13 14:29:25 collections:746] # 687 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 11/11 [00:00<00:00, 16.02it/s]


[NeMo I 2024-11-13 14:29:26 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  2.19it/s]

[NeMo I 2024-11-13 14:29:26 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:29:26 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:29:26 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:29:26 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:29:26 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:29:26 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:29:26 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:29:26 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 25.13it/s]

[NeMo I 2024-11-13 14:29:26 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:29:26 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:29:26 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:29:26 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:29:26 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:29:26 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:29:26 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:29:26 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:29:26 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:31:28 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:31:28 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:31:28 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:31:28 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:31:29 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:31:29 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:31:29 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:31:32 features:305] PADDING: 16
[NeMo I 2024-11-13 14:31:33 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:31:33 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:31:33 features:305] PADDING: 16
[NeMo I 2024-11-13 14:31:34 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:31:34 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:31:34 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:31:34 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:31:34 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:31:34 features:305] PADDING: 16
[NeMo I 2024-11-13 14:31:34 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:31:34 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:31:34 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:31:34 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:31:34 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 13.96it/s]

[NeMo I 2024-11-13 14:31:34 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:31:34 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:31:34 collections:741] Dataset successfully loaded with 13 items and total duration provided from manifest is  0.17 hours.
[NeMo I 2024-11-13 14:31:34 collections:746] # 13 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 13/13 [00:03<00:00,  4.01it/s]

[NeMo I 2024-11-13 14:31:37 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:31:43 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  2.21it/s]

[NeMo I 2024-11-13 14:31:44 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:31:44 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:31:44 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:31:44 collections:741] Dataset successfully loaded with 481 items and total duration provided from manifest is  0.08 hours.
[NeMo I 2024-11-13 14:31:44 collections:746] # 481 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 8/8 [00:00<00:00,  9.28it/s]


[NeMo I 2024-11-13 14:31:45 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:31:45 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 14:31:45 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:31:45 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:31:45 collections:741] Dataset successfully loaded with 510 items and total duration provided from manifest is  0.08 hours.
[NeMo I 2024-11-13 14:31:45 collections:746] # 510 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 8/8 [00:00<00:00, 11.97it/s]


[NeMo I 2024-11-13 14:31:45 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:31:45 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 14:31:45 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:31:45 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:31:45 collections:741] Dataset successfully loaded with 554 items and total duration provided from manifest is  0.08 hours.
[NeMo I 2024-11-13 14:31:45 collections:746] # 554 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 9/9 [00:00<00:00, 12.85it/s]

[NeMo I 2024-11-13 14:31:46 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:31:46 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 14:31:46 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:31:46 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:31:46 collections:741] Dataset successfully loaded with 656 items and total duration provided from manifest is  0.09 hours.
[NeMo I 2024-11-13 14:31:46 collections:746] # 656 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 11/11 [00:00<00:00, 16.36it/s]

[NeMo I 2024-11-13 14:31:47 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:31:47 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json





[NeMo I 2024-11-13 14:31:47 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:31:47 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:31:47 collections:741] Dataset successfully loaded with 920 items and total duration provided from manifest is  0.10 hours.
[NeMo I 2024-11-13 14:31:47 collections:746] # 920 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 15/15 [00:00<00:00, 17.53it/s]


[NeMo I 2024-11-13 14:31:48 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.54it/s]

[NeMo I 2024-11-13 14:31:48 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:31:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:31:48 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:31:49 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:31:49 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:31:49 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:31:49 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:31:49 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 16.14it/s]

[NeMo I 2024-11-13 14:31:49 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:31:49 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:31:49 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:31:49 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:31:49 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:31:49 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:31:49 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:31:49 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:31:49 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:35:19 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:35:19 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:35:19 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:35:19 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:35:21 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:35:21 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:35:21 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:35:21 features:305] PADDING: 16
[NeMo I 2024-11-13 14:35:22 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:35:23 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:35:23 features:305] PADDING: 16
[NeMo I 2024-11-13 14:35:24 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:35:24 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:35:24 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:35:24 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:35:24 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:35:24 features:305] PADDING: 16
[NeMo I 2024-11-13 14:35:24 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:35:24 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:35:24 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:35:24 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:35:24 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 10.37it/s]

[NeMo I 2024-11-13 14:35:24 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:35:24 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:35:24 collections:741] Dataset successfully loaded with 20 items and total duration provided from manifest is  0.27 hours.
[NeMo I 2024-11-13 14:35:24 collections:746] # 20 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 20/20 [00:04<00:00,  4.56it/s]

[NeMo I 2024-11-13 14:35:29 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:35:38 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  1.51it/s]

[NeMo I 2024-11-13 14:35:39 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:35:39 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:35:39 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:35:39 collections:741] Dataset successfully loaded with 701 items and total duration provided from manifest is  0.09 hours.
[NeMo I 2024-11-13 14:35:39 collections:746] # 701 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 11/11 [00:01<00:00,  9.55it/s]

[NeMo I 2024-11-13 14:35:40 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:35:40 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json





[NeMo I 2024-11-13 14:35:40 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:35:40 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:35:40 collections:741] Dataset successfully loaded with 731 items and total duration provided from manifest is  0.09 hours.
[NeMo I 2024-11-13 14:35:40 collections:746] # 731 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 12/12 [00:00<00:00, 12.50it/s]


[NeMo I 2024-11-13 14:35:41 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:35:41 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 14:35:41 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:35:41 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:35:41 collections:741] Dataset successfully loaded with 769 items and total duration provided from manifest is  0.10 hours.
[NeMo I 2024-11-13 14:35:41 collections:746] # 769 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 13/13 [00:00<00:00, 13.55it/s]


[NeMo I 2024-11-13 14:35:42 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:35:42 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 14:35:42 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:35:42 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:35:42 collections:741] Dataset successfully loaded with 857 items and total duration provided from manifest is  0.10 hours.
[NeMo I 2024-11-13 14:35:42 collections:746] # 857 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 14/14 [00:00<00:00, 16.14it/s]

[NeMo I 2024-11-13 14:35:43 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings





[NeMo I 2024-11-13 14:35:43 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 14:35:43 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:35:43 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:35:43 collections:741] Dataset successfully loaded with 1125 items and total duration provided from manifest is  0.11 hours.
[NeMo I 2024-11-13 14:35:43 collections:746] # 1125 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 18/18 [00:01<00:00, 17.19it/s]


[NeMo I 2024-11-13 14:35:44 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  2.07it/s]

[NeMo I 2024-11-13 14:35:45 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:35:45 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:35:45 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:35:45 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:35:45 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:35:45 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:35:45 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:35:45 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 10.69it/s]

[NeMo I 2024-11-13 14:35:45 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:35:45 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:35:45 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:35:46 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:35:46 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:35:46 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:35:46 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:35:46 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:35:46 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:38:19 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:38:19 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:38:19 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:38:19 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:38:21 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:38:21 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:38:21 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:38:21 features:305] PADDING: 16
[NeMo I 2024-11-13 14:38:21 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:38:21 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:38:21 features:305] PADDING: 16
[NeMo I 2024-11-13 14:38:22 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:38:22 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:38:22 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:38:22 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:38:22 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:38:22 features:305] PADDING: 16
[NeMo I 2024-11-13 14:38:22 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:38:22 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:38:22 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:38:22 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:38:22 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 10.59it/s]

[NeMo I 2024-11-13 14:38:22 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:38:22 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:38:22 collections:741] Dataset successfully loaded with 18 items and total duration provided from manifest is  0.25 hours.
[NeMo I 2024-11-13 14:38:22 collections:746] # 18 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 18/18 [00:04<00:00,  4.39it/s]

[NeMo I 2024-11-13 14:38:26 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:38:35 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  1.60it/s]

[NeMo I 2024-11-13 14:38:36 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:38:36 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:38:36 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:38:36 collections:741] Dataset successfully loaded with 689 items and total duration provided from manifest is  0.10 hours.
[NeMo I 2024-11-13 14:38:36 collections:746] # 689 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 11/11 [00:01<00:00,  9.22it/s]


[NeMo I 2024-11-13 14:38:37 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:38:37 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 14:38:37 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:38:37 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:38:37 collections:741] Dataset successfully loaded with 723 items and total duration provided from manifest is  0.10 hours.
[NeMo I 2024-11-13 14:38:37 collections:746] # 723 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 12/12 [00:00<00:00, 12.81it/s]

[NeMo I 2024-11-13 14:38:38 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:38:38 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json





[NeMo I 2024-11-13 14:38:38 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:38:38 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:38:38 collections:741] Dataset successfully loaded with 780 items and total duration provided from manifest is  0.11 hours.
[NeMo I 2024-11-13 14:38:38 collections:746] # 780 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 13/13 [00:00<00:00, 13.22it/s]

[NeMo I 2024-11-13 14:38:39 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:38:39 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 14:38:39 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:38:39 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:38:39 collections:741] Dataset successfully loaded with 894 items and total duration provided from manifest is  0.12 hours.
[NeMo I 2024-11-13 14:38:39 collections:746] # 894 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 14/14 [00:00<00:00, 15.84it/s]


[NeMo I 2024-11-13 14:38:40 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:38:40 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 14:38:40 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:38:40 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:38:40 collections:741] Dataset successfully loaded with 1200 items and total duration provided from manifest is  0.13 hours.
[NeMo I 2024-11-13 14:38:40 collections:746] # 1200 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 19/19 [00:01<00:00, 12.86it/s]


[NeMo I 2024-11-13 14:38:42 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.05it/s]

[NeMo I 2024-11-13 14:38:43 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:38:43 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:38:43 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:38:43 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:38:43 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:38:43 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:38:43 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:38:43 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 14.45it/s]

[NeMo I 2024-11-13 14:38:43 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:38:43 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:38:43 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:38:43 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:38:43 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:38:44 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:38:44 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:38:44 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:38:44 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:42:58 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:42:58 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:42:58 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:42:58 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:43:01 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:43:01 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:43:01 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:43:01 features:305] PADDING: 16
[NeMo I 2024-11-13 14:43:03 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:43:04 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:43:04 features:305] PADDING: 16
[NeMo I 2024-11-13 14:43:05 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:43:05 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:43:05 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:43:05 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:43:05 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:43:05 features:305] PADDING: 16
[NeMo I 2024-11-13 14:43:05 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:43:05 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:43:05 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:43:05 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:43:05 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  7.07it/s]

[NeMo I 2024-11-13 14:43:05 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:43:05 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:43:05 collections:741] Dataset successfully loaded with 25 items and total duration provided from manifest is  0.35 hours.
[NeMo I 2024-11-13 14:43:05 collections:746] # 25 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 25/25 [00:05<00:00,  4.31it/s]

[NeMo I 2024-11-13 14:43:11 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:43:23 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  1.14it/s]

[NeMo I 2024-11-13 14:43:24 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:43:24 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:43:24 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:43:24 collections:741] Dataset successfully loaded with 657 items and total duration provided from manifest is  0.06 hours.
[NeMo I 2024-11-13 14:43:24 collections:746] # 657 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 11/11 [00:01<00:00, 10.62it/s]

[NeMo I 2024-11-13 14:43:25 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:43:25 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 14:43:25 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:43:25 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:43:25 collections:741] Dataset successfully loaded with 666 items and total duration provided from manifest is  0.07 hours.
[NeMo I 2024-11-13 14:43:25 collections:746] # 666 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 11/11 [00:00<00:00, 12.75it/s]

[NeMo I 2024-11-13 14:43:26 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:43:26 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 14:43:26 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:43:26 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:43:26 collections:741] Dataset successfully loaded with 685 items and total duration provided from manifest is  0.07 hours.
[NeMo I 2024-11-13 14:43:26 collections:746] # 685 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 11/11 [00:00<00:00, 13.15it/s]

[NeMo I 2024-11-13 14:43:27 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:43:27 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 14:43:27 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:43:27 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:43:27 collections:741] Dataset successfully loaded with 731 items and total duration provided from manifest is  0.07 hours.
[NeMo I 2024-11-13 14:43:27 collections:746] # 731 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 12/12 [00:00<00:00, 16.42it/s]

[NeMo I 2024-11-13 14:43:28 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:43:28 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 14:43:28 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:43:28 collections:740] Filtered duration for loading collection is  0.00 hours.





[NeMo I 2024-11-13 14:43:28 collections:741] Dataset successfully loaded with 889 items and total duration provided from manifest is  0.08 hours.
[NeMo I 2024-11-13 14:43:28 collections:746] # 889 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 14/14 [00:00<00:00, 17.12it/s]

[NeMo I 2024-11-13 14:43:29 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  1.23it/s]

[NeMo I 2024-11-13 14:43:30 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:43:30 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:43:30 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:43:30 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:43:30 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:43:30 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:43:30 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:43:30 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 15.30it/s]

[NeMo I 2024-11-13 14:43:30 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:43:30 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:43:30 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:43:30 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:43:30 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:43:30 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:43:30 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:43:30 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:43:30 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:47:48 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:47:48 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:47:48 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:47:48 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:47:50 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:47:50 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:47:50 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:47:50 features:305] PADDING: 16
[NeMo I 2024-11-13 14:47:50 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:47:51 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:47:51 features:305] PADDING: 16
[NeMo I 2024-11-13 14:47:52 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:47:52 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:47:52 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:47:52 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:47:52 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:47:52 features:305] PADDING: 16
[NeMo I 2024-11-13 14:47:52 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:47:52 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:47:52 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:47:52 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:47:52 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  6.75it/s]

[NeMo I 2024-11-13 14:47:52 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:47:52 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:47:52 collections:741] Dataset successfully loaded with 28 items and total duration provided from manifest is  0.38 hours.
[NeMo I 2024-11-13 14:47:52 collections:746] # 28 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 28/28 [00:06<00:00,  4.17it/s]

[NeMo I 2024-11-13 14:47:59 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:48:12 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  1.05it/s]

[NeMo I 2024-11-13 14:48:13 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:48:13 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:48:13 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:48:13 collections:741] Dataset successfully loaded with 930 items and total duration provided from manifest is  0.15 hours.
[NeMo I 2024-11-13 14:48:13 collections:746] # 930 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 15/15 [00:01<00:00,  9.28it/s]


[NeMo I 2024-11-13 14:48:14 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:48:14 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 14:48:14 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:48:15 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:48:15 collections:741] Dataset successfully loaded with 988 items and total duration provided from manifest is  0.15 hours.
[NeMo I 2024-11-13 14:48:15 collections:746] # 988 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 16/16 [00:01<00:00, 12.25it/s]


[NeMo I 2024-11-13 14:48:16 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:48:16 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 14:48:16 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:48:16 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:48:16 collections:741] Dataset successfully loaded with 1082 items and total duration provided from manifest is  0.16 hours.
[NeMo I 2024-11-13 14:48:16 collections:746] # 1082 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 17/17 [00:01<00:00, 12.82it/s]

[NeMo I 2024-11-13 14:48:17 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:48:17 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json





[NeMo I 2024-11-13 14:48:17 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:48:17 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:48:17 collections:741] Dataset successfully loaded with 1268 items and total duration provided from manifest is  0.17 hours.
[NeMo I 2024-11-13 14:48:17 collections:746] # 1268 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 20/20 [00:01<00:00, 12.39it/s]


[NeMo I 2024-11-13 14:48:19 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:48:19 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 14:48:20 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:48:20 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:48:20 collections:741] Dataset successfully loaded with 1757 items and total duration provided from manifest is  0.19 hours.
[NeMo I 2024-11-13 14:48:20 collections:746] # 1757 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 28/28 [00:01<00:00, 15.91it/s]


[NeMo I 2024-11-13 14:48:22 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.83it/s]

[NeMo I 2024-11-13 14:48:23 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:48:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:48:23 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:48:23 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:48:23 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:48:23 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:48:23 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:48:23 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 10.62it/s]

[NeMo I 2024-11-13 14:48:23 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:48:23 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:48:23 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:48:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:48:23 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:48:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:48:23 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:48:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:48:23 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:51:03 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:51:03 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:51:03 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:51:03 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:51:05 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:51:05 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:51:05 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:51:05 features:305] PADDING: 16
[NeMo I 2024-11-13 14:51:05 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:51:09 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:51:09 features:305] PADDING: 16
[NeMo I 2024-11-13 14:51:09 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:51:09 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:51:09 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:51:09 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:51:09 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:51:09 features:305] PADDING: 16
[NeMo I 2024-11-13 14:51:09 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:51:09 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:51:09 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:51:09 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:51:09 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  8.95it/s]

[NeMo I 2024-11-13 14:51:10 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:51:10 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:51:10 collections:741] Dataset successfully loaded with 18 items and total duration provided from manifest is  0.24 hours.
[NeMo I 2024-11-13 14:51:10 collections:746] # 18 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 18/18 [00:04<00:00,  4.17it/s]

[NeMo I 2024-11-13 14:51:14 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:51:22 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  1.60it/s]

[NeMo I 2024-11-13 14:51:23 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:51:23 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:51:23 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:51:23 collections:741] Dataset successfully loaded with 701 items and total duration provided from manifest is  0.15 hours.
[NeMo I 2024-11-13 14:51:23 collections:746] # 701 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 11/11 [00:01<00:00,  8.66it/s]

[NeMo I 2024-11-13 14:51:24 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:51:24 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json





[NeMo I 2024-11-13 14:51:24 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:51:24 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:51:24 collections:741] Dataset successfully loaded with 791 items and total duration provided from manifest is  0.16 hours.
[NeMo I 2024-11-13 14:51:24 collections:746] # 791 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 13/13 [00:01<00:00, 12.36it/s]

[NeMo I 2024-11-13 14:51:25 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:51:25 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 14:51:25 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:51:25 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:51:25 collections:741] Dataset successfully loaded with 915 items and total duration provided from manifest is  0.17 hours.
[NeMo I 2024-11-13 14:51:25 collections:746] # 915 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 15/15 [00:01<00:00, 12.69it/s]

[NeMo I 2024-11-13 14:51:26 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:51:26 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json





[NeMo I 2024-11-13 14:51:26 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:51:26 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:51:26 collections:741] Dataset successfully loaded with 1138 items and total duration provided from manifest is  0.19 hours.
[NeMo I 2024-11-13 14:51:26 collections:746] # 1138 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 18/18 [00:01<00:00, 14.03it/s]


[NeMo I 2024-11-13 14:51:28 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:51:28 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 14:51:28 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:51:28 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:51:28 collections:741] Dataset successfully loaded with 1666 items and total duration provided from manifest is  0.20 hours.
[NeMo I 2024-11-13 14:51:28 collections:746] # 1666 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 27/27 [00:01<00:00, 13.91it/s]


[NeMo I 2024-11-13 14:51:30 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.52it/s]

[NeMo I 2024-11-13 14:51:31 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:51:31 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:51:31 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:51:31 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:51:31 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:51:31 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:51:31 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:51:31 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00,  9.50it/s]

[NeMo I 2024-11-13 14:51:31 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:51:31 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:51:31 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:51:32 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:51:32 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:51:32 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:51:32 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:51:32 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:51:32 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:52:30 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:52:30 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:52:30 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:52:30 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:52:32 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:52:32 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:52:32 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:52:32 features:305] PADDING: 16
[NeMo I 2024-11-13 14:52:33 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:52:34 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:52:34 features:305] PADDING: 16
[NeMo I 2024-11-13 14:52:35 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:52:35 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:52:35 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:52:35 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:52:35 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:52:35 features:305] PADDING: 16
[NeMo I 2024-11-13 14:52:35 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:52:35 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:52:35 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:52:35 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:52:35 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 45.55it/s]


[NeMo I 2024-11-13 14:52:35 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:52:35 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:52:35 collections:741] Dataset successfully loaded with 4 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 14:52:35 collections:746] # 4 files loaded accounting to # 1 labels


      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 4/4 [00:00<00:00,  5.09it/s]

[NeMo I 2024-11-13 14:52:36 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:52:38 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  7.10it/s]

[NeMo I 2024-11-13 14:52:38 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:52:38 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:52:38 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:52:38 collections:741] Dataset successfully loaded with 107 items and total duration provided from manifest is  0.01 hours.
[NeMo I 2024-11-13 14:52:38 collections:746] # 107 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  9.69it/s]

[NeMo I 2024-11-13 14:52:38 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:52:38 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 14:52:38 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:52:38 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:52:38 collections:741] Dataset successfully loaded with 111 items and total duration provided from manifest is  0.02 hours.
[NeMo I 2024-11-13 14:52:38 collections:746] # 111 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00, 11.96it/s]

[NeMo I 2024-11-13 14:52:38 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings





[NeMo I 2024-11-13 14:52:38 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 14:52:38 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:52:38 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:52:38 collections:741] Dataset successfully loaded with 119 items and total duration provided from manifest is  0.02 hours.
[NeMo I 2024-11-13 14:52:38 collections:746] # 119 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00, 12.60it/s]

[NeMo I 2024-11-13 14:52:38 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:52:38 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 14:52:38 clustering_diarizer:347] Extracting embeddings for Diarization





[NeMo I 2024-11-13 14:52:38 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:52:38 collections:741] Dataset successfully loaded with 136 items and total duration provided from manifest is  0.02 hours.
[NeMo I 2024-11-13 14:52:38 collections:746] # 136 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00, 18.53it/s]

[NeMo I 2024-11-13 14:52:38 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:52:38 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json





[NeMo I 2024-11-13 14:52:38 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:52:38 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:52:38 collections:741] Dataset successfully loaded with 183 items and total duration provided from manifest is  0.02 hours.
[NeMo I 2024-11-13 14:52:38 collections:746] # 183 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00, 14.30it/s]

[NeMo I 2024-11-13 14:52:39 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  4.63it/s]

[NeMo I 2024-11-13 14:52:39 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:52:39 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:52:39 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:52:39 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:52:39 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:52:39 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:52:39 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:52:39 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 30.33it/s]

[NeMo I 2024-11-13 14:52:39 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:52:39 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:52:39 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:52:39 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:52:39 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:52:39 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:52:39 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:52:39 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:52:39 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:55:24 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:55:24 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:55:24 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:55:24 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:55:26 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:55:26 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:55:26 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:55:26 features:305] PADDING: 16
[NeMo I 2024-11-13 14:55:27 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:55:28 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:55:28 features:305] PADDING: 16
[NeMo I 2024-11-13 14:55:29 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:55:29 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:55:29 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:55:29 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:55:29 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:55:29 features:305] PADDING: 16
[NeMo I 2024-11-13 14:55:29 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:55:29 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:55:29 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:55:29 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:55:29 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  9.51it/s]

[NeMo I 2024-11-13 14:55:29 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:55:29 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:55:29 collections:741] Dataset successfully loaded with 20 items and total duration provided from manifest is  0.28 hours.
[NeMo I 2024-11-13 14:55:29 collections:746] # 20 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 20/20 [00:04<00:00,  4.42it/s]

[NeMo I 2024-11-13 14:55:34 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:55:43 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:01<00:00,  1.10s/it]

[NeMo I 2024-11-13 14:55:44 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:55:44 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:55:44 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:55:44 collections:741] Dataset successfully loaded with 506 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 14:55:44 collections:746] # 506 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 8/8 [00:00<00:00,  9.60it/s]

[NeMo I 2024-11-13 14:55:45 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:55:45 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 14:55:45 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:55:45 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:55:45 collections:741] Dataset successfully loaded with 511 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 14:55:45 collections:746] # 511 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 8/8 [00:00<00:00, 12.92it/s]


[NeMo I 2024-11-13 14:55:45 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:55:45 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 14:55:45 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:55:45 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:55:45 collections:741] Dataset successfully loaded with 522 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 14:55:45 collections:746] # 522 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 9/9 [00:00<00:00, 13.75it/s]

[NeMo I 2024-11-13 14:55:46 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:55:46 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 14:55:46 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:55:46 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:55:46 collections:741] Dataset successfully loaded with 555 items and total duration provided from manifest is  0.05 hours.
[NeMo I 2024-11-13 14:55:46 collections:746] # 555 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 9/9 [00:00<00:00, 16.24it/s]


[NeMo I 2024-11-13 14:55:47 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:55:47 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 14:55:47 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:55:47 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:55:47 collections:741] Dataset successfully loaded with 659 items and total duration provided from manifest is  0.06 hours.
[NeMo I 2024-11-13 14:55:47 collections:746] # 659 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 11/11 [00:00<00:00, 17.79it/s]


[NeMo I 2024-11-13 14:55:48 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  2.29it/s]

[NeMo I 2024-11-13 14:55:48 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:55:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:55:48 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:55:48 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:55:48 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:55:48 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:55:48 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:55:48 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 23.39it/s]

[NeMo I 2024-11-13 14:55:48 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:55:48 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:55:48 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:55:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:55:48 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:55:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:55:48 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:55:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:55:48 msdd_models:1435]   
    




[NeMo I 2024-11-13 14:58:23 msdd_models:1097] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-11-13 14:58:23 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:58:23 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-11-13 14:58:23 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:58:25 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-11-13 14:58:25 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-11-13 14:58:25 modelPT:189] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-11-13 14:58:25 features:305] PADDING: 16
[NeMo I 2024-11-13 14:58:25 features:305] PADDING: 16


      return torch.load(model_weights, map_location='cpu')
    


[NeMo I 2024-11-13 14:58:29 save_restore_connector:272] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-11-13 14:58:29 features:305] PADDING: 16
[NeMo I 2024-11-13 14:58:29 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-11-13 14:58:29 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:58:29 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-11-13 14:58:29 common:826] Instantiating model from pre-trained checkpoint


[NeMo W 2024-11-13 14:58:30 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-11-13 14:58:30 features:305] PADDING: 16
[NeMo I 2024-11-13 14:58:30 save_restore_connector:272] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_2.0.0rc1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-11-13 14:58:30 msdd_models:870] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-11-13 14:58:30 msdd_models:871] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-11-13 14:58:30 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:58:30 clustering_diarizer:313] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 12.74it/s]

[NeMo I 2024-11-13 14:58:30 classification_models:293] Perform streaming frame-level VAD
[NeMo I 2024-11-13 14:58:30 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:58:30 collections:741] Dataset successfully loaded with 17 items and total duration provided from manifest is  0.23 hours.
[NeMo I 2024-11-13 14:58:30 collections:746] # 17 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
vad: 100%|██████████| 17/17 [00:03<00:00,  4.34it/s]

[NeMo I 2024-11-13 14:58:34 clustering_diarizer:254] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-11-13 14:58:41 clustering_diarizer:266] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  1.68it/s]

[NeMo I 2024-11-13 14:58:42 clustering_diarizer:291] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-11-13 14:58:42 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:58:42 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:58:42 collections:741] Dataset successfully loaded with 661 items and total duration provided from manifest is  0.11 hours.
[NeMo I 2024-11-13 14:58:42 collections:746] # 661 files loaded accounting to # 1 labels



      with autocast():
    
      with torch.cuda.amp.autocast(enabled=False):
    
[1/5] extract embeddings: 100%|██████████| 11/11 [00:01<00:00,  9.38it/s]

[NeMo I 2024-11-13 14:58:43 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings





[NeMo I 2024-11-13 14:58:43 clustering_diarizer:291] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-11-13 14:58:43 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:58:43 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:58:43 collections:741] Dataset successfully loaded with 704 items and total duration provided from manifest is  0.12 hours.
[NeMo I 2024-11-13 14:58:43 collections:746] # 704 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 11/11 [00:00<00:00, 11.89it/s]

[NeMo I 2024-11-13 14:58:44 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:58:44 clustering_diarizer:291] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-11-13 14:58:44 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:58:44 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:58:44 collections:741] Dataset successfully loaded with 780 items and total duration provided from manifest is  0.12 hours.
[NeMo I 2024-11-13 14:58:44 collections:746] # 780 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 13/13 [00:00<00:00, 13.50it/s]


[NeMo I 2024-11-13 14:58:45 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:58:45 clustering_diarizer:291] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-11-13 14:58:45 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:58:45 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:58:45 collections:741] Dataset successfully loaded with 926 items and total duration provided from manifest is  0.13 hours.
[NeMo I 2024-11-13 14:58:45 collections:746] # 926 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 15/15 [00:01<00:00, 14.63it/s]


[NeMo I 2024-11-13 14:58:47 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-11-13 14:58:47 clustering_diarizer:291] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-11-13 14:58:47 clustering_diarizer:347] Extracting embeddings for Diarization
[NeMo I 2024-11-13 14:58:47 collections:740] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-11-13 14:58:47 collections:741] Dataset successfully loaded with 1281 items and total duration provided from manifest is  0.14 hours.
[NeMo I 2024-11-13 14:58:47 collections:746] # 1281 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 21/21 [00:01<00:00, 13.37it/s]


[NeMo I 2024-11-13 14:58:49 clustering_diarizer:393] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:01<00:00,  1.38s/it]

[NeMo I 2024-11-13 14:58:50 clustering_diarizer:461] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-11-13 14:58:50 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:58:50 msdd_models:966] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-11-13 14:58:50 msdd_models:966] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-11-13 14:58:50 msdd_models:966] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-11-13 14:58:50 msdd_models:966] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-11-13 14:58:50 msdd_models:966] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-11-13 14:58:50 msdd_models:944] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

      with autocast():
    
100%|██████████| 1/1 [00:00<00:00, 11.80it/s]

[NeMo I 2024-11-13 14:58:50 msdd_models:1407]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-11-13 14:58:50 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-11-13 14:58:50 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-11-13 14:58:51 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:58:51 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:58:51 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:58:51 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-11-13 14:58:51 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-11-13 14:58:51 msdd_models:1435]   
    




## Aligning the transcription with the original audio using Forced Alignment
---
Forced alignment aims to to align the transcription segments with the original audio signal contained in the vocal_target file. This process involves finding the exact timestamps in the audio signal where each segment was spoken and aligning the text accordingly.

By combining the outputs of the two models, the code produces a fully aligned transcription of the speech contained in the vocal_target file. This aligned transcription can be useful for a variety of speech processing tasks, such as speaker diarization, sentiment analysis, and language identification.

## Convert audio to mono for NeMo combatibility

## Speaker Diarization using NeMo MSDD Model
---
This code uses a model called Nvidia NeMo MSDD (Multi-scale Diarization Decoder) to perform speaker diarization on an audio signal. Speaker diarization is the process of separating an audio signal into different segments based on who is speaking at any given time.

## Mapping Spekers to Sentences According to Timestamps

## Realligning Speech segments using Punctuation
---

This code provides a method for disambiguating speaker labels in cases where a sentence is split between two different speakers. It uses punctuation markings to determine the dominant speaker for each sentence in the transcription.

```
Speaker A: It's got to come from somewhere else. Yeah, that one's also fun because you know the lows are
Speaker B: going to suck, right? So it's actually it hits you on both sides.
```

For example, if a sentence is split between two speakers, the code takes the mode of speaker labels for each word in the sentence, and uses that speaker label for the whole sentence. This can help to improve the accuracy of speaker diarization, especially in cases where the Whisper model may not take fine utterances like "hmm" and "yeah" into account, but the Diarization Model (Nemo) may include them, leading to inconsistent results.

The code also handles cases where one speaker is giving a monologue while other speakers are making occasional comments in the background. It ignores the comments and assigns the entire monologue to the speaker who is speaking the majority of the time. This provides a robust and reliable method for realigning speech segments to their respective speakers based on punctuation in the transcription.

## Cleanup and Exporing the results