<a href="https://colab.research.google.com/github/kamilakesbi/notebooks/blob/main/synthetic_pipeline_diarizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🤗 Generate synthetic speaker diarization datas with Diarizers


Speaker diarization systems often require large amount of multi-speakers annotated datas to trian, but such datas is in practice limited.

To overcome the lack of speaker diarizaiton datasets, recent research approaches have considered training speaker diarization systems on simulated datasets where audio segments of individual speakers, coming from ASR datasets, are concatenated to form multi-speaker artificial meetings.

We release a synthetic pipeline for speaker diarization dataset generation which is compatible with : Diarizers, our library to fine-tune speaker diarization models.

This pipeline involves several steps:

- First, it requires to choose an initial ASR dataset to start from. In this notebook, we will choose the `japanese` subset of the `mozilla-foundation/common_voice_17_0` dataset. In practice, any other ASR dataset from the Hub (with single speaker audio segments and associated speaker ids) can be used.

- Second,

## Installation

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Wed May 29 11:45:54 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   49C    P8              12W /  72W |      1MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
!pip install --upgrade --quiet git+https://github.com/kamilakesbi/diarizers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.3/92.3 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m873.5/873.5 kB[0m [31m62.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.3/82.3 kB[0m [31m15.8 MB/s[0m 

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Start from any ASR dataset:

The starting point of the pipeline is an [Automatic Speech Recognition dataset](https://huggingface.co/datasets?task_categories=task_categories:automatic-speech-recognition&sort=trending) chosen from the Hugging Face Hub .

The dataset needs to contain audios segments corresponding to single speakers with their corresponding speaker ids.

Several datasets can be used to do so. Here, we choose to start from the `japanese` subset of the `common_voice_17` dataset.

Let's first load an example of this dataset:


In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset('mozilla-foundation/common_voice_17_0', 'ja', split='validated', streaming=True)

In [None]:
dataset

IterableDataset({
    features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
    n_shards: 3
})

In [None]:
from IPython.display import Audio, display

example = next(iter(dataset))

print('Speaker id: ', example['client_id'])

print('Audio: ')
display(Audio(example['audio']['array'], rate=example['audio']['sampling_rate']))

Reading metadata...: 93022it [00:03, 29835.64it/s]


Speaker id:  004a974ec0e77c3c846ac5d7dbef70cdd6329682f98c2ee7c14fd9a333a683d5f433d9beda70f415c5adabb80b220254b74b4eb6ab058b7b067c51ca8cb96c8a
Audio: 


In [None]:
from diarizers import SyntheticDataset, SyntheticDatasetConfig

In [None]:
synthetic_dataset_config = SyntheticDatasetConfig(
    dataset_name='mozilla-foundation/common_voice_17_0',
    subset='validated',
    split='ja',
    min_samples_per_speaker=10,
    audio_column_name = "audio",
    speaker_column_name = "client_id",
    nb_speakers_from_dataset=200,
    sample_rate=16000,
)

In [None]:
synthetic_dataset = SyntheticDataset(synthetic_dataset_config)

Downloading data:   0%|          | 0.00/17.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/309M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/187M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/200M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.44G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.77G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.13G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/884M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/936M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/239M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/408M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.21G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/373M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.35M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.04M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/83.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]


Reading metadata...: 10039it [00:00, 130731.79it/s]


Generating validation split: 0 examples [00:00, ? examples/s]


Reading metadata...: 6261it [00:00, 119699.61it/s]


Generating test split: 0 examples [00:00, ? examples/s]


Reading metadata...: 6261it [00:00, 115616.94it/s]


Generating other split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 12901it [00:00, 128999.36it/s][A
Reading metadata...: 26163it [00:00, 131122.34it/s][A
Reading metadata...: 39276it [00:00, 130372.84it/s][A
Reading metadata...: 52314it [00:00, 126249.53it/s][A
Reading metadata...: 64958it [00:00, 124294.99it/s][A
Reading metadata...: 77401it [00:00, 120818.48it/s][A
Reading metadata...: 89502it [00:00, 116506.58it/s][A
Reading metadata...: 101182it [00:00, 115332.72it/s][A
Reading metadata...: 112732it [00:00, 113862.20it/s][A
Reading metadata...: 124128it [00:01, 108488.31it/s][A
Reading metadata...: 135019it [00:01, 107281.51it/s][A
Reading metadata...: 146706it [00:01, 110039.79it/s][A
Reading metadata...: 157742it [00:01, 109109.52it/s][A
Reading metadata...: 169609it [00:01, 111903.06it/s][A
Reading metadata...: 180824it [00:01, 110582.91it/s][A
Reading metadata...: 191900it [00:01, 108690.52it/s][A
Reading metadata...: 203041it [00:01, 109481.09it/s][A
Reading met

Generating invalidated split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 13547it [00:00, 113907.24it/s]


Generating validated split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 11115it [00:00, 111143.22it/s][A
Reading metadata...: 24308it [00:00, 123363.75it/s][A
Reading metadata...: 37558it [00:00, 127529.74it/s][A
Reading metadata...: 50692it [00:00, 129030.31it/s][A
Reading metadata...: 63933it [00:00, 130245.34it/s][A
Reading metadata...: 76958it [00:00, 129965.69it/s][A
Reading metadata...: 93022it [00:00, 126421.75it/s]


nb speakers in dataset to keep: 200


Filter (num_proc=2):   0%|          | 0/93022 [00:00<?, ? examples/s]

100%|██████████| 200/200 [00:00<00:00, 181965.47it/s]
Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /root/.cache/torch/hub/master.zip


In [None]:
print('Speakers that will be used for sampling: ', len(synthetic_dataset.speakers_to_sample_from))

Speakers that will be used for sampling:  200


##  Define the properties of the synthetic meetings to generate:

- Generate meetings: overall algorithm.

In [58]:
synthetic_dataset_config = SyntheticDatasetConfig(
    dataset_name='mozilla-foundation/common_voice_17_0',
    subset='validated',
    split='ja',
    min_samples_per_speaker=10,
    audio_column_name = "audio",
    speaker_column_name = "client_id",
    nb_speakers_from_dataset=5,
    num_meetings=2,
    nb_speakers_per_meeting=3,
    segments_per_meeting=16,
    overlap_proba=0,
    silence_proba=0,
)

In [59]:
synthetic_dataset = SyntheticDataset(synthetic_dataset_config).generate()

nb speakers in dataset to keep: 5


100%|██████████| 5/5 [00:00<00:00, 11902.11it/s]
Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /root/.cache/torch/hub/master.zip
100%|██████████| 2/2 [00:00<00:00, 76.75it/s]


Map (num_proc=2):   0%|          | 0/32 [00:00<?, ? examples/s]

In [49]:
synthetic_dataset

Dataset({
    features: ['audio', 'speakers', 'timestamps_start', 'timestamps_end'],
    num_rows: 2
})

In [50]:
from IPython.display import Audio, display

example = synthetic_dataset[0]

display(Audio(example['audio']['array'], rate=example['audio']['sampling_rate']))

In [52]:
print('Number of speakers in generated meeting: ', len(set(example['speakers'])))

Number of speakers in generated meeting:  3


- Add silence and overlap:

In [60]:
synthetic_dataset_config.overlap_proba = 0.3
synthetic_dataset_config.overlap_length = 3

synthetic_dataset_config.silence_proba = 1
synthetic_dataset_config.silence_duration = 1

In [61]:
synthetic_dataset = SyntheticDataset(synthetic_dataset_config).generate()

nb speakers in dataset to keep: 5


100%|██████████| 5/5 [00:00<00:00, 9808.94it/s]
Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /root/.cache/torch/hub/master.zip
100%|██████████| 2/2 [00:00<00:00, 89.20it/s]


Map (num_proc=2):   0%|          | 0/32 [00:00<?, ? examples/s]

In [62]:
from IPython.display import Audio, display

example = synthetic_dataset[0]

display(Audio(example['audio']['array'], rate=example['audio']['sampling_rate']))

- Denoise:

In [63]:
synthetic_dataset_config.denoise = True

In [65]:
synthetic_dataset = SyntheticDataset(synthetic_dataset_config).generate()
example = synthetic_dataset[0]
display(Audio(example['audio']['array'], rate=example['audio']['sampling_rate']))

nb speakers in dataset to keep: 5


100%|██████████| 5/5 [00:00<00:00, 10832.40it/s]
Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /root/.cache/torch/hub/master.zip
100%|██████████| 2/2 [00:00<00:00, 86.02it/s]


Map:   0%|          | 0/32 [00:00<?, ? examples/s]

You can play with the other hyperparameters! Here are the other params you can play with:

- ...

3. Generate a large synthetic dataset and push it to the Hub:

In [None]:
synthetic_dataset_config = SyntheticDatasetConfig(
    dataset_name =  "mozilla-foundation/common_voice_17_0",
    subset = "validated",
    split = "ja",
    speaker_column_name = "client_id",
    audio_column_name = "audio",
    min_samples_per_speaker = 10,
    nb_speakers_from_dataset = 200,
    sample_rate  = 16000,
    num_meetings = 800,
    nb_speakers_per_meeting = 3,
    segments_per_meeting = 16,
    normalize = True,
    overlap_proba = 0.3,
    overlap_length = 3,
    random_gain = False,
    add_silence = True,
    silence_duration = 3,
    silence_proba = 3,
    denoise = False,
    num_proc = 2
)

synthetic_dataset = SyntheticDataset(synthetic_dataset_config).generate()

nb speakers in dataset to keep: 200


Filter (num_proc=2):   0%|          | 0/93022 [00:00<?, ? examples/s]

100%|██████████| 200/200 [00:00<00:00, 167470.71it/s]
Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /root/.cache/torch/hub/master.zip
 90%|████████▉ | 717/800 [07:21<01:38,  1.19s/it]

In [None]:
kwargs = {
    "dataset_tags":'speaker-diarization-synthetic-dataset',
    "language": "jpn",
    "tasks": "speaker-diarization",
    "tags": ['speaker-diarization', 'synthetic-dataset', 'speaker-segmentation']
}

synthetic_dataset.push_to_hub(**kwargs)

4. Train on a synthetic dataset