<a href="https://colab.research.google.com/github/nianlonggu/WhisperSeg/blob/master/docs/WhisperSeg_Voice_Activity_Detection_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Voice Activity Detection Demo

## Clone GitHub repository and install environment

In [None]:
!git clone https://github.com/nianlonggu/WhisperSeg.git
!cd WhisperSeg; pip install -r requirements.txt --quiet

Cloning into 'WhisperSeg'...
remote: Enumerating objects: 398, done.[K
remote: Counting objects: 100% (32/32), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 398 (delta 16), reused 9 (delta 2), pack-reused 366[K
Receiving objects: 100% (398/398), 178.10 MiB | 35.87 MiB/s, done.
Resolving deltas: 100% (122/122), done.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.1/519.1 kB[0m [31m43.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import os
os.chdir("WhisperSeg")

We demonstrate here using a WhisperSeg trained on multi-species data to segment the audio files of different species.



## Load the pretrained multi-species WhisperSeg

### CTranslate2 version for faster inference
We provided a CTranslate2 converted version, which enables 4x faster inference speed. To use this converted model, we need to import the "WhisperSegmenterFast" module.

In [None]:
from model import WhisperSegmenterFast

segmenter = WhisperSegmenterFast( "nccratliri/whisperseg-large-ms-ct2", device="cuda" )

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

Downloading (…)1d1accddb0/README.md:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading (…)el/added_tokens.json:   0%|          | 0.00/22.1k [00:00<?, ?B/s]

Downloading (…)cddb0/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)1accddb0/config.json:   0%|          | 0.00/12.1k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/805 [00:00<?, ?B/s]

Downloading (…)/hf_model/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)ddb0/vocabulary.json:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

Downloading (…)hf_model/config.json:   0%|          | 0.00/2.71k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)/hf_model/merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading (…)odel/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading model.bin:   0%|          | 0.00/3.08G [00:00<?, ?B/s]

### Illustration of segmentation parameters

The following paratemers need to be configured for different species.
* **sr**: sampling rate $f_s$ of the audio when loading
* **min_frequency**: the minimum frequency when computing the Log Melspectrogram. Frequency components below min_frequency will not be included in the input spectrogram
* **spec_time_step**: Spectrogram Time Resolution. By default, one single input spectrogram of WhisperSeg contains 1000 columns. 'spec_time_step' represents the time difference between two adjacent columns in the spectrogram. It is equal to FFT_hop_size / sampling_rate: $\frac{L_\text{hop}}{f_s}$ .
* **min_segment_length**: The minimum allowed length of predicted segments. The predicted segments whose length is below 'min_segment_length' will be discarded.
* **eps**: The threshold $\epsilon_\text{vote}$ during the multi-trial majority voting when processing long audio files
* **num_trials**: The number of segmentation variant produced during the multi-trial majority voting process. Setting num_trials to 1 for noisy data with long segment durations, such as the human AVA-speech dataset, and set num_trials to 3 when segmenting animal vocalizations.

The recommended settings of these parameters are listed in Table 1 in the paper:
![Specific Segmentation Parameters](https://github.com/nianlonggu/WhisperSeg/blob/master/assets/species_specific_parameters.png?raw=true)

### Segmentation Examples

In [None]:
import librosa
import json
from audio_utils import SpecViewer
### SpecViewer is a customized class for interactive spectrogram viewing
spec_viewer = SpecViewer()

#### Zebra finch (adults)

In [None]:
sr = 32000
min_frequency = 0
spec_time_step = 0.0025
min_segment_length = 0.01
eps = 0.02
num_trials = 3

audio, _ = librosa.load( "data/example_subset/Zebra_finch/test_adults/zebra_finch_g17y2U-f00007.wav",
                         sr = sr )
prediction = segmenter.segment(  audio, sr = sr, min_frequency = min_frequency, spec_time_step = spec_time_step,
                       min_segment_length = min_segment_length, eps = eps,num_trials = num_trials )
print(prediction)

{'onset': [0.01, 0.38, 0.603, 0.758, 0.912, 1.813, 1.967, 2.073, 2.838, 2.982, 3.112, 3.668, 3.828, 3.953, 5.158, 5.323, 5.467], 'offset': [0.073, 0.447, 0.673, 0.83, 1.483, 1.882, 2.037, 2.643, 2.893, 3.063, 3.283, 3.742, 3.898, 4.523, 5.223, 5.393, 6.043], 'cluster': ['zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0']}


In [None]:
spec_viewer.visualize( audio = audio, sr = sr, min_frequency= min_frequency, prediction = prediction,
                       window_size=8, precision_bits=1
                     )

interactive(children=(FloatSlider(value=0.0, description='offset', max=0.0, step=0.4), Output()), _dom_classes…

<function ipywidgets.widgets.interaction._InteractFactory.__call__.<locals>.<lambda>(*args, **kwargs)>

Let's load the human annoated segments and compare them with WhisperSeg's prediction.

In [None]:
label = json.load( open("data/example_subset/Zebra_finch/test_adults/zebra_finch_g17y2U-f00007.json") )
spec_viewer.visualize( audio = audio, sr = sr, min_frequency= min_frequency, prediction = prediction, label=label,
                       window_size=8, precision_bits=1
                     )

interactive(children=(FloatSlider(value=0.0, description='offset', max=0.0, step=0.4), Output()), _dom_classes…

<function ipywidgets.widgets.interaction._InteractFactory.__call__.<locals>.<lambda>(*args, **kwargs)>

#### Zebra finch (juveniles)

In [None]:
sr = 32000
min_frequency = 0
spec_time_step = 0.0025
min_segment_length = 0.01
eps = 0.02
num_trials = 3

audio_file = "data/example_subset/Zebra_finch/test_juveniles/zebra_finch_R3428_40932.29996086_1_24_8_19_56.wav"
label_file = audio_file[:-4] + ".json"
audio, _ = librosa.load( audio_file, sr = sr )
label = json.load( open(label_file) )

prediction = segmenter.segment(  audio, sr = sr, min_frequency = min_frequency, spec_time_step = spec_time_step,
                       min_segment_length = min_segment_length, eps = eps,num_trials = num_trials )
spec_viewer.visualize( audio = audio, sr = sr, min_frequency= min_frequency, prediction = prediction, label=label,
                       window_size=15, precision_bits=1 )

interactive(children=(FloatSlider(value=0.0, description='offset', max=0.0, step=0.75), Output()), _dom_classe…

<function ipywidgets.widgets.interaction._InteractFactory.__call__.<locals>.<lambda>(*args, **kwargs)>

#### Bengalese finch

In [None]:
sr = 32000
min_frequency = 0
spec_time_step = 0.0025
min_segment_length = 0.01
eps = 0.02
num_trials = 3

audio_file = "data/example_subset/Bengalese_finch/test/bengalese_finch_bl26lb16_190412_0721.20144_0.wav"
label_file = audio_file[:-4] + ".json"
audio, _ = librosa.load( audio_file, sr = sr )
label = json.load( open(label_file) )

prediction = segmenter.segment(  audio, sr = sr, min_frequency = min_frequency, spec_time_step = spec_time_step,
                       min_segment_length = min_segment_length, eps = eps,num_trials = num_trials )
spec_viewer.visualize( audio = audio, sr = sr, min_frequency= min_frequency, prediction = prediction, label=label,
                       window_size=3 )

interactive(children=(FloatSlider(value=0.0, description='offset', max=0.016812499999999897, step=0.15), Outpu…

<function ipywidgets.widgets.interaction._InteractFactory.__call__.<locals>.<lambda>(*args, **kwargs)>

#### Marmoset

In [None]:
sr = 48000
min_frequency = 0
spec_time_step = 0.0025
min_segment_length = 0.01
eps = 0.02
num_trials = 3

audio_file = "data/example_subset/Marmoset/test/marmoset_pair4_animal1_together_A_0.wav"
label_file = audio_file[:-4] + ".json"
audio, _ = librosa.load( audio_file, sr = sr )
label = json.load( open(label_file) )

prediction = segmenter.segment(  audio, sr = sr, min_frequency = min_frequency, spec_time_step = spec_time_step,
                       min_segment_length = min_segment_length, eps = eps,num_trials = num_trials )
spec_viewer.visualize( audio = audio, sr = sr, min_frequency= min_frequency, prediction = prediction, label=label )

interactive(children=(FloatSlider(value=28.5, description='offset', max=57.00002083333333, step=0.25), Output(…

<function ipywidgets.widgets.interaction._InteractFactory.__call__.<locals>.<lambda>(*args, **kwargs)>

#### Mouse

In [None]:
sr = 300000
min_frequency = 35000
spec_time_step = 0.0005
min_segment_length = 0.01
eps = 0.02
num_trials = 3

audio_file = "data/example_subset/Mouse/test/mouse_Rfem_Afem01_0.wav"
label_file = audio_file[:-4] + ".json"
audio, _ = librosa.load( audio_file, sr = sr )
label = json.load( open(label_file) )

prediction = segmenter.segment(  audio, sr = sr, min_frequency = min_frequency, spec_time_step = spec_time_step,
                       min_segment_length = min_segment_length, eps = eps,num_trials = num_trials )
spec_viewer.visualize( audio = audio, sr = sr, min_frequency= min_frequency, prediction = prediction, label=label )

interactive(children=(FloatSlider(value=6.5, description='offset', max=13.02541, step=0.25), Output()), _dom_c…

<function ipywidgets.widgets.interaction._InteractFactory.__call__.<locals>.<lambda>(*args, **kwargs)>

#### Human (AVA-Speech)

In [None]:
sr = 16000
min_frequency = 0
spec_time_step = 0.01
min_segment_length = 0.1
eps = 0.2
num_trials = 1

audio_file = "data/example_subset/Human_AVA_Speech/test/human_xO4ABy2iOQA_clip.wav"
label_file = audio_file[:-4] + ".json"
audio, _ = librosa.load( audio_file, sr = sr )
label = json.load( open(label_file) )

prediction = segmenter.segment(  audio, sr = sr, min_frequency = min_frequency, spec_time_step = spec_time_step,
                       min_segment_length = min_segment_length, eps = eps,num_trials = num_trials )
spec_viewer.visualize( audio = audio, sr = sr, min_frequency= min_frequency, prediction = prediction, label=label,
                       window_size=20, precision_bits=0, xticks_step_size = 2 )

interactive(children=(FloatSlider(value=140.0, description='offset', max=280.0, step=1.0), Output()), _dom_cla…

<function ipywidgets.widgets.interaction._InteractFactory.__call__.<locals>.<lambda>(*args, **kwargs)>

## Contact
GitHub: https://github.com/nianlonggu/WhisperSeg

Nianlong Gu
nianlong.gu@uzh.ch
