Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VAD workflow with Silero #1160

Merged
merged 34 commits into from
Sep 28, 2023
Merged

Conversation

rilshok
Copy link
Contributor

@rilshok rilshok commented Sep 24, 2023

Description of Changes

We have added integration of the Silero VAD into the Lhotse project. This integration enables the use of Silero VAD for voice activity detection in audio recordings within Lhotse RecordingSets. The solution analyzes each track of every recording and stores the results in the SupervisionSet.

In the current change, we have introduced the following:

  • A base class called ActivityDetector to allow for the addition of new activity detectors in the future.
  • A runner class named ActivityDetectionProcessor for parallel execution of activity detection on the RecordingSet.
  • Two classes, SileroVAD8k and SileroVAD16k, for the integration of Silero VAD.
  • A workflow named activity-detection for running activity detection on the RecordingSet.

You can find the Silero VAD model for this integration on the Silero VAD GitHub project.

Example usage in CLI

Prepare the model for work:

lhotse workflows activity-detection \
  --model-name silero-vad-16k \
  --chore

Run activity detection by Silero VAD:

lhotse workflows activity-detection \
  --model-name silero-vad-16k \
  --recordings-manifest data/librispeech_recordings_train-clean-5.jsonl.gz \
  --output-supervisions-manifest librispeech_recordings_train-clean-5.jsonl.gz \
  --jobs 2 \
  --device cpu
Loading recordings from data/librispeech_recordings_train-clean-5.jsonl.gz...
Making activity detection processor for 'silero-vad-16k'...
Running activity detection using 'silero-vad-16k'...
Using cache found in ~/.cache/torch/hub/snakers4_silero-vad_master
...
Detecting activities: 100%|████████████████| 1519/1519 [04:50<00:00,  5.22rec/s]
Saving 'silero-vad-16k' results ...
Results saved to:
.../librispeech_recordings_train-clean-5.jsonl.gz

Example usage in code

from lhotse.workflows.activity_detection.silero_vad import SileroVAD16k
from lhotse.audio import RecordingSet

vad = SileroVAD16k(device="cuda")

recordings = RecordingSet.from_file("data/librispeech_recordings_train-clean-5.jsonl.gz")
record = recordings[25]

vad(record)
[SupervisionSegment(id='6272-70171-0025-SileroVAD_16kHz-0-00000', recording_id='6272-70171-0025', start=0.194, duration=2.396, channel=0, text=None, language=None, speaker=None, gender=None, custom=None, alignment=None),
 SupervisionSegment(id='6272-70171-0025-SileroVAD_16kHz-0-00001', recording_id='6272-70171-0025', start=3.682, duration=1.02, channel=0, text=None, language=None, speaker=None, gender=None, custom=None, alignment=None),
 SupervisionSegment(id='6272-70171-0025-SileroVAD_16kHz-0-00002', recording_id='6272-70171-0025', start=4.994, duration=0.956, channel=0, text=None, language=None, speaker=None, gender=None, custom=None, alignment=None),
 SupervisionSegment(id='6272-70171-0025-SileroVAD_16kHz-0-00003', recording_id='6272-70171-0025', start=6.146, duration=2.652, channel=0, text=None, language=None, speaker=None, gender=None, custom=None, alignment=None),
 SupervisionSegment(id='6272-70171-0025-SileroVAD_16kHz-0-00004', recording_id='6272-70171-0025', start=9.122, duration=4.316, channel=0, text=None, language=None, speaker=None, gender=None, custom=None, alignment=None),
 SupervisionSegment(id='6272-70171-0025-SileroVAD_16kHz-0-00005', recording_id='6272-70171-0025', start=13.634, duration=3.006, channel=0, text=None, language=None, speaker=None, gender=None, custom=None, alignment=None)]

Related Issues

Related tasks and links to discussions related to this integration:

  • #1041 - Add Silero VAD integration

Copy link
Collaborator

@desh2608 desh2608 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution! Left some minor comments.

Out of curiosity, do you have any speed benchmarks? For example, if I were to run the VAD on a 1 hour recording with 1 GPU, how long would it take?

lhotse/bin/modes/workflows.py Outdated Show resolved Hide resolved
lhotse/workflows/activity_detection/README.md Show resolved Hide resolved
lhotse/workflows/activity_detection/base.py Show resolved Hide resolved
@csukuangfj
Copy link
Contributor

Thanks for this contribution! Left some minor comments.

Out of curiosity, do you have any speed benchmarks? For example, if I were to run the VAD on a 1 hour recording with 1 GPU, how long would it take?

I think it is faster on CPU for silero VAD.

Silero VAD uses LSTM and you need to process the file sequentially.

@rilshok
Copy link
Contributor Author

rilshok commented Sep 25, 2023

I haven't made precise performance measurements. Silero VAD authors claim that processing one audio fragment (30+ ms) takes less than 1 ms on a single CPU thread. My own practice confirms this performance. However, it is worth noting that this value may not be correct for a single thread. When processing in parallel on a CPU with two processes (with two model states), Torch fully loads my CPU, and adding more workers does not bring a significant performance gain. On my inexpensive GPU, however, parallel processing resulted in about a 3x performance increase when running multiple processes. I would be grateful for contributions to document the performance of this model.

pzelasko
pzelasko previously approved these changes Sep 26, 2023
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, it looks great! Can you fix that doc comment before we merge?

lhotse/workflows/activity_detection/README.md Outdated Show resolved Hide resolved
lhotse/workflows/activity_detection/base.py Show resolved Hide resolved
lhotse/workflows/activity_detection/base.py Show resolved Hide resolved
@rilshok
Copy link
Contributor Author

rilshok commented Sep 26, 2023

The tests fail on an older version of torch because of the trust_repo argument passed to trust.hub. I'm working on it.

pzelasko
pzelasko previously approved these changes Sep 26, 2023
@pzelasko pzelasko enabled auto-merge (squash) September 26, 2023 19:38
@desh2608 desh2608 added this to the v1.17 milestone Sep 26, 2023
auto-merge was automatically disabled September 26, 2023 19:56

Head branch was pushed to by a user without write access

@pzelasko pzelasko enabled auto-merge (squash) September 27, 2023 01:19
pzelasko
pzelasko previously approved these changes Sep 27, 2023
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

auto-merge was automatically disabled September 27, 2023 05:47

Head branch was pushed to by a user without write access

@rilshok
Copy link
Contributor Author

rilshok commented Sep 27, 2023

Do we have a plan how to solve the test problem in py 3.11 environment? The problem seems to come from outside and is not related to my changes.

@rilshok
Copy link
Contributor Author

rilshok commented Sep 27, 2023

I tried to reproduce the problem with this test, unfortunately it does not reproduce for me.

lhotse/bin/modes/workflows.py Outdated Show resolved Hide resolved
lhotse/workflows/activity_detection/README.md Outdated Show resolved Hide resolved
@pzelasko pzelasko enabled auto-merge (squash) September 28, 2023 18:11
@pzelasko pzelasko merged commit b138baf into lhotse-speech:master Sep 28, 2023
8 of 10 checks passed
flyingleafe pushed a commit to flyingleafe/lhotse that referenced this pull request Oct 11, 2023
* initialise the script for activity detection

* init lhotse.workflows.activity_distillation module

* add the Silero VAD model wrapper

* inherit SileroVAD from ActivityDetector

* pass parameters to the model explicitly

* process each channel and return the supervision

* parallel processing by activity detector

* number the segments found

* make abstract processing of an individual track

* rename module and workflow to activity_detection

* standardise detectors by sampling rate

* rename silero vad models

* implement a script for supervisory with silero-vad

* handle exceptions and user input

* allow the path to the output dir

* reset the cached state of the model if necessary

* add docs for activity_detection module

* fix if dir does not exist

* fix cuda issue

* add RecordingSet python example

* add base test for silero vad workflow

* add test for silero vad in parallel

* change detector name

* replace the chore option with force_download

* improve user experience

* add simple test for activity_detection workflow

* clarify the need to use --force_download

* rm slash

* trust the repository since torch>=1.12

* skip tests if torch version <1.12

* exclude non-coverage eligible code

* change the behaviour of the force_download option

---------

Co-authored-by: Piotr Żelasko <petezor@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants