How to fetch audio segments in real time from diarization pipeline #26

Yagna24 · 2022-02-08T06:02:56Z

Hi, First of all thank you for this repo.

I was wondering if its possible to use the diarization pipeline to fetch the audio segments of discrete speakers.
It would immensely help if there's a way around it.

juanmc2005 · 2022-02-08T10:58:38Z

Hi @Yagna24, thank you for your question.

I'm not quite sure I understand what you mean by "audio segments of discrete speakers", but I think you may be looking for the timestamps where a speaker is active or the waveform.

Timestamps

Basically, the diarization pipeline outputs pairs of short (500ms) Annotation (the diarization prediction) and SlidingWindowFeature (the waveform). You can always use Annotation.label_timeline("speaker0") to get the list of segments in which speaker0 is active in the form of a Timeline object. Keep in mind that a timeline is essentially an ordered list of Segment, that essentially just contain a start and end time.

If you want to use the accumulated output to filter by speaker in this way, you can always use the accumulate_output() operator. For example:

import diart.operators as dops
from diart.sources import MicrophoneAudioSource
import rx.operators as ops
from diart.pipelines import OnlineSpeakerDiarization

pipeline = OnlineSpeakerDiarization()
audio_source = MicrophoneAudioSource(sample_rate=16000)

pipeline.from_source(audio_source).pipe(
    dops.accumulate_output(duration=pipeline.duration, step=pipeline.step),
    # x: (Annotation, SlidingWindowFeature, real_time: float)
).subscribe(on_next=lambda x: print(x[0].label_timeline("speaker0")))

Waveform

If you want the waveform given a speaker timeline (or segment), then you can use SlidingWindowFeature.crop(timeline). For example:

import diart.operators as dops
from diart.sources import MicrophoneAudioSource
import rx.operators as ops
from diart.pipelines import OnlineSpeakerDiarization
import diart.utils as utils
from pyannote.core import SlidingWindowFeature

pipeline = OnlineSpeakerDiarization()
audio_source = MicrophoneAudioSource(sample_rate=16000)
visualize = utils.visualize_feature(duration=5.0)

pipeline.from_source(audio_source).pipe(
    dops.accumulate_output(duration=pipeline.duration, step=pipeline.step),
    # x: (Annotation, SlidingWindowFeature, real_time: float)
).subscribe(on_next=lambda x: visualize(SlidingWindowFeature(x[1].crop(x[0].label_timeline("speaker0")), x[1].sliding_window)))

This should show you the concatenated waveforms where speaker0 is active thanks to the function utils.visualize_feature(). Notice that crop returns a numpy array and not a SlidingWindowFeature.

Of course, I'm assuming that you want the accumulated output, and these code snippets are the quick and dirty way. I'm sure you can find a way to make it look better :)

For more info on pyannote.core data structures, I invite you to take a look at the docs.

Please let me know if this answers your question.

Yagna24 · 2022-02-09T07:00:13Z

Hi @juanmc2005 , Thank you for your response.

For this part,

You can always use Annotation.label_timeline("speaker0") to get the list of segments in which speaker0 is active in the form of a Timeline object. Keep in mind that a timeline is essentially an ordered list of Segment, that essentially just contain a start and end time.

I could use the start and end time stamps to crop the audio for every speaker.

I don't understand what the output of accumulate_output() would be

If you want to use the accumulated output to filter by speaker in this way, you can always use the accumulate_output() operator.

Last question about the main pipeline,

https://github.com/juanmc2005/StreamingSpeakerDiarization/blob/main/src/diart/pipelines.py

Is this the main pipeline that should concern me? Because I am planning to add the timestamp fetching part and audio cropping code in diart/pipelines.py to have that in the main output as well.

Please correct me if I am wrong.
Thanks for your time.

juanmc2005 · 2022-02-09T10:40:08Z

I could use the start and end time stamps to crop the audio for every speaker.

Absolutely, you can get the individual segments that compose the timeline for a given speaker and crop the audio with them.

I don't understand what the output of accumulate_output() would be

It depends at what level you look at it.
In reality, accumulate_output() is a function that builds and returns an RxPY operator, which is basically a function that can be added to a stream via pipe (i.e. stream.pipe(accumulate_output(...))).
The arguments of accumulate_output() are needed to build such operator.

This is the explanation for how it "connects" to RxPY, but it's easier to think about it in terms of what the output operator does.
The returned operator is going to take a pair (Annotation, SlidingWindowFeature) as input (i.e. the prediction and the waveform) and it's going to return a tuple (Annotation, SlidingWindowFeature, float) that correspond to the accumulated predictions from the beginning, the accumulated waveform from the beginning and the current real time of the stream.
In order to access this output you need to work on the on_next function of the following operator (or the observer in the subscribe method), like I showed in this example:

pipeline.from_source(audio_source).pipe(
    dops.accumulate_output(duration=pipeline.duration, step=pipeline.step),
    # x: (Annotation, SlidingWindowFeature, real_time: float)
).subscribe(on_next=lambda x: print(x[0].label_timeline("speaker0")))

Is this the main pipeline that should concern me? Because I am planning to add the timestamp fetching part and audio cropping code in diart/pipelines.py to have that in the main output as well.

Yes this pipeline implements the entire system and it will stream the diarization predictions from the audio.
You can of course modify the insides of OnlineSpeakerDiarization but that's not necessary.
Thanks to RxPY you can "connect" more operators to the output of from_source() (which is an rx observable) using the pipe method like I showed in my code snippets.

You may also not want to use RxPY and that's fine too. In that case you'll need to use the building blocks in functional to implement the system with your streaming library of choice.
In that case I would be happy to hear suggestions for another popular streaming library and even merge a PR adding compatibility :)

Yagna24 · 2022-02-11T09:01:51Z

Hi @juanmc2005 , Thank you for the response. I have an update about getting the timestamps . Since I want to work on a similar type problem, I found one which is simpler, I used the pyannote's diarization pipeline to get the contents :

diarized = diarization(audio_file)

diarized.for_json('contents')

Next, I filtered out the timestamps, removing the characters to keep the main timestamps.

Stored start and end timestamps as array in T1 and T2.
Then I simply use pydub to crop the audio in between T1 and T2 time interval.

I am also interested in performing the same in real time. I look forward to work the main pipeline now.

Yagna24 changed the title ~~Fetch audio segments in real time from diarization pipeline~~ How to fetch audio segments in real time from diarization pipeline Feb 8, 2022

juanmc2005 added the question Further information is requested label Feb 14, 2022

juanmc2005 closed this as completed Mar 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to fetch audio segments in real time from diarization pipeline #26

How to fetch audio segments in real time from diarization pipeline #26

Yagna24 commented Feb 8, 2022

juanmc2005 commented Feb 8, 2022

Yagna24 commented Feb 9, 2022

juanmc2005 commented Feb 9, 2022

Yagna24 commented Feb 11, 2022

How to fetch audio segments in real time from diarization pipeline #26

How to fetch audio segments in real time from diarization pipeline #26

Comments

Yagna24 commented Feb 8, 2022

juanmc2005 commented Feb 8, 2022

Timestamps

Waveform

Yagna24 commented Feb 9, 2022

juanmc2005 commented Feb 9, 2022

Yagna24 commented Feb 11, 2022