Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to fetch audio segments in real time from diarization pipeline #26

Closed
Yagna24 opened this issue Feb 8, 2022 · 4 comments
Closed
Labels
question Further information is requested

Comments

@Yagna24
Copy link
Contributor

Yagna24 commented Feb 8, 2022

Hi, First of all thank you for this repo.

I was wondering if its possible to use the diarization pipeline to fetch the audio segments of discrete speakers.
It would immensely help if there's a way around it.

@Yagna24 Yagna24 changed the title Fetch audio segments in real time from diarization pipeline How to fetch audio segments in real time from diarization pipeline Feb 8, 2022
@juanmc2005
Copy link
Owner

Hi @Yagna24, thank you for your question.

I'm not quite sure I understand what you mean by "audio segments of discrete speakers", but I think you may be looking for the timestamps where a speaker is active or the waveform.

Timestamps

Basically, the diarization pipeline outputs pairs of short (500ms) Annotation (the diarization prediction) and SlidingWindowFeature (the waveform). You can always use Annotation.label_timeline("speaker0") to get the list of segments in which speaker0 is active in the form of a Timeline object. Keep in mind that a timeline is essentially an ordered list of Segment, that essentially just contain a start and end time.

If you want to use the accumulated output to filter by speaker in this way, you can always use the accumulate_output() operator. For example:

import diart.operators as dops
from diart.sources import MicrophoneAudioSource
import rx.operators as ops
from diart.pipelines import OnlineSpeakerDiarization

pipeline = OnlineSpeakerDiarization()
audio_source = MicrophoneAudioSource(sample_rate=16000)

pipeline.from_source(audio_source).pipe(
    dops.accumulate_output(duration=pipeline.duration, step=pipeline.step),
    # x: (Annotation, SlidingWindowFeature, real_time: float)
).subscribe(on_next=lambda x: print(x[0].label_timeline("speaker0")))

Waveform

If you want the waveform given a speaker timeline (or segment), then you can use SlidingWindowFeature.crop(timeline). For example:

import diart.operators as dops
from diart.sources import MicrophoneAudioSource
import rx.operators as ops
from diart.pipelines import OnlineSpeakerDiarization
import diart.utils as utils
from pyannote.core import SlidingWindowFeature

pipeline = OnlineSpeakerDiarization()
audio_source = MicrophoneAudioSource(sample_rate=16000)
visualize = utils.visualize_feature(duration=5.0)

pipeline.from_source(audio_source).pipe(
    dops.accumulate_output(duration=pipeline.duration, step=pipeline.step),
    # x: (Annotation, SlidingWindowFeature, real_time: float)
).subscribe(on_next=lambda x: visualize(SlidingWindowFeature(x[1].crop(x[0].label_timeline("speaker0")), x[1].sliding_window)))

This should show you the concatenated waveforms where speaker0 is active thanks to the function utils.visualize_feature(). Notice that crop returns a numpy array and not a SlidingWindowFeature.

Of course, I'm assuming that you want the accumulated output, and these code snippets are the quick and dirty way. I'm sure you can find a way to make it look better :)

For more info on pyannote.core data structures, I invite you to take a look at the docs.

Please let me know if this answers your question.

@Yagna24
Copy link
Contributor Author

Yagna24 commented Feb 9, 2022

Hi @juanmc2005 , Thank you for your response.

For this part,

You can always use Annotation.label_timeline("speaker0") to get the list of segments in which speaker0 is active in the form of a Timeline object. Keep in mind that a timeline is essentially an ordered list of Segment, that essentially just contain a start and end time.

I could use the start and end time stamps to crop the audio for every speaker.

I don't understand what the output of accumulate_output() would be

If you want to use the accumulated output to filter by speaker in this way, you can always use the accumulate_output() operator.

Last question about the main pipeline,

https://github.com/juanmc2005/StreamingSpeakerDiarization/blob/main/src/diart/pipelines.py

Is this the main pipeline that should concern me? Because I am planning to add the timestamp fetching part and audio cropping code in diart/pipelines.py to have that in the main output as well.

Please correct me if I am wrong.
Thanks for your time.

@juanmc2005
Copy link
Owner

I could use the start and end time stamps to crop the audio for every speaker.

Absolutely, you can get the individual segments that compose the timeline for a given speaker and crop the audio with them.

I don't understand what the output of accumulate_output() would be

It depends at what level you look at it.
In reality, accumulate_output() is a function that builds and returns an RxPY operator, which is basically a function that can be added to a stream via pipe (i.e. stream.pipe(accumulate_output(...))).
The arguments of accumulate_output() are needed to build such operator.

This is the explanation for how it "connects" to RxPY, but it's easier to think about it in terms of what the output operator does.
The returned operator is going to take a pair (Annotation, SlidingWindowFeature) as input (i.e. the prediction and the waveform) and it's going to return a tuple (Annotation, SlidingWindowFeature, float) that correspond to the accumulated predictions from the beginning, the accumulated waveform from the beginning and the current real time of the stream.
In order to access this output you need to work on the on_next function of the following operator (or the observer in the subscribe method), like I showed in this example:

pipeline.from_source(audio_source).pipe(
    dops.accumulate_output(duration=pipeline.duration, step=pipeline.step),
    # x: (Annotation, SlidingWindowFeature, real_time: float)
).subscribe(on_next=lambda x: print(x[0].label_timeline("speaker0")))

Is this the main pipeline that should concern me? Because I am planning to add the timestamp fetching part and audio cropping code in diart/pipelines.py to have that in the main output as well.

Yes this pipeline implements the entire system and it will stream the diarization predictions from the audio.
You can of course modify the insides of OnlineSpeakerDiarization but that's not necessary.
Thanks to RxPY you can "connect" more operators to the output of from_source() (which is an rx observable) using the pipe method like I showed in my code snippets.

You may also not want to use RxPY and that's fine too. In that case you'll need to use the building blocks in functional to implement the system with your streaming library of choice.
In that case I would be happy to hear suggestions for another popular streaming library and even merge a PR adding compatibility :)

@Yagna24
Copy link
Contributor Author

Yagna24 commented Feb 11, 2022

Hi @juanmc2005 , Thank you for the response. I have an update about getting the timestamps . Since I want to work on a similar type problem, I found one which is simpler, I used the pyannote's diarization pipeline to get the contents :

diarized = diarization(audio_file)

diarized.for_json('contents')

Next, I filtered out the timestamps, removing the characters to keep the main timestamps.

Stored start and end timestamps as array in T1 and T2.
Then I simply use pydub to crop the audio in between T1 and T2 time interval.

I am also interested in performing the same in real time. I look forward to work the main pipeline now.

@juanmc2005 juanmc2005 added the question Further information is requested label Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants