New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to fetch audio segments in real time from diarization pipeline #26
Comments
Hi @Yagna24, thank you for your question. I'm not quite sure I understand what you mean by "audio segments of discrete speakers", but I think you may be looking for the timestamps where a speaker is active or the waveform. TimestampsBasically, the diarization pipeline outputs pairs of short (500ms) If you want to use the accumulated output to filter by speaker in this way, you can always use the import diart.operators as dops
from diart.sources import MicrophoneAudioSource
import rx.operators as ops
from diart.pipelines import OnlineSpeakerDiarization
pipeline = OnlineSpeakerDiarization()
audio_source = MicrophoneAudioSource(sample_rate=16000)
pipeline.from_source(audio_source).pipe(
dops.accumulate_output(duration=pipeline.duration, step=pipeline.step),
# x: (Annotation, SlidingWindowFeature, real_time: float)
).subscribe(on_next=lambda x: print(x[0].label_timeline("speaker0"))) WaveformIf you want the waveform given a speaker timeline (or segment), then you can use import diart.operators as dops
from diart.sources import MicrophoneAudioSource
import rx.operators as ops
from diart.pipelines import OnlineSpeakerDiarization
import diart.utils as utils
from pyannote.core import SlidingWindowFeature
pipeline = OnlineSpeakerDiarization()
audio_source = MicrophoneAudioSource(sample_rate=16000)
visualize = utils.visualize_feature(duration=5.0)
pipeline.from_source(audio_source).pipe(
dops.accumulate_output(duration=pipeline.duration, step=pipeline.step),
# x: (Annotation, SlidingWindowFeature, real_time: float)
).subscribe(on_next=lambda x: visualize(SlidingWindowFeature(x[1].crop(x[0].label_timeline("speaker0")), x[1].sliding_window))) This should show you the concatenated waveforms where Of course, I'm assuming that you want the accumulated output, and these code snippets are the quick and dirty way. I'm sure you can find a way to make it look better :) For more info on pyannote.core data structures, I invite you to take a look at the docs. Please let me know if this answers your question. |
Hi @juanmc2005 , Thank you for your response. For this part,
I could use the start and end time stamps to crop the audio for every speaker. I don't understand what the output of accumulate_output() would be
Last question about the main pipeline,
Is this the main pipeline that should concern me? Because I am planning to add the timestamp fetching part and audio cropping code in Please correct me if I am wrong. |
Absolutely, you can get the individual segments that compose the timeline for a given speaker and crop the audio with them.
It depends at what level you look at it. This is the explanation for how it "connects" to RxPY, but it's easier to think about it in terms of what the output operator does. pipeline.from_source(audio_source).pipe(
dops.accumulate_output(duration=pipeline.duration, step=pipeline.step),
# x: (Annotation, SlidingWindowFeature, real_time: float)
).subscribe(on_next=lambda x: print(x[0].label_timeline("speaker0")))
Yes this pipeline implements the entire system and it will stream the diarization predictions from the audio. You may also not want to use RxPY and that's fine too. In that case you'll need to use the building blocks in |
Hi @juanmc2005 , Thank you for the response. I have an update about getting the timestamps . Since I want to work on a similar type problem, I found one which is simpler, I used the pyannote's diarization pipeline to get the contents :
Next, I filtered out the timestamps, removing the characters to keep the main timestamps. Stored start and end timestamps as array in T1 and T2. I am also interested in performing the same in real time. I look forward to work the main pipeline now. |
Hi, First of all thank you for this repo.
I was wondering if its possible to use the diarization pipeline to fetch the audio segments of discrete speakers.
It would immensely help if there's a way around it.
The text was updated successfully, but these errors were encountered: