Small improvements to word boundary events #25

tuzz · 2024-09-24T11:54:29Z

After digging into this issue I found that text_offset is set to -1 (u32::MAX) when the text doesn't exactly match a substring in the SSML. This also means we can't reliably extract the text by character indexes, so call the C API to do that.

We can’t rely on slicing the text from text_offset and word_length since text_offset might be None. Therefore, extract the text explicitly using the C API and set it on the event. This is consistent with the Python SDK.

adambezecny · 2024-09-24T13:59:00Z

hi Chris,

thank you for the pull request. Give me some time to review. I'll come back to you as soon as possible.

regards,

Adam

tuzz · 2024-09-24T17:00:24Z

No problem. Happy to answer any questions. If it's useful, I also tried to handle the case when text_offset=None in my application. Here's my parsing code, although it probably doesn't belong in the SDK.

timestamp.rs

use serde::Serialize;
use cognitive_services_speech_sdk_rs as azure;

#[derive(Serialize, Debug)]
pub struct Timestamp {
    start_index: u32,
    end_index: u32,
    #[serde(serialize_with = "Timestamp::to_seconds_f64")]
    start_time: u64,
    #[serde(serialize_with = "Timestamp::to_seconds_f64")]
    duration: u64,
    text: String,
    ssml: String,
}

impl Timestamp {
    pub fn parse_azure_word_boundary_events(events: &[azure::speech::SpeechSynthesisWordBoundaryEvent], ssml: &str) -> Vec<Self> {
        let mut timestamps: Vec<Self> = vec![];

        for (i, event) in events.iter().enumerate() {
            let azure::speech::SpeechSynthesisWordBoundaryEvent { text_offset, word_length, audio_offset, duration_ms, boundary_type, text, .. } = event;

            // If the event's text is "cat's" but the SSML contained "cat&apos;s" then the event's text_offset will be None.
            let (start_index, end_index) = match text_offset {
                Some(offset) => (*offset, offset + word_length),
                None => Self::next_word_indexes(&timestamps, ssml, events[i + 1..].iter().find_map(|e| e.text_offset)),
            };

            let timestamp = Timestamp {
                start_index,
                end_index,
                start_time: *audio_offset,
                duration: *duration_ms,
                text: text.to_string(),
                ssml: ssml.chars().skip(start_index as usize).take((end_index - start_index) as usize).collect(),
            };

            match boundary_type {
                azure::common::SpeechSynthesisBoundaryType::PunctuationBoundary => Self::amend(timestamp, &mut timestamps),
                azure::common::SpeechSynthesisBoundaryType::SentenceBoundary => Self::amend(timestamp, &mut timestamps),
                azure::common::SpeechSynthesisBoundaryType::WordBoundary => timestamps.push(timestamp),
            }
        }

        timestamps
    }

    // Scan the SSML for the trimmed sequence after the end of the previous word and before start of the next word.
    fn next_word_indexes(timestamps: &[Timestamp], ssml: &str, start_of_next_word: Option<u32>) -> (u32, u32) {
        let end_of_prev_word = timestamps.last().map_or(0, |previous| previous.end_index);
        let start_of_next_word = start_of_next_word.or_else(|| ssml.find("</voice>").map(|i| i as u32)).unwrap_or_else(|| ssml.chars().count() as u32);

        let in_between_len = start_of_next_word - end_of_prev_word;
        let mut in_between = ssml.chars().enumerate().skip(end_of_prev_word as usize).take(in_between_len as usize);

        let start_of_this_word = in_between.find(|(_, c)| !c.is_whitespace()).map_or(end_of_prev_word, |(i, _)| i as u32);
        let end_of_this_word = in_between.filter(|(_, c)| !c.is_whitespace()).last().map_or(start_of_next_word, |(i, _)| i as u32 + 1);

        (start_of_this_word, end_of_this_word)
    }

    fn amend(timestamp: Timestamp, timestamps: &mut Vec<Self>) {
        if let Some(previous) = timestamps.last_mut() {
            previous.end_index = timestamp.end_index;
            previous.duration += timestamp.duration;
            previous.text.push_str(&timestamp.text);
            previous.ssml.push_str(&timestamp.ssml);
        } else {
            timestamps.push(timestamp);
        }
    }

    // Keep start_time and duration as u64 to avoid floating point addition. Serialize to seconds at the end.
    fn to_seconds_f64<S>(seconds: &u64, serializer: S) -> Result<S::Ok, S::Error> where S: serde::Serializer {
        serializer.serialize_f64(*seconds as f64 / 10_000_000.0)
    }
}

adambezecny · 2024-09-30T19:21:12Z

hi again,

sorry for delayed responses, I am pretty busy with my current projects right now :( It would really help if you could prepare some example demonstrating this feature. have a look into examples/synthesizer folder of this project. It would be really great if we had some meaningful example for this.

I must admin it has been long time since I actively worked with this library, I need to really see concrete example demonstrating this to understand what do we achieve by merging this feature.

thanks!

adambezecny · 2024-10-05T19:45:22Z

hi Chris,

weekend is here and I finally got some time to have a look at this. Please note this lib is port of go library

I made it work today and with some minor tweaks of this example I was able to synthetize into wav file this string: my cat's tail is rather long

result bellow:

Enter some text that you want to speak, or enter empty text to exit.
> my cat's tail is rather long
Synthesis started.

{handle:0x7fa3bc000bd0 AudioOffset:500000 Duration:200ms TextOffset:0 WordLength:2 Text:my BoundaryType:0}

{handle:0x7fa3bc000bd0 AudioOffset:2625000 Duration:387.5ms TextOffset:3 WordLength:5 Text:cat's BoundaryType:0}

{handle:0x7fa3bc000bd0 AudioOffset:6500000 Duration:425ms TextOffset:9 WordLength:4 Text:tail BoundaryType:0}

{handle:0x7fa3bc000bd0 AudioOffset:11500000 Duration:225ms TextOffset:14 WordLength:2 Text:is BoundaryType:0}

{handle:0x7fa3bc000bd0 AudioOffset:13750000 Duration:325ms TextOffset:17 WordLength:6 Text:rather BoundaryType:0}

{handle:0x7fa3bc000bd0 AudioOffset:17125000 Duration:437.5ms TextOffset:24 WordLength:4 Text:long BoundaryType:0}
Synthesizing, audio chunk size 37134.
Synthesizing, audio chunk size 32804.
Synthesizing, audio chunk size 5506.
Synthesizing, audio chunk size 2740.
Read [78000] bytes from audio data stream.
Enter some text that you want to speak, or enter empty text to exit.
> Synthesized, audio length 78046.
^Csignal: interrupt

now back to rust stuff. I have extended current examples, see here: 4802be0

current version does not have Text and produces something like this (when running audio_data_stream::run_example().await;):

C:\Users\adamb\dev\cognitive-services-speech-sdk-rs>cargo run --example synthesizer                                                                                                                       
   Compiling cognitive-services-speech-sdk-rs v1.0.4 (C:\Users\adamb\dev\cognitive-services-speech-sdk-rs)
    Finished dev [unoptimized + debuginfo] target(s) in 1.17s                                                                                                                                            
     Running `target\debug\examples\synthesizer.exe`
[2024-10-05T19:30:11Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:30:11Z INFO  synthesizer::audio_data_stream] running audio_data_stream example...
[2024-10-05T19:30:11Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 500000, duration_ms: 2000000, text_offset: 0, word_length: 2, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 2625000, duration_ms: 3875000, text_offset: 3, word_length: 5, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8395b80, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 6500000, duration_ms: 4250000, text_offset: 9, word_length: 4, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c83962d0, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 11500000, duration_ms: 2250000, text_offset: 14, word_length: 2, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 13750000, duration_ms: 3250000, text_offset: 17, word_length: 6, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 17125000, duration_ms: 4375000, text_offset: 24, word_length: 4, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::audio_data_stream] got result!
[2024-10-05T19:30:11Z INFO  synthesizer::audio_data_stream] example finished!

C:\Users\adamb\dev\cognitive-services-speech-sdk-rs>

Your version produces this:

[2024-10-05T19:32:09Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:32:09Z INFO  synthesizer::audio_data_stream] running audio_data_stream example...
[2024-10-05T19:32:09Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a2474776b0, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 500000, duration_ms: 2000000, text_offset: Some(0), word_length: 2, boundary_type: WordBoundary, text: "my" }
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a247476f20, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 2625000, duration_ms: 3875000, text_offset: Some(3), word_length: 5, boundary_type: WordBoundary, text: "cat's" }
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a247476f20, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 6500000, duration_ms: 4250000, text_offset: Some(9), word_length: 4, boundary_type: WordBoundary, text: "tail" }
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a248d90820, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 11500000, duration_ms: 2250000, text_offset: Some(14), word_length: 2, boundary_type: WordBoundary, text: "is" }
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a248d904e0, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 13750000, duration_ms: 3250000, text_offset: Some(17), word_length: 6, boundary_type: WordBoundary, text: "rather" }
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a248d908f0, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 17125000, duration_ms: 4375000, text_offset: Some(24), word_length: 4, boundary_type: WordBoundary, text: "long" }
[2024-10-05T19:32:10Z INFO  synthesizer::audio_data_stream] got result!
[2024-10-05T19:32:10Z INFO  synthesizer::audio_data_stream] example finished!

having Text in the event is definitely beneficial (and consistent with latest go version, I like that) but I cannot somehow succeed with SSML string. When I do something like this (i.e. use SSML string and replace speak_text_async with speak_ssml_async ):

use super::helpers;
use log::*;

/// demonstrates how to store synthetized data easily via Audio Data Stream abstraction
#[allow(dead_code)]
pub async fn run_example() {
    info!("---------------------------------------------------");
    info!("running audio_data_stream example...");
    info!("---------------------------------------------------");

    //let text = "my cat's tail is rather long";
    let text = "<speak xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xmlns:emo='http://www.w3.org/2009/10/emotionml' version='1.0' xml:lang='en-US'><voice name='en-GB-George'>my cat&apos;s tail is rather long</voice></speak>";

    let (mut speech_synthesize, _) = helpers::speech_synthesizer();

    helpers::set_callbacks(&mut speech_synthesize);

    match speech_synthesize.speak_ssml_async(text).await {
        Err(err) => error!("speak_text_async error {:?}", err),
        Ok(result) => {
            info!("got result!");
            helpers::save_wav("c:/tmp/output2.wav", result).await;
        }
    }

    info!("example finished!");
}

i will get empty file and no events:

[2024-10-05T19:39:33Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:39:33Z INFO  synthesizer::audio_data_stream] running audio_data_stream example...
[2024-10-05T19:39:33Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:39:33Z INFO  synthesizer::audio_data_stream] got result!
[2024-10-05T19:39:33Z INFO  synthesizer::audio_data_stream] example finished!

which brings me to my original question, could you prepare simple example (ideally analogy of synthesizer/speak_text_async.rs) where you can demonstrate the problem you are describing in your PR? thanks

tuzz · 2024-10-05T20:51:15Z

Hi @adambezecny, thanks for looking into this. Sorry I haven't been responsive as I'm on annual leave at the moment - I'll take a look properly as soon as I'm back.

To quickly summarise the problem: it happens when the SSML contains escape characters like apos;. The text associated with the event is correctly returned as ' which is great, but the text_offset field is incorrect. I think the Azure SDK is trying to convey that the timestamp doesn't relate to an exact substring of the SSML and it signals that by returning -1 (which is cast to an unsigned int, hence it is set to u32::MAX).

I attempted to recover the correct offset (rather than u32::MAX) in the code snippet in my comment above. I wasn't sure whether to add this into the cognitive-services-speech-sdk-rs repository because it's custom code that I wrote that probably isn't implemented in the other SDK wrappers (I haven't checked the go one, but Python just returns -1).

I think my code is correctly figuring out the right text_offset and word_length for this edge case, but I haven't tested it extensively so it might not be ready for inclusion yet if we did want to go down that route.

adambezecny · 2024-10-09T07:51:34Z

ok, provide please working example where this issue is demonstrated. as stated above I was not able to replicate it, probably just doing something wrong. Ideally add new example into existing examples. I would like to test it with code in main then in your branch and see the difference.

in general:

adding text into event is great and I will definitely merge it
other change, not sure yet but probably not. I want to keep it consistent with other SDKs

adambezecny · 2024-10-09T07:52:07Z

src/speech/speech_synthesis_word_boundary_event.rs


 /// Event passed into speech synthetizer's callback set_synthesizer_word_boundary_cb.
 #[derive(Debug)]
 pub struct SpeechSynthesisWordBoundaryEvent {
    pub handle: SmartHandle<SPXEVENTHANDLE>,
    pub audio_offset: u64,
    pub duration_ms: u64,
-    pub text_offset: u32,
+    pub text_offset: Option<u32>,


i would stick with u32 and -1 to keep it simple and consistent with other sdks

adambezecny · 2024-10-09T07:52:52Z

src/speech/speech_synthesis_word_boundary_event.rs

@@ -36,11 +38,22 @@ impl SpeechSynthesisWordBoundaryEvent {
            );
            convert_err(ret, "SpeechSynthesisWordBoundaryEvent::from_handle error")?;

+            // The text_offset is set to -1 (u32::MAX) if the event's text


this comment does not really belong here. also logic below with wrapping the value into option I would not do

adambezecny · 2024-10-09T07:53:12Z

src/speech/speech_synthesis_word_boundary_event.rs

            #[cfg(target_os = "windows")]
            let boundary_type = SpeechSynthesisBoundaryType::from_i32(boundary_type);
            #[cfg(not(target_os = "windows"))]
            let boundary_type = SpeechSynthesisBoundaryType::from_u32(boundary_type);

+            let c_text = synthesizer_event_get_text(handle);
+            let text = CStr::from_ptr(c_text).to_str()?.to_owned();
+            let ret = property_bag_free_string(c_text);


adding text is ok, i will definitelly merge it, go sdk already has it as well

adambezecny · 2024-10-20T19:50:52Z

@tuzz

hi, I have just release version 1.0.5 which contains text property added, in the end I did not include Option stuff, seems like too specific. anyway, thanks for the contribution!

regards,

Adam

tuzz · 2024-10-20T21:09:57Z

Great, thank you!

tuzz added 2 commits September 24, 2024 12:49

Make text_offset an Option<u32> for word boundary events

60adefe

Add text to word boundary events

3e9c084

We can’t rely on slicing the text from text_offset and word_length since text_offset might be None. Therefore, extract the text explicitly using the C API and set it on the event. This is consistent with the Python SDK.

adambezecny reviewed Oct 9, 2024

View reviewed changes

adambezecny approved these changes Oct 20, 2024

View reviewed changes

adambezecny merged commit dbdac3c into jabber-tools:main Oct 20, 2024

adambezecny mentioned this pull request Oct 21, 2024

SpeechSynthesisWordBoundaryEvent has text_offset set to 4294967295 #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small improvements to word boundary events #25

Small improvements to word boundary events #25

tuzz commented Sep 24, 2024 •

edited

Loading

adambezecny commented Sep 24, 2024

tuzz commented Sep 24, 2024

adambezecny commented Sep 30, 2024

adambezecny commented Oct 5, 2024

tuzz commented Oct 5, 2024 •

edited

Loading

adambezecny commented Oct 9, 2024

adambezecny Oct 9, 2024

adambezecny Oct 9, 2024

adambezecny Oct 9, 2024

adambezecny commented Oct 20, 2024

tuzz commented Oct 20, 2024

Small improvements to word boundary events #25

Small improvements to word boundary events #25

Conversation

tuzz commented Sep 24, 2024 • edited Loading

adambezecny commented Sep 24, 2024

tuzz commented Sep 24, 2024

adambezecny commented Sep 30, 2024

adambezecny commented Oct 5, 2024

tuzz commented Oct 5, 2024 • edited Loading

adambezecny commented Oct 9, 2024

adambezecny Oct 9, 2024

Choose a reason for hiding this comment

adambezecny Oct 9, 2024

Choose a reason for hiding this comment

adambezecny Oct 9, 2024

Choose a reason for hiding this comment

adambezecny commented Oct 20, 2024

tuzz commented Oct 20, 2024

tuzz commented Sep 24, 2024 •

edited

Loading

tuzz commented Oct 5, 2024 •

edited

Loading