Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small improvements to word boundary events #25

Merged

Conversation

tuzz
Copy link
Contributor

@tuzz tuzz commented Sep 24, 2024

After digging into this issue I found that text_offset is set to -1 (u32::MAX) when the text doesn't exactly match a substring in the SSML. This also means we can't reliably extract the text by character indexes, so call the C API to do that.

We can’t rely on slicing the text from text_offset and word_length since
text_offset might be None. Therefore, extract the text explicitly using
the C API and set it on the event. This is consistent with the Python SDK.
@adambezecny
Copy link
Contributor

hi Chris,

thank you for the pull request. Give me some time to review. I'll come back to you as soon as possible.

regards,

Adam

@tuzz
Copy link
Contributor Author

tuzz commented Sep 24, 2024

No problem. Happy to answer any questions. If it's useful, I also tried to handle the case when text_offset=None in my application. Here's my parsing code, although it probably doesn't belong in the SDK.

timestamp.rs
use serde::Serialize;
use cognitive_services_speech_sdk_rs as azure;

#[derive(Serialize, Debug)]
pub struct Timestamp {
    start_index: u32,
    end_index: u32,
    #[serde(serialize_with = "Timestamp::to_seconds_f64")]
    start_time: u64,
    #[serde(serialize_with = "Timestamp::to_seconds_f64")]
    duration: u64,
    text: String,
    ssml: String,
}

impl Timestamp {
    pub fn parse_azure_word_boundary_events(events: &[azure::speech::SpeechSynthesisWordBoundaryEvent], ssml: &str) -> Vec<Self> {
        let mut timestamps: Vec<Self> = vec![];

        for (i, event) in events.iter().enumerate() {
            let azure::speech::SpeechSynthesisWordBoundaryEvent { text_offset, word_length, audio_offset, duration_ms, boundary_type, text, .. } = event;

            // If the event's text is "cat's" but the SSML contained "cat&apos;s" then the event's text_offset will be None.
            let (start_index, end_index) = match text_offset {
                Some(offset) => (*offset, offset + word_length),
                None => Self::next_word_indexes(&timestamps, ssml, events[i + 1..].iter().find_map(|e| e.text_offset)),
            };

            let timestamp = Timestamp {
                start_index,
                end_index,
                start_time: *audio_offset,
                duration: *duration_ms,
                text: text.to_string(),
                ssml: ssml.chars().skip(start_index as usize).take((end_index - start_index) as usize).collect(),
            };

            match boundary_type {
                azure::common::SpeechSynthesisBoundaryType::PunctuationBoundary => Self::amend(timestamp, &mut timestamps),
                azure::common::SpeechSynthesisBoundaryType::SentenceBoundary => Self::amend(timestamp, &mut timestamps),
                azure::common::SpeechSynthesisBoundaryType::WordBoundary => timestamps.push(timestamp),
            }
        }

        timestamps
    }

    // Scan the SSML for the trimmed sequence after the end of the previous word and before start of the next word.
    fn next_word_indexes(timestamps: &[Timestamp], ssml: &str, start_of_next_word: Option<u32>) -> (u32, u32) {
        let end_of_prev_word = timestamps.last().map_or(0, |previous| previous.end_index);
        let start_of_next_word = start_of_next_word.or_else(|| ssml.find("</voice>").map(|i| i as u32)).unwrap_or_else(|| ssml.chars().count() as u32);

        let in_between_len = start_of_next_word - end_of_prev_word;
        let mut in_between = ssml.chars().enumerate().skip(end_of_prev_word as usize).take(in_between_len as usize);

        let start_of_this_word = in_between.find(|(_, c)| !c.is_whitespace()).map_or(end_of_prev_word, |(i, _)| i as u32);
        let end_of_this_word = in_between.filter(|(_, c)| !c.is_whitespace()).last().map_or(start_of_next_word, |(i, _)| i as u32 + 1);

        (start_of_this_word, end_of_this_word)
    }

    fn amend(timestamp: Timestamp, timestamps: &mut Vec<Self>) {
        if let Some(previous) = timestamps.last_mut() {
            previous.end_index = timestamp.end_index;
            previous.duration += timestamp.duration;
            previous.text.push_str(&timestamp.text);
            previous.ssml.push_str(&timestamp.ssml);
        } else {
            timestamps.push(timestamp);
        }
    }

    // Keep start_time and duration as u64 to avoid floating point addition. Serialize to seconds at the end.
    fn to_seconds_f64<S>(seconds: &u64, serializer: S) -> Result<S::Ok, S::Error> where S: serde::Serializer {
        serializer.serialize_f64(*seconds as f64 / 10_000_000.0)
    }
}

@adambezecny
Copy link
Contributor

hi again,

sorry for delayed responses, I am pretty busy with my current projects right now :( It would really help if you could prepare some example demonstrating this feature. have a look into examples/synthesizer folder of this project. It would be really great if we had some meaningful example for this.

I must admin it has been long time since I actively worked with this library, I need to really see concrete example demonstrating this to understand what do we achieve by merging this feature.

thanks!

@adambezecny
Copy link
Contributor

hi Chris,

weekend is here and I finally got some time to have a look at this. Please note this lib is port of go library

I made it work today and with some minor tweaks of this example I was able to synthetize into wav file this string: my cat's tail is rather long

result bellow:

Enter some text that you want to speak, or enter empty text to exit.
> my cat's tail is rather long
Synthesis started.

{handle:0x7fa3bc000bd0 AudioOffset:500000 Duration:200ms TextOffset:0 WordLength:2 Text:my BoundaryType:0}

{handle:0x7fa3bc000bd0 AudioOffset:2625000 Duration:387.5ms TextOffset:3 WordLength:5 Text:cat's BoundaryType:0}

{handle:0x7fa3bc000bd0 AudioOffset:6500000 Duration:425ms TextOffset:9 WordLength:4 Text:tail BoundaryType:0}

{handle:0x7fa3bc000bd0 AudioOffset:11500000 Duration:225ms TextOffset:14 WordLength:2 Text:is BoundaryType:0}

{handle:0x7fa3bc000bd0 AudioOffset:13750000 Duration:325ms TextOffset:17 WordLength:6 Text:rather BoundaryType:0}

{handle:0x7fa3bc000bd0 AudioOffset:17125000 Duration:437.5ms TextOffset:24 WordLength:4 Text:long BoundaryType:0}
Synthesizing, audio chunk size 37134.
Synthesizing, audio chunk size 32804.
Synthesizing, audio chunk size 5506.
Synthesizing, audio chunk size 2740.
Read [78000] bytes from audio data stream.
Enter some text that you want to speak, or enter empty text to exit.
> Synthesized, audio length 78046.
^Csignal: interrupt

now back to rust stuff. I have extended current examples, see here: 4802be0

current version does not have Text and produces something like this (when running audio_data_stream::run_example().await;):

C:\Users\adamb\dev\cognitive-services-speech-sdk-rs>cargo run --example synthesizer                                                                                                                       
   Compiling cognitive-services-speech-sdk-rs v1.0.4 (C:\Users\adamb\dev\cognitive-services-speech-sdk-rs)
    Finished dev [unoptimized + debuginfo] target(s) in 1.17s                                                                                                                                            
     Running `target\debug\examples\synthesizer.exe`
[2024-10-05T19:30:11Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:30:11Z INFO  synthesizer::audio_data_stream] running audio_data_stream example...
[2024-10-05T19:30:11Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 500000, duration_ms: 2000000, text_offset: 0, word_length: 2, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 2625000, duration_ms: 3875000, text_offset: 3, word_length: 5, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8395b80, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 6500000, duration_ms: 4250000, text_offset: 9, word_length: 4, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c83962d0, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 11500000, duration_ms: 2250000, text_offset: 14, word_length: 2, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 13750000, duration_ms: 3250000, text_offset: 17, word_length: 6, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 17125000, duration_ms: 4375000, text_offset: 24, word_length: 4, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::audio_data_stream] got result!
[2024-10-05T19:30:11Z INFO  synthesizer::audio_data_stream] example finished!

C:\Users\adamb\dev\cognitive-services-speech-sdk-rs>

Your version produces this:

[2024-10-05T19:32:09Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:32:09Z INFO  synthesizer::audio_data_stream] running audio_data_stream example...
[2024-10-05T19:32:09Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a2474776b0, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 500000, duration_ms: 2000000, text_offset: Some(0), word_length: 2, boundary_type: WordBoundary, text: "my" }
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a247476f20, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 2625000, duration_ms: 3875000, text_offset: Some(3), word_length: 5, boundary_type: WordBoundary, text: "cat's" }
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a247476f20, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 6500000, duration_ms: 4250000, text_offset: Some(9), word_length: 4, boundary_type: WordBoundary, text: "tail" }
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a248d90820, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 11500000, duration_ms: 2250000, text_offset: Some(14), word_length: 2, boundary_type: WordBoundary, text: "is" }
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a248d904e0, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 13750000, duration_ms: 3250000, text_offset: Some(17), word_length: 6, boundary_type: WordBoundary, text: "rather" }
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a248d908f0, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 17125000, duration_ms: 4375000, text_offset: Some(24), word_length: 4, boundary_type: WordBoundary, text: "long" }
[2024-10-05T19:32:10Z INFO  synthesizer::audio_data_stream] got result!
[2024-10-05T19:32:10Z INFO  synthesizer::audio_data_stream] example finished!

having Text in the event is definitely beneficial (and consistent with latest go version, I like that) but I cannot somehow succeed with SSML string. When I do something like this (i.e. use SSML string and replace speak_text_async with speak_ssml_async ):

use super::helpers;
use log::*;

/// demonstrates how to store synthetized data easily via Audio Data Stream abstraction
#[allow(dead_code)]
pub async fn run_example() {
    info!("---------------------------------------------------");
    info!("running audio_data_stream example...");
    info!("---------------------------------------------------");

    //let text = "my cat's tail is rather long";
    let text = "<speak xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xmlns:emo='http://www.w3.org/2009/10/emotionml' version='1.0' xml:lang='en-US'><voice name='en-GB-George'>my cat&apos;s tail is rather long</voice></speak>";

    let (mut speech_synthesize, _) = helpers::speech_synthesizer();

    helpers::set_callbacks(&mut speech_synthesize);

    match speech_synthesize.speak_ssml_async(text).await {
        Err(err) => error!("speak_text_async error {:?}", err),
        Ok(result) => {
            info!("got result!");
            helpers::save_wav("c:/tmp/output2.wav", result).await;
        }
    }

    info!("example finished!");
}

i will get empty file and no events:

[2024-10-05T19:39:33Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:39:33Z INFO  synthesizer::audio_data_stream] running audio_data_stream example...
[2024-10-05T19:39:33Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:39:33Z INFO  synthesizer::audio_data_stream] got result!
[2024-10-05T19:39:33Z INFO  synthesizer::audio_data_stream] example finished!

which brings me to my original question, could you prepare simple example (ideally analogy of synthesizer/speak_text_async.rs) where you can demonstrate the problem you are describing in your PR? thanks

@tuzz
Copy link
Contributor Author

tuzz commented Oct 5, 2024

Hi @adambezecny, thanks for looking into this. Sorry I haven't been responsive as I'm on annual leave at the moment - I'll take a look properly as soon as I'm back.

To quickly summarise the problem: it happens when the SSML contains escape characters like apos;. The text associated with the event is correctly returned as ' which is great, but the text_offset field is incorrect. I think the Azure SDK is trying to convey that the timestamp doesn't relate to an exact substring of the SSML and it signals that by returning -1 (which is cast to an unsigned int, hence it is set to u32::MAX).

I attempted to recover the correct offset (rather than u32::MAX) in the code snippet in my comment above. I wasn't sure whether to add this into the cognitive-services-speech-sdk-rs repository because it's custom code that I wrote that probably isn't implemented in the other SDK wrappers (I haven't checked the go one, but Python just returns -1).

I think my code is correctly figuring out the right text_offset and word_length for this edge case, but I haven't tested it extensively so it might not be ready for inclusion yet if we did want to go down that route.

@adambezecny
Copy link
Contributor

ok, provide please working example where this issue is demonstrated. as stated above I was not able to replicate it, probably just doing something wrong. Ideally add new example into existing examples. I would like to test it with code in main then in your branch and see the difference.

in general:

  • adding text into event is great and I will definitely merge it
  • other change, not sure yet but probably not. I want to keep it consistent with other SDKs


/// Event passed into speech synthetizer's callback set_synthesizer_word_boundary_cb.
#[derive(Debug)]
pub struct SpeechSynthesisWordBoundaryEvent {
pub handle: SmartHandle<SPXEVENTHANDLE>,
pub audio_offset: u64,
pub duration_ms: u64,
pub text_offset: u32,
pub text_offset: Option<u32>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would stick with u32 and -1 to keep it simple and consistent with other sdks

@@ -36,11 +38,22 @@ impl SpeechSynthesisWordBoundaryEvent {
);
convert_err(ret, "SpeechSynthesisWordBoundaryEvent::from_handle error")?;

// The text_offset is set to -1 (u32::MAX) if the event's text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment does not really belong here. also logic below with wrapping the value into option I would not do

#[cfg(target_os = "windows")]
let boundary_type = SpeechSynthesisBoundaryType::from_i32(boundary_type);
#[cfg(not(target_os = "windows"))]
let boundary_type = SpeechSynthesisBoundaryType::from_u32(boundary_type);

let c_text = synthesizer_event_get_text(handle);
let text = CStr::from_ptr(c_text).to_str()?.to_owned();
let ret = property_bag_free_string(c_text);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding text is ok, i will definitelly merge it, go sdk already has it as well

@adambezecny adambezecny merged commit dbdac3c into jabber-tools:main Oct 20, 2024
@adambezecny
Copy link
Contributor

@tuzz

hi, I have just release version 1.0.5 which contains text property added, in the end I did not include Option stuff, seems like too specific. anyway, thanks for the contribution!

regards,

Adam

@tuzz
Copy link
Contributor Author

tuzz commented Oct 20, 2024

Great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants