-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small improvements to word boundary events #25
Small improvements to word boundary events #25
Conversation
We can’t rely on slicing the text from text_offset and word_length since text_offset might be None. Therefore, extract the text explicitly using the C API and set it on the event. This is consistent with the Python SDK.
hi Chris, thank you for the pull request. Give me some time to review. I'll come back to you as soon as possible. regards, Adam |
No problem. Happy to answer any questions. If it's useful, I also tried to handle the case when timestamp.rsuse serde::Serialize;
use cognitive_services_speech_sdk_rs as azure;
#[derive(Serialize, Debug)]
pub struct Timestamp {
start_index: u32,
end_index: u32,
#[serde(serialize_with = "Timestamp::to_seconds_f64")]
start_time: u64,
#[serde(serialize_with = "Timestamp::to_seconds_f64")]
duration: u64,
text: String,
ssml: String,
}
impl Timestamp {
pub fn parse_azure_word_boundary_events(events: &[azure::speech::SpeechSynthesisWordBoundaryEvent], ssml: &str) -> Vec<Self> {
let mut timestamps: Vec<Self> = vec![];
for (i, event) in events.iter().enumerate() {
let azure::speech::SpeechSynthesisWordBoundaryEvent { text_offset, word_length, audio_offset, duration_ms, boundary_type, text, .. } = event;
// If the event's text is "cat's" but the SSML contained "cat's" then the event's text_offset will be None.
let (start_index, end_index) = match text_offset {
Some(offset) => (*offset, offset + word_length),
None => Self::next_word_indexes(×tamps, ssml, events[i + 1..].iter().find_map(|e| e.text_offset)),
};
let timestamp = Timestamp {
start_index,
end_index,
start_time: *audio_offset,
duration: *duration_ms,
text: text.to_string(),
ssml: ssml.chars().skip(start_index as usize).take((end_index - start_index) as usize).collect(),
};
match boundary_type {
azure::common::SpeechSynthesisBoundaryType::PunctuationBoundary => Self::amend(timestamp, &mut timestamps),
azure::common::SpeechSynthesisBoundaryType::SentenceBoundary => Self::amend(timestamp, &mut timestamps),
azure::common::SpeechSynthesisBoundaryType::WordBoundary => timestamps.push(timestamp),
}
}
timestamps
}
// Scan the SSML for the trimmed sequence after the end of the previous word and before start of the next word.
fn next_word_indexes(timestamps: &[Timestamp], ssml: &str, start_of_next_word: Option<u32>) -> (u32, u32) {
let end_of_prev_word = timestamps.last().map_or(0, |previous| previous.end_index);
let start_of_next_word = start_of_next_word.or_else(|| ssml.find("</voice>").map(|i| i as u32)).unwrap_or_else(|| ssml.chars().count() as u32);
let in_between_len = start_of_next_word - end_of_prev_word;
let mut in_between = ssml.chars().enumerate().skip(end_of_prev_word as usize).take(in_between_len as usize);
let start_of_this_word = in_between.find(|(_, c)| !c.is_whitespace()).map_or(end_of_prev_word, |(i, _)| i as u32);
let end_of_this_word = in_between.filter(|(_, c)| !c.is_whitespace()).last().map_or(start_of_next_word, |(i, _)| i as u32 + 1);
(start_of_this_word, end_of_this_word)
}
fn amend(timestamp: Timestamp, timestamps: &mut Vec<Self>) {
if let Some(previous) = timestamps.last_mut() {
previous.end_index = timestamp.end_index;
previous.duration += timestamp.duration;
previous.text.push_str(×tamp.text);
previous.ssml.push_str(×tamp.ssml);
} else {
timestamps.push(timestamp);
}
}
// Keep start_time and duration as u64 to avoid floating point addition. Serialize to seconds at the end.
fn to_seconds_f64<S>(seconds: &u64, serializer: S) -> Result<S::Ok, S::Error> where S: serde::Serializer {
serializer.serialize_f64(*seconds as f64 / 10_000_000.0)
}
} |
hi again, sorry for delayed responses, I am pretty busy with my current projects right now :( It would really help if you could prepare some example demonstrating this feature. have a look into examples/synthesizer folder of this project. It would be really great if we had some meaningful example for this. I must admin it has been long time since I actively worked with this library, I need to really see concrete example demonstrating this to understand what do we achieve by merging this feature. thanks! |
hi Chris, weekend is here and I finally got some time to have a look at this. Please note this lib is port of go library I made it work today and with some minor tweaks of this example I was able to synthetize into wav file this string: my cat's tail is rather long result bellow:
now back to rust stuff. I have extended current examples, see here: 4802be0 current version does not have Text and produces something like this (when running audio_data_stream::run_example().await;):
Your version produces this:
having Text in the event is definitely beneficial (and consistent with latest go version, I like that) but I cannot somehow succeed with SSML string. When I do something like this (i.e. use SSML string and replace speak_text_async with speak_ssml_async ): use super::helpers;
use log::*;
/// demonstrates how to store synthetized data easily via Audio Data Stream abstraction
#[allow(dead_code)]
pub async fn run_example() {
info!("---------------------------------------------------");
info!("running audio_data_stream example...");
info!("---------------------------------------------------");
//let text = "my cat's tail is rather long";
let text = "<speak xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xmlns:emo='http://www.w3.org/2009/10/emotionml' version='1.0' xml:lang='en-US'><voice name='en-GB-George'>my cat's tail is rather long</voice></speak>";
let (mut speech_synthesize, _) = helpers::speech_synthesizer();
helpers::set_callbacks(&mut speech_synthesize);
match speech_synthesize.speak_ssml_async(text).await {
Err(err) => error!("speak_text_async error {:?}", err),
Ok(result) => {
info!("got result!");
helpers::save_wav("c:/tmp/output2.wav", result).await;
}
}
info!("example finished!");
} i will get empty file and no events:
which brings me to my original question, could you prepare simple example (ideally analogy of synthesizer/speak_text_async.rs) where you can demonstrate the problem you are describing in your PR? thanks |
Hi @adambezecny, thanks for looking into this. Sorry I haven't been responsive as I'm on annual leave at the moment - I'll take a look properly as soon as I'm back. To quickly summarise the problem: it happens when the SSML contains escape characters like I attempted to recover the correct offset (rather than I think my code is correctly figuring out the right |
ok, provide please working example where this issue is demonstrated. as stated above I was not able to replicate it, probably just doing something wrong. Ideally add new example into existing examples. I would like to test it with code in main then in your branch and see the difference. in general:
|
|
||
/// Event passed into speech synthetizer's callback set_synthesizer_word_boundary_cb. | ||
#[derive(Debug)] | ||
pub struct SpeechSynthesisWordBoundaryEvent { | ||
pub handle: SmartHandle<SPXEVENTHANDLE>, | ||
pub audio_offset: u64, | ||
pub duration_ms: u64, | ||
pub text_offset: u32, | ||
pub text_offset: Option<u32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would stick with u32 and -1 to keep it simple and consistent with other sdks
@@ -36,11 +38,22 @@ impl SpeechSynthesisWordBoundaryEvent { | |||
); | |||
convert_err(ret, "SpeechSynthesisWordBoundaryEvent::from_handle error")?; | |||
|
|||
// The text_offset is set to -1 (u32::MAX) if the event's text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment does not really belong here. also logic below with wrapping the value into option I would not do
#[cfg(target_os = "windows")] | ||
let boundary_type = SpeechSynthesisBoundaryType::from_i32(boundary_type); | ||
#[cfg(not(target_os = "windows"))] | ||
let boundary_type = SpeechSynthesisBoundaryType::from_u32(boundary_type); | ||
|
||
let c_text = synthesizer_event_get_text(handle); | ||
let text = CStr::from_ptr(c_text).to_str()?.to_owned(); | ||
let ret = property_bag_free_string(c_text); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding text is ok, i will definitelly merge it, go sdk already has it as well
hi, I have just release version 1.0.5 which contains text property added, in the end I did not include Option stuff, seems like too specific. anyway, thanks for the contribution! regards, Adam |
Great, thank you! |
After digging into this issue I found that text_offset is set to -1 (u32::MAX) when the text doesn't exactly match a substring in the SSML. This also means we can't reliably extract the text by character indexes, so call the C API to do that.