New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime: words containing non-ASCII characters are concatenated without space #583
Comments
Also, automatic lowercasing is done silently: wenet/runtime/core/utils/string.cc Line 155 in 6042313
|
@xingchensong, please fix this. |
Thank u for ur feedback. Here are two solutions:
Any suggestions are welcome :) |
IMHO such formatting issue is out of wenet's scope, therefore I would much prefer option 2. Option 1 (automatically detecting language) has so many corner cases that it's difficult to do it well across all locales. |
The second solution is preferred. And it's right time to add the PostProcessorInterface now, and the PostProcessor here can be named BlankPostProcessor. Is there any way to give a no-code implementation? What if we do it in regex? |
For blankpostprocess,
What if :
detokenized = "''.join(pieces).replace('▁', ' ')
Binbin Zhang ***@***.***> 于 2021年8月30日周一 上午9:30写道:
… The second solution is preferred. And it's right time to add the
PostProcessorInterface now, and the PostProcessor here can be named
BlankPostProcessor. Is there any way to give a no-code implementation? What
if we do it in regex?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#583 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABFN3Q3Y3U7N7PUY2IZI6B3T7LNNDANCNFSM5C5OFCDA>
.
|
All replacement operations can be implemented with regex (c++ API: void BlankPostProcessor::Process(const std::string& str) {
if (wp_start_with_space_symbol_) {
// first replace all ' ' with '', then replace all '_' with ' '
// example3: “_i _lo ve _wenet” ==> "_i_love_wenet" ==> " i love wenet"
// example5: “我 爱 _wenet _very _much” ==> "我爱_wenet_very_much" ==> "我爱 wenet very much"
} else {
CheckXXXWord(); // split string via space and check the type of each char
JoinString(); // join chars accoding to there types
// example1: “我 爱 你” ==> "我", "爱", "你" ==> "我爱你"
// example2: “i love wenet” ==> “i", "love", "wenet” ==> "i love wenet"
// example4: “我 爱 wenet very much” ==> "我", "爱", "wenet", "very", "much" ==> "我爱wenet very much"
}
// lower the str according to configurations
if (config.return_lower_str) {
tolower(str);
}
} |
why the type of Process's first argument is a string? Not a vector ? |
it is ConnectionHandler handler(std::move(socket), feature_config_, decode_config_, decode_resource_, postprocess_config_, postprocess_resource_); and we should do post processing in function |
@alumae , is there a blank symbol '▁' before ää in "aa ää xx yy" in your training? |
There is no '▁' symbol in WFST based decoding when LM is integrated. |
Here is our solution.
And we can add a special BlankProcessor if we want to further remove white space between words, which depends on the language.
|
No, as I am using WFST based decoding. |
we move all post-processing related functions to |
The runtime outputs decoded words containing non-ASCII characters as concatenated with neighbouring words: e.g. "aa ää xx yy" is transformed to "aaääxx yy".
This is caused by the code block starting at
wenet/runtime/core/decoder/torch_asr_decoder.cc
Line 217 in 6042313
I understand that this is done in order to output Chinese "words" correctly (i.e., without spaces). However, this should at least be configurable, as currently it breaks wenet runtime for most other languages (i.e. those that have words with non-ASCII characters and where words are separated by spaces in the orthography).
The text was updated successfully, but these errors were encountered: