Runtime: words containing non-ASCII characters are concatenated without space #583

alumae · 2021-08-27T12:26:08Z

The runtime outputs decoded words containing non-ASCII characters as concatenated with neighbouring words: e.g. "aa ää xx yy" is transformed to "aaääxx yy".

This is caused by the code block starting at

wenet/runtime/core/decoder/torch_asr_decoder.cc

Line 217 in 6042313

bool is_englishword_prev = false;

I understand that this is done in order to output Chinese "words" correctly (i.e., without spaces). However, this should at least be configurable, as currently it breaks wenet runtime for most other languages (i.e. those that have words with non-ASCII characters and where words are separated by spaces in the orthography).

alumae · 2021-08-27T12:54:26Z

Also, automatic lowercasing is done silently:

wenet/runtime/core/utils/string.cc

Line 155 in 6042313

result[i] = tolower(result[i]);

robin1001 · 2021-08-29T05:17:52Z

@xingchensong, please fix this.

xingchensong · 2021-08-29T15:12:19Z

Thank u for ur feedback. Here are two solutions:

Automatically detect language types and adjust the final output according to it
- pros: can handle all situations (multilingual, code-switch, etc)
- cons: detecting the type of language is costly. In order to do this, we may need to :
  1. include a 3rd-party library (i.e. link). Unfortunately, most of the language detection projects are too heavy to port in wenet because they need n-grams to obtain higher detection accuracy.
  2. or traverse the utf-8 character set and build map<char, languageID>, this will take us a lot of time if we want to support all languages.

Always insert a "<space>" after each token and let users decide whether spaces need to be removed.

pros: greatly reduce the complexity of the code, see comparisons below:

// previous logic of TorchAsrDecoder::UpdateResult
// example1: ['我', '爱', '你'] ==> “我爱你”
// example2: ['i', 'love', 'wenet'] ==> “i love wenet”
// example3: ['_i', '_lo', 've', '_wenet'] ==> “i love wenet”
// example4: ['我', '爱', 'wenet', 'very', 'much'] ==> “我爱wenet very much”
// example5: ['我', ’爱‘, '_wenet', '_very', '_much'] ==> “我爱 wenet very much”
// example6: ['aa', 'ää', 'xx', 'yy'] ==> “aaääxx yy”
for (size_t j = 0; j < hypothesis.size(); j++) {
   std::string word = symbol_table_->Find(hypothesis[j]);
   if (wp_start_with_space_symbol_) {
     path.sentence += word;
     continue;
   }
   bool is_englishword_now = CheckEnglishWord(word);
   if (is_englishword_prev && is_englishword_now) {
     path.sentence += (' ' + word);
   } else {
     path.sentence += (word);
   }
   is_englishword_prev = is_englishword_now;
}
path.sentence = ProcessBlank(path.sentence);

// current logic of TorchAsrDecoder::UpdateResult
// example1: ['我', '爱', '你'] ==> “我 爱 你”
// example2: ['i', 'love', 'wenet'] ==> “i love wenet”
// example3: ['_i', '_lo', 've', '_wenet'] ==> “_i _lo ve _wenet”
// example4: ['我', '爱', 'wenet', 'very', 'much'] ==> “我 爱 wenet very much”
// example5: ['我', ’爱‘, '_wenet', '_very', '_much'] ==> “我 爱 _wenet _very _much”
// example6: ['aa', 'ää', 'xx', 'yy'] ==> “aa ää xx yy”
for (size_t j = 0; j < hypothesis.size(); j++) {
   std::string word = symbol_table_->Find(hypothesis[j]);
   path.sentence += (' ' + word);
}

we then move CheckXXXWord() and ProcessBlank() to a newly defined class, lets say class PostProcessor :

// Interface class
class PostProcessorInterface {
  ...
  // call other functions to do post processing
  virtual void Process(const std::string& str) = 0;
  // function `TorchAsrDecoder::InitPostProcessing()`  will be moved here,
  // this will also simplify the code of `TorchAsrDecoder` and decouple post-processing from the decoder.
  virtual void InitPostProcessing() = 0;
};

// mandarin-english bi-lingual cases, this class will be provided by wenet and used as default postprocessor
class PostProcessorCnEn : public PostProcessorInterface {
  void Process(const std::string& str) override;
  ...
  bool wp_start_with_space_symbol_;
  ItnModel itn_;  // for future-support of inverse text normalizer
};

void PostProcessorCnEn::Process(const std::string& str) {
    if (wp_start_with_space_symbol_) {
      // first replace all ' ' with '', then replace all '_' with ' '
      // example3: “_i _lo ve _wenet” ==> "_i_love_wenet" ==> " i love wenet"
      // example5: “我 爱 _wenet _very _much” ==> "我爱_wenet_very_much" ==> "我爱 wenet very much"
    } else {
      CheckXXXWord();  //  split string via space and check the type of each char
      JoinString();  // join chars accoding to there types
      // example1: “我 爱 你” ==> "我", "爱", "你" ==> "我爱你"
      // example2: “i love wenet” ==> “i", "love", "wenet” ==> "i love wenet"
      // example4: “我 爱 wenet very much” ==> "我", "爱", "wenet", "very", "much" ==> "我爱wenet very much"
    }
    // lower the str according to configurations
    if (config.return_lower_str) {
      tolower(str);
    }
}

// Indo-European languages, multi-lingual cases, this class will be provided by wenet
// and can be selected as actual postprocessor according to configurations.
class PostProcessorIndoEuro : public PostProcessorInterface {
  void Process(const std::string& str) override;
};

void PostProcessorIndoEuro::Process(const std::string& str) {
    // spaces are needed in most Indo-European languages, because spaces are already
    // included in str, we will do nothing here except for lowering the str.
    if (config.return_lower_str) {
      tolower(str);
    }
    // example6:  “aa ää xx yy” ==> "aa ää xx yy"
}

// other cases, this class should be created by users if they meet special cases
// which cannot be handled by `PostProcessorCnEn` and `PostProcessorIndoEuro`
class PostProcessorXXYYZZ : public PostProcessorInterface {
  void Process(const std::string& str) override;
};

void PostProcessorXXYYZZ::Process(const std::string& str) {
    // do anything they want
}

cons: users need to create a new class for their special cases.

Any suggestions are welcome ：)

alumae · 2021-08-29T19:30:50Z

IMHO such formatting issue is out of wenet's scope, therefore I would much prefer option 2. Option 1 (automatically detecting language) has so many corner cases that it's difficult to do it well across all locales.

robin1001 · 2021-08-30T01:30:46Z

The second solution is preferred. And it's right time to add the PostProcessorInterface now, and the PostProcessor here can be named BlankPostProcessor. Is there any way to give a no-code implementation? What if we do it in regex?

Mddct · 2021-08-30T01:56:43Z

For blankpostprocess, What if : detokenized = "''.join(pieces).replace('▁', ' ') Binbin Zhang ***@***.***> 于 2021年8月30日周一上午9:30写道：

…

The second solution is preferred. And it's right time to add the PostProcessorInterface now, and the PostProcessor here can be named BlankPostProcessor. Is there any way to give a no-code implementation? What if we do it in regex? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#583 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABFN3Q3Y3U7N7PUY2IZI6B3T7LNNDANCNFSM5C5OFCDA> .

xingchensong · 2021-08-30T02:04:06Z

The second solution is preferred. And it's right time to add the PostProcessorInterface now, and the PostProcessor here can be named BlankPostProcessor. Is there any way to give a no-code implementation? What if we do it in regex?

All replacement operations can be implemented with regex (c++ API: std::regex_replace(std::string str,std::regex reg,std::string replace)). But for the outer control logic (i.e., if-else branch), I think it is difficult to do in regex? Can u do me a favor ?

void BlankPostProcessor::Process(const std::string& str) {
    if (wp_start_with_space_symbol_) {
      // first replace all ' ' with '', then replace all '_' with ' '
      // example3: “_i _lo ve _wenet” ==> "_i_love_wenet" ==> " i love wenet"
      // example5: “我 爱 _wenet _very _much” ==> "我爱_wenet_very_much" ==> "我爱 wenet very much"
    } else {
      CheckXXXWord();  //  split string via space and check the type of each char
      JoinString();  // join chars accoding to there types
      // example1: “我 爱 你” ==> "我", "爱", "你" ==> "我爱你"
      // example2: “i love wenet” ==> “i", "love", "wenet” ==> "i love wenet"
      // example4: “我 爱 wenet very much” ==> "我", "爱", "wenet", "very", "much" ==> "我爱wenet very much"
    }
    // lower the str according to configurations
    if (config.return_lower_str) {
      tolower(str);
    }
}

Mddct · 2021-08-30T02:20:14Z

The second solution is preferred. And it's right time to add the PostProcessorInterface now, and the PostProcessor here can be named BlankPostProcessor. Is there any way to give a no-code implementation? What if we do it in regex?

All replacement operations can be implemented with regex (c++ API: std::regex_replace(std::string str,std::regex reg,std::string replace)). But for the outer control logic (i.e., if-else branch), I think it is difficult to do in regex? Can u do me a favor ?

void BlankPostProcessor::Process(const std::string& str) {
    if (wp_start_with_space_symbol_) {
      // first replace all ' ' with '', then replace all '_' with ' '
      // example3: “_i _lo ve _wenet” ==> "_i_love_wenet" ==> " i love wenet"
      // example5: “我 爱 _wenet _very _much” ==> "我爱_wenet_very_much" ==> "我爱 wenet very much"
    } else {
      CheckXXXWord();  //  split string via space and check the type of each char
      JoinString();  // join chars accoding to there types
      // example1: “我 爱 你” ==> "我", "爱", "你" ==> "我爱你"
      // example2: “i love wenet” ==> “i", "love", "wenet” ==> "i love wenet"
      // example4: “我 爱 wenet very much” ==> "我", "爱", "wenet", "very", "much" ==> "我爱wenet very much"
    }
    // lower the str according to configurations
    if (config.return_lower_str) {
      tolower(str);
    }
}

why the type of Process's first argument is a string? Not a vector ?

xingchensong · 2021-08-30T02:38:13Z

The second solution is preferred. And it's right time to add the PostProcessorInterface now, and the PostProcessor here can be named BlankPostProcessor. Is there any way to give a no-code implementation? What if we do it in regex?

All replacement operations can be implemented with regex (c++ API: std::regex_replace(std::string str,std::regex reg,std::string replace)). But for the outer control logic (i.e., if-else branch), I think it is difficult to do in regex? Can u do me a favor ?
void BlankPostProcessor::Process(const std::string& str) {
    if (wp_start_with_space_symbol_) {
      // first replace all ' ' with '', then replace all '_' with ' '
      // example3: “_i _lo ve _wenet” ==> "_i_love_wenet" ==> " i love wenet"
      // example5: “我 爱 _wenet _very _much” ==> "我爱_wenet_very_much" ==> "我爱 wenet very much"
    } else {
      CheckXXXWord();  //  split string via space and check the type of each char
      JoinString();  // join chars accoding to there types
      // example1: “我 爱 你” ==> "我", "爱", "你" ==> "我爱你"
      // example2: “i love wenet” ==> “i", "love", "wenet” ==> "i love wenet"
      // example4: “我 爱 wenet very much” ==> "我", "爱", "wenet", "very", "much" ==> "我爱wenet very much"
    }
    // lower the str according to configurations
    if (config.return_lower_str) {
      tolower(str);
    }
}
why the type of Process's first argument is a string? Not a vector ?

it is path.sentence. PostProcessor should be maintained by ConnectionHandler:

ConnectionHandler handler(std::move(socket), feature_config_, decode_config_, decode_resource_, postprocess_config_, postprocess_resource_);

and we should do post processing in function ConnectionHandler::SerializeResult:
https://github.com/wenet-e2e/wenet/blob/main/runtime/core/websocket/websocket_server.cc#L102-L123

robin1001 · 2021-08-30T02:45:47Z

@alumae , is there a blank symbol '▁' before ää in "aa ää xx yy" in your training?

robin1001 · 2021-08-30T02:55:40Z

For blankpostprocess, What if : detokenized = "''.join(pieces).replace('▁', ' ') Binbin Zhang @.***> 于 2021年8月30日周一上午9:30写道：
…
The second solution is preferred. And it's right time to add the PostProcessorInterface now, and the PostProcessor here can be named BlankPostProcessor. Is there any way to give a no-code implementation? What if we do it in regex? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#583 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFN3Q3Y3U7N7PUY2IZI6B3T7LNNDANCNFSM5C5OFCDA .

There is no '▁' symbol in WFST based decoding when LM is integrated.

robin1001 · 2021-08-30T03:54:37Z

Here is our solution.

For decoding without LM. Since '▁' is used as blank in training, so we can directly join all the outputs and replace '▁' with white space just like: detokenized = "''.join(pieces).replace('▁', ' ').

// example1: ['我', '爱', '你'] ==> “我爱你”
// example2: ['▁i', '▁lo', 've', '▁wenet'] ==> “ i love wenet”
// example3: ['我', ’爱‘, '▁wenet', '▁very', '▁much'] ==> “我爱 wenet very much”
// example4: ['▁aa', '▁ää', '▁xx', '▁yy'] ==> “ aa ää xx yy”

For decoding with LM. there is no '▁' in the output, and words are the output unit. We can simply join all the outputs with white space for all languages.

// example1: ['我', '爱', '你'] ==> “我 爱 你”
// example2: ['i', 'love', 'wenet'] ==> “ i love wenet”
// example3: ['我', ’爱‘, 'wenet', 'very', 'much'] ==> “我 爱 wenet very much”
// example4: ['aa', 'ää', 'xx', 'yy'] ==> “ aa ää xx yy”

And we can add a special BlankProcessor if we want to further remove white space between words, which depends on the language.

// example1:  “我 爱 你”==> “我爱你”
// example2: “ i love wenet” ==> “i love wenet”
// example3: “我 爱 wenet very much” ==> “我爱wenet very much”
// example4: “aa ää xx yy” ==> “aa ää xx yy”

alumae · 2021-08-30T06:31:43Z

@alumae , is there a blank symbol '▁' before ää in "aa ää xx yy" in your training?

No, as I am using WFST based decoding.

xingchensong · 2021-09-01T02:32:34Z

we move all post-processing related functions to class PostProcessor, please see PR #597

xingchensong mentioned this issue Aug 31, 2021

[runtime] post_processor #597

Merged

robin1001 closed this as completed Sep 4, 2021

xingchensong mentioned this issue Sep 26, 2022

The space problem in English ASR model ？ #1460

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime: words containing non-ASCII characters are concatenated without space #583

Runtime: words containing non-ASCII characters are concatenated without space #583

alumae commented Aug 27, 2021

alumae commented Aug 27, 2021

robin1001 commented Aug 29, 2021

xingchensong commented Aug 29, 2021

alumae commented Aug 29, 2021

robin1001 commented Aug 30, 2021

Mddct commented Aug 30, 2021 via email •

edited

xingchensong commented Aug 30, 2021

Mddct commented Aug 30, 2021 •

edited

xingchensong commented Aug 30, 2021

robin1001 commented Aug 30, 2021

robin1001 commented Aug 30, 2021 •

edited

robin1001 commented Aug 30, 2021 •

edited

alumae commented Aug 30, 2021

xingchensong commented Sep 1, 2021

Runtime: words containing non-ASCII characters are concatenated without space #583

Runtime: words containing non-ASCII characters are concatenated without space #583

Comments

alumae commented Aug 27, 2021

alumae commented Aug 27, 2021

robin1001 commented Aug 29, 2021

xingchensong commented Aug 29, 2021

alumae commented Aug 29, 2021

robin1001 commented Aug 30, 2021

Mddct commented Aug 30, 2021 via email • edited

xingchensong commented Aug 30, 2021

Mddct commented Aug 30, 2021 • edited

xingchensong commented Aug 30, 2021

robin1001 commented Aug 30, 2021

robin1001 commented Aug 30, 2021 • edited

robin1001 commented Aug 30, 2021 • edited

alumae commented Aug 30, 2021

xingchensong commented Sep 1, 2021

Mddct commented Aug 30, 2021 via email •

edited

Mddct commented Aug 30, 2021 •

edited

robin1001 commented Aug 30, 2021 •

edited

robin1001 commented Aug 30, 2021 •

edited