Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime: words containing non-ASCII characters are concatenated without space #583

Closed
alumae opened this issue Aug 27, 2021 · 14 comments
Closed

Comments

@alumae
Copy link

alumae commented Aug 27, 2021

The runtime outputs decoded words containing non-ASCII characters as concatenated with neighbouring words: e.g. "aa ää xx yy" is transformed to "aaääxx yy".

This is caused by the code block starting at

bool is_englishword_prev = false;

I understand that this is done in order to output Chinese "words" correctly (i.e., without spaces). However, this should at least be configurable, as currently it breaks wenet runtime for most other languages (i.e. those that have words with non-ASCII characters and where words are separated by spaces in the orthography).

@alumae
Copy link
Author

alumae commented Aug 27, 2021

Also, automatic lowercasing is done silently:

result[i] = tolower(result[i]);

@robin1001
Copy link
Collaborator

@xingchensong, please fix this.

@xingchensong
Copy link
Member

Thank u for ur feedback. Here are two solutions:

  1. Automatically detect language types and adjust the final output according to it

    • pros: can handle all situations (multilingual, code-switch, etc)
    • cons: detecting the type of language is costly. In order to do this, we may need to :
      1. include a 3rd-party library (i.e. link). Unfortunately, most of the language detection projects are too heavy to port in wenet because they need n-grams to obtain higher detection accuracy.
      2. or traverse the utf-8 character set and build map<char, languageID>, this will take us a lot of time if we want to support all languages.
  2. Always insert a "<space>" after each token and let users decide whether spaces need to be removed.

    • pros: greatly reduce the complexity of the code, see comparisons below:

      // previous logic of TorchAsrDecoder::UpdateResult
      // example1: ['我', '爱', '你'] ==> “我爱你”
      // example2: ['i', 'love', 'wenet'] ==> “i love wenet”
      // example3: ['_i', '_lo', 've', '_wenet'] ==> “i love wenet”
      // example4: ['我', '爱', 'wenet', 'very', 'much'] ==> “我爱wenet very much”
      // example5: ['我', ’爱‘, '_wenet', '_very', '_much'] ==> “我爱 wenet very much”
      // example6: ['aa', 'ää', 'xx', 'yy'] ==> “aaääxx yy”
      for (size_t j = 0; j < hypothesis.size(); j++) {
         std::string word = symbol_table_->Find(hypothesis[j]);
         if (wp_start_with_space_symbol_) {
           path.sentence += word;
           continue;
         }
         bool is_englishword_now = CheckEnglishWord(word);
         if (is_englishword_prev && is_englishword_now) {
           path.sentence += (' ' + word);
         } else {
           path.sentence += (word);
         }
         is_englishword_prev = is_englishword_now;
      }
      path.sentence = ProcessBlank(path.sentence);
      
      // current logic of TorchAsrDecoder::UpdateResult
      // example1: ['我', '爱', '你'] ==> “我 爱 你”
      // example2: ['i', 'love', 'wenet'] ==> “i love wenet”
      // example3: ['_i', '_lo', 've', '_wenet'] ==> “_i _lo ve _wenet”
      // example4: ['我', '爱', 'wenet', 'very', 'much'] ==> “我 爱 wenet very much”
      // example5: ['我', ’爱‘, '_wenet', '_very', '_much'] ==> “我 爱 _wenet _very _much”
      // example6: ['aa', 'ää', 'xx', 'yy'] ==> “aa ää xx yy”
      for (size_t j = 0; j < hypothesis.size(); j++) {
         std::string word = symbol_table_->Find(hypothesis[j]);
         path.sentence += (' ' + word);
      }

      we then move CheckXXXWord() and ProcessBlank() to a newly defined class, lets say class PostProcessor :

      // Interface class
      class PostProcessorInterface {
        ...
        // call other functions to do post processing
        virtual void Process(const std::string& str) = 0;
        // function `TorchAsrDecoder::InitPostProcessing()`  will be moved here,
        // this will also simplify the code of `TorchAsrDecoder` and decouple post-processing from the decoder.
        virtual void InitPostProcessing() = 0;
      };
      
      // mandarin-english bi-lingual cases, this class will be provided by wenet and used as default postprocessor
      class PostProcessorCnEn : public PostProcessorInterface {
        void Process(const std::string& str) override;
        ...
        bool wp_start_with_space_symbol_;
        ItnModel itn_;  // for future-support of inverse text normalizer
      };
      
      void PostProcessorCnEn::Process(const std::string& str) {
          if (wp_start_with_space_symbol_) {
            // first replace all ' ' with '', then replace all '_' with ' '
            // example3: “_i _lo ve _wenet” ==> "_i_love_wenet" ==> " i love wenet"
            // example5: “我 爱 _wenet _very _much” ==> "我爱_wenet_very_much" ==> "我爱 wenet very much"
          } else {
            CheckXXXWord();  //  split string via space and check the type of each char
            JoinString();  // join chars accoding to there types
            // example1: “我 爱 你” ==> "我", "爱", "你" ==> "我爱你"
            // example2: “i love wenet” ==> “i", "love", "wenet” ==> "i love wenet"
            // example4: “我 爱 wenet very much” ==> "我", "爱", "wenet", "very", "much" ==> "我爱wenet very much"
          }
          // lower the str according to configurations
          if (config.return_lower_str) {
            tolower(str);
          }
      }
      
      // Indo-European languages, multi-lingual cases, this class will be provided by wenet
      // and can be selected as actual postprocessor according to configurations.
      class PostProcessorIndoEuro : public PostProcessorInterface {
        void Process(const std::string& str) override;
      };
      
      void PostProcessorIndoEuro::Process(const std::string& str) {
          // spaces are needed in most Indo-European languages, because spaces are already
          // included in str, we will do nothing here except for lowering the str.
          if (config.return_lower_str) {
            tolower(str);
          }
          // example6:  “aa ää xx yy” ==> "aa ää xx yy"
      }
      
      // other cases, this class should be created by users if they meet special cases
      // which cannot be handled by `PostProcessorCnEn` and `PostProcessorIndoEuro`
      class PostProcessorXXYYZZ : public PostProcessorInterface {
        void Process(const std::string& str) override;
      };
      
      void PostProcessorXXYYZZ::Process(const std::string& str) {
          // do anything they want
      }
    • cons: users need to create a new class for their special cases.

Any suggestions are welcome :)

@alumae
Copy link
Author

alumae commented Aug 29, 2021

IMHO such formatting issue is out of wenet's scope, therefore I would much prefer option 2. Option 1 (automatically detecting language) has so many corner cases that it's difficult to do it well across all locales.

@robin1001
Copy link
Collaborator

The second solution is preferred. And it's right time to add the PostProcessorInterface now, and the PostProcessor here can be named BlankPostProcessor. Is there any way to give a no-code implementation? What if we do it in regex?

@Mddct
Copy link
Collaborator

Mddct commented Aug 30, 2021 via email

@xingchensong
Copy link
Member

The second solution is preferred. And it's right time to add the PostProcessorInterface now, and the PostProcessor here can be named BlankPostProcessor. Is there any way to give a no-code implementation? What if we do it in regex?

All replacement operations can be implemented with regex (c++ API: std::regex_replace(std::string str,std::regex reg,std::string replace)). But for the outer control logic (i.e., if-else branch), I think it is difficult to do in regex? Can u do me a favor ?

void BlankPostProcessor::Process(const std::string& str) {
    if (wp_start_with_space_symbol_) {
      // first replace all ' ' with '', then replace all '_' with ' '
      // example3: “_i _lo ve _wenet” ==> "_i_love_wenet" ==> " i love wenet"
      // example5: “我 爱 _wenet _very _much” ==> "我爱_wenet_very_much" ==> "我爱 wenet very much"
    } else {
      CheckXXXWord();  //  split string via space and check the type of each char
      JoinString();  // join chars accoding to there types
      // example1: “我 爱 你” ==> "我", "爱", "你" ==> "我爱你"
      // example2: “i love wenet” ==> “i", "love", "wenet” ==> "i love wenet"
      // example4: “我 爱 wenet very much” ==> "我", "爱", "wenet", "very", "much" ==> "我爱wenet very much"
    }
    // lower the str according to configurations
    if (config.return_lower_str) {
      tolower(str);
    }
}

@Mddct
Copy link
Collaborator

Mddct commented Aug 30, 2021

The second solution is preferred. And it's right time to add the PostProcessorInterface now, and the PostProcessor here can be named BlankPostProcessor. Is there any way to give a no-code implementation? What if we do it in regex?

All replacement operations can be implemented with regex (c++ API: std::regex_replace(std::string str,std::regex reg,std::string replace)). But for the outer control logic (i.e., if-else branch), I think it is difficult to do in regex? Can u do me a favor ?

void BlankPostProcessor::Process(const std::string& str) {
    if (wp_start_with_space_symbol_) {
      // first replace all ' ' with '', then replace all '_' with ' '
      // example3: “_i _lo ve _wenet” ==> "_i_love_wenet" ==> " i love wenet"
      // example5: “我 爱 _wenet _very _much” ==> "我爱_wenet_very_much" ==> "我爱 wenet very much"
    } else {
      CheckXXXWord();  //  split string via space and check the type of each char
      JoinString();  // join chars accoding to there types
      // example1: “我 爱 你” ==> "我", "爱", "你" ==> "我爱你"
      // example2: “i love wenet” ==> “i", "love", "wenet” ==> "i love wenet"
      // example4: “我 爱 wenet very much” ==> "我", "爱", "wenet", "very", "much" ==> "我爱wenet very much"
    }
    // lower the str according to configurations
    if (config.return_lower_str) {
      tolower(str);
    }
}

why the type of Process's first argument is a string? Not a vector ?

@xingchensong
Copy link
Member

The second solution is preferred. And it's right time to add the PostProcessorInterface now, and the PostProcessor here can be named BlankPostProcessor. Is there any way to give a no-code implementation? What if we do it in regex?

All replacement operations can be implemented with regex (c++ API: std::regex_replace(std::string str,std::regex reg,std::string replace)). But for the outer control logic (i.e., if-else branch), I think it is difficult to do in regex? Can u do me a favor ?

void BlankPostProcessor::Process(const std::string& str) {
    if (wp_start_with_space_symbol_) {
      // first replace all ' ' with '', then replace all '_' with ' '
      // example3: “_i _lo ve _wenet” ==> "_i_love_wenet" ==> " i love wenet"
      // example5: “我 爱 _wenet _very _much” ==> "我爱_wenet_very_much" ==> "我爱 wenet very much"
    } else {
      CheckXXXWord();  //  split string via space and check the type of each char
      JoinString();  // join chars accoding to there types
      // example1: “我 爱 你” ==> "我", "爱", "你" ==> "我爱你"
      // example2: “i love wenet” ==> “i", "love", "wenet” ==> "i love wenet"
      // example4: “我 爱 wenet very much” ==> "我", "爱", "wenet", "very", "much" ==> "我爱wenet very much"
    }
    // lower the str according to configurations
    if (config.return_lower_str) {
      tolower(str);
    }
}

why the type of Process's first argument is a string? Not a vector ?

it is path.sentence. PostProcessor should be maintained by ConnectionHandler:

ConnectionHandler handler(std::move(socket), feature_config_, decode_config_, decode_resource_, postprocess_config_, postprocess_resource_);

and we should do post processing in function ConnectionHandler::SerializeResult:
https://github.com/wenet-e2e/wenet/blob/main/runtime/core/websocket/websocket_server.cc#L102-L123

@robin1001
Copy link
Collaborator

@alumae , is there a blank symbol '▁' before ää in "aa ää xx yy" in your training?

@robin1001
Copy link
Collaborator

robin1001 commented Aug 30, 2021

For blankpostprocess, What if : detokenized = "''.join(pieces).replace('▁', ' ') Binbin Zhang @.***> 于 2021年8月30日周一 上午9:30写道:

The second solution is preferred. And it's right time to add the PostProcessorInterface now, and the PostProcessor here can be named BlankPostProcessor. Is there any way to give a no-code implementation? What if we do it in regex? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#583 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFN3Q3Y3U7N7PUY2IZI6B3T7LNNDANCNFSM5C5OFCDA .

There is no '▁' symbol in WFST based decoding when LM is integrated.

@robin1001
Copy link
Collaborator

robin1001 commented Aug 30, 2021

Here is our solution.

  1. For decoding without LM. Since '▁' is used as blank in training, so we can directly join all the outputs and replace '▁' with white space just like: detokenized = "''.join(pieces).replace('▁', ' ').
// example1: ['我', '爱', '你'] ==> “我爱你”
// example2: ['▁i', '▁lo', 've', '▁wenet'] ==> “ i love wenet”
// example3: ['我', ’爱‘, '▁wenet', '▁very', '▁much'] ==> “我爱 wenet very much”
// example4: ['▁aa', '▁ää', '▁xx', '▁yy'] ==> “ aa ää xx yy”
  1. For decoding with LM. there is no '▁' in the output, and words are the output unit. We can simply join all the outputs with white space for all languages.
// example1: ['我', '爱', '你'] ==> “我 爱 你”
// example2: ['i', 'love', 'wenet'] ==> “ i love wenet”
// example3: ['我', ’爱‘, 'wenet', 'very', 'much'] ==> “我 爱 wenet very much”
// example4: ['aa', 'ää', 'xx', 'yy'] ==> “ aa ää xx yy”

And we can add a special BlankProcessor if we want to further remove white space between words, which depends on the language.

// example1:  “我 爱 你”==> “我爱你”
// example2: “ i love wenet” ==> “i love wenet”
// example3: “我 爱 wenet very much” ==> “我爱wenet very much”
// example4: “aa ää xx yy” ==> “aa ää xx yy”

@alumae
Copy link
Author

alumae commented Aug 30, 2021

@alumae , is there a blank symbol '▁' before ää in "aa ää xx yy" in your training?

No, as I am using WFST based decoding.

@xingchensong
Copy link
Member

we move all post-processing related functions to class PostProcessor, please see PR #597

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants