[ZH Email] Fix email address not be recognized when there are some Chinese characters in the sentence.#3185
Conversation
| { | ||
| "Input": "Both a..bc@outlook.com and .abc@hotmail.com are not valid e-mail addresses.", | ||
| "Comment": "By default the current system is strict. If a relaxed match is needed (to catch these), enable the Relaxed option.", | ||
| "Comment": "By default the current system is strict. If a relaxed match is needed (to catch these), enable the Relaxed option.", |
There was a problem hiding this comment.
what is this "Relaxed" option? I am thinking if we should enable this with the relaxed option itself?
There was a problem hiding this comment.
When "Relaxed" option is disabled (by default), it will finally use a more strict regex that reference RFC5322 to filter emails that already extracted.
When it is enabled, it will directly return the emails that already extracted, it will breaking 2 existing test cases.
There was a problem hiding this comment.
These are runtime configuration options. Users can specify which configuration they prefer, depending on their needs.
I don't believe the default value should be changed, as that is a breaking change for users of the default package and this was the default setting agreed with teams using the recognizers at the time.
There was a problem hiding this comment.
My proposal was not to enable these settings. Instead to enable the detection of email address not be recognized when there are some Chinese characters in the sentence as a relaxed option instead of by default
There was a problem hiding this comment.
My proposal was not to enable these settings. Instead to enable the detection of email address not be recognized when there are some Chinese characters in the sentence as a relaxed option instead of by default
Relaxed option will not solve the issue, the issue is caused by the regex that I am updated in this PR.
There are 2 regex used to extract email, in case there is a Chinese sentence "1邮件地址test@test.com", the first regex matches "test@test.com", the 2nd regex (the regex I have fixed in the PR) matches "1邮件地址test@test.com", so the results merged as "1邮件地址test@test.com", if relaxed option is enabled here, it will directly return "1邮件地址test@test.com" which is incorrect, the correct email should be "test@test.com"
There was a problem hiding this comment.
Need to mention that currently we don't support RFC 6530 which allow non-ASCII characters in email address, such as Chinese, Arabic.
There was a problem hiding this comment.
I am not very familiar with this codebase but my idea was
Using the relaxed option - run the regex without the word separator to return the correct detection for 邮件地址test@test.com along with all other existing anomalous behaviors like ending with a . or having multiple dots (..) in the email address and keep existing email detection intact so we do not break existing customers.
That said, if this is indeed a bug fix, it should be ok to fix in the regular regex.
There was a problem hiding this comment.
The more that I think, I am ok with your approach.
| { | ||
| "Input": "Both a..bc@outlook.com and .abc@hotmail.com are not valid e-mail addresses.", | ||
| "Comment": "By default the current system is strict. If a relaxed match is needed (to catch these), enable the Relaxed option.", | ||
| "Comment": "By default the current system is strict. If a relaxed match is needed (to catch these), enable the Relaxed option.", |
There was a problem hiding this comment.
The more that I think, I am ok with your approach.
Co-authored-by: Michael Wang (Centific Technologies Inc) <v-michwang@microsoft.com>
* Fix chinese characters be recognized as email - first commit (#3185) Co-authored-by: Michael Wang (Centific Technologies Inc) <v-michwang@microsoft.com> * Support Chinese next next week day - first commit (#3184) Co-authored-by: Michael Wang (Centific Technologies Inc) <v-michwang@microsoft.com> --------- Co-authored-by: Michael <xiaofeng200485@163.com> Co-authored-by: Michael Wang (Centific Technologies Inc) <v-michwang@microsoft.com>
Fix email address not be recognized when there are some Chinese characters in the sentence.
Originally, the recognizer cannot recognize the email address "test@test.com" in sentence "1邮件地址test@test.com", now fix it.