[ZH Email] Fix email address not be recognized when there are some Chinese characters in the sentence. by MichaelMWW · Pull Request #3185 · microsoft/Recognizers-Text

MichaelMWW · 2024-11-21T10:32:34Z

Fix email address not be recognized when there are some Chinese characters in the sentence.
Originally, the recognizer cannot recognize the email address "test@test.com" in sentence "1邮件地址test@test.com", now fix it.

bidisha-c · 2024-11-25T19:45:08Z

  {
    "Input": "Both a..bc@outlook.com and .abc@hotmail.com are not valid e-mail addresses.",
-	"Comment": "By default the current system is strict. If a relaxed match is needed (to catch these), enable the Relaxed option.",
+    "Comment": "By default the current system is strict. If a relaxed match is needed (to catch these), enable the Relaxed option.",


what is this "Relaxed" option? I am thinking if we should enable this with the relaxed option itself?

When "Relaxed" option is disabled (by default), it will finally use a more strict regex that reference RFC5322 to filter emails that already extracted.
When it is enabled, it will directly return the emails that already extracted, it will breaking 2 existing test cases.

These are runtime configuration options. Users can specify which configuration they prefer, depending on their needs.

I don't believe the default value should be changed, as that is a breaking change for users of the default package and this was the default setting agreed with teams using the recognizers at the time.

My proposal was not to enable these settings. Instead to enable the detection of email address not be recognized when there are some Chinese characters in the sentence as a relaxed option instead of by default

My proposal was not to enable these settings. Instead to enable the detection of email address not be recognized when there are some Chinese characters in the sentence as a relaxed option instead of by default

Relaxed option will not solve the issue, the issue is caused by the regex that I am updated in this PR.
There are 2 regex used to extract email, in case there is a Chinese sentence "1邮件地址test@test.com", the first regex matches "test@test.com", the 2nd regex (the regex I have fixed in the PR) matches "1邮件地址test@test.com", so the results merged as "1邮件地址test@test.com", if relaxed option is enabled here, it will directly return "1邮件地址test@test.com" which is incorrect, the correct email should be "test@test.com"

Need to mention that currently we don't support RFC 6530 which allow non-ASCII characters in email address, such as Chinese, Arabic.

I am not very familiar with this codebase but my idea was

Using the relaxed option - run the regex without the word separator to return the correct detection for 邮件地址test@test.com along with all other existing anomalous behaviors like ending with a . or having multiple dots (..) in the email address and keep existing email detection intact so we do not break existing customers.

That said, if this is indeed a bug fix, it should be ok to fix in the regular regex.

The more that I think, I am ok with your approach.

bidisha-c · 2024-11-26T05:57:59Z

  {
    "Input": "Both a..bc@outlook.com and .abc@hotmail.com are not valid e-mail addresses.",
-	"Comment": "By default the current system is strict. If a relaxed match is needed (to catch these), enable the Relaxed option.",
+    "Comment": "By default the current system is strict. If a relaxed match is needed (to catch these), enable the Relaxed option.",


The more that I think, I am ok with your approach.

Co-authored-by: Michael Wang (Centific Technologies Inc) <v-michwang@microsoft.com>

* Fix chinese characters be recognized as email - first commit (#3185) Co-authored-by: Michael Wang (Centific Technologies Inc) <v-michwang@microsoft.com> * Support Chinese next next week day - first commit (#3184) Co-authored-by: Michael Wang (Centific Technologies Inc) <v-michwang@microsoft.com> --------- Co-authored-by: Michael <xiaofeng200485@163.com> Co-authored-by: Michael Wang (Centific Technologies Inc) <v-michwang@microsoft.com>

Fix chinese characters be recognized as email - first commit

420b6a7

MichaelMWW changed the title ~~Fix email address not be recognized when there are some Chinese characters in the sentence.~~ [ZH Email]Fix email address not be recognized when there are some Chinese characters in the sentence. Nov 21, 2024

bidisha-c reviewed Nov 25, 2024

View reviewed changes

tellarin changed the title ~~[ZH Email]Fix email address not be recognized when there are some Chinese characters in the sentence.~~ [ZH Email] Fix email address not be recognized when there are some Chinese characters in the sentence. Nov 26, 2024

bidisha-c approved these changes Nov 26, 2024

View reviewed changes

MichaelMWW merged commit 25afa66 into master Nov 26, 2024

aurghob pushed a commit that referenced this pull request Jan 3, 2025

Fix chinese characters be recognized as email - first commit (#3185)

7379d8b

Co-authored-by: Michael Wang (Centific Technologies Inc) <v-michwang@microsoft.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ZH Email] Fix email address not be recognized when there are some Chinese characters in the sentence.#3185

[ZH Email] Fix email address not be recognized when there are some Chinese characters in the sentence.#3185
MichaelMWW merged 1 commit intomasterfrom
v-michwang/FixEmailNotRecognizedIssue

MichaelMWW commented Nov 21, 2024

Uh oh!

bidisha-c Nov 25, 2024

Uh oh!

MichaelMWW Nov 26, 2024

Uh oh!

tellarin Nov 26, 2024

Uh oh!

bidisha-c Nov 26, 2024

Uh oh!

MichaelMWW Nov 26, 2024

Uh oh!

MichaelMWW Nov 26, 2024

Uh oh!

bidisha-c Nov 26, 2024

Uh oh!

bidisha-c Nov 26, 2024

Uh oh!

bidisha-c Nov 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MichaelMWW commented Nov 21, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants