Skip to content

[ZH Email] Fix email address not be recognized when there are some Chinese characters in the sentence.#3185

Merged
MichaelMWW merged 1 commit intomasterfrom
v-michwang/FixEmailNotRecognizedIssue
Nov 26, 2024
Merged

[ZH Email] Fix email address not be recognized when there are some Chinese characters in the sentence.#3185
MichaelMWW merged 1 commit intomasterfrom
v-michwang/FixEmailNotRecognizedIssue

Conversation

@MichaelMWW
Copy link
Copy Markdown
Collaborator

Fix email address not be recognized when there are some Chinese characters in the sentence.
Originally, the recognizer cannot recognize the email address "test@test.com" in sentence "1邮件地址test@test.com", now fix it.

@MichaelMWW MichaelMWW changed the title Fix email address not be recognized when there are some Chinese characters in the sentence. [ZH Email]Fix email address not be recognized when there are some Chinese characters in the sentence. Nov 21, 2024
{
"Input": "Both a..bc@outlook.com and .abc@hotmail.com are not valid e-mail addresses.",
"Comment": "By default the current system is strict. If a relaxed match is needed (to catch these), enable the Relaxed option.",
"Comment": "By default the current system is strict. If a relaxed match is needed (to catch these), enable the Relaxed option.",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this "Relaxed" option? I am thinking if we should enable this with the relaxed option itself?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When "Relaxed" option is disabled (by default), it will finally use a more strict regex that reference RFC5322 to filter emails that already extracted.
When it is enabled, it will directly return the emails that already extracted, it will breaking 2 existing test cases.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are runtime configuration options. Users can specify which configuration they prefer, depending on their needs.

I don't believe the default value should be changed, as that is a breaking change for users of the default package and this was the default setting agreed with teams using the recognizers at the time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My proposal was not to enable these settings. Instead to enable the detection of email address not be recognized when there are some Chinese characters in the sentence as a relaxed option instead of by default

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My proposal was not to enable these settings. Instead to enable the detection of email address not be recognized when there are some Chinese characters in the sentence as a relaxed option instead of by default

Relaxed option will not solve the issue, the issue is caused by the regex that I am updated in this PR.
There are 2 regex used to extract email, in case there is a Chinese sentence "1邮件地址test@test.com", the first regex matches "test@test.com", the 2nd regex (the regex I have fixed in the PR) matches "1邮件地址test@test.com", so the results merged as "1邮件地址test@test.com", if relaxed option is enabled here, it will directly return "1邮件地址test@test.com" which is incorrect, the correct email should be "test@test.com"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to mention that currently we don't support RFC 6530 which allow non-ASCII characters in email address, such as Chinese, Arabic.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with this codebase but my idea was

Using the relaxed option - run the regex without the word separator to return the correct detection for 邮件地址test@test.com along with all other existing anomalous behaviors like ending with a . or having multiple dots (..) in the email address and keep existing email detection intact so we do not break existing customers.

That said, if this is indeed a bug fix, it should be ok to fix in the regular regex.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more that I think, I am ok with your approach.

@tellarin tellarin changed the title [ZH Email]Fix email address not be recognized when there are some Chinese characters in the sentence. [ZH Email] Fix email address not be recognized when there are some Chinese characters in the sentence. Nov 26, 2024
{
"Input": "Both a..bc@outlook.com and .abc@hotmail.com are not valid e-mail addresses.",
"Comment": "By default the current system is strict. If a relaxed match is needed (to catch these), enable the Relaxed option.",
"Comment": "By default the current system is strict. If a relaxed match is needed (to catch these), enable the Relaxed option.",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more that I think, I am ok with your approach.

@MichaelMWW MichaelMWW merged commit 25afa66 into master Nov 26, 2024
aurghob pushed a commit that referenced this pull request Jan 3, 2025
Co-authored-by: Michael Wang (Centific Technologies Inc) <v-michwang@microsoft.com>
aurghob added a commit that referenced this pull request Jan 6, 2025
* Fix chinese characters be recognized as email - first commit (#3185)

Co-authored-by: Michael Wang (Centific Technologies Inc) <v-michwang@microsoft.com>

* Support Chinese next next week day - first commit (#3184)

Co-authored-by: Michael Wang (Centific Technologies Inc) <v-michwang@microsoft.com>

---------

Co-authored-by: Michael <xiaofeng200485@163.com>
Co-authored-by: Michael Wang (Centific Technologies Inc) <v-michwang@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants