Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance email regex in keyring #257

Closed
wants to merge 1 commit into from
Closed

Enhance email regex in keyring #257

wants to merge 1 commit into from

Conversation

bratkartoffel
Copy link
Contributor

@bratkartoffel bratkartoffel commented Feb 28, 2022

This new regex better matches the string format for identities as
specified by the RFC. The email address is surrounded by <>.
The new regex correctly parses identities like this:

foo@bar.com <baz@foo.com>

Resulting in baz@foo.com being returned. Without this fix, foo@bar.com
was returned. Furthermore, this new regex should better match
international characters.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 89.775% when pulling f8e3a20 on bratkartoffel:bugfix/enhance-extract-email into 1088b6c on pgpainless:master.

@vanitasvitae
Copy link
Member

vanitasvitae commented Mar 1, 2022

Good finding. Maybe we might want to do something like

        for (String userId : userIds) {
            Matcher matcher = PATTERN_EMAIL_USERID.matcher(userId); // matches "Foo Bar <foo@bar.baz>"
            if (matcher.find()) {
                emails.add(matcher.group(1));
            } else {
            matcher = PATTERN_EMAIL_STANDALONE.matcher(userId); // matches "foo@bar.baz"
            if (matcher.find()) {
                emails.add(matcher.group(1));
            } 
        }

to also match user-ids that look like "foo@bar.baz".
What do you think?

@vanitasvitae
Copy link
Member

vanitasvitae commented Mar 1, 2022

I just noticed, that the new regex would match some illegal email addresses, such as "<john doe@mail.tld>" (contains a space).
Its probably a good idea to adopt a regex like this one which specifically checks for email address with accordance to rfc5322.

^[a-zA-Z0-9_!#$%&'*+/=?`{|}~^.-]+@[a-zA-Z0-9.-]+$

@bratkartoffel
Copy link
Contributor Author

bratkartoffel commented Mar 1, 2022

Hi Paul, thanks for looking into this.

Yes, my regex is pretty non-validating which aims to have as few false-positives as possible. This means that I think it's better to extract some adresses which are not valid instead of excluding exotic correct adresses. My regex should not match spaces (\s is excluded).

Some tests of my regex: https://regex101.com/r/E9wBzS/1

If you play the game through and really like 99.99% of correctness, you end up with a damn hard readable regex: https://www.emailregex.com/

So lets get back a step, what do we really want to do with this method? As far as I can see, this method should extract an address from an identity. If it's valid or not should imho be checked by a specialized library / tool by the calling application (stick to the one-tool-one-task philosophy). This way it's up to the calling application to handle invalid adresses, even with the possiblilty to ignore that. Maybe a comment in javadoc for this method to explain the non-validating behaviour would be good?
So I suggest to keep it as simple as possible and use an regex which extracts almost everything between the < and >, which is a definitve must for this field.

This new regex better matches the string format for identities as
specified by the RFC. The email address is surrounded by <>.
The new regex correctly parses identities like this:

   foo@bar.com <baz@foo.com>

Resulting in baz@foo.com being returned. Without this fix, foo@bar.com
was returned.
vanitasvitae added a commit that referenced this pull request Mar 1, 2022
Based on #257
Thanks @bratkartoffel for the initial proposed changes
@vanitasvitae
Copy link
Member

vanitasvitae commented Mar 1, 2022

I just pushed d55d6a1 which should fix this.

Edit: Sorry, I missed your comment with the explanation for your reasoning. Let me reply to that here.

I tend to disagree to allow matching almost everything between angle brackets. Users are lazy and if there is a method named "getEMailAddresses()" users will likely assume that this method only returns valid email addresses.
Therefore I guess it is better to make sure that only valid email addresses are returned.

If an email application dev notices that this method misses some exotic mail addresses, they can still write a custom function to extract mail addresses themselves.

Anyway, I greatly appreciate your contribution. Thanks for pointing out that the original method was improperly matching some addresses!

@bratkartoffel
Copy link
Contributor Author

Thank you, looking forward for the next release to test this with my application.

@vanitasvitae
Copy link
Member

Thank you, looking forward for the next release to test this with my application.

I just released 1.1.2 to maven central. It should be available in a few hours :)

@bratkartoffel bratkartoffel deleted the bugfix/enhance-extract-email branch March 21, 2022 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants