Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Email Addresses #53

Closed
JimmyGalar opened this issue Aug 16, 2021 · 7 comments
Closed

Issue with Email Addresses #53

JimmyGalar opened this issue Aug 16, 2021 · 7 comments

Comments

@JimmyGalar
Copy link

I am using the xurls code to pull out possible urls from a message body string. The urls can be in either strict or relaxed format so I need to use the relaxed method of xurls to find the possible urls in the string. The issue is that email addresses can also be in the string and the relaxed method of xurls is pulling those out too.

For example my string might be:
"Hello from http://www.google.com, please check the www.test.com webpage for further information. If you have any questions please email John.Smith@test.com or Testing@test.com"

What I would like xurls to do is just pull the http://www.google.com or www.test.com.

Instead is pulls the 2 urls, and John.Sm, test.com, test.com. Is there anything that can be done so that only urls are pulled?

@mvdan
Copy link
Owner

mvdan commented Aug 21, 2021

You can fairly easily filter out emails from the result. For example, if a result contains @ but does not contain ://, it's an email.

If we just returned URLs, the quality of the results would be worse. The input john.smith@test.com would give you test.com, for example.

@JimmyGalar
Copy link
Author

I did that as a workaround, was hoping there was a pattern or systemic way to exclude versus removing from the string to be interpreted.

@mvdan
Copy link
Owner

mvdan commented Aug 23, 2021

I'm happy to discuss API ideas if you have any, but remember that it's unlikely we can "disable" matching emails in the relaxed regexp.

@mvdan
Copy link
Owner

mvdan commented Aug 31, 2021

What would you think of adding a new top-level API like:

func IsEmail(string) bool

Then, you could iterate over your xurls.Relaxed results and use xurls.IsEmail to filter emails as needed. In the future we could write other similar helper funcs, like HasScheme.

@mvdan
Copy link
Owner

mvdan commented Feb 24, 2022

Friendly ping @JimmyGalar :)

mvdan added a commit that referenced this issue Jan 1, 2024
This doesn't require new API nor does it require us to do a second
regular expression match on each string, so I think it's the right way.
We can also add more subexpression names later which might be useful,
such as "web URL without scheme" or "URL scheme".

Updates #53.
@mvdan
Copy link
Owner

mvdan commented Jan 1, 2024

I thought about this briefly and pushed 09d66fb to master, what do you all think?

@mvdan
Copy link
Owner

mvdan commented Jan 28, 2024

I will assume that the fix in master is enough. Feel free to leave a comment or file a new issue if you disagree.

@mvdan mvdan closed this as completed Jan 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants