-
-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with Email Addresses #53
Comments
You can fairly easily filter out emails from the result. For example, if a result contains If we just returned URLs, the quality of the results would be worse. The input |
I did that as a workaround, was hoping there was a pattern or systemic way to exclude versus removing from the string to be interpreted. |
I'm happy to discuss API ideas if you have any, but remember that it's unlikely we can "disable" matching emails in the relaxed regexp. |
What would you think of adding a new top-level API like:
Then, you could iterate over your |
Friendly ping @JimmyGalar :) |
This doesn't require new API nor does it require us to do a second regular expression match on each string, so I think it's the right way. We can also add more subexpression names later which might be useful, such as "web URL without scheme" or "URL scheme". Updates #53.
I thought about this briefly and pushed 09d66fb to master, what do you all think? |
I will assume that the fix in master is enough. Feel free to leave a comment or file a new issue if you disagree. |
I am using the xurls code to pull out possible urls from a message body string. The urls can be in either strict or relaxed format so I need to use the relaxed method of xurls to find the possible urls in the string. The issue is that email addresses can also be in the string and the relaxed method of xurls is pulling those out too.
For example my string might be:
"Hello from http://www.google.com, please check the www.test.com webpage for further information. If you have any questions please email John.Smith@test.com or Testing@test.com"
What I would like xurls to do is just pull the http://www.google.com or www.test.com.
Instead is pulls the 2 urls, and John.Sm, test.com, test.com. Is there anything that can be done so that only urls are pulled?
The text was updated successfully, but these errors were encountered: