Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect URL and Email in unspecified strings #4503

Merged
merged 3 commits into from Apr 21, 2017
Merged

Detect URL and Email in unspecified strings #4503

merged 3 commits into from Apr 21, 2017

Conversation

wvengen
Copy link
Contributor

@wvengen wvengen commented Mar 8, 2017

implements #4493

(if you like this feature, please add a thumbs-up reaction here and in the issue)

@@ -109,7 +109,7 @@ export function formatEmail(value, { jsx } = {}) {
if (jsx && EMAIL_WHITELIST_REGEX.test(value)) {
return <ExternalLink href={"mailto:" + value}>{value}</ExternalLink>;
} else {
value;
return value;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this fixes an existing bug with email addresses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@tlrobinson
Copy link
Contributor

tlrobinson commented Mar 8, 2017

Thanks for doing this.

I think the URL and email regexes we use when we know the column should be a URL or email are way too lax for when we don't know the type. e.x. the email regex will match any string with a @ in it.

If we were to do this I think we'd want much more strict regexes for the fallback version.

Also, in the not so distant future you'll also be able to mark columns in SQL query with the Email/URL (and other) types.

@salsakran What are your thoughts on whether we should do this?

@tlrobinson
Copy link
Contributor

Also, I'm wondering if we should just run the same or similar metadata detection on native/SQL query results that we run on the db schema sync?

@salsakran
Copy link
Contributor

long term we'll be allowing the use of a query to be used as the source table for another query (#1822), and we should try to guess subsequent type info if we can. In the interim, I'm not opposed to treating result urls as urls in situations where we can guess them.

Are there any security issues opened up by this? The only thing I can think of is someone crafting a weird sql generated url that someone elses clicks. This would be a GET, and to the best of my recollection we don't have side effecting GET api endpoints do we? @tlrobinson @camsaul ?

@tlrobinson
Copy link
Contributor

I can't think of any additional security implications. We're already URL-ifying user data when the column is marked as a URL or email, and if Metabase is vulnerable to CSRF that should be fixed regardless.

Our ExternalLink React component opens links in new tabs, and marks the link as rel="noopener" to prevent phishing attacks.

If anyone can think of any other issues let me know.

@wvengen
Copy link
Contributor Author

wvengen commented Mar 8, 2017

Thanks for the feedback! Good point regarding the lax regexes. I'd be happy to dig up some stricter (and hopefully RFC-compliant) regexes for URLs and email addresses.

@camsaul
Copy link
Member

camsaul commented Mar 8, 2017

@salsakran Our GET endpoints don't have any side effects besides adding records to the ViewLog and QueryExecution tables

@salsakran
Copy link
Contributor

cool, like I said, I'm 👍 on the approach (for now, until we figure out a table analysis story for #1822), haven't inspected the code yet.

@wvengen
Copy link
Contributor Author

wvengen commented Mar 9, 2017

To check performance impact of stricter regexes, I ran a simple benchmark (using node):

# old regexes
formatEmail(email) 1970
formatEmail(url)   1040
formatEmail(text)  8620
formatUrl(email)   1990
formatUrl(url)     1282
formatUrl(text)    8738

# new regexes
formatEmail(email) 2149
formatEmail(url)   1230
formatEmail(text)  8776
formatUrl(email)   2067
formatUrl(url)     1364
formatUrl(text)    8778

I think this is acceptable.

@tlrobinson
Copy link
Contributor

@wvengen Thanks for finding some better regexes. I think we should keep the same limited set of protocols for the URL regex, though. Just replace (bitcoin|ftps?|geo|git|gopher|https?|ircs?|magnet|mailto|mms|news|nntp|redis|sftp|sips?|sms|ssh|svn|tel|telnet|urn|worldwind|xmpp) with (https?|mailto) and I think this will be good to go.

@wvengen
Copy link
Contributor Author

wvengen commented Mar 15, 2017

@tlrobinson thanks for your reply. I'm still for the more extended protocol list, but will put that in a different PR.

@tlrobinson
Copy link
Contributor

@wvengen I'm open to it, we just wanted to be start out pretty conservative.

@metabase/core-developers What are your thoughts on the expanded list of protocols?

@camsaul
Copy link
Member

camsaul commented Mar 20, 2017

@tlrobinson I'm with you, I don't think a lot of people are using gopher:// or telnet:// URLs these days, I think the usual clickable protocols (https? and mailto) are a good starting point

@wvengen
Copy link
Contributor Author

wvengen commented Mar 20, 2017

Since I've amended the commit: this protocol list comes from MediaWiki.

@wvengen
Copy link
Contributor Author

wvengen commented Apr 10, 2017

Can we merge the current PR, without the longer list of protocols? It looks like we agree on that.

@camsaul
Copy link
Member

camsaul commented Apr 10, 2017

FWIW I think the regexes, e.g.

const URL_WHITELIST_REGEX = /^(https?|mailto):\/*(?:[^:@]+(?::[^@]+)?@)?(?:[^\s:/?#]+|\[[a-f\d:]+])(?::\d+)?(?:\/[^?#]*)?(?:\?[^#]*)?(?:#.*)?$/i;

would be a good thing to have some tests around 😺

It's not really feasible to verify that it's correct just by looking at it

@wvengen
Copy link
Contributor Author

wvengen commented Apr 19, 2017

Sounds good. There were some tests already, I've expanded them a bit more.

@tlrobinson tlrobinson merged commit 810afe1 into metabase:master Apr 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants