Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support For Extracting Based on Hostname? #14

Closed
sshaw opened this issue Nov 21, 2015 · 5 comments
Closed

Support For Extracting Based on Hostname? #14

sshaw opened this issue Nov 21, 2015 · 5 comments

Comments

@sshaw
Copy link

sshaw commented Nov 21, 2015

What do you think? It could work like StringMatchingScheme but accept a hostname (or second level+ domains). The prevents one from having building an additional regexp to check URLs returned by xurls.FindAllString.

@mvdan
Copy link
Owner

mvdan commented Nov 21, 2015

StringMatchingScheme was added because someone said using grep or a golang regexp on top of xurls would defeat the purpose. This request falls under the same category.

I'm really not sure, to be honest. Scheme and hostname seem fine. But shouldn't we do tld, path, etcetera too then? I think this would bloat the package.

Have you thought about parsing the urls instead, and then filtering based on the parts that you obtain? I could add this to cmd/xurls so that it could be used in the command line.

Adding that to the xurls go package should be avoided though, since that can be easily done in go via the net/url package just like I would be doing in cmd/xurls.

@sshaw
Copy link
Author

sshaw commented Dec 4, 2015

I'm really not sure, to be honest. Scheme and hostname seem fine. But shouldn't we do tld, path, etcetera too then?

Maybe add them too, why not?

I think this would bloat the package.

Feature bloat? I'm a fan of keeping things simple, but what's the guiding philosophy here? If it's a command line tool then sure, the current behavior is in line with UNIX pipeline philosophy. But it's also a Go package for extracting URLs, so what's wrong with adding features that make extracting URLs easier?

Have you thought about parsing the urls instead, and then filtering based on the parts that you obtain?

Yes, this is what I am doing, something like:

var urls []string = xurls.Strict.FindAllString(text, -1)
for _, u := range(urls) {
  u, e := url.Parse(u)
  // ...
  if u.Host != "foo.com" { 
    continue
  }

But then I thought: xurls just ran this very complete regex to extract/parse the URL, and now I'm parsing it again. I saw that you allow for matching scheme, and was surprised that there was not an option for the possibly more common target of hostname.

I could add this to cmd/xurls so that it could be used in the command line.

I don't think it's needed here.

Adding that to the xurls go package should be avoided though, since that can be easily done in go via the net/url package just like I would be doing in cmd/xurls.

Indeed. Though if some/most/all (?) of the people using your package were making calls to url.Parse() and/or re.MatchString() on the results of xurls.XXX.FindAllString(), isn't that a sign of a missing feature and/or room for improvement :neckbeard:?

@sshaw
Copy link
Author

sshaw commented Dec 4, 2015

Maybe it could offer up a builder:

re, _ := xurls.Builder.MatchScheme("https://")
re, _ = re.MatchUsername("sshaw")
re, _ = re.MatchHostname("foo.com")
re.MatchString(str)

@mvdan
Copy link
Owner

mvdan commented Dec 25, 2015

I'm not ignoring this, neither have I forgotten about it - just thinking about it, still.

@mvdan
Copy link
Owner

mvdan commented Apr 13, 2016

The strict regexp is much simpler than you think - it doesn't understand of usernames, hosts, paths or anything. That is the relaxed regexp, which is much bigger (and is about 5x slower).

I'm closing this since implementing this would complicate the strict regexp quite a lot, and probably make it slower too. Moreover, your current method of parsing the url is saner in the long run - this library is not an URL parser nor should it be.

@mvdan mvdan closed this as completed Apr 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants