Support For Extracting Based on Hostname? #14

sshaw · 2015-11-21T17:55:42Z

What do you think? It could work like StringMatchingScheme but accept a hostname (or second level+ domains). The prevents one from having building an additional regexp to check URLs returned by xurls.FindAllString.

The text was updated successfully, but these errors were encountered:

mvdan · 2015-11-21T22:05:38Z

StringMatchingScheme was added because someone said using grep or a golang regexp on top of xurls would defeat the purpose. This request falls under the same category.

I'm really not sure, to be honest. Scheme and hostname seem fine. But shouldn't we do tld, path, etcetera too then? I think this would bloat the package.

Have you thought about parsing the urls instead, and then filtering based on the parts that you obtain? I could add this to cmd/xurls so that it could be used in the command line.

Adding that to the xurls go package should be avoided though, since that can be easily done in go via the net/url package just like I would be doing in cmd/xurls.

sshaw · 2015-12-04T05:16:15Z

I'm really not sure, to be honest. Scheme and hostname seem fine. But shouldn't we do tld, path, etcetera too then?

Maybe add them too, why not?

I think this would bloat the package.

Feature bloat? I'm a fan of keeping things simple, but what's the guiding philosophy here? If it's a command line tool then sure, the current behavior is in line with UNIX pipeline philosophy. But it's also a Go package for extracting URLs, so what's wrong with adding features that make extracting URLs easier?

Have you thought about parsing the urls instead, and then filtering based on the parts that you obtain?

Yes, this is what I am doing, something like:

var urls []string = xurls.Strict.FindAllString(text, -1)
for _, u := range(urls) {
  u, e := url.Parse(u)
  // ...
  if u.Host != "foo.com" { 
    continue
  }

But then I thought: xurls just ran this very complete regex to extract/parse the URL, and now I'm parsing it again. I saw that you allow for matching scheme, and was surprised that there was not an option for the possibly more common target of hostname.

I could add this to cmd/xurls so that it could be used in the command line.

I don't think it's needed here.

Adding that to the xurls go package should be avoided though, since that can be easily done in go via the net/url package just like I would be doing in cmd/xurls.

Indeed. Though if some/most/all (?) of the people using your package were making calls to url.Parse() and/or re.MatchString() on the results of xurls.XXX.FindAllString(), isn't that a sign of a missing feature and/or room for improvement ?

sshaw · 2015-12-04T05:23:52Z

Maybe it could offer up a builder:

re, _ := xurls.Builder.MatchScheme("https://")
re, _ = re.MatchUsername("sshaw")
re, _ = re.MatchHostname("foo.com")
re.MatchString(str)

mvdan · 2015-12-25T20:04:49Z

I'm not ignoring this, neither have I forgotten about it - just thinking about it, still.

mvdan · 2016-04-13T13:38:56Z

The strict regexp is much simpler than you think - it doesn't understand of usernames, hosts, paths or anything. That is the relaxed regexp, which is much bigger (and is about 5x slower).

I'm closing this since implementing this would complicate the strict regexp quite a lot, and probably make it slower too. Moreover, your current method of parsing the url is saner in the long run - this library is not an URL parser nor should it be.

mvdan closed this as completed Apr 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support For Extracting Based on Hostname? #14

Support For Extracting Based on Hostname? #14

sshaw commented Nov 21, 2015

mvdan commented Nov 21, 2015

sshaw commented Dec 4, 2015

sshaw commented Dec 4, 2015

mvdan commented Dec 25, 2015

mvdan commented Apr 13, 2016

Navigation Menu

Support For Extracting Based on Hostname? #14

Support For Extracting Based on Hostname? #14

Comments

sshaw commented Nov 21, 2015

mvdan commented Nov 21, 2015

sshaw commented Dec 4, 2015

sshaw commented Dec 4, 2015

mvdan commented Dec 25, 2015

mvdan commented Apr 13, 2016