New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support For Extracting Based on Hostname? #14
Comments
I'm really not sure, to be honest. Scheme and hostname seem fine. But shouldn't we do tld, path, etcetera too then? I think this would bloat the package. Have you thought about parsing the urls instead, and then filtering based on the parts that you obtain? I could add this to Adding that to the |
Maybe add them too, why not?
Feature bloat? I'm a fan of keeping things simple, but what's the guiding philosophy here? If it's a command line tool then sure, the current behavior is in line with UNIX pipeline philosophy. But it's also a Go package for extracting URLs, so what's wrong with adding features that make extracting URLs easier?
Yes, this is what I am doing, something like: var urls []string = xurls.Strict.FindAllString(text, -1)
for _, u := range(urls) {
u, e := url.Parse(u)
// ...
if u.Host != "foo.com" {
continue
} But then I thought: xurls just ran this very complete regex to extract/parse the URL, and now I'm parsing it again. I saw that you allow for matching scheme, and was surprised that there was not an option for the possibly more common target of hostname.
I don't think it's needed here.
Indeed. Though if some/most/all (?) of the people using your package were making calls to |
Maybe it could offer up a builder: re, _ := xurls.Builder.MatchScheme("https://")
re, _ = re.MatchUsername("sshaw")
re, _ = re.MatchHostname("foo.com")
re.MatchString(str) |
I'm not ignoring this, neither have I forgotten about it - just thinking about it, still. |
The strict regexp is much simpler than you think - it doesn't understand of usernames, hosts, paths or anything. That is the relaxed regexp, which is much bigger (and is about 5x slower). I'm closing this since implementing this would complicate the strict regexp quite a lot, and probably make it slower too. Moreover, your current method of parsing the url is saner in the long run - this library is not an URL parser nor should it be. |
What do you think? It could work like
StringMatchingScheme
but accept a hostname (or second level+ domains). The prevents one from having building an additional regexp to check URLs returned byxurls.FindAllString
.The text was updated successfully, but these errors were encountered: