Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement suggestion with multiple domains in one single URL. #30

Closed
uggyuggy opened this issue Jul 31, 2019 · 6 comments
Closed

Improvement suggestion with multiple domains in one single URL. #30

uggyuggy opened this issue Jul 31, 2019 · 6 comments

Comments

@uggyuggy
Copy link

Hi,
Thank's for providing us xurls.

I came across the following case:

$ echo "http://www.fakedomain.com/account/legitdomain.com" | bin/xurls -r
http://www.fakedomain.com/account/legitdomain.com

I wonder if there is a easy (still fast) way for xurls to identify there are 2 "URLs" inside ?
So this could possibly report something like:

$ echo "http://www.fakedomain.com/account/legitdomain.com/folder" | bin/xurls -r
http://www.fakedomain.com/account/legitdomain.com/folder
legitdomain.com/folder
$

Possibly by adding an additional option to support it on demand only.

If there is a space in the string, both are found fine (expected and fine)

echo "http://www.fakedomain.com/        account/legitdomain.com/folder" | bin/xurls -r
http://www.fakedomain.com/
legitdomain.com/folder

This is only suggestion. If this impact performances badly, this is probably better to not implement.

@mvdan
Copy link
Owner

mvdan commented Aug 4, 2019

This is an interesting feature request. I think it shouldn't be on by default; the default behavior to not show overlapping results is I think the most common need, and, like you say, the fastest.

We can't simply try all possible substrings of a found match. For example, that would mean that foobar.com would also give bar.com, oobar.com, and so on. So we need something a bit smarter.

Perhaps we can limit this feature to at most one nested match within a URL's "path" element (including ?queries etc). That might work, but I wonder how common of a need that is.

Finally, I'm not sure how to expose this in the API or in the command-line tool. Any ideas?

@uggyuggy
Copy link
Author

uggyuggy commented Aug 4, 2019

I fully agree this should not be the default.
You are right about the "overlapping" problem to consider.

By "default" now, the requested behaviour is using the space etc as "delimiter".
It may be possible to define a option with an argument defining an additional "new" separator.
-s "/" could mean "use / also as a separator"

$ echo "http://www.fakedomain.com/account/legitdomain.com/folder" | bin/xurls -r -s "/"
http://www.fakedomain.com/account/legitdomain.com/folder
legitdomain.com/folder
$

I fully agree too this requirement is not very common.. Was just an idea as I came across this specific case.. ;)

Thank's

@mvdan
Copy link
Owner

mvdan commented Aug 5, 2019

This has nothing to do with separators, though. Your suggestion above would mean we'd support overlapping matches, which we don't now. If anything, adding / as an optional separator would just break all paths.

@uggyuggy
Copy link
Author

uggyuggy commented Aug 6, 2019

Yes, suggestion is optional overlapping match.

Now:

$ echo "foo/domainB.com/bar" | bin/xurls -r
domainB.com/bar
$
$ echo "domainA.com/foo/domainB.com/bar" | bin/xurls -r
domainA.com/foo/domainB.com/bar
$

In this example, domainB.com is identified only when there is no "overlap".

Suggestion about / is like "replace all / with space/ and check again if something is found:

"domainA.com /foo/domainB.com/bar"
"domainA.com/foo /domainB.com/bar"
"domainA.com/foo/domainB.com /bar"

then "reassemble" removing the extra space (to have the first full path with dir)

@mvdan
Copy link
Owner

mvdan commented Sep 22, 2019

What you're suggesting there would be a recursive search; at best, that would be quadratic complexity, and quite a lot of added code for a niche use case.

I've decided to not implement this for now. It should be fairly easy to implement this outside xurls. If you find that the API is nice and the code is generally useful, we can look at adding that later, but for now I don't think it's worth the effort.

@mvdan mvdan closed this as completed Sep 22, 2019
@uggyuggy
Copy link
Author

That's OK. Thank's

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants