Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linkify is eager about top level domain URLs #35

Closed
kumar303 opened this issue Jul 18, 2011 · 12 comments
Closed

linkify is eager about top level domain URLs #35

kumar303 opened this issue Jul 18, 2011 · 12 comments
Labels

Comments

@kumar303
Copy link

It seems that linkify will turn anything that looks like a domain into a URL. There should be a way to turn this off. Maybe eager_domain_linking=False.

For example, the filenames in amo-validator output are getting linked. Like /path/to/somescript.sh gets linked as http://somescript.sh . Screenshot: https://bug670047.bugzilla.mozilla.org/attachment.cgi?id=544675 More details: https://bugzilla.mozilla.org/show_bug.cgi?id=670047

(I have the AMO issue assigned to me but it keeps getting bumped due to priority. I should get to it soon!)

@jsocol
Copy link
Contributor

jsocol commented Jul 18, 2011

I don't consider linkifying something that looks like a domain a bug: the bug here is that path/to/file.sh shouldn't be considered to "look like a domain". Back into the URL regex to fix this.

@kumar303
Copy link
Author

One thing to consider is that top level xpi files might exist without any prefixed path, something like "somescript.sh" -- this would be linked by linkify, which is not desired for amo-validator output.

@jsocol
Copy link
Contributor

jsocol commented Jul 18, 2011

.sh is a valid TLD, this behavior is correct. Turning it off would also stop linkifying example.com in the text. If you don't want this behavior, why are you running linkify at all?

@kumar303
Copy link
Author

Right, losing example.com would be fine. We just use it for URLs that are already prefixed with http(s). We want those links to go through the outgoing server so using bleach for this seemed like a good idea.

@jsocol
Copy link
Contributor

jsocol commented Jul 18, 2011

The default behavior can't change, because I don't think this is expected behavior. I'd nominate aggressive or eager for the name of the kwarg to linkify().

@SmileyChris
Copy link
Contributor

An aside, these false positives are why django's urlize filter only linkifies urls prefixed with a schema or 'www.', or domain-only urls ending in .com, .org and .net.

@jsocol
Copy link
Contributor

jsocol commented Jul 21, 2011

@SmileyChris: That difference is actually one of the reasons this method exists in the first place, which is why I'm not keen on changing it.

@SmileyChris
Copy link
Contributor

Fair enough then! :)

@kmike
Copy link

kmike commented Aug 8, 2011

Could linkify be more configurable? Sometimes example.com should be linked, sometimes not. Now the only possiblity to configure the linkify is to copy-paste the whole linkify method. I think it'll be good to do one of the following:

  • acccept callbacks for common tasks (e.g. post-process a node, find links, etc);
  • make utility methods module-level functions so one can reuse them and build custom linkify;
  • make linkify class with overridable methods + call for backward compatibility

My use case: linkify IDN domains (e.g. сайт.рф) + add target='_blank' to links. I can also imagine somebody wanting to add css classes to external links, to stop linking localhost url or to use custom domain list (maybe even different domain lists for different parts of site).

Would you accept patch with such changes or this is all unreasonable or you want to implement something yourselves?

@jsocol
Copy link
Contributor

jsocol commented Aug 9, 2011

The cleanest of those options feels like the class, but I really struggle to see how any of them could be completely backwards compatible (except maybe callbacks for "common tasks" but I don't want a 40-kwarg method). I'm open to any of them as long as import bleach; bleach.linkify(text) still works, none of the current defaults change, it's not a massive performance hit (it's slow enough as-is) and it's clean.

Any of these changes, I think, necessitates moving the linkify code into a new module. Probably clean, too, for consistency.

@kmike: I'm certainly open to a big rewrite but it's not my highest priority right now, especially not given how much bigger that scope is than this issue.

@msabramo
Copy link
Contributor

msabramo commented May 9, 2012

Another consequence of this eager linking is that linkify links the word "settings.py" on my crate.io page. Is there any way to tell it to suppress the link?

@jsocol
Copy link
Contributor

jsocol commented May 9, 2012

.py is the ccTLD for Paraguay. It's legit.

That said, #56 would make linkify customizable enough that anyone could refuse to linkify whatever TLDs they want. The solution will be to write callbacks that test it, e.g.:

def no_python(text, attrs):
    if attrs['href'].endswith('.py'):
        return None

Or more broadly:

def only_with_protocol(text, attrs):
    if not attrs['href'].startswith('http://'):
        return None

The latter seems like a logical thing to include, maybe not as a default, but as an option.

Closing this in favor of #56.

@jsocol jsocol closed this as completed May 9, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants