Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The urlize function does not compose parentheses correctly at the end #827

Closed
colidyre opened this issue Mar 20, 2018 · 9 comments · Fixed by #1195
Closed

The urlize function does not compose parentheses correctly at the end #827

colidyre opened this issue Mar 20, 2018 · 9 comments · Fixed by #1195
Milestone

Comments

@colidyre
Copy link

Expected Behavior

Final parentheses should be within the tag and link should be exactly as given before filtering.

Actual Behavior

Final parentheses is outside the tag and final parentheses is lost in url construction

Template Code

from jinja2.utils import urlize
example_url = u'https://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)'
urlize(example_url, 50)
# u'<a href="https://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management">https://de.wikipedia.org/wiki/Agilit%C3%A4t_(Manag...</a>)'

Your Environment

  • Python version: 2.7.12
  • Jinja version: 2.8
@norela
Copy link

norela commented Apr 12, 2018

In the utils.py, this issue is solved if you remove ')' from the 24th line:

_punctuation_re = re.compile(
    '^(?P<lead>(?:%s)*)(?P<middle>.*?)(?P<trail>(?:%s)*)$' % (
        '|'.join(map(re.escape, ('(', '<', '&lt;'))),
        '|'.join(map(re.escape, ('.', ',', ')', '>', '\n', '&gt;')))
    )
)

to:

_punctuation_re = re.compile(
    '^(?P<lead>(?:%s)*)(?P<middle>.*?)(?P<trail>(?:%s)*)$' % (
        '|'.join(map(re.escape, ('(', '<', '&lt;'))),
        '|'.join(map(re.escape, ('.', ',', '>', '\n', '&gt;')))
    )
)

The ')' is added to the 'trail'-group instead of the 'middle'-group at line 213.
I don't know what impact/consequences this has on other inputs for the urlize function.

@mrecachinas
Copy link

mrecachinas commented Apr 15, 2018

@norela The consequence will be on inputs like the following that wrap an entire URL in parentheses (i.e., the parentheses are not part of the URL).

>>> url_text = "Hello world (http://www.foo.org)."
>>> urlize(url_text) # with regex suggestion above by @norela
'Hello world (<a href="http://www.foo.org)">http://www.foo.org)</a>.'

This issue is trickier than it seems.

If we assume that if a URL contains parentheses, they must be balanced and cannot contain nested parentheses, we could add a quick fix after line 213 that looks something like:

...
lead, middle, trail = match.groups()

# naive check for balanced parens 
if len(middle) > 0 and '(' in middle and len(trail) > 0 and trail[0] == ")":
    middle = middle + trail[0]
    trail = trail[1:]
if middle.startswith('www.') or (
...

Unfortunately, the URL standard doesn't require parentheses be balanced and does not disallow nested parentheses, so this solution would be temporary and not robust.

If we instead assume that non-URL text should contain balanced parentheses, then an alternative potential solution may be to check lead and trail for balanced parentheses. If they are balanced, we keep the right-parenthesis in trail; if they are unbalanced, we append the right-parenthesis onto middle. This has the benefit of making no assumption about whether the URL's parentheses are balanced.

@norela
Copy link

norela commented Apr 15, 2018

What about something like this.

import re

regex = r'((?:(https?|s?ftp):\/\/)?(?:www\.)?((?:(?:[A-Z0-9][A-Z0-9-]{0,61}[A-Z0-9]\.)+)([A-Z]{2,6})\
|(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))(?::(\d{1,5}))?(?:(\/\S+)*))'

find_url_in_string = re.compile(regex, re.IGNORECASE)

def urlize(string):
    url = find_url_in_string.search(string)
    if url is not None and url.group(0) is not None:
        return("{} <a href='{}'>{}</a> {}".format(string.split(url.group(0).strip())[0], 
                                                url.group(0).strip(), url.group(0).strip(),
                                                string.split(url.group(0).strip())[1]))
    
string = "This is a link http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management) awesome"
urlize(string)

returns:
"This is a link <a href='http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)'>http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)</a> awesome"

found a regex to extract the url from a string on stackoverflow:
https://stackoverflow.com/questions/9760588/how-do-you-extract-a-url-from-a-string-using-python#9760660

@mrecachinas
Copy link

mrecachinas commented Apr 15, 2018

How does it perform with the following input?

string = "Foo (See http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)) Bar"

@mrecachinas
Copy link

Also, check out this Article by Jeff Atwood on this very issue

@norela
Copy link

norela commented Apr 15, 2018

I see, thanks.
Maybe it helps to check if there have been more occurrences of '(' as there are ')' in the part before the url, like so:

def urlize(string):
    url = find_url_in_string.search(string)
    if url is not None and url.group(0) is not None:
        if string.split(url.group(0).strip())[0].count('(') > string.split(url.group(0).strip())[0].count(')') and \
                        url.group(0).strip().endswith(')'):
            return("{} <a href='{}'>{}</a>) {}".format(string.split(url.group(0).strip())[0], 
                                    url.group(0).strip().rstrip(')'), url.group(0).strip().rstrip(')'),
                                    string.split(url.group(0).strip())[1]))
        else:
            return("{} <a href='{}'>{}</a> {}".format(string.split(url.group(0).strip())[0], 
                                                url.group(0).strip(), url.group(0).strip(),
                                                string.split(url.group(0).strip())[1]))

@mrecachinas
Copy link

mrecachinas commented Apr 16, 2018

I think you're on the right track. Still seeing it fail on this case:

>>> string = "Foo (See http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)) Bar"
>>> urlize(string) # includes your proposed fix
Foo (See  <a href='http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management'>http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management</a>)  Bar

You'll notice it omitted the closing parenthesis in the URL -- i.e.,

http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management

instead of

http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)

@norela
Copy link

norela commented Apr 17, 2018

Thanks for the feedback. Furthermore, I think the check should be recursive too.
For inputs like: "This is a link (http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)), awesome"
What do you think of the following:

import re

regex = r'((?:(https?|s?ftp):\/\/)?(?:www\.)?((?:(?:[A-Z0-9][A-Z0-9-]{0,61}[A-Z0-9]\.)+)([A-Z]{2,6})\
|(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))(?::(\d{1,5}))?(?:(\/\S+)*))'

find_url_in_string = re.compile(regex, re.IGNORECASE)


def urlize(string):
    url = find_url_in_string.search(string)
    if url is not None and url.group(0) is not None:
        url = url.group(0).strip()
        pre_url = string.split(url)[0]
        post_url = string.split(url)[1]
        result = "{} <a href='{}'>{}</a> {}"
        escapechar1, escapechar2 = ['(', '<', '&lt;']*2, [')', '>','&gt;']*2
        
        for _ in range(len(escapechar1)):
            if (url[-1] in ['.', ',', '\n']*2) or (pre_url.count(escapechar1[_]) > pre_url.count(escapechar2[_]) \
            and url.endswith(escapechar2[_])):
                post_url = url[-1]+post_url
                url = url[:-1]

        return result.format(pre_url, url, url, post_url)

@mrecachinas
Copy link

Sorry for the delay. It seems reasonable. What do @mitsuhiko and @davidism think about this and the underlying assumptions?

@davidism davidism added this to the 3.0.0 milestone Apr 19, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants