The `urlize` function does not compose parentheses correctly at the end #827

colidyre · 2018-03-20T13:53:39Z

Expected Behavior

Final parentheses should be within the tag and link should be exactly as given before filtering.

Actual Behavior

Final parentheses is outside the tag and final parentheses is lost in url construction

Template Code

from jinja2.utils import urlize
example_url = u'https://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)'
urlize(example_url, 50)
# u'<a href="https://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management">https://de.wikipedia.org/wiki/Agilit%C3%A4t_(Manag...</a>)'

Your Environment

Python version: 2.7.12
Jinja version: 2.8

norela · 2018-04-12T10:28:43Z

In the utils.py, this issue is solved if you remove ')' from the 24th line:

_punctuation_re = re.compile(
    '^(?P<lead>(?:%s)*)(?P<middle>.*?)(?P<trail>(?:%s)*)$' % (
        '|'.join(map(re.escape, ('(', '<', '&lt;'))),
        '|'.join(map(re.escape, ('.', ',', ')', '>', '\n', '&gt;')))
    )
)

to:

_punctuation_re = re.compile(
    '^(?P<lead>(?:%s)*)(?P<middle>.*?)(?P<trail>(?:%s)*)$' % (
        '|'.join(map(re.escape, ('(', '<', '&lt;'))),
        '|'.join(map(re.escape, ('.', ',', '>', '\n', '&gt;')))
    )
)

The ')' is added to the 'trail'-group instead of the 'middle'-group at line 213.
I don't know what impact/consequences this has on other inputs for the urlize function.

mrecachinas · 2018-04-15T11:37:58Z

@norela The consequence will be on inputs like the following that wrap an entire URL in parentheses (i.e., the parentheses are not part of the URL).

>>> url_text = "Hello world (http://www.foo.org)."
>>> urlize(url_text) # with regex suggestion above by @norela
'Hello world (<a href="http://www.foo.org)">http://www.foo.org)</a>.'

This issue is trickier than it seems.

If we assume that if a URL contains parentheses, they must be balanced and cannot contain nested parentheses, we could add a quick fix after line 213 that looks something like:

...
lead, middle, trail = match.groups()

# naive check for balanced parens 
if len(middle) > 0 and '(' in middle and len(trail) > 0 and trail[0] == ")":
    middle = middle + trail[0]
    trail = trail[1:]
if middle.startswith('www.') or (
...

Unfortunately, the URL standard doesn't require parentheses be balanced and does not disallow nested parentheses, so this solution would be temporary and not robust.

If we instead assume that non-URL text should contain balanced parentheses, then an alternative potential solution may be to check lead and trail for balanced parentheses. If they are balanced, we keep the right-parenthesis in trail; if they are unbalanced, we append the right-parenthesis onto middle. This has the benefit of making no assumption about whether the URL's parentheses are balanced.

norela · 2018-04-15T19:59:44Z

What about something like this.

import re

regex = r'((?:(https?|s?ftp):\/\/)?(?:www\.)?((?:(?:[A-Z0-9][A-Z0-9-]{0,61}[A-Z0-9]\.)+)([A-Z]{2,6})\
|(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))(?::(\d{1,5}))?(?:(\/\S+)*))'

find_url_in_string = re.compile(regex, re.IGNORECASE)

def urlize(string):
    url = find_url_in_string.search(string)
    if url is not None and url.group(0) is not None:
        return("{} <a href='{}'>{}</a> {}".format(string.split(url.group(0).strip())[0], 
                                                url.group(0).strip(), url.group(0).strip(),
                                                string.split(url.group(0).strip())[1]))
    
string = "This is a link http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management) awesome"
urlize(string)

returns:
"This is a link <a href='http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)'>http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)</a> awesome"

found a regex to extract the url from a string on stackoverflow:
https://stackoverflow.com/questions/9760588/how-do-you-extract-a-url-from-a-string-using-python#9760660

mrecachinas · 2018-04-15T20:23:34Z

How does it perform with the following input?

string = "Foo (See http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)) Bar"

mrecachinas · 2018-04-15T20:41:32Z

Also, check out this Article by Jeff Atwood on this very issue

norela · 2018-04-15T21:00:02Z

I see, thanks.
Maybe it helps to check if there have been more occurrences of '(' as there are ')' in the part before the url, like so:

def urlize(string):
    url = find_url_in_string.search(string)
    if url is not None and url.group(0) is not None:
        if string.split(url.group(0).strip())[0].count('(') > string.split(url.group(0).strip())[0].count(')') and \
                        url.group(0).strip().endswith(')'):
            return("{} <a href='{}'>{}</a>) {}".format(string.split(url.group(0).strip())[0], 
                                    url.group(0).strip().rstrip(')'), url.group(0).strip().rstrip(')'),
                                    string.split(url.group(0).strip())[1]))
        else:
            return("{} <a href='{}'>{}</a> {}".format(string.split(url.group(0).strip())[0], 
                                                url.group(0).strip(), url.group(0).strip(),
                                                string.split(url.group(0).strip())[1]))

mrecachinas · 2018-04-16T20:48:28Z

I think you're on the right track. Still seeing it fail on this case:

>>> string = "Foo (See http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)) Bar"
>>> urlize(string) # includes your proposed fix
Foo (See  <a href='http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management'>http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management</a>)  Bar

You'll notice it omitted the closing parenthesis in the URL -- i.e.,

http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management

instead of

http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)

norela · 2018-04-17T09:34:26Z

Thanks for the feedback. Furthermore, I think the check should be recursive too.
For inputs like: "This is a link (http://de.wikipedia.org/wiki/Agilit%C3%A4t_(Management)), awesome"
What do you think of the following:

import re

regex = r'((?:(https?|s?ftp):\/\/)?(?:www\.)?((?:(?:[A-Z0-9][A-Z0-9-]{0,61}[A-Z0-9]\.)+)([A-Z]{2,6})\
|(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))(?::(\d{1,5}))?(?:(\/\S+)*))'

find_url_in_string = re.compile(regex, re.IGNORECASE)


def urlize(string):
    url = find_url_in_string.search(string)
    if url is not None and url.group(0) is not None:
        url = url.group(0).strip()
        pre_url = string.split(url)[0]
        post_url = string.split(url)[1]
        result = "{} <a href='{}'>{}</a> {}"
        escapechar1, escapechar2 = ['(', '<', '&lt;']*2, [')', '>','&gt;']*2
        
        for _ in range(len(escapechar1)):
            if (url[-1] in ['.', ',', '\n']*2) or (pre_url.count(escapechar1[_]) > pre_url.count(escapechar2[_]) \
            and url.endswith(escapechar2[_])):
                post_url = url[-1]+post_url
                url = url[:-1]

        return result.format(pre_url, url, url, post_url)

mrecachinas · 2018-04-29T16:14:54Z

Sorry for the delay. It seems reasonable. What do @mitsuhiko and @davidism think about this and the underlying assumptions?

mrecachinas mentioned this issue Apr 15, 2018

Doesn't ignore surrounding parentheses lipoja/URLExtract#16

Closed

bebleo mentioned this issue Apr 19, 2020

Improve and extend urlize #1195

Merged

davidism added this to the 3.0.0 milestone Apr 19, 2020

davidism closed this as completed in #1195 Jan 31, 2021

github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `urlize` function does not compose parentheses correctly at the end #827

The `urlize` function does not compose parentheses correctly at the end #827

colidyre commented Mar 20, 2018

norela commented Apr 12, 2018 •

edited

Loading

mrecachinas commented Apr 15, 2018 •

edited

Loading

norela commented Apr 15, 2018 •

edited

Loading

mrecachinas commented Apr 15, 2018 •

edited

Loading

mrecachinas commented Apr 15, 2018

norela commented Apr 15, 2018

mrecachinas commented Apr 16, 2018 •

edited

Loading

norela commented Apr 17, 2018

mrecachinas commented Apr 29, 2018

The urlize function does not compose parentheses correctly at the end #827

The urlize function does not compose parentheses correctly at the end #827

Comments

colidyre commented Mar 20, 2018

Expected Behavior

Actual Behavior

Template Code

Your Environment

norela commented Apr 12, 2018 • edited Loading

mrecachinas commented Apr 15, 2018 • edited Loading

norela commented Apr 15, 2018 • edited Loading

mrecachinas commented Apr 15, 2018 • edited Loading

mrecachinas commented Apr 15, 2018

norela commented Apr 15, 2018

mrecachinas commented Apr 16, 2018 • edited Loading

norela commented Apr 17, 2018

mrecachinas commented Apr 29, 2018

The `urlize` function does not compose parentheses correctly at the end #827

The `urlize` function does not compose parentheses correctly at the end #827

norela commented Apr 12, 2018 •

edited

Loading

mrecachinas commented Apr 15, 2018 •

edited

Loading

norela commented Apr 15, 2018 •

edited

Loading

mrecachinas commented Apr 15, 2018 •

edited

Loading

mrecachinas commented Apr 16, 2018 •

edited

Loading