Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve and extend urlize #1195

Merged
merged 3 commits into from Jan 31, 2021
Merged

Conversation

bebleo
Copy link
Contributor

@bebleo bebleo commented Apr 19, 2020

Fixes #522
Fixes #827
Fixes #1172

The 'urlize' function does not as expected both in terms of some missing support. Among a few a few of the behaviors were a lack of support for 5/8 of the original TLDs, emails prefixed with mailto: would not be returned as links, and no support for ftp://.

This PR does not "fix" the latter point, but does add a policy that can be set so that the function will create the links. This can also be used to add support for any arbitrary URI scheme.

Specifically:

@davidism davidism added this to the 3.0.0 milestone Apr 19, 2020
@davidism
Copy link
Member

  • updated docstrings
  • moved regexes from the top of the module to above the function
  • turned the http regex into a verbose one with comments since it's complex
  • renamed extra_uri_schemes to extra_schemes

move regexes near implementation
commented verbose regex for http pattern
renamed extra_uri_schemes to extra_schemes
don't try other url types if one already matched
no-op function if trim is not enabled
avoid backtracking when matching trailing punctuation
match head and tail punctuation separately
don't scan for unbalanced parentheses more than necessary
ensure email domain starts and ends with a word character
@davidism
Copy link
Member

After looking over the code more, I identified a number of optimizations, so I committed another refactor.

rel_attr = f' rel="{escape(rel)}"' if rel else ""
target_attr = f' target="{escape(target)}"' if target else ""

for i, word in enumerate(words):
match = _punctuation_re.match(word)
head, middle, tail = "", word, ""
match = re.match(r"^([(<]|&lt;)+", middle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calls re.match() with the same pattern string for each iteration of the loop (and similar is done below). I don’t think the re module caches recently- or frequently-compiled patterns, in which case these calls would be needlessly recompiling the same pattern in every iteration, is that right? If so, is it worth compiling these patterns once, and reusing the resulting pattern objects here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re caches the last 512 patterns used. Compiling every single pattern at import time increases startup time for projects, I'd seen some more (general) discussions about this recently so decided these could be inlined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting, good to know, thanks!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants