New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve and extend urlize #1195
Conversation
bc49b95
to
4b031cb
Compare
4b031cb
to
f16f0c6
Compare
|
f16f0c6
to
bdcf827
Compare
move regexes near implementation commented verbose regex for http pattern renamed extra_uri_schemes to extra_schemes
bdcf827
to
be83e7e
Compare
don't try other url types if one already matched no-op function if trim is not enabled avoid backtracking when matching trailing punctuation match head and tail punctuation separately don't scan for unbalanced parentheses more than necessary ensure email domain starts and ends with a word character
After looking over the code more, I identified a number of optimizations, so I committed another refactor. |
rel_attr = f' rel="{escape(rel)}"' if rel else "" | ||
target_attr = f' target="{escape(target)}"' if target else "" | ||
|
||
for i, word in enumerate(words): | ||
match = _punctuation_re.match(word) | ||
head, middle, tail = "", word, "" | ||
match = re.match(r"^([(<]|<)+", middle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calls re.match()
with the same pattern string for each iteration of the loop (and similar is done below). I don’t think the re
module caches recently- or frequently-compiled patterns, in which case these calls would be needlessly recompiling the same pattern in every iteration, is that right? If so, is it worth compiling these patterns once, and reusing the resulting pattern objects here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re
caches the last 512 patterns used. Compiling every single pattern at import time increases startup time for projects, I'd seen some more (general) discussions about this recently so decided these could be inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting, good to know, thanks!
Fixes #522
Fixes #827
Fixes #1172
The 'urlize' function does not as expected both in terms of some missing support. Among a few a few of the behaviors were a lack of support for 5/8 of the original TLDs, emails prefixed with
mailto:
would not be returned as links, and no support forftp://
.This PR does not "fix" the latter point, but does add a
policy
that can be set so that the function will create the links. This can also be used to add support for any arbitrary URI scheme.Specifically:
extra_uri_schemes
policy and parameter to address lack offtp:
support and allow for other schemes (for exampletel:
) to be added as needed. (issue urlize filter ignores ftp:// URLs #522 and PR Added ftp urls to be catched in urlize #587)urlize
function does not compose parentheses correctly at the end #827)https//
andhttps://
links are well-formed.https
instead ofhttp