Skip to content

Commit

Permalink
Avoid pathological regex performance when linkifying large ivy output.
Browse files Browse the repository at this point in the history
Our build contains an 800 MB jar. Ivy outputs 110k dots while downloading this. Pants then tries to linkify this and `re.findall` takes 10+ minutes.

I tried instead changing all the groups to non-capturing and all the `?` and `+`s to non-greedy, but they didn't solve the problem (they did help a bit, but not enough to suffice).

The negative lookahead solution in this commit feels hacky. I'm open to other suggestions (it might be worth trying the regex or re2 libraries for non-backtracking regexes, if it's worth adding those deps to pants - also not sure if those will actually help, since we'll still be quadratic).

Testing Done:
https://travis-ci.org/pantsbuild/pants/builds/118159590

also turned linkify.py into a standalone script to repro the issue on our ivy output and verified execution time went from awful -> good

Bugs closed: 3085

Reviewed at https://rbcommons.com/s/twitter/r/3603/
  • Loading branch information
landism authored and kwlzn committed Mar 24, 2016
1 parent 5503041 commit b50df7c
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 2 deletions.
9 changes: 7 additions & 2 deletions src/python/pants/reporting/linkify.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,17 @@
_ABS_PATH_COMPONENTS = r'({})+'.format(_ABS_PATH_COMPONENT)
_OPTIONAL_TARGET_SUFFIX = r'(:{})?'.format(_REL_PATH_COMPONENT) # For /foo/bar:target.

# Ivy can print out many, many .'s in a row when downloading large jars. Evaluating _PATH in this
# case can be ridiculously slow (e.g., 10+ minutes when there are 100k dots in a row).
# Technically a path could start with 5+ dots. If this happens, it won't be linked.
_IGNORE_LONG_DOT_CHAINS = r'(?!\.{5})'

# Note that we require at least two path components.
# We require the last character to be alphanumeric or underscore, because some tools print an
# ellipsis after file names (I'm looking at you, zinc). None of our files end in a dot in practice,
# so this is fine.
_PATH = _PREFIX + _REL_PATH_COMPONENT + _OPTIONAL_PORT + _ABS_PATH_COMPONENTS + \
_OPTIONAL_TARGET_SUFFIX + '\w'
_PATH = _IGNORE_LONG_DOT_CHAINS + _PREFIX + _REL_PATH_COMPONENT + _OPTIONAL_PORT + \
_ABS_PATH_COMPONENTS + _OPTIONAL_TARGET_SUFFIX + '\w'
_PATH_RE = re.compile(_PATH)

_NO_URL = "no url" # Sentinel value for non-existent files in linkify's memo
Expand Down
9 changes: 9 additions & 0 deletions tests/python/pants_test/reporting/test_linkify.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,3 +92,12 @@ def test_linkify_stores_values_in_memo(self):
memo = {}
self._do_test_linkify(url, url, memo)
self.assertEqual(url, memo[url])

# Technically, if there's a file named ....., we should linkify it.
# This is thus not actually verifying desired behavior. However,
# this seems the most reasonable way to verify that linkify does
# not go crazy on dots, as described in linkify.py.
def test_linkify_ignore_many_dots(self):
url = '.....'
self._do_test_not_linkified(url)

0 comments on commit b50df7c

Please sign in to comment.