Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compilaton flag to avoid storing compiled regexp in internal cache #387

Closed
mrabarnett opened this issue Oct 22, 2020 · 8 comments
Closed

Comments

@mrabarnett
Copy link
Owner

Original report by DenisB (Bitbucket: DenisB, GitHub: DenisB).


Hi,

I use pretty big patterns for compilation with regex (about 200 kilobytes). Then I use it against some string and throw away. Suddenly I’ve experienced out of memory condition because of regex' internal cache.

There is a way to disable cache with DEBUG flag but I got a lot of messages to console :)

So I’ve made my own size-limited cache and call regex.purge() after each regex.compile() call.

It will be nice to avoid inserting compiled regexp into module’s internal cache with some flag:

compiled_and_not_on_cache = regex.compile(pattern, flags=regex.MULTILINE | regex.NO_CACHE)

Thank you.

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


It couild be a reasonably argued that any pattern that’s explicitly compiled doesn’t need to go into the cache. That’s not the case in the ‘re’ module, but in that module the shortcut functions are more limited in their parameters. For example:

>>> import re
>>> re.search('c', 'abc', pos=1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: search() got an unexpected keyword argument 'pos'
>>> re.compile('c').search('abc', pos=1)
<re.Match object; span=(2, 3), match='c'>
>>> 
>>> import regex
>>> regex.search('c', 'abc', pos=1)
<regex.Match object; span=(2, 3), match='c'>
>>> regex.compile('c').search('abc', pos=1)
<regex.Match object; span=(2, 3), match='c'>
>>>

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


From regex 2020.10.22, it doesn't cache patterns that are compiled explicitly.

@mrabarnett
Copy link
Owner Author

Original comment by Makyen (Bitbucket: Makyen, GitHub: Makyen).


Never caching explicitly compiled patterns seems like it has the potential to be a significant performance hit for a substantial portion of the userbase. It quite likely will be for us. What the actual impact will be is unknown. The change will require us to go through our entire codebase and rewrite any portions which use explicit compiles (which is quite inconvenient if the regex is used more than once in a function).

Having a cache which is always used is better, IMO.

Ideally, there would be a way to optionally exclude an explicitly compiled regex from the cache. If that was available, we would definitely use it. Currently, we use our own storage of compiled patterns for a portion of the patterns we use. For those which we don’t want cached by regex, we issue a regex.purge() after they have been compiled, but we definitely rely on everything else being stored in the regex cache. If any of the patterns which we're not wanting cached change, then we accept the temporary performance hit of having to recompile every pattern which is using the regex cache, but that’s only a hit for the next time each portion of the code is executed.

My concern with changing to not caching explicitly compiled patterns, beyond the substantial impact on our project, is that it's not the operation which most users intuitively expect and may have a noticeable performance impact on a substantial number of other projects. As an ongoing issue, users who are not aware of the caching difference between explicit and implicit compilation are likely to unknowingly experience lower performance than they would have otherwise seen. I don’t know what the impact is on other projects, but judging from a very quick search of our codebase, it appears that using an explicit compile is quite common, even in cases where an implicit compile would have been sufficient. From that, I expect that using explicit compiles is probably widespread.

An example use case where not caching an explicitly compiled pattern is: a function which uses the same pattern more than once, so it is explicitly compiled and reused within the function. With implicit caching, that compilation automatically happens only once throughout the entire run of the program. Without implicit caching,, that compilation happens every time the function is called. That has the potential to be a substantial performance impact, if the function is called a significant number of times.

From our project’s point of view, we will be locking the regex version to < 2020.10.22 until A] the impact can be investigated and all portions of our code reviewed and rewritten where necessary, and B] our contributors and code reviewers can be educated about the issue. The likely effective result will be that we remain on < 2020.10.22 until a bugfix or new feature forces us to do the above work.

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


OK, slight reversion/revision. From regex 2020.10.23, you can prevent explicitly-compiled patterns from being cached by using regex.cache_all(False).

@mrabarnett
Copy link
Owner Author

Original comment by Makyen (Bitbucket: Makyen, GitHub: Makyen).


Thank you. That will work much better for us and allow us to not have to regex.purge(), as we know exactly what we desire to exclude from the cache (i.e. which compiled patterns are already persistently stored).

Just to double-check: regex.cache_all(False) is a toggle, so if we do:

import regex

foo_regex = regex.compile('foo')
regex.cache_all(False)
bar_regex = regex.compile('bar')
regex.cache_all(True)
baz_regex = regex.compile('baz')

Then, foo_regex and baz_regex will be cached, but bar_regex will not be cached. Correct?

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


Correct. It controls whether new explictly-compiled patterns will be cached.

@mrabarnett
Copy link
Owner Author

Original comment by Makyen (Bitbucket: Makyen, GitHub: Makyen).


Great! Thank you very much for making the change. It will be quite helpful.

@mrabarnett
Copy link
Owner Author

Original comment by DenisB (Bitbucket: DenisB, GitHub: DenisB).


Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant