Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage issues with large Apache httpd configuration files #569

Closed
joohoi opened this issue Jun 27, 2018 · 5 comments
Closed

Memory usage issues with large Apache httpd configuration files #569

joohoi opened this issue Jun 27, 2018 · 5 comments

Comments

@joohoi
Copy link

joohoi commented Jun 27, 2018

An user reported an memory usage issue in Certbot, and when investigating the cause I found out that our (very suboptimal) regexes caused the Augeas match memory usage to balloon out of proportions.

This only happens with really large configurations, and complex regexes however. The reporting users configuration had roughly 300 configuration files with 1500 lines each. I was able to create a much simplified file with similar effect. The effects start to be really visible after 40k lines, and at 80k lines augtool with stock httpd lens segfaults in startup on Debian stretch.

Please note that the regex is really suboptimal, and is used for backwards compatibility for environments with too old version of Augeas to support case insensitive regex flag 'i'. Using recent version and the case insensitive flag reduces the memory footprint roughly by two thirds, but it is still roughly 400 fold compared to actual configuration size.

How to reproduce

  1. Modify the following template, repeating the line: AddType huge/memory usage 50k times.
<VirtualHost *:80>
	ServerName memory.issue
	ServerAdmin webmaster@localhost
	DocumentRoot /var/www/html
	<Directory /var/www/html>
		<IfModule mod_mime.c>
			# Repeat the following line 50k times
			AddType huge/memory       usage	
		</IfModule>
	</Directory>
</VirtualHost>
  1. Copy the modified template to /etc/apache2/sites-available/memorybomb.conf

  2. Start augtool -I httpd.aug

  3. Use suboptimal regex:

match /files/etc/apache2/sites-available/*[label()=~regexp('.*\\.conf')]//*[self::directive=~regexp('([Ii][Nn][Cc][Ll][Uu][Dd][Ee])|([Ii][Nn][Cc][Ll][Uu][Dd][Ee])|([Ii][Nn][Cc][Ll][Uu][Dd][Ee][Oo][Pp][Tt][Ii][Oo][Nn][Aa][Ll])')]
  1. Observe the memory usage, it should use roughly 2Gb of memory during the regex match, roughly 1000 times the on-disk size of the configuration.
@lutter
Copy link
Member

lutter commented Jun 27, 2018

Thanks for the detailed description. I am having trouble reproducing this behavior. I followed your instructions, and with a file where the line is repeated 50k times, memory usage only goes up to ~ 120MB.

It also doesn't seem to make a difference whether I use that match or not in terms of memory usage.

I tried this with augeas 1.8.0 (the version in stretch, AFAIK), augeas 1.10.0, and the latest from git HEAD with roughly the same results. I did all my experiments on Fedora 28; I'll try and get my hands on Debian Stretch, too.

Can you confirm that you see this crazy amount of memory used if you run augtool -At 'Httpd.lns incl /etc/apache2/sites-available/memorybomb.conf' quit ? I just want to make sure there isn't something else that factors into the memory use.

BTW, the -I httpd.aug in your example doesn't have an effect - you have to pass the directory that contains httpd.aug to -I, not the file itself.

@joohoi
Copy link
Author

joohoi commented Jun 27, 2018

Can you confirm that you see this crazy amount of memory used if you run augtool -At 'Httpd.lns incl /etc/apache2/sites-available/memorybomb.conf' quit ? I just want to make sure there isn't something else that factors into the memory use.

I don't. The memory usage balloons out when while running the match. If I make the memorybomb.conf ~80k lines, I do see the segfault with that command though.

I'm running augeas version: 1.8.0-1+deb9u1

@lutter
Copy link
Member

lutter commented Jun 27, 2018

Thanks .. after trying some more (and realizing where I made an early morning mistake) I can reproduce this now. It looks like the amount of memory taken is related to the size of the regexp. I'll look into it.

@lutter
Copy link
Member

lutter commented Jun 29, 2018

I looked into this some more and have a good understanding of what's causing the issue. The basic problem is that the interpreter for path expressions does some very naive things (which are fine when you search over a few hundred nodes, but really hurt when you are dealing with 50k nodes) In particular, the interpreter recompiles the regex every time it needs to check a node (there's no lifting of constant expressions out of loops) and that the interpreter simplifies its memory management by not releasing memory until it's done evaluating a path expression. Those two together lead to the memory blowup.

Addressing that will take a bit of time; I have some POC code for lifting constant expressions which brings memory usage down from > 2GB to ~ 180MB. But since this is a fairly intrusive change, it needs more work and testing.

One thing I realized is that you can change your expression slightly: instead of //*[self::directive =~ regexp(...) ] use //directive[ . =~ regexp(...) ] That cuts down on the number of nodes that have to be compared against the regexp, and therefore produces less garbage during evaluation. In my testing, it brings memory usage down from ~ 2.2 GB to ~ 860MB - not a solution, but a decent bandaid, and the right thing to do anyway as it eliminates unnecessary computation.

I saw the segfault with 80k entries, too. I haven't looked into it, but it looks like it' s caused by an integer overflow in a part of the code that's unrelated to this problem. I'll look at that in more detail once I've got a handle on the memory usage.

lutter added a commit to lutter/augeas that referenced this issue Aug 9, 2018
In path expressions, we generally need to evaluate functions against every
node that we consider for the result set. For example, in the path
expression /files/etc/hosts/*[ipaddr =~ regexp('127\\.')], the regexp
function was evaluated against every entry in /etc/hosts. Evaluating that
function requires the construction and compilation of a new regexp. Because
of how memory is managed during evaluation of path expressions, the memory
used by all these copies of the same regexp is only freed after we are done
evaluating the path expression. This causes unacceptable memory usage in
large files (see hercules-team#569)

To avoid these issues, we now distinguish between pure and impure functions
in the path expression interpreter. When we encounter a pure function, we
change the AST for the path expression so that the function invocation is
replaced with the result of invoking the function. With the example above,
that means we only construct and compile the regexp '127\\.' once,
regardless of how many nodes it gets checked against. That leads to a
dramatic reduction in the memory required to evaluate path expressions with
such constructs against large files.

Fixes hercules-team#569
@lutter
Copy link
Member

lutter commented Aug 9, 2018

At long last, PR #578 has a patch that reduces memory consumption for this issue a lot. In my testing, for a 50000 line httpd.conf, memory usage goes from ~ 2.2GB down to ~ 107MB. That's still a lot for a 1.7MB file, but it now seems that that memory is not used by the match statement but by something else.

If you have a chance to give this a spin, I would very much appreciate confirmation of my testing.

lutter added a commit to lutter/augeas that referenced this issue Aug 9, 2018
In path expressions, we generally need to evaluate functions against every
node that we consider for the result set. For example, in the path
expression /files/etc/hosts/*[ipaddr =~ regexp('127\\.')], the regexp
function was evaluated against every entry in /etc/hosts. Evaluating that
function requires the construction and compilation of a new regexp. Because
of how memory is managed during evaluation of path expressions, the memory
used by all these copies of the same regexp is only freed after we are done
evaluating the path expression. This causes unacceptable memory usage in
large files (see hercules-team#569)

To avoid these issues, we now distinguish between pure and impure functions
in the path expression interpreter. When we encounter a pure function, we
change the AST for the path expression so that the function invocation is
replaced with the result of invoking the function. With the example above,
that means we only construct and compile the regexp '127\\.' once,
regardless of how many nodes it gets checked against. That leads to a
dramatic reduction in the memory required to evaluate path expressions with
such constructs against large files.

Fixes hercules-team#569
lutter added a commit to lutter/augeas that referenced this issue Aug 22, 2018
In path expressions, we generally need to evaluate functions against every
node that we consider for the result set. For example, in the path
expression /files/etc/hosts/*[ipaddr =~ regexp('127\\.')], the regexp
function was evaluated against every entry in /etc/hosts. Evaluating that
function requires the construction and compilation of a new regexp. Because
of how memory is managed during evaluation of path expressions, the memory
used by all these copies of the same regexp is only freed after we are done
evaluating the path expression. This causes unacceptable memory usage in
large files (see hercules-team#569)

To avoid these issues, we now distinguish between pure and impure functions
in the path expression interpreter. When we encounter a pure function, we
change the AST for the path expression so that the function invocation is
replaced with the result of invoking the function. With the example above,
that means we only construct and compile the regexp '127\\.' once,
regardless of how many nodes it gets checked against. That leads to a
dramatic reduction in the memory required to evaluate path expressions with
such constructs against large files.

Fixes hercules-team#569
clrpackages pushed a commit to clearlinux-pkgs/augeas that referenced this issue Nov 19, 2018
….11.0

1.11.0 - 2018-08-24
  - General changes/additions
    * augmatch: add a --quiet option; make the exit status useful to tell
                whether there was a match or not
    * Drastically reduce the amount of memory needed to evaluate complex
      path expressions against large files (Issue #569)
    * Fix a segfault on OSX when 'augmatch' is run without any
      arguments (Issue #556)
  - API changes
    * aug_source did not in fact return the source; and always returned
      NULL for that. That has been fixed.
  - Lens changes/additions
    * Chrony: add new options supported in chrony 3.2 and 3.3 (Miroslav Lichvar)
    * Dhclient: fix parsing of append/prepend and similar directives
                (John Morrissey)

(NEWS truncated at 15 lines)

	Version 1.11.0

2018-08-22  David Lutterkort  <lutter@watzmann.net>

	Replace pure function invocations in path expressions with their result
	In path expressions, we generally need to evaluate functions against every
	node that we consider for the result set. For example, in the path
	expression /files/etc/hosts/*[ipaddr =~ regexp('127\\.')], the regexp
	function was evaluated against every entry in /etc/hosts. Evaluating that
	function requires the construction and compilation of a new regexp. Because
	of how memory is managed during evaluation of path expressions, the memory
	used by all these copies of the same regexp is only freed after we are done
	evaluating the path expression. This causes unacceptable memory usage in
	large files (see hercules-team/augeas#569)

(NEWS truncated at 15 lines)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants