Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize browscap.ini loading and get_browser() #2242

Closed
wants to merge 10 commits into from

Conversation

nikic
Copy link
Member

@nikic nikic commented Dec 16, 2016

No description provided.

@nikic nikic changed the base branch from master to PHP-7.0 December 16, 2016 18:40
@nikic nikic closed this Dec 16, 2016
@nikic nikic reopened this Dec 16, 2016
@nikic
Copy link
Member Author

nikic commented Dec 16, 2016

For the 8MB browscap.ini file @ http://access.thelounge.net/harry/browscap.ini.txt:

  • According to massif, the peak memory usage of PHP drops from 85MB to 23MB.
  • The startup time of PHP drops from 0.19s to 0.10s.
  • The time of running the get_browser_basic.phpt test with this ini drops from 19s to 0.23s. (!!!)

@nikic
Copy link
Member Author

nikic commented Dec 16, 2016

The main memory usage implementations implemented are:

  • Fix stack management in ini parser. This is unrelated, just included it here to get a CI run. This drops the stack usage for parsing this file from 1MB to virtually zero. Landed as fa92a16.
  • Use interning for keys and values occurring in the browscap file. By "interning" I mean something local to browscap, not the global interning mechanism.
  • The key value pairs for the different patterns are no longer represented as a hash table. Instead, there is a common array containing all key value pairs and each pattern only stores offsets into that array. This avoids lots of space wastage due to partially unused hashtables. We only reconstruct a hashtable for the actual get_browser() return value.
  • There is now a dedicated browscap entry structure, which stores these key/value offsets, as well as the pattern and the parent. Additionally a few bits of additional information is stored to speed up matching.
  • The regex is now computed from the pattern lazily during get_browser() matching. It is no longer stored. (This is possible due to the following optimizations, which reduce the number of matched regexes.)

The optimizations to matching are:

  • The key problem is that there is a lot of patterns and compiling a regular expression for each pattern is prohibitively slow. As such, we want to cheaply identify patterns that certainly cannot match.
  • For this purpose we now store the length of the placeholder-free prefix of each pattern. If the prefix doesn't match, the pattern certainly doesn't match.
  • Additionally, we store the start/length of 3 other placeholder free strings in the pattern. We then check whether the agent name contains these strings using zend_memnstr. If it doesn't, we can also skip the pattern.
  • For my testcase, this reduces the number of regexes actually matches from >300k to about ~5k, making for a large speedup.

To make the storage of all these offsets/length more compact, I've limited the pattern size to 16bit. I don't think this is problematic, the patterns are much shorter than that.

@jerrygrey
Copy link

This is quite needed, thank you! 🎉

@nikic
Copy link
Member Author

nikic commented Dec 17, 2016

@weltling @krakjoe I originally targeted this change at PHP-7.0, because among other things, it fixes a performance regression in PHP 7.0 (see https://bugs.php.net/bug.php?id=73265). However, the change turned out to be quite large, so probably too intrusive at this point. Can this at least go into 7.1?

@weltling
Copy link
Contributor

@nikic, IMO the patch size is not a blocker for 7.0 in this case. Reasons -

  • no API/ABI breach
  • the area is in no case critical
  • the prospective improvement is significant

I'd be more concerned about a possible behavior change, as there are currently not that much tests there, the amount of the test data is not big. As for me, if we could test some huge data set on user agent strings to ensure no regressions, it could be declared as good for 7.0. Maybe asking for some real world data from the reporter, it could be compressed and prepared for the test?

Thanks.

@nikic
Copy link
Member Author

nikic commented Dec 18, 2016

Good point about the tests. I couldn't find any source that provides a large corpus of agent strings in machine-readable form. I ended up crawling http://useragentstring.com/pages/useragentstring.php instead, which provides about ~20k distinct user agent strings. I checked that the output of get_browser() before and after this change is identical.

I have two concerns about including this in the test suite: First, I'm not sure it's okay to include this data for copyright reasons (as they don't make the raw data available themselves). Second, while checking all user agents (against an 8M browscap.ini) previously took more than an hour, it still takes one minute now, so this would be a very slow test.

@weltling
Copy link
Contributor

weltling commented Dec 18, 2016

As an additional info - i've extracted some data from a private site for the last couple of weeks. About 1k strings extracted with a quick bash oneliner. Some are extracted invalid or for sure composed by some robot/fishing software, but most are correct. Then i run a this code

$lst = file("user_agents.txt", FILE_IGNORE_NEW_LINES);
foreach($lst as $b) {var_dump(get_browser($b, true));}

Here's the data

http://belski.net/uploads/my/pr_2242/user_agents.txt.gz
http://belski.net/uploads/my/pr_2242/browscap_no_patch.out.gz

No parsing difference between patched and unpatched version. A huge timing difference, 2 hours vs. 25 seconds between orig and patch. Magic :)

I really don't see any reason to defer the patch, 7.0 is fine. Though, I'd be in favor to put a test, it could be marked as slow. Would skip on Travis, but otherwise were fine for local runs, fe I don't exclude any tests with my own test runs, so do many as well. OFC I don't pretend the data to be a reference, just a QUAD extract, but it could be used. Anyway, no license issues on my side :)

Thanks.

@weltling
Copy link
Contributor

weltling commented Dec 18, 2016

Or maybe a partial data set, that could reduce the test time even more. Fe similar strings like AppleWebKit/537.36 vs AppleWebKit/537.37 don't make much difference.

Thanks.

@php-pulls
Copy link

Comment on behalf of krakjoe at php.net:

labelling

Testing corpus provided by Anatol against current browscap.ini lite.

About 30% of all agents are not recognized by this browscap.ini,
but this should give us decent coverage.
Avoid many string duplications, use interning (browscap-local, of
course), reduce pattern size.
Try to avoid expensive regex compilations if we the prefix already
doesn't match.
Also compute regex lazily. We're matching so few regexes now,
it doesn't make sense to store precomputed regex patterns.

Pattern length restricted to 16-bit to reduce structure size.
In particular, make number of "contains" segements configurable.
Also limit lengths to 256 bytes. We simply clamp the length -- it's
not a problem is we match an incomplete prefix/contains segment.
Lowercasing now is the dominant time component. Alternatively, a
lowercased version of the pattern could be precomputed, but this
would require more memory again -- worthwhile?
@nikic
Copy link
Member Author

nikic commented Jan 1, 2017

In 8043451 I've added a test that checks the current browscap_lite.ini against your corpus. The browscap_lite recognizes about 70% of the agents -- I think this is good enough to get some decent coverage, while keeping test run time small and avoiding to embed a large browscap.ini in the repo.

@weltling
Copy link
Contributor

weltling commented Jan 2, 2017

ACK, maybe it's time to merge it then? It'd slide into the RC tomorrow. Or at a later point, anyway.

thanks.

@nikic
Copy link
Member Author

nikic commented Jan 2, 2017

Merged into 7.0.

@nikic nikic closed this Jan 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants