Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analyze performance of user_agent_parser gem #5

Closed
jsvd opened this issue Jun 26, 2015 · 18 comments
Closed

analyze performance of user_agent_parser gem #5

jsvd opened this issue Jun 26, 2015 · 18 comments

Comments

@jsvd
Copy link
Contributor

jsvd commented Jun 26, 2015

Currently this plugin can be a major resource of CPU usage during data ingestion.

In my MBP 13" core i5, 16gb and SSD, adding this plugin to a stdin -> grok -> date -> geoip -> elasticsearch pipeline slows the ingestion of 300k events by 30-40%

This is due to the high number of Regexp#match operations it's required to do for each single event.
Possible improvements: carefully introducing a LRU cache or reorganizing the yml file without losing the "specific to general" regexp pattern matching

@breml
Copy link

breml commented Aug 25, 2015

Interesting option to improve performance for slow filters:
https://github.com/coolacid/logstash-filter-cache-memcached

@ebuildy
Copy link

ebuildy commented Mar 8, 2016

Do you think the Java version could be faster?

Cheers,

@original-brownbear
Copy link
Contributor

original-brownbear commented May 1, 2017

@jsvd looked into this a little today.

I guess step 1 here would be answering:

Do you think the Java version could be faster?

As far as I can see the linked PRs only attempted using some other Java lib and not specifically the Java version of the lib currently used (https://github.com/ua-parser/uap-java). I'd be very surprised if that wouldn't yield a serious speedup. And even if it doesn't it shouldn't be so hard to fix whatever is holding the Java library back (still seeing some nasty things in the Java version too, so there's room <= lots of redundant parsing of part of the UA String for example).

If you want I can take a stab at integrating the Java version + setting up a realistic benchmark to judge it. That shouldn't take much time :)

@jsvd
Copy link
Contributor Author

jsvd commented May 2, 2017

@original-brownbear ++ on experimenting with uap-java, specially since user-agent-utils was eol'd.
as Suyog mentioned in the email it's worth investigating both the performance gain and to estimate how different/better/worse the results are with some test dataset of user agent strings.

@original-brownbear
Copy link
Contributor

original-brownbear commented May 2, 2017

@jsvd should I go ahead on this one? :)
Edit: on it as discussed :)

@jsvd
Copy link
Contributor Author

jsvd commented May 2, 2017

yep. regardless of the outcome it will be an interesting exercise and will help us understand the performance nuances of this kind of problem

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 2, 2017
original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 2, 2017
original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 2, 2017
original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 2, 2017
original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 2, 2017
original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 2, 2017
original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 3, 2017
original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 3, 2017
@original-brownbear
Copy link
Contributor

@jsvd @suyograo so I set up the java version and a benchmark in https://github.com/original-brownbear/logstash-filter-useragent/tree/5.

Used the test datasets the uap data has here (https://github.com/ua-parser/uap-core/tree/master/test_resources) and just ran over it (43k samples) in two runs:

Good news:

  • Results between Ruby and Java match 100% it seems

Not so exciting news:

  • Performance only goes up 2x

Parse all sequentially:
Ruby version: ~80s
Java version: ~40s

Parse all sequentially and repeat each String 10 times:
... same as without repeating. 9 x 43k cache lookups aren't even visible relative to 43k parses, which was to be expected I guess :)

=> I'll see if I can make the Java version faster with reasonable effort
=> It's probably most promising to implement a stronger (i.e. lowest possible memory footprint) cache. It should be possible to be really efficient here by simply caching serialized parse results instead of actual objects. ... let me try speeding up the actual parser first though :)

@original-brownbear
Copy link
Contributor

profile of the Java version:

perf

@original-brownbear
Copy link
Contributor

Unfortunately for the overall runtime we have:

allfind

=> it's all about the find calls that don't leave that much room for improvement, but let's see

@ebuildy
Copy link

ebuildy commented May 3, 2017

Did you try this one https://rubygems.org/gems/logstash-filter-useragent2/versions/3.0.0-java ?

Using a different Java UA parser.

@original-brownbear
Copy link
Contributor

@ebuildy no I tried https://github.com/ua-parser/uap-java here.

Is that the version you used to get the 2.5x speedup in #23 (comment) ?

@ebuildy
Copy link

ebuildy commented May 3, 2017

Ya, but fields are different, in my use case (many different browsers), this helped a lot.

I didnt catch up latest news about this plugin, do you plan to do an official Java version? thanks you

@original-brownbear
Copy link
Contributor

@ebuildy

Ya, but fields are different, in my use case (many different browsers), this helped a lot.

I think for now (step 1) we need to keep the fields (and unfortunately the underlying regular expressions) exactly the same for compatibility reasons.

I didnt catch up latest news about this plugin, do you plan to do an official Java version? thanks you

I think so if it actually does improve performance it is my understanding that we will move to a Java version.

@ebuildy
Copy link

ebuildy commented May 3, 2017

Very nice !

Keep me posted for a test it on a real env. if you want (10m hits per hour) ebuildy at gmail dot com

Many thanks

@original-brownbear
Copy link
Contributor

@jsvd @suyograo so this is what I found out/created:

  • I was able to way optimize the memory use of the original Java version in my branch https://github.com/original-brownbear/logstash-filter-useragent/tree/5
    • Even at 100k elements cached it doesn't show in the GC noise relative 1k elements cached (at -Xmx500m! and without G1 String dedup)
  • I was not able to speed things up beyond a certain point, it's about 2.5x to 3x faster than the JRuby regex calls
    • re2j is not fully compatible with java.util.regex for the regex in the yml => not an option for maintainability reasons (imo)
    • 98%+ of the runtime is java.util.regex.Matcher#find() now => not really all that much room at lowering CPU use left here (imo)
    • With a 100k cache-size, the cache can do 500k+ lookups/s easily

=> I think we may be good (enough) here with the above. Realistically speaking, I feel like we could simply advise users to set cache size ~= number_of_daily_uniques and this thing won't contribute much to the overall pipeline run-time, right? (100k didn't really show if you have orders of magnitude more uniques daily you probably also have enough ram to crank up the setting)

@jsvd
Copy link
Contributor Author

jsvd commented May 3, 2017

This is great, @original-brownbear! the speed up + lower cache footprint are definitely enough gains to move to creating a PR.
Pairing this with metrics on cache hits/misses may also be interesting, depending on the cost of the observer effect. Either way, this can be done in a separate PR.

@suyograo
Copy link
Contributor

suyograo commented May 3, 2017

@original-brownbear nice work! Bummer about re2j -- and incompatible regex matches are a no go. I'm assuming there is a good amount of user data indexed out there resulting from this filter.

Given your analysis, aggressive caching + UAP in java seems like a good step, so +1 to turn this into a PR.

@original-brownbear
Copy link
Contributor

@suyograo @jsvd PR incoming then, moving the build to Gradle and cleaning up the packaging a little is all that's left I think :)

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 5, 2017
original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 5, 2017
original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 5, 2017
original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 5, 2017
original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 6, 2017
@suyograo suyograo mentioned this issue May 6, 2017
5 tasks
original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 6, 2017
original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 6, 2017
* Speedups in UAP-Java code
* Output format adjustments to UAP-Java code
* Refactored Ruby code to work with UAP-Java code

Fixes logstash-plugins#5
elasticsearch-bot pushed a commit that referenced this issue May 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants