analyze performance of user_agent_parser gem #5

jsvd · 2015-06-26T11:09:01Z

Currently this plugin can be a major resource of CPU usage during data ingestion.

In my MBP 13" core i5, 16gb and SSD, adding this plugin to a stdin -> grok -> date -> geoip -> elasticsearch pipeline slows the ingestion of 300k events by 30-40%

This is due to the high number of Regexp#match operations it's required to do for each single event.
Possible improvements: carefully introducing a LRU cache or reorganizing the yml file without losing the "specific to general" regexp pattern matching

breml · 2015-08-25T14:34:01Z

Interesting option to improve performance for slow filters:
https://github.com/coolacid/logstash-filter-cache-memcached

ebuildy · 2016-03-08T07:37:53Z

Do you think the Java version could be faster?

Cheers,

original-brownbear · 2017-05-01T18:07:17Z

@jsvd looked into this a little today.

I guess step 1 here would be answering:

Do you think the Java version could be faster?

As far as I can see the linked PRs only attempted using some other Java lib and not specifically the Java version of the lib currently used (https://github.com/ua-parser/uap-java). I'd be very surprised if that wouldn't yield a serious speedup. And even if it doesn't it shouldn't be so hard to fix whatever is holding the Java library back (still seeing some nasty things in the Java version too, so there's room <= lots of redundant parsing of part of the UA String for example).

If you want I can take a stab at integrating the Java version + setting up a realistic benchmark to judge it. That shouldn't take much time :)

jsvd · 2017-05-02T10:06:52Z

@original-brownbear ++ on experimenting with uap-java, specially since user-agent-utils was eol'd.
as Suyog mentioned in the email it's worth investigating both the performance gain and to estimate how different/better/worse the results are with some test dataset of user agent strings.

original-brownbear · 2017-05-02T10:22:41Z

@jsvd should I go ahead on this one? :)
Edit: on it as discussed :)

jsvd · 2017-05-02T10:59:02Z

yep. regardless of the outcome it will be an interesting exercise and will help us understand the performance nuances of this kind of problem

original-brownbear · 2017-05-03T07:26:03Z

@jsvd @suyograo so I set up the java version and a benchmark in https://github.com/original-brownbear/logstash-filter-useragent/tree/5.

Used the test datasets the uap data has here (https://github.com/ua-parser/uap-core/tree/master/test_resources) and just ran over it (43k samples) in two runs:

Good news:

Results between Ruby and Java match 100% it seems

Not so exciting news:

Performance only goes up 2x

Parse all sequentially:
Ruby version: ~80s
Java version: ~40s

Parse all sequentially and repeat each String 10 times:
... same as without repeating. 9 x 43k cache lookups aren't even visible relative to 43k parses, which was to be expected I guess :)

=> I'll see if I can make the Java version faster with reasonable effort
=> It's probably most promising to implement a stronger (i.e. lowest possible memory footprint) cache. It should be possible to be really efficient here by simply caching serialized parse results instead of actual objects. ... let me try speeding up the actual parser first though :)

original-brownbear · 2017-05-03T07:31:33Z

profile of the Java version:

original-brownbear · 2017-05-03T07:51:08Z

Unfortunately for the overall runtime we have:

=> it's all about the find calls that don't leave that much room for improvement, but let's see

ebuildy · 2017-05-03T07:54:41Z

Did you try this one https://rubygems.org/gems/logstash-filter-useragent2/versions/3.0.0-java ?

Using a different Java UA parser.

original-brownbear · 2017-05-03T07:58:28Z

@ebuildy no I tried https://github.com/ua-parser/uap-java here.

Is that the version you used to get the 2.5x speedup in #23 (comment) ?

ebuildy · 2017-05-03T09:41:04Z

Ya, but fields are different, in my use case (many different browsers), this helped a lot.

I didnt catch up latest news about this plugin, do you plan to do an official Java version? thanks you

original-brownbear · 2017-05-03T10:05:24Z

@ebuildy

Ya, but fields are different, in my use case (many different browsers), this helped a lot.

I think for now (step 1) we need to keep the fields (and unfortunately the underlying regular expressions) exactly the same for compatibility reasons.

I didnt catch up latest news about this plugin, do you plan to do an official Java version? thanks you

I think so if it actually does improve performance it is my understanding that we will move to a Java version.

ebuildy · 2017-05-03T10:08:07Z

Very nice !

Keep me posted for a test it on a real env. if you want (10m hits per hour) ebuildy at gmail dot com

Many thanks

original-brownbear · 2017-05-03T13:09:47Z

@jsvd @suyograo so this is what I found out/created:

I was able to way optimize the memory use of the original Java version in my branch https://github.com/original-brownbear/logstash-filter-useragent/tree/5
- Even at 100k elements cached it doesn't show in the GC noise relative 1k elements cached (at -Xmx500m! and without G1 String dedup)
I was not able to speed things up beyond a certain point, it's about 2.5x to 3x faster than the JRuby regex calls
- re2j is not fully compatible with java.util.regex for the regex in the yml => not an option for maintainability reasons (imo)
- 98%+ of the runtime is java.util.regex.Matcher#find() now => not really all that much room at lowering CPU use left here (imo)
- With a 100k cache-size, the cache can do 500k+ lookups/s easily

=> I think we may be good (enough) here with the above. Realistically speaking, I feel like we could simply advise users to set cache size ~= number_of_daily_uniques and this thing won't contribute much to the overall pipeline run-time, right? (100k didn't really show if you have orders of magnitude more uniques daily you probably also have enough ram to crank up the setting)

jsvd · 2017-05-03T14:05:27Z

This is great, @original-brownbear! the speed up + lower cache footprint are definitely enough gains to move to creating a PR.
Pairing this with metrics on cache hits/misses may also be interesting, depending on the cost of the observer effect. Either way, this can be done in a separate PR.

suyograo · 2017-05-03T14:22:24Z

@original-brownbear nice work! Bummer about re2j -- and incompatible regex matches are a no go. I'm assuming there is a good amount of user data indexed out there resulting from this filter.

Given your analysis, aggressive caching + UAP in java seems like a good step, so +1 to turn this into a PR.

original-brownbear · 2017-05-03T14:45:10Z

@suyograo @jsvd PR incoming then, moving the build to Gradle and cleaning up the packaging a little is all that's left I think :)

* Speedups in UAP-Java code * Output format adjustments to UAP-Java code * Refactored Ruby code to work with UAP-Java code Fixes logstash-plugins#5

Fixes #38

jsvd added the performance-improvements label Jun 26, 2015

suyograo added the enhancement label Jun 26, 2015

suyograo mentioned this issue Jun 26, 2015

Performance improvements for plugins elastic/logstash#3477

Closed

6 tasks

andrewvc mentioned this issue Sep 8, 2015

Add LRU cache to speedup UA lookups (3.7x speedup) #12

Merged

ebuildy mentioned this issue Mar 8, 2016

Move to Java lib bitwalker user-agent-utils #23

Closed

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 2, 2017

logstash-plugins#5 java parser

1ccc303

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 2, 2017

logstash-plugins#5 java parser

e113d87

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 2, 2017

logstash-plugins#5 java parser

1455338

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 2, 2017

logstash-plugins#5 works

3e57089

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 2, 2017

logstash-plugins#5 works

9f5b841

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 2, 2017

logstash-plugins#5 works

6a33081

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 3, 2017

logstash-plugins#5 works

d832fe7

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 3, 2017

logstash-plugins#5 benchmark

3ca0e98

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 5, 2017

logstash-plugins#5 all passes

a99607e

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 5, 2017

logstash-plugins#5 all passes

a0d5ff1

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 5, 2017

logstash-plugins#5 all passes

c6bcf8c

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 5, 2017

logstash-plugins#5 update readme

b7a63c0

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 6, 2017

logstash-plugins#5 fix license in java files

7558d3a

suyograo mentioned this issue May 6, 2017

#5 Java Version #38

Closed

5 tasks

original-brownbear added a commit to original-brownbear/logstash-filter-useragent that referenced this issue May 6, 2017

logstash-plugins#5 added uap java code

50609f7

elasticsearch-bot pushed a commit that referenced this issue May 6, 2017

#5 added uap java code

b9aa054

Fixes #38

elasticsearch-bot closed this as completed in e06ef7b May 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analyze performance of user_agent_parser gem #5

analyze performance of user_agent_parser gem #5

jsvd commented Jun 26, 2015

breml commented Aug 25, 2015

ebuildy commented Mar 8, 2016

original-brownbear commented May 1, 2017 •

edited

jsvd commented May 2, 2017

original-brownbear commented May 2, 2017 •

edited

jsvd commented May 2, 2017

original-brownbear commented May 3, 2017

original-brownbear commented May 3, 2017

original-brownbear commented May 3, 2017

ebuildy commented May 3, 2017

original-brownbear commented May 3, 2017

ebuildy commented May 3, 2017

original-brownbear commented May 3, 2017

ebuildy commented May 3, 2017

original-brownbear commented May 3, 2017

jsvd commented May 3, 2017 •

edited

suyograo commented May 3, 2017 •

edited

original-brownbear commented May 3, 2017

analyze performance of user_agent_parser gem #5

analyze performance of user_agent_parser gem #5

Comments

jsvd commented Jun 26, 2015

breml commented Aug 25, 2015

ebuildy commented Mar 8, 2016

original-brownbear commented May 1, 2017 • edited

jsvd commented May 2, 2017

original-brownbear commented May 2, 2017 • edited

jsvd commented May 2, 2017

original-brownbear commented May 3, 2017

Good news:

Not so exciting news:

original-brownbear commented May 3, 2017

original-brownbear commented May 3, 2017

ebuildy commented May 3, 2017

original-brownbear commented May 3, 2017

ebuildy commented May 3, 2017

original-brownbear commented May 3, 2017

ebuildy commented May 3, 2017

original-brownbear commented May 3, 2017

jsvd commented May 3, 2017 • edited

suyograo commented May 3, 2017 • edited

original-brownbear commented May 3, 2017

original-brownbear commented May 1, 2017 •

edited

original-brownbear commented May 2, 2017 •

edited

jsvd commented May 3, 2017 •

edited

suyograo commented May 3, 2017 •

edited