Publicsuffixes2 by kngenie · Pull Request #9 · internetarchive/heritrix3

kngenie · 2012-04-18T17:57:19Z

reimplementation of PublicSuffixes with radix tree.

…icant performance gain)

deleted debug println.

gojomo · 2012-04-18T22:48:15Z

A bit confused: in a quick look at PublicSuffixes2, it seems it's still building a big regex string and then Pattern in order to do the key operations. Is that the case? I would have thought those the main memory-consumers.

Where does this get its memory savings, and what's the magnitude of the savings?

Separate comments:

there's a similar function in the Google Guava libraries we're now including; we could move to that, though one downside is we'd be subject to their (sometimes slow) schedule of updating from the public suffixes list
should add own handle as '@author' or '@contributor' in Javadoc

kngenie · 2012-04-19T18:17:50Z

sorry, probably pull request description is misleading. javadoc comment in PublicSuffixes may be too short.

Yes, it still uses the regular expression as old PublicSuffixes did. It was the fastest path to address the problem I found (described in https://webarchive.jira.com/browse/HER-1965).

I added a comment to HER-1965 comparing regular expressions generated by old and new PublicSuffixes. In short, old regular expression has 14,197 (?: )'s, and new regex has 1,386. This results in ~90% smaller Matcher object, and apparently faster matching operation (not a rigorous benchmark, but I saw ~4x improvement). Also pattern generation must be taking less time and memory, but such one-time saving is not a big deal.

It may be possible to implement even more efficient PublicSuffixes leveraging this radix tree approach, but I'm wondering how much effort would be necessary to beat the Java's (supposedly) well-optimized regular expression implementation.

For use of Google Guava library, we've just found a case against it recently: https://webarchive.jira.com/browse/HER-2004

new PublicSuffixes has my name at the bottom of class-level javadoc comment. should it be in different format ("handle"?)

Publicsuffixes2

kngenie added 3 commits April 18, 2012 10:14

trial rewrite of PublicSuffixes for small memory footprint (no signif…

b2d62a8

…icant performance gain)

PublicSuffixes2Test still had tests for PublicSuffix. fixed.

a0d52e6

renamed PublicSuffixes2 to PublicSuffixes.

27d621f

deleted debug println.

nlevitt added a commit that referenced this pull request Apr 20, 2012

Merge pull request #9 from kngenie/publicsuffixes2

3fed078

Publicsuffixes2

nlevitt merged commit 3fed078 into internetarchive:master Apr 20, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publicsuffixes2#9

Publicsuffixes2#9
nlevitt merged 3 commits intointernetarchive:masterfrom
kngenie:publicsuffixes2

kngenie commented Apr 18, 2012

Uh oh!

gojomo commented Apr 18, 2012

Uh oh!

kngenie commented Apr 19, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kngenie commented Apr 18, 2012

Uh oh!

gojomo commented Apr 18, 2012

Uh oh!

kngenie commented Apr 19, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants