Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to Google Guava for public suffix API #4

Open
anjackson opened this issue Nov 8, 2013 · 6 comments
Open

Switch to Google Guava for public suffix API #4

anjackson opened this issue Nov 8, 2013 · 6 comments
Assignees
Milestone

Comments

@anjackson
Copy link
Member

While porting for #1, this happened:

One issue I noticed was that the archive-access code brings in entire heritrix-commons just for one class, which appears to be quite general purpose:

import org.archive.net.PublicSuffixes;

(indeed, there is a Google Guava class that does pretty much the same thing). This seems a little over the top, so I copied the PublicSuffixes to iipc-web-commons under the org.archive.url package, along with the corresponding unit tests and effective_tld data file.

This is rather clumsy, and given this is provided by Google Guava, there seems little point maintaining our own code (assuming theirs is kept up to date). The task is then to check that the Google one is well maintained and switch over to that instead of copying in code from elsewhere.

nlevitt pushed a commit to nlevitt/webarchive-commons that referenced this issue Jan 17, 2014
changes needed for move to httpcomponents
@nlevitt
Copy link
Member

nlevitt commented Sep 27, 2014

I'm not seeing what #12 has to do with this?

@kris-sigur
Copy link
Member

My mistake

@johnerikhalse
Copy link
Contributor

I had a look at Guava. I assume that it is com.google.common.net.InternetDomainName @anjackson has in mind. As far as I can see this class is not a real replacement for org.archive.net.PublicSuffixes. The latter is looking up real world suffixes from https://publicsuffix.org/ while the former is just evaluating patterns and have no knowledge if the domain names are real.

@anjackson
Copy link
Member Author

Ah, so, I was going on the basis of this documentation that indicated that the Guava classes also just used the publicsuffix.org list. Maybe that documentation is out of date?

@johnerikhalse
Copy link
Contributor

My fault. You are right I didn't read the code well enough.

Regarding which implementation is best maintained, both implementations uses a local copy of the list from publicsuffix.org (PSL). Webarchive-commons' list was last updated sometime late in 2013, while Guava master branch was updated August 20, 2014. In both cases freshness is dependent on the release frequency of the library and that we always depend on the latest version.

For freshness, I think moving to Guava is good. But if we are using other part of Guava and that part gets API-changes, then we must update our code just to get the updated PSL. This is probably not a big problem assuming Guava is concerned about backward compatibility.

@johnerikhalse
Copy link
Contributor

I've found that Heritrix is already using com.google.common.net.InternetDomainName. I think we should do the move as well.

@johnerikhalse johnerikhalse added this to the 1.2.0 Release milestone Sep 30, 2014
@johnerikhalse johnerikhalse self-assigned this Nov 10, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants