Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDN support #27

Closed
PsypherPunk opened this issue Dec 18, 2013 · 7 comments

Comments

Projects
None yet
6 participants
@PsypherPunk
Copy link
Contributor

commented Dec 18, 2013

S'sheet line: 10
For whom? BNF, DN
Notes: CDX/indexing consequences? Need a test case. Heritrix issues, maybe just H1, so need H1 and H3 test cases.
Est. Milestone: 2.x.x

@anjackson

This comment has been minimized.

Copy link
Member

commented Feb 12, 2014

Issues with Internationalized domain names under H1, so the idea was to have a test case for this for H1/H3 and OpenWayback, I think.

@saraaubry

This comment has been minimized.

Copy link

commented Feb 20, 2014

It has to do with cdx-indexer at the first place. A while ago, we couldn't index these domains or pages:
http://www.àchatperché.net
http://www.é-moi.com
http://editions.bnf.fr/astérix-de-à-z

@csrster

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2014

In Denmark, all our domains are harvested and indexed with punycode, but users also want to search using the accented domains, like øx.dk etc.

We've been experimenting with converting non-ASCII domain names in the wayback search-box to punycode client-side in javascript, and it seems to solve our problems. But maybe the conversion should really happen server-side as part of the canonicalization.

@kris-sigur

This comment has been minimized.

Copy link
Member

commented Oct 1, 2014

I think Wayback can mostly work with punycode internally (in the CDXs for example). However the web front-end needs to translate to and from puny code as is appropriate.

@johnerikhalse

This comment has been minimized.

Copy link
Member

commented Oct 2, 2014

I think this should be done server-side since different frontends shouldn't be implementing the same thing.

I had a look at how we could achieve this.
First I tried to put it into the canonicalizer. That works if the resource is found in the archive. Otherwise the UI gives you the option to search under the parent of the submitted URL. The problem is that the code finding the parent uses the original URL to find the parent and fails for IDN hostnames.
To resolve this I also converted to punycode in FormRequestParser. Then it works. But there is one drawback. If you submit http://øx.dk/foo and that's not in the archive. The suggested parent to search is the punycoded version: http://xn--x-4ga.dk/.
To resolve this I found it easiest to add a toUnicodeHostString() to UsableURI class in webarchive-commons which is then used by UrlOperations.getUrlParentDir(String url).

There are probably more places to modify in other configurations like Proxy mode, but that should be easy enough to do.

The thing with this solution is that I convert to puny-version before the canonicalazion. I don't know if that is a problem, but it is then not the case that the canonicalizer know that it is processing the unmodified URL.

@johnerikhalse johnerikhalse added this to the 2.1.0 Release milestone Oct 16, 2014

@johnerikhalse

This comment has been minimized.

Copy link
Member

commented Mar 20, 2015

I've made a possible solution here: https://github.com/iipc/openwayback/tree/issue27_IDN
I am not sure if every aspect of IDN support is solved or if I have broken anything. It would be nice if somebody could build and test it. Please feel free to make any enhancements you like to it.

@kris-sigur

This comment has been minimized.

Copy link
Member

commented May 13, 2015

PR #246 merged. Closing issue.

@kris-sigur kris-sigur closed this May 13, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.