Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jsoup can't connect to URL with Unicode path #1914

Closed
johannroux opened this issue Mar 9, 2023 · 3 comments
Closed

Jsoup can't connect to URL with Unicode path #1914

johannroux opened this issue Mar 9, 2023 · 3 comments
Assignees
Labels
bug Confirmed bug that we should fix fixed
Milestone

Comments

@johannroux
Copy link

Hello,

Testing with Jsoup 1.15.4 (latest on this date), it seems Jsoup is unable to retrieve pages published on URLs with a Unicode path.

Description of the problem

val url = "https://example.com/unicode/שלום"

val doc: Document = Jsoup.connect(url)
    .followRedirects(true)
    .get()

returns:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=400, URL=[https://example.com/unicode/שלום]

Expected behaviour

Jsoup should handle the encoding on its side and connect without problem.

Already-tested ideas

I tried the following:

Encode the whole URL with URI.toASCIIString()

val url = "https://example.com/unicode/שלום"
val encodedUrl = URI(url).toASCIIString()

val doc: Document = Jsoup.connect(encodedUrl)
    .followRedirects(true)
    .get()

returns:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=[https://example.com/unicode/%25D7%25A9%25D7%259C%25D7%2595%25D7%259D]

Encode the whole URL with URLEncoder

val url = "https://example.com/unicode/שלום"
val encodedUrl = URLEncoder.encode(url, StandardCharsets.UTF_8.toString())

val doc: Document = Jsoup.connect(encodedUrl)
    .followRedirects(true)
    .get()

returns:

java.lang.IllegalArgumentException: The supplied URL, 'https%3A%2F%2Fexample.com%2Funicode%2F%D7%A9%D7%9C%D7%95%D7%9D', is malformed. Make sure it is an absolute URL, and starts with 'http://' or 'https://'. See https://jsoup.org/cookbook/extracting-data/working-with-urls

Encode just the path with URLEncoder

val url = "https://example.com/unicode/" + URLEncoder.encode("שלום", StandardCharsets.UTF_8.toString()) // yields https://example.com/unicode/%D7%A9%D7%9C%D7%95%D7%9D

val doc: Document = Jsoup.connect(url)
    .followRedirects(true)
    .get()

returns:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=[https://example.com/unicode/%25D7%25A9%25D7%259C%25D7%2595%25D7%259D]

It seems that Jsoup is doing a double encoding on its end, but I might be wrong.

Thank you in advance for your help!

@jhy jhy self-assigned this Mar 26, 2023
@jhy jhy added bug Confirmed bug that we should fix fixed labels Mar 26, 2023
@jhy jhy added this to the 1.16.1 milestone Mar 26, 2023
@jhy jhy closed this as completed in 6e71f35 Mar 26, 2023
@jhy
Copy link
Owner

jhy commented Mar 26, 2023

Thanks, fixed! The first form now works correctly by normalizing the input path.

The double escaping issue was fixed in #1902.

Specifically, the first form emits:

HTTP error fetching URL. Status=404, URL=[https://example.com/unicode/%D7%A9%D7%9C%D7%95%D7%9D]

Which is the correctly encoded URL, and the status that example.com returns.

jhy added a commit that referenced this issue Mar 26, 2023
@jhy
Copy link
Owner

jhy commented Mar 26, 2023

Also with 0121311, if the query string contains non-ascii, we normalize that to ascii. Any existing escapes are preserved (which is why this impl is more complicated than just decoding the URL components and then constructing a URI and letting that encode -- existing escapes would get incorrectly smooshed)

@johannroux
Copy link
Author

Thank you very much @jhy! When is the next release planned? 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug that we should fix fixed
Projects
None yet
Development

No branches or pull requests

2 participants