New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Response.buffer gives misinterpreted text instead of raw network buffer #6478
Comments
It looks like the response is decoded in Take a look at http://string-functions.com/encodedecode.aspx and enter your text. The page defaults to an input encoding of See if you can explicitly return a response encoding header in your example. |
That's probably correct. The browser uses the default content encoding to interpret the text.
Yes, that's a solution. But that's not the point of this issue. The problem is that puppeteer provides a |
@Deathspike is correct. See this alternative example: const puppeteer = require('puppeteer')
async function test () {
const browser = await puppeteer.launch()
const context = await browser.createIncognitoBrowserContext()
const page = await context.newPage()
const response = await page.goto('https://unequivocal.eu/latin1.html')
const body = await response.buffer()
console.log(body.slice(body.indexOf('<body>') + 6))
await page.close()
await context.close()
await browser.close()
}
test().catch(err => console.log(err))
<!DOCTYPE html>
<html lang=en><meta charset="iso-8859-1"><title>Latin-1</title>
<body>©2021</body> The browser correctly interprets this as a copyright symbol, so it is clearly understanding the character encoding. However the above script outputs Ideally this should be fixed (I guess by adding a new function, given that fixing the existing one would be backwards-incompatible) but at the very least this should be documented. |
We're marking this issue as unconfirmed because it has not had recent activity and we weren't able to confirm it yet. It will be closed if no further activity occurs within the next 30 days. |
At first I thought that finally somebody is looking into this. But then it just was the stale bot.
Well there are working examples in this thread… |
Not stale |
We are observing something similar. response.buffer returns a buffer encoded in "utf-8". We were under assumption that reponse.buffer will be raw bytes that server has sent (which should have been encoded in "iso-8859-1"). So there are certain conditions when the response.buffer does not keep the original/specified encoding in the response headers, but changes to utf-8 encoding instead. The documentation is not clear on this. These might be related |
Based on the discussion in the above two links using Fetch.getResponseBody might help to get the raw buffer.
|
So Puppeteer interprets the data as utf-8 if the browser returns the body as a string with base64Encoded=false. In the original example, the data is already encoded wrongly when Puppeteer receives it. In general, we rely on the browser to send the data correctly and according to https://bugs.chromium.org/p/chromium/issues/detail?id=771825 it won't be fixed since the original body is not kept around. |
Steps to reproduce
Tell us about your environment:
What steps will reproduce the problem?
response.buffer()
.buffer
tostring
specifying the correct text encoding.Code tells more than words:
What is the expected result?
I expect
response.buffer()
to yield the raw buffer which I can use to interpret the text asutf-8
myself.What happens instead?
I receive a buffer containing the misinterpreted text.
The text was updated successfully, but these errors were encountered: