Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Response.buffer gives misinterpreted text instead of raw network buffer #6478

Open
Deathspike opened this issue Oct 7, 2020 · 9 comments
Open

Comments

@Deathspike
Copy link

Steps to reproduce

Tell us about your environment:

  • Puppeteer version: 5.3.1
  • Platform / OS version: Windows 10 (Build 2004) 64-Bit
  • URLs (if applicable):
  • Node.js version: 12.19.0

What steps will reproduce the problem?

  1. Open a page that does not define the appropriate text encoding.
  2. Retrieve the response.buffer().
  3. Change buffer to string specifying the correct text encoding.
  4. Observe that the browser did not provide the raw buffer, but rather it's (wrong) interpretation.

Code tells more than words:

const http = require('http');
const puppeteer = require('puppeteer');

const server = http.createServer((_, res) => {
  const buffer = Buffer.from('Flächendeckender Blitzzauber wird ausgelöst.', 'utf8');
  res.write(buffer);
  res.end();
  server.close();
}).listen(4567);

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const response = await page.goto('http://localhost:4567');

  // Expected: Text is interpreted as the default encoding, not UTF-8.
  const text = await response.text();
  console.log(text); // -> Flächendeckender Blitzzauber wird ausgelöst.

  // Unexpected: I should receive the buffer without text interpretation.
  const buffer = await response.buffer();
  const bufferText = buffer.toString('utf8');
  console.log(bufferText); // -> Flächendeckender Blitzzauber wird ausgelöst.

  await browser.close();
})();

What is the expected result?

I expect response.buffer() to yield the raw buffer which I can use to interpret the text as utf-8 myself.

What happens instead?

I receive a buffer containing the misinterpreted text.

@dpmott
Copy link

dpmott commented Mar 11, 2021

It looks like the response is decoded in iso-8859-1, which I suspect is either the default encoding specified by your http server, or is the implicit default when no response encoding is specified.

Take a look at http://string-functions.com/encodedecode.aspx and enter your text. The page defaults to an input encoding of utf-8, and an output decoding of iso-8859-1, and it yields the same results as you're seeing.

See if you can explicitly return a response encoding header in your example.

@Deathspike
Copy link
Author

It looks like the response is decoded in iso-8859-1, which I suspect is either the default encoding specified by your http server, or is the implicit default when no response encoding is specified.

That's probably correct. The browser uses the default content encoding to interpret the text.

See if you can explicitly return a response encoding header in your example.

Yes, that's a solution. But that's not the point of this issue. The problem is that puppeteer provides a buffer() function, which isn't the network buffer. Instead, it takes the wrongly interpreted text and creates a buffer out of that. That's completely incorrect. It should yield the raw network buffer. That is, before the browser interpretation manipulated/tinkered with the server response. As it stands, it's impossible to get the correct buffer of a binary response, for example, because the browser text interpretation will possibly throw a wrench into things.

@jribbens
Copy link

@Deathspike is correct. See this alternative example:

const puppeteer = require('puppeteer')

async function test () {
  const browser = await puppeteer.launch()
  const context = await browser.createIncognitoBrowserContext()
  const page = await context.newPage()
  const response = await page.goto('https://unequivocal.eu/latin1.html')
  const body = await response.buffer()
  console.log(body.slice(body.indexOf('<body>') + 6))
  await page.close()
  await context.close()
  await browser.close()
}

test().catch(err => console.log(err))

latin1.html contains the following, encoded in iso-8859-1:

<!DOCTYPE html>
<html lang=en><meta charset="iso-8859-1"><title>Latin-1</title>
<body>©2021</body>

The browser correctly interprets this as a copyright symbol, so it is clearly understanding the character encoding. However the above script outputs <Buffer c2 a9 32 30 32 31 3c 2f 62 6f 64 79 3e 0a>. You can see from the c2 a9 that it has decoded the binary data as iso-8859-1 and then re-encoded it as utf-8. So response.buffer() is not, as the documentation would imply, returning the binary data that the browser received, but, when the browser thinks it's text, that text re-encoded as utf-8.

Ideally this should be fixed (I guess by adding a new function, given that fixing the existing one would be backwards-incompatible) but at the very least this should be documented.

@stale
Copy link

stale bot commented Jun 24, 2022

We're marking this issue as unconfirmed because it has not had recent activity and we weren't able to confirm it yet. It will be closed if no further activity occurs within the next 30 days.

@Zauberbutter
Copy link

At first I thought that finally somebody is looking into this. But then it just was the stale bot.

and we weren't able to confirm it yet.

Well there are working examples in this thread…

@stale stale bot added the unconfirmed label Jun 24, 2022
@jribbens
Copy link

Not stale

@stale stale bot removed the unconfirmed label Jun 25, 2022
@vijay-koppala
Copy link

vijay-koppala commented Aug 19, 2022

We are observing something similar.
response header(s)
Content-Type: text/javascript;charset=iso-8859-1

response.buffer returns a buffer encoded in "utf-8". We were under assumption that reponse.buffer will be raw bytes that server has sent (which should have been encoded in "iso-8859-1").

So there are certain conditions when the response.buffer does not keep the original/specified encoding in the response headers, but changes to utf-8 encoding instead. The documentation is not clear on this.

These might be related
https://bugs.chromium.org/p/chromium/issues/detail?id=771825
https://bugs.chromium.org/p/chromium/issues/detail?id=1311395

@vijay-koppala
Copy link

vijay-koppala commented Aug 19, 2022

Based on the discussion in the above two links using Fetch.getResponseBody might help to get the raw buffer.

  const puppeteer = require('puppeteer');
 (async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const client = await page.target().createCDPSession();
  await client.send("Fetch.enable", {
    patterns: [{ requestStage: "Response" }]
  });
  client.on("Fetch.requestPaused", async event => {
    const { requestId } = event;
    const responseCdp = await client.send("Fetch.getResponseBody", { requestId });
    //We may need to use responseCdp.base64Encoded to see if it is base64endoded or not
    let buff = Buffer.from(responseCdp.body, 'base64');
    //The buffer has iso-8859-1 bytes (NOT utf-8 bytes)
    console.log("Response body is ${buff.length} bytes");
    await client.send("Fetch.continueRequest", { requestId });
  });
 await page.goto('{some-url-that-can-return-for-example-iso-8859-1-response}');
 await browser.close();
 })();

@OrKoN
Copy link
Collaborator

OrKoN commented Mar 23, 2024

So Puppeteer interprets the data as utf-8 if the browser returns the body as a string with base64Encoded=false. In the original example, the data is already encoded wrongly when Puppeteer receives it. In general, we rely on the browser to send the data correctly and according to https://bugs.chromium.org/p/chromium/issues/detail?id=771825 it won't be fixed since the original body is not kept around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants