Response.buffer gives misinterpreted text instead of raw network buffer #6478

Deathspike · 2020-10-07T11:55:29Z

Steps to reproduce

Tell us about your environment:

Puppeteer version: 5.3.1
Platform / OS version: Windows 10 (Build 2004) 64-Bit
URLs (if applicable):
Node.js version: 12.19.0

What steps will reproduce the problem?

Open a page that does not define the appropriate text encoding.
Retrieve the response.buffer().
Change buffer to string specifying the correct text encoding.
Observe that the browser did not provide the raw buffer, but rather it's (wrong) interpretation.

Code tells more than words:

const http = require('http');
const puppeteer = require('puppeteer');

const server = http.createServer((_, res) => {
  const buffer = Buffer.from('Flächendeckender Blitzzauber wird ausgelöst.', 'utf8');
  res.write(buffer);
  res.end();
  server.close();
}).listen(4567);

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const response = await page.goto('http://localhost:4567');

  // Expected: Text is interpreted as the default encoding, not UTF-8.
  const text = await response.text();
  console.log(text); // -> FlÃ¤chendeckender Blitzzauber wird ausgelÃ¶st.

  // Unexpected: I should receive the buffer without text interpretation.
  const buffer = await response.buffer();
  const bufferText = buffer.toString('utf8');
  console.log(bufferText); // -> FlÃ¤chendeckender Blitzzauber wird ausgelÃ¶st.

  await browser.close();
})();

What is the expected result?

I expect response.buffer() to yield the raw buffer which I can use to interpret the text as utf-8 myself.

What happens instead?

I receive a buffer containing the misinterpreted text.

The text was updated successfully, but these errors were encountered:

dpmott · 2021-03-11T22:38:00Z

It looks like the response is decoded in iso-8859-1, which I suspect is either the default encoding specified by your http server, or is the implicit default when no response encoding is specified.

Take a look at http://string-functions.com/encodedecode.aspx and enter your text. The page defaults to an input encoding of utf-8, and an output decoding of iso-8859-1, and it yields the same results as you're seeing.

See if you can explicitly return a response encoding header in your example.

Deathspike · 2021-03-12T08:21:49Z

It looks like the response is decoded in iso-8859-1, which I suspect is either the default encoding specified by your http server, or is the implicit default when no response encoding is specified.

That's probably correct. The browser uses the default content encoding to interpret the text.

See if you can explicitly return a response encoding header in your example.

Yes, that's a solution. But that's not the point of this issue. The problem is that puppeteer provides a buffer() function, which isn't the network buffer. Instead, it takes the wrongly interpreted text and creates a buffer out of that. That's completely incorrect. It should yield the raw network buffer. That is, before the browser interpretation manipulated/tinkered with the server response. As it stands, it's impossible to get the correct buffer of a binary response, for example, because the browser text interpretation will possibly throw a wrench into things.

jribbens · 2021-05-14T20:11:26Z

@Deathspike is correct. See this alternative example:

const puppeteer = require('puppeteer')

async function test () {
  const browser = await puppeteer.launch()
  const context = await browser.createIncognitoBrowserContext()
  const page = await context.newPage()
  const response = await page.goto('https://unequivocal.eu/latin1.html')
  const body = await response.buffer()
  console.log(body.slice(body.indexOf('<body>') + 6))
  await page.close()
  await context.close()
  await browser.close()
}

test().catch(err => console.log(err))

latin1.html contains the following, encoded in iso-8859-1:

<!DOCTYPE html>
<html lang=en><meta charset="iso-8859-1"><title>Latin-1</title>
<body>©2021</body>

The browser correctly interprets this as a copyright symbol, so it is clearly understanding the character encoding. However the above script outputs <Buffer c2 a9 32 30 32 31 3c 2f 62 6f 64 79 3e 0a>. You can see from the c2 a9 that it has decoded the binary data as iso-8859-1 and then re-encoded it as utf-8. So response.buffer() is not, as the documentation would imply, returning the binary data that the browser received, but, when the browser thinks it's text, that text re-encoded as utf-8.

Ideally this should be fixed (I guess by adding a new function, given that fixing the existing one would be backwards-incompatible) but at the very least this should be documented.

stale · 2022-06-24T10:58:16Z

We're marking this issue as unconfirmed because it has not had recent activity and we weren't able to confirm it yet. It will be closed if no further activity occurs within the next 30 days.

Zauberbutter · 2022-06-24T11:00:56Z

At first I thought that finally somebody is looking into this. But then it just was the stale bot.

and we weren't able to confirm it yet.

Well there are working examples in this thread…

jribbens · 2022-06-25T14:43:07Z

Not stale

vijay-koppala · 2022-08-19T19:16:08Z

We are observing something similar.
response header(s)
Content-Type: text/javascript;charset=iso-8859-1

response.buffer returns a buffer encoded in "utf-8". We were under assumption that reponse.buffer will be raw bytes that server has sent (which should have been encoded in "iso-8859-1").

So there are certain conditions when the response.buffer does not keep the original/specified encoding in the response headers, but changes to utf-8 encoding instead. The documentation is not clear on this.

These might be related
https://bugs.chromium.org/p/chromium/issues/detail?id=771825
https://bugs.chromium.org/p/chromium/issues/detail?id=1311395

vijay-koppala · 2022-08-19T23:32:39Z

Based on the discussion in the above two links using Fetch.getResponseBody might help to get the raw buffer.

  const puppeteer = require('puppeteer');
 (async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const client = await page.target().createCDPSession();
  await client.send("Fetch.enable", {
    patterns: [{ requestStage: "Response" }]
  });
  client.on("Fetch.requestPaused", async event => {
    const { requestId } = event;
    const responseCdp = await client.send("Fetch.getResponseBody", { requestId });
    //We may need to use responseCdp.base64Encoded to see if it is base64endoded or not
    let buff = Buffer.from(responseCdp.body, 'base64');
    //The buffer has iso-8859-1 bytes (NOT utf-8 bytes)
    console.log("Response body is ${buff.length} bytes");
    await client.send("Fetch.continueRequest", { requestId });
  });
 await page.goto('{some-url-that-can-return-for-example-iso-8859-1-response}');
 await browser.close();
 })();

OrKoN · 2024-03-23T16:02:26Z

So Puppeteer interprets the data as utf-8 if the browser returns the body as a string with base64Encoded=false. In the original example, the data is already encoded wrongly when Puppeteer receives it. In general, we rely on the browser to send the data correctly and according to https://bugs.chromium.org/p/chromium/issues/detail?id=771825 it won't be fixed since the original body is not kept around.

stale bot added the unconfirmed label Jun 24, 2022

stale bot removed the unconfirmed label Jun 25, 2022

OrKoN added bug confirmed labels Aug 22, 2022

OrKoN mentioned this issue Mar 23, 2024

docs: mention issues with buffer encoding #12141

Merged

OrKoN added upstream P3 labels May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Response.buffer gives misinterpreted text instead of raw network buffer #6478

Response.buffer gives misinterpreted text instead of raw network buffer #6478

Deathspike commented Oct 7, 2020

dpmott commented Mar 11, 2021

Deathspike commented Mar 12, 2021

jribbens commented May 14, 2021

stale bot commented Jun 24, 2022

Zauberbutter commented Jun 24, 2022

jribbens commented Jun 25, 2022

vijay-koppala commented Aug 19, 2022 •

edited

vijay-koppala commented Aug 19, 2022 •

edited

OrKoN commented Mar 23, 2024

Response.buffer gives misinterpreted text instead of raw network buffer #6478

Response.buffer gives misinterpreted text instead of raw network buffer #6478

Comments

Deathspike commented Oct 7, 2020

Steps to reproduce

dpmott commented Mar 11, 2021

Deathspike commented Mar 12, 2021

jribbens commented May 14, 2021

stale bot commented Jun 24, 2022

Zauberbutter commented Jun 24, 2022

jribbens commented Jun 25, 2022

vijay-koppala commented Aug 19, 2022 • edited

vijay-koppala commented Aug 19, 2022 • edited

OrKoN commented Mar 23, 2024

vijay-koppala commented Aug 19, 2022 •

edited

vijay-koppala commented Aug 19, 2022 •

edited