Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headers are unnecessary encoded to ByteString if there is no Transfer-Encoding: chunked header #42579

Closed
AlttiRi opened this issue Apr 2, 2022 · 7 comments
Labels
http Issues or PRs related to the http subsystem.

Comments

@AlttiRi
Copy link

AlttiRi commented Apr 2, 2022

Version

v17.5.0

Platform

Windows 10

Subsystem

http

What steps will reproduce the bug?

Run this http server:

import http from "http";

const name = `Rock & roll 音楽 («🎵🎶»).txt`; // Also is used as the response body

const defaultCdHeader = getContentDisposition(name);
// console.log(defaultCdHeader === Buffer.from(`inline; filename="${name}"`).toString("binary")); // true

const host = "localhost";
const port = 8000;
const origin = `http://${host}:${port}`;
const server = http.createServer(requestListener);
server.listen(port, host, () => {
    console.log("Server is running on:     " + origin + "       (open the web page)");
    console.log("Downloads the file:       " + origin + "/?dl=1");
    console.log("Remove Transfer-Encoding: " + origin + "/?te=0" + " (open the web page)");
    console.log("Downloads w/o T-E header: " + origin + "/?dl=1&te=0");
});

function requestListener(req, res) {
    let cdHeader = defaultCdHeader;
    const {dl, te} = Object.fromEntries(new URL(origin + req.url).searchParams.entries());
    if (dl === "1") {
        cdHeader = getContentDisposition(name, {type: "attachment"});
    }

    res.setHeader("Content-Disposition", cdHeader);
    // The header must be a `ByteString`. For example, I can't do that:
    // res.setHeader("Header-X", name); // TypeError [ERR_INVALID_CHAR]: Invalid character in header content ["Header-X"]

    if (te === "0") { // Note: "Transfer-Encoding: chunked" is set by default.
        res.removeHeader("Transfer-Encoding"); // Any other TE header's value has the same effect as the header removing
        const byteCount = new TextEncoder().encode(name).length;
        res.setHeader("Content-Length", byteCount);
    }
    res.setHeader("Content-Type", "text/html; charset=utf-8");
    res.writeHead(200);
    res.end(name);
}

// Note: In case of this issue you can ignore the follow function code. It just returns C-D header as `ByteString`.
/**
 * Simple implementation for getting "Content-Disposition" header from filename
 * @example
 * // By default, it produces the same result as it (with replacing all double-quotes by underscore):
 * Buffer.from(`inline; filename="${name}"`).toString("binary")
 *
 * @param {string} name
 * @param {Object} opts
 * @param {"inline"|"attachment"} [opts.type="inline"]
 * @param {Boolean} [opts.encoded=false]
 * @param {Boolean} [opts.filename=true]
 * @param {String} [opts.replacer="_"]
 * @return {string}
 */
function getContentDisposition(name, opts = {}) {
    const {
        type, encoded, filename, replacer
    } = Object.assign({type: "inline", encoded: false, filename: true, replacer: "_"}, opts);
    const fixName = (name) => name.replaceAll(`"`, replacer); // The most trivial fix, since it uses the quoted filename. Prevents from the incorrect header parsing. For, example: `";"` (3 chars in ` quotes).
    const encodeMap = new Map([["'", "%27"], ["(", "%28"], [")", "%29"], ["*", "%30"]]); // Required to escape it for old browsers (For example, a Chromium of 2013)
    const getEncodedName = (name) => encodeURIComponent(name).replaceAll(/['()*]/g, (val) => encodeMap.get(val));
    const encodedStr = encoded ? `filename*=UTF-8''${getEncodedName(name)}` : "";
    const filenameStr = filename ? `filename="${fixName(name)}"` : "";
    const header = [type, filenameStr, encodedStr].filter(x => x).join("; ");
    return Buffer.from(header).toString("binary");
}
  1. Click on the link http://localhost:8000/?dl=1 to download a file with Rock & roll 音楽 («🎵🎶»).txt name.
  2. Click on the link http://localhost:8000/?dl=1&te=0 to download a file with Rock & roll 音楽 («🎵🎶»).txt name (in this case "Transfer-Encoding" header will be removed)

How often does it reproduce? Is there a required condition?

Always.

What is the expected behavior?

Both files are downloaded with Rock & roll 音楽 («🎵🎶»).txt name.

What do you see instead?

The first file has the correct name — Rock & roll 音楽 («🎵🎶»).txt.
The second one has wrong name — Rock & roll é_³æ¥½ («ð__µð__¶Â»).txt

Additional information

TL;DR

If there is "Transfer-Encoding: chunked" header (exactly chunked) setHeader works properly, it sets the input header (ByteString) as is.
(Note: "Transfer-Encoding: chunked" is set by default.)

In any other case it additionally (unnecessary) encodes the header to ByteString.
So, the header is encoded twice, that is wrong.

Additional info

The most of HTTP headers are contains only ASCII characters. But when you need to put in a header (For example, "Content-Disposition", or any custom header) a string that contains non-ASCII* character(s), you can't just put it in as issetHeader.
For example:

// TypeError [ERR_INVALID_CHAR]: Invalid character in header content ["Header-X"]
res.setHeader("Content-Disposition", `attachment; filename="Rock & roll 音楽 («🎵🎶»).txt"`);

A HTTP header is a Binary String (ByteString) — UTF-8 bytes within String object.*

There is no problem with the headers which contain only ASCII characters, since ASCII charset is subset of UTF-8 and Latin 1 encodings, so toByteString(ASCIIString) === ASCIIString.

To get a ByteString from USVString you just need to take UTF-8 bytes from an input string then represent them in Latin 1 (ISO-8859-1) encoding.

For example, in Node.js:

function toByteString(string) {
    return Buffer.from(string).toString("binary"); // or "latin1"
}

*To be honest, the entire quote of [ByteString](https://webidl.spec.whatwg.org/#idl-ByteString:

Such sequences might be interpreted as UTF-8 encoded strings [RFC3629] or strings in some other 8-bit-per-code-unit encoding, although this is not required.

As I can see, a browser also can detect if the string is "just" 8859-1, not UTF-8 bytes encoded in 8859-1.

So, both "Content-Disposition" headers are valid "valid"**:

res.setHeader("Content-Disposition", `attachment; filename="¡«£»÷ÿ.png"`); // Correct ONLY for the some certain browser's languages
res.setHeader("Content-Disposition", `attachment; filename="${toByteString("¡«£»÷ÿ.png")}"`); // Always is correct

The result in both "both"** cases is a file with "¡«£»÷ÿ.png"** name, even while "¡«£»÷ÿ.png" !== toByteString("¡«£»÷ÿ.png").

UPDATE:
**Using non-UTF-8 bytes ("some other 8-bit-per-code-unit encoding") in ByteString is browser/OS language dependent!

For example, in Firefox with non-EN language using of "¡«£»÷ÿ.png" as is (without toByteString()) results to ������.png filename, instead of ¡«£»÷ÿ.png
In Chrome it will be Ў«Ј»чя.png for Cyrillic.

So, I think it (using of 8859-1 in "usual way") should be highly unrecommended.
Headers should always be a ByteString with only UTF-8 bytes represented as 8859-1 (Latin 1).

Problem

The problem is that I can't correctly set a header that is a ByteString (UTF-8 bytes in Latin 1) if the original string contains non-ASCII characters.
Like the other servers do it.

The problem appears only when the Transfer-Encoding: chunked header (which is present by default) is removed (or changed).
In this case setHeader encodes the header to Binary String.

That is unnecessary, since it's already a ByteString.

It's not possible to put in setHeader a USVString, since in this case it will throw TypeError [ERR_INVALID_CHAR]: Invalid character in header content error.

So, the header is encoded to "binary" twice, and browsers download the file with the wrong filenames:
Rock & roll é_³æ¥½ («ð__µð__¶Â»).txt instead of Rock & roll 音楽 («🎵🎶»).txt.

You can open the demo server with disabled Transfer-Encoding: chunked header (http://localhost:8000/?te=0 ) and check it:

function bSrt2Str(bString) {
    return new TextDecoder().decode(binaryStringToArrayBuffer(bString));
}
function binaryStringToArrayBuffer(binaryString) {
    const u8Array = new Uint8Array(binaryString.length);
    for (let i = 0; i < binaryString.length; i++) {
        u8Array[i] = binaryString.charCodeAt(i);
    }
    return u8Array;
}
let cd = (await fetch("")).headers.get("Content-Disposition");
console.log(cd);
console.log(bSrt2Str(cd));
console.log(bSrt2Str(bSrt2Str(cd)));

image

The header is encoded twice!


Examples

A lot of forums encodes headers such way for the attached files (XenForo, vBulletin, for example).

The real life examples:

Oh, wait, it requires an account, if you don't have/(want to create an account), just use my demo server.
Anyway, just look at the screenshots below.

In the browser console you can verify that header are ByteString:

function bSrt2Str(bString) {
    return new TextDecoder().decode(binaryStringToArrayBuffer(bString));
}
function binaryStringToArrayBuffer(binaryString) {
    const u8Array = new Uint8Array(binaryString.length);
    for (let i = 0; i < binaryString.length; i++) {
        u8Array[i] = binaryString.charCodeAt(i);
    }
    return u8Array;
}
let cd = (await fetch("")).headers.get("Content-Disposition");
console.log(cd);
console.log(bSrt2Str(cd));

image

image


As a bonus, here is an example of Java server made with ServerSocket which also works properly:

Main.java
package com.company;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.net.ServerSocket;
import java.net.Socket;
import java.nio.charset.StandardCharsets;

public class Main {
    public static void main(String[] args) {
        try (ServerSocket serverSocket = new ServerSocket(8000)) {
            System.out.println("Sever run on http://127.0.0.1:8000");
            while (true) {
                Socket socket = serverSocket.accept();
                try (BufferedReader input = new BufferedReader(new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8));
                     PrintWriter output = new PrintWriter(socket.getOutputStream())) {
                    while (input.ready()) { // Print headers
                        System.out.println(input.readLine());
                    }
                    String name = "Rock & roll 音楽 («\uD83C\uDFB5\uD83C\uDFB6»).txt";
                    System.out.println(name);

                    output.println("HTTP/1.1 200 OK");
                    output.println("Content-Type: text/html; charset=utf-8");
                    output.println("Content-Disposition: attachment; filename=\"" + name + "\"");
                    output.println();
                    output.println(name);
                    output.flush();
                }
            }
        } catch (IOException ex) {
            ex.printStackTrace();
        }
    }
}
@VoltrexKeyva VoltrexKeyva added the http Issues or PRs related to the http subsystem. label Apr 2, 2022
@benjamingr
Copy link
Member

@nodejs/http

@ShogunPanda
Copy link
Contributor

ShogunPanda commented Apr 6, 2022

Hello.
I have triaged this issue.

You are correct in saying the that Transfer-Encoding header make node behave differently: when TE is present, the headers are sent with the first chunk and the chunk encoding is binary.

When TE is not present, the headers are sent with the data but there is no encoding specified and Node defaults to UTF-8.
Your file name, which you binary encoded to latin1, is then re-interpreted as utf-8 and that's why the double encoding happened.

However, the problem is not in how Node processes the headers but in the value it self. No spec mandates (at least for what I could find) that non UTF-8 should be converted as latin1/binary and then the client should/might reinterpret them. My suggestion is to encode them using RFC8187 (which you already have in your code when encoded=true but somehow you don't do by default).

I have filed a PR that explicit this in the docs.
Thanks for your report!

@xtx1130
Copy link
Contributor

xtx1130 commented Apr 6, 2022

For RFC8187:

Producers MUST use the "UTF-8" ([RFC3629]) character encoding.

@ShogunPanda we should use encodeURIComponent & decodeURIComponent to deal with these Special characters, right?

@AlttiRi
Copy link
Author

AlttiRi commented Apr 6, 2022

I see that PHP-based servers work this way — they encode headers (it fact, it's only needed for non-ASCII characters) as UTF-8 bytes within Latin1.

It's not something new.
It works fine with a 2013 browser (The older browsers that I found.)

Why I can't do the same with node.js?
To produce the same header, as it do PHP forum servers, like I do in my Java example above.


encoded=true is just a workaround.

However, it should work even without this.
And it will work if Node.js will not encode the headers itself if TE is missed.
It already expects Latin1 as a input string. A valid header value. But it additionally encodes it.

@AlttiRi
Copy link
Author

AlttiRi commented Apr 6, 2022

When TE is not present, the headers are sent with the data but there is no encoding specified and Node defaults to UTF-8.

No spec mandates (at least for what I could find) that non UTF-8 should be converted as latin1/binary and then the client should/might reinterpret them.

So, it feels like a bug.
It encodes headers to UTF-8, while it's not needed.

Clients expect headers as is — as Latin1/binary (which may contain either UTF-8 bytes, or some 8-bit encoding), but Node.js (w/o TE header) encodes them in UTF-8.

Is it not a bug?

@ShogunPanda
Copy link
Contributor

For RFC8187:

Producers MUST use the "UTF-8" ([RFC3629]) character encoding.

@ShogunPanda we should use encodeURIComponent & decodeURIComponent to deal with these Special characters, right?

Yes, encodeURIComponent on server->client, and in theory decodeURIComponent when receiving a client request.

@ShogunPanda
Copy link
Contributor

ShogunPanda commented Apr 6, 2022

So, it feels like a bug. It encodes headers to UTF-8, while it's not needed.

Clients expect headers as is — as Latin1/binary (which may contain either UTF-8 bytes, or some 8-bit encoding), but Node.js (w/o TE header) encodes them in UTF-8.

Here's the thing. The client expect ASCII only. Not binary. The big difference is that technically the RFC7230 only allows US-ASCII (so from 0 to 127) while we're using latin1 (0 to 255). The client might not understand the extension at all.
Putting latin1 encoded UTF-8 is arbitrary and needs agreement with the client.

Is it not a bug?
I see that PHP-based servers work this way — they encode headers (it fact, it's only needed for non-ASCII characters) as UTF-8 bytes within Latin1.

It's not something new. It works fine with a 2013 browser (The older browsers that I found.)

In my opinion supporting that is a deviation from the spec.

Why I can't do the same with node.js? To produce the same header, as it do PHP forum servers, like I do in my Java example above.

However, it should work even without this. And it will work if Node.js will not encode the headers itself if TE is missed. It already expects Latin1 as a input string. A valid header value. But it additionally encodes it.

That's another thing.
In node we have a optimization that sends headers along with the first part of the body (or the whole body). Only the headers are in latin1, body is not.
If we try to remove the optimization (and I tried when triaging this bug) we solve the issue but benchmarks show a nearly 100% decrease in speed, which we cannot afford for a edge case like this.

Once again, I'd like to remark that RFC8187 is the newer standard available (2018), built on top of older RFCs. Consider this to be correct case and any other client support as hackish, since it's not regulated anywhere.

xtx1130 pushed a commit to xtx1130/node that referenced this issue Apr 25, 2022
PR-URL: nodejs#42624
Fixes: nodejs#42579
Co-authored-by: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
juanarbol pushed a commit that referenced this issue May 31, 2022
PR-URL: #42624
Fixes: #42579
Co-authored-by: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
danielleadams pushed a commit that referenced this issue Jun 27, 2022
PR-URL: #42624
Fixes: #42579
Co-authored-by: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
targos pushed a commit that referenced this issue Jul 11, 2022
PR-URL: #42624
Fixes: #42579
Co-authored-by: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
targos pushed a commit that referenced this issue Jul 31, 2022
PR-URL: #42624
Fixes: #42579
Co-authored-by: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
guangwong pushed a commit to noslate-project/node that referenced this issue Oct 10, 2022
PR-URL: nodejs/node#42624
Fixes: nodejs/node#42579
Co-authored-by: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants