-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replacement Character(�) appears in multibyte text output from Google VertexAI #5285
Comments
To address the issue of Replacement Characters (�) appearing in outputs with multibyte characters from Google VertexAI, it's crucial to handle UTF-8 character encoding properly, especially when dealing with streams. The core of the problem lies in how chunks are processed and converted to strings, potentially splitting multibyte characters across chunks. Here's a more direct approach to solve this:
Here is a simplified example: const { StringDecoder } = require('string_decoder');
const decoder = new StringDecoder('utf8');
let buffer = Buffer.alloc(0); // Initialize an empty buffer
// Example: Simulate receiving chunks of data
const chunks = [Buffer.from([0xE3, 0x81, 0x82]), Buffer.from([0xE3, 0x81, 0x84])];
chunks.forEach((chunk) => {
buffer = Buffer.concat([buffer, chunk]); // Accumulate chunks
});
// Convert buffer to string, correctly handling multibyte characters
const str = decoder.write(buffer) + decoder.end();
console.log(str); // Correctly prints characters without replacement characters This approach ensures that you're not inadvertently introducing replacement characters by splitting multibyte characters across chunks. For contributing to the LangChain.js codebase with a fix for this issue:
Make sure to review the LangChain.js contribution guidelines in their CONTRIBUTING.md file for more detailed instructions on how to contribute effectively.
|
As @dosu suggested, I've implemented a solution in #5286, but instead of using
By using const decoder = new TextDecoder();
decoder.decode(Buffer.from([0xf0, 0x9f]), { stream: true }); // ''
decoder.decode(Buffer.from([0x91, 0x8b]), { stream: true }); // '👋' Additionally, |
The use of
|
didi you managed to solve the character problem?
} |
Try to use TextDecoderStream const response = await fetch("https://example.com");
const reader = response.body
.pipeThrough(new TextDecoderStream("utf-8"))
.getReader();
// I don't know what readDataStream does
while (true) {
const { done, value } = await reader.read();
if (done) break;
console.log(value);
} |
Checked other resources
Example Code
Make the model output long texts containing multibyte characters as a stream.
Error Message and Stack Trace (if applicable)
(No errors or stack traces occur)
Output Example: Includes Replacement Characters (�)
Description
This issue occurs when requesting outputs from the model in languages that include multibyte characters, such as Japanese, Chinese, Russian, Greek, and various other languages, or in texts that include emojis 😎.
This issue occurs due to the handling of streams containing multibyte characters and the behavior of
buffer.toString()
method in Node.langchainjs/libs/langchain-google-gauth/src/auth.ts
Line 15 in a1ed4fe
When receiving a stream containing multibyte characters, the point at which a chunk (
readable.on('data', ...)
is executed) is may be in the middle of a character’s byte sequence.For instance, the emoji "👋" is represented in UTF-8 as
0xF0 0x9F 0x91 0x8B
.The callback might be executed after only
0xF0 0x9F
has been received.buffer.toString()
attempts to decode byte sequences assuming UTF-8 encoding.If the bytes are invalid, it does not throw an error, instead silently outputs a REPLACEMENT CHARACTER (�).
https://nodejs.org/api/buffer.html#buffers-and-character-encodings
To resolve this, use
TextDecoder
with thestream
option.https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/decode
Related Issues
The issue has been reported below, but it persists even in the latest version.
The same issue occurred when using Google Cloud's client libraries instead of LangChain, but it has been fixed.
I will send a Pull Request later, but I am not familiar with this codebase, and there are many google-related packages under libs/ which I have not grasped enough. Any advice would be appreciated.
System Info
The text was updated successfully, but these errors were encountered: