Skip to content

Replacement Character(�) appears in multibyte text output from Google VertexAI #5285

@pokutuna

Description

@pokutuna

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain.js documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain.js rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Make the model output long texts containing multibyte characters as a stream.

import { VertexAI } from "@langchain/google-vertexai";

// Set your project ID and pass the credentials according to the doc.
// https://js.langchain.com/docs/integrations/llms/google_vertex_ai
const project = "YOUR_PROJECT_ID";

const langchainModel = new VertexAI({
  model: "gemini-1.5-pro-preview-0409",
  location: "us-central1",
  authOptions: { projectId: project },
});

// EN: List as many Japanese proverbs as possible.
const prompt = "日本のことわざをできるだけたくさん挙げて";
for await (const chunk of await langchainModel.stream(prompt)) {
  process.stdout.write(chunk);
}

Error Message and Stack Trace (if applicable)

(No errors or stack traces occur)

Output Example: Includes Replacement Characters (�)

## ������������:知恵の宝庫

日本のことわざは、長い歴史の中で培われた知恵や教訓が詰まった、短い言葉の宝庫で������いくつかご紹介しますね。

**人生・教訓**

* **井の中の蛙大海を知らず** (I no naka no kawazu taikai wo shirazu):  狭い世界しか知らない者のたとえ。
* **石の上にも三年** (Ishi no ue ni mo san nen):  ������強く努力すれば成功する。
* **案ずるより産むが易し** (Anzuru yori umu ga yasushi):  心配するよりも行動した方が良い。
* **転�������������** (Korobanu saki no tsue):  前もって準備をすることの大切さ。
* **失敗は成功のもと** (Shippai wa seikou no moto):  失敗から学ぶことで成功�������る。

**人���関係**

* **類は友を呼ぶ** (Rui wa tomo wo yobu):  似た者同士が仲良くなる。
* **情けは人の為ならず** (Nasake wa hito no tame narazu):  人に親切にすることは巡り巡��て自分に良いことが返ってくる。
* **人の振り見て我が振り直せ** (Hito no furi mite waga furi naose):  他人の行動を見て自分の行動を反省する。
* **出る杭は打たれる** (Deru kui wa utareru):  他人より目���つ��叩かれる。
* **三人寄れば文殊の知恵** (Sannin yoreba monju no chie):  みんなで知恵を出し合えば良い考えが浮かぶ。

...

Description

This issue occurs when requesting outputs from the model in languages that include multibyte characters, such as Japanese, Chinese, Russian, Greek, and various other languages, or in texts that include emojis 😎.

This issue occurs due to the handling of streams containing multibyte characters and the behavior of buffer.toString() method in Node.

data.on("data", (data) => this.appendBuffer(data.toString()));

When receiving a stream containing multibyte characters, the point at which a chunk (readable.on('data', ...) is executed) is may be in the middle of a character’s byte sequence.
For instance, the emoji "👋" is represented in UTF-8 as 0xF0 0x9F 0x91 0x8B.
The callback might be executed after only 0xF0 0x9F has been received.

buffer.toString() attempts to decode byte sequences assuming UTF-8 encoding.
If the bytes are invalid, it does not throw an error, instead silently outputs a REPLACEMENT CHARACTER (�).
https://nodejs.org/api/buffer.html#buffers-and-character-encodings

To resolve this, use TextDecoder with the stream option.
https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/decode

Related Issues

The issue has been reported below, but it persists even in the latest version.

The same issue occurred when using Google Cloud's client libraries instead of LangChain, but it has been fixed.


I will send a Pull Request later, but I am not familiar with this codebase, and there are many google-related packages under libs/ which I have not grasped enough. Any advice would be appreciated.

System Info

  • macOS
  • node v20.12.2
  • langchain versions
$ npm list --depth=1 | grep langchain
├─┬ @langchain/community@0.0.54
│ ├── @langchain/core@0.1.61
│ ├── @langchain/openai@0.0.28
├─┬ @langchain/google-vertexai@0.0.12
│ ├── @langchain/core@0.1.61 deduped
│ └── @langchain/google-gauth@0.0.12
├─┬ langchain@0.1.36
│ ├── @langchain/community@0.0.54 deduped
│ ├── @langchain/core@0.1.61 deduped
│ ├── @langchain/openai@0.0.28 deduped
│ ├── @langchain/textsplitters@0.0.0
│ ├── langchainhub@0.0.8

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions