Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intl.Segmenter has different output in Node and Chrome #51563

Open
AudunWA opened this issue Jan 25, 2024 · 6 comments
Open

Intl.Segmenter has different output in Node and Chrome #51563

AudunWA opened this issue Jan 25, 2024 · 6 comments
Assignees
Labels
i18n-api Issues and PRs related to the i18n implementation.

Comments

@AudunWA
Copy link

AudunWA commented Jan 25, 2024

Version

v18.19.0

Platform

Darwin MacBook-Pro-10.local 23.1.0 Darwin Kernel Version 23.1.0: Mon Oct 9 21:28:45 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6020 arm64

Subsystem

No response

What steps will reproduce the bug?

Run the following code in Node.js:

const segmenter = new Intl.Segmenter("en-GB", { granularity: "word" });
console.log(
    [...segmenter.segment("This is a sentence.This is another.")]
        .filter((it) => it.isWordLike)
        .map((it) => it.segment)
);

How often does it reproduce? Is there a required condition?

The results are always the same.

What is the expected behavior? Why is that the expected behavior?

Running the same code in the console of Chrome 120.0.6099.234 results in

['This', 'is', 'a', 'sentence', 'This', 'is', 'another']

I expect the output to be the same in Node.js and Chrome.

What do you see instead?

Running this code in Node.js results in the output

[ 'This', 'is', 'a', 'sentence.This', 'is', 'another' ]

Additional information

No response

@richardlau
Copy link
Member

cc @nodejs/i18n-api
Node.js 18.19.0 contains ICU 73.2 -- I'm not sure what version Chrome 120 uses.

@srl295
Copy link
Member

srl295 commented Jan 25, 2024

ICU 74.1 behaves like Node.js, per https://icu4c-demos.unicode.org/icu-bin/icusegments#1/en - i'm inclined to think Chrome is wrong here. I can confirm the Chrome behavior. My Chrome 120.0.6099.225 seems to have ICU 73.x

@srl295 srl295 self-assigned this Jan 25, 2024
@srl295
Copy link
Member

srl295 commented Jan 25, 2024

Chrome uses customized ICU data. Maybe the segmenter data is scrambled.

@V-yadav18
Copy link

@AudunWA you can use the unicode-segmentation library in Node.js, which provides a JavaScript implementation of Unicode segmentation algorithms.

@srl295
Copy link
Member

srl295 commented Jan 25, 2024

@V-yadav18 can you link to it here? Would be good to test that also.

@srl295
Copy link
Member

srl295 commented Jan 25, 2024

@V-yadav18 https://www.npmjs.com/package/unicode-segmentation is 404, can you put a link to the library you're referring to?

@VoltrexKeyva VoltrexKeyva added the i18n-api Issues and PRs related to the i18n implementation. label Jan 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-api Issues and PRs related to the i18n implementation.
Projects
None yet
Development

No branches or pull requests

5 participants