Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

统计 UTF-8 字符(文字)个数 or counting UTF-8 characters #114

Open
lanlin opened this issue Mar 10, 2022 · 4 comments
Open

统计 UTF-8 字符(文字)个数 or counting UTF-8 characters #114

lanlin opened this issue Mar 10, 2022 · 4 comments
Labels

Comments

@lanlin
Copy link
Owner

lanlin commented Mar 10, 2022

背景

假如,你有一个蒸汽炸锅... 咳咳,不好意思,串台了。

假如,你需要写一个验证规则来限制文章标题和内容的长度,而你的产品又是面向全世界的...

一般我们会选择 UTF-8 来作为字符集,但 UTF-8 字符集一个字符所占的字节数不定 1-4个字节的范围。

因此 UTF-8 字符串的字节数跟实际的文字字符数不见得是相同的,单纯统计字符串的字节数是不准确的。

下面是不同编程语言的字符数统计方法备忘,欢迎补充

两个特殊字符串,供大家尝试

'I❤𠀰'        // 3 characters
'😹🐶😹🐶'  // 4 characters
@lanlin lanlin added the 常用 label Mar 10, 2022
@lanlin
Copy link
Owner Author

lanlin commented Mar 10, 2022

PHP

// 10 characters
\mb_strlen('hello 😹🐶😹🐶', 'UTF-8');

@lanlin
Copy link
Owner Author

lanlin commented Mar 10, 2022

Go

// 10 characters
len([]rune("hello 😹🐶😹🐶"))

@lanlin
Copy link
Owner Author

lanlin commented Mar 10, 2022

JavaScript

// 10 characters
[...'hello 😹🐶😹🐶'].length;
  1. JavaScript has a Unicode problem
  2. JavaScript 如何正确处理 Unicode 编码问题!

@lanlin
Copy link
Owner Author

lanlin commented Mar 10, 2022

Rust

use unicode_segmentation::UnicodeSegmentation;

// 10 characters
"hello 😹🐶😹🐶".graphemes(true).count();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant