Skip to content

Commit 11ec18e

Browse files
feat: support CJK languages
1 parent bc9ed15 commit 11ec18e

File tree

9 files changed

+6002
-216
lines changed

9 files changed

+6002
-216
lines changed

README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,15 @@
11
# tokenwise
22

3-
GPT token count and context size utilities when approximations are good enough.
3+
GPT token count and context size utilities when approximations are good enough. The following table shows the accuracy of the token count approximation for different input texts:
4+
5+
| Description | Actual GPT Token Count | Estimated Token Count | Error Range (%) |
6+
| ----------- | ---------------------- | --------------------- | --------------- |
7+
| Short English text | 10 | 11 | 10.00% |
8+
| German text with umlauts | 56 | 49 | 12.50% |
9+
| Metamorphosis by Franz Kafka (English) | 31891 | 36557 | 14.63% |
10+
| Die Verwandlung by Franz Kafka (German) | 40620 | 37873 | 6.76% |
11+
| 道德經 by Laozi (Chinese) | 14386 | 12239 | 14.92% |
12+
| TypeScript ES5 Type Declarations (~ 4000 loc) | 47890 | 54829 | 14.49% |
413

514
For advanced use cases, please use a full tokenizer like [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer). This library is intended to be used for quick estimations and to avoid the overhead of a full tokenizer, e.g. when you want to limit your bundle size.
615

package.json

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@
4343
],
4444
"scripts": {
4545
"build": "unbuild",
46+
"docs:generate": "tsx scripts/generateTable.ts",
4647
"dev": "unbuild --stub",
4748
"lint": "eslint .",
4849
"lint:fix": "eslint . --fix",
@@ -51,11 +52,13 @@
5152
"test:types": "tsc --noEmit"
5253
},
5354
"devDependencies": {
54-
"@antfu/eslint-config": "^2.1.0",
55+
"@antfu/eslint-config": "^2.1.1",
5556
"bumpp": "^9.2.0",
5657
"eslint": "^8.54.0",
58+
"gpt-tokenizer": "^2.1.2",
59+
"tsx": "^4.6.0",
5760
"typescript": "^5.3.2",
5861
"unbuild": "^2.0.0",
59-
"vitest": "^0.34.6"
62+
"vitest": "^1.0.0-beta.6"
6063
}
6164
}

0 commit comments

Comments
 (0)