You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+63-57Lines changed: 63 additions & 57 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,8 @@
1
1
# tokenx
2
2
3
-
GPT token count and context size utilities when approximations are good enough. For advanced use cases, please use a full tokenizer like [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer). This library is intended to be used for quick estimations and to avoid the overhead of a full tokenizer, e.g. when you want to limit your bundle size.
3
+
Fast and lightweight token count estimation for any LLM without requiring a full tokenizer. This library provides quick approximations that are good enough for most use cases while keeping your bundle size minimal.
4
+
5
+
For advanced use cases requiring precise token counts, please use a full tokenizer like [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer).
4
6
5
7
## Benchmarks
6
8
@@ -11,18 +13,20 @@ The following table shows the accuracy of the token count approximation for diff
11
13
| --- | --- | --- | --- |
12
14
| Short English text | 10 | 11 | 10.00% |
13
15
| German text with umlauts | 56 | 49 | 12.50% |
14
-
| Metamorphosis by Franz Kafka (English) | 31892 |33930|6.39% |
15
-
| Die Verwandlung by Franz Kafka (German) | 40621 |34908|14.06% |
16
-
| 道德經 by Laozi (Chinese) | 14387 |11919|17.15% |
17
-
| TypeScript ES5 Type Declarations (~ 4000 loc) |48408|51688|6.78% |
16
+
| Metamorphosis by Franz Kafka (English) | 31892 |35705|11.96% |
17
+
| Die Verwandlung by Franz Kafka (German) | 40621 |35069|13.67% |
18
+
| 道德經 by Laozi (Chinese) | 14387 |12059|16.18% |
19
+
| TypeScript ES5 Type Declarations (~ 4000 loc) |48553|52434|7.99% |
18
20
<!-- END GENERATED TOKEN COUNT TABLE -->
19
21
20
22
## Features
21
23
22
-
- 🌁 Estimate token count without a full tokenizer
23
-
- 📐 Supports multiple model context sizes
24
-
- 🗣️ Supports accented characters, like German umlauts or French accents
24
+
- ⚡ Fast token estimation without a full tokenizer
25
+
- 🌍 Multi-language support with configurable language rules
26
+
- 🗣️ Built-in support for accented characters (German, French, Spanish, etc.)
console.log(`Is within token limit: ${withinLimit}`)
72
-
```
73
61
74
-
## API
75
-
76
-
### `approximateTokenSize`
62
+
// Use custom options for different languages or models
63
+
const customOptions = {
64
+
defaultCharsPerToken: 4, // More conservative estimation
65
+
languageConfigs: [
66
+
{ pattern:/[你我他]/g, averageCharsPerToken: 1.5 }, // Custom Chinese rule
67
+
]
68
+
}
77
69
78
-
Estimates the number of tokens in a given input string based on common English patterns and tokenization heuristics. Work well for other languages too, like German.
0 commit comments