Count tokens

Simple tool that have one purpose - count tokens in a text file.

Requirements

This package is using tiktoken library for tokenization.

Installation

For usage from comman line install the package in isolated environement with pipx:

$ pipx install count-tokens

or install it in your current environment with pip.

Usage

Open terminal and run:

$ count-tokens document.txt

You should see something like this:

File: document.txt
Encoding: cl100k_base
Number of tokens: 67

if you want to see just the tokens count run:

$ count-tokens document.txt --quiet

and the output will be:

NOTE: tiktoken supports three encodings used by OpenAI models:

Encoding name	OpenAI models
`cl100k_base`	`gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`
`p50k_base`	Codex models, `text-davinci-002`, `text-davinci-003`
`r50k_base` (or `gpt2`)	GPT-3 models like `davinci`

to use token-count with other than default cl100k_base encoding use the additional input argument -e or --encoding:

$ count-tokens document.txt -e r50k_base

Approximate number of tokens

In case you need the results a bit faster and you don't need the exact number of tokens you can use the --approx parameter with w to have approximation based on number of words or c to have approximation based on number of characters.

$ count-tokens document.txt --approx w

It is based on assumption that there is 4/3 (1 and 1/3) tokens per word and 4 characters per token.

## Programmatic usage

```python
from count_tokens.count import count_tokens_in_file

num_tokens = count_tokens_in_file("document.txt")

from count_tokens.count import count_tokens_in_string

num_tokens = count_tokens_in_string("This is a string.")

for both functions you can use encoding parameter to specify the encoding used by the model:

from count_tokens.count import count_tokens_in_string

num_tokens = count_tokens_in_string("This is a string.", encoding="cl100k_base")

Default value for encoding is cl100k_base.

Related Projects

tiktoken - tokenization library used by this package

Credits

Thanks to the authors of the tiktoken library for open sourcing their work.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
count_tokens		count_tokens
tests		tests
.bandit		.bandit
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Count tokens

Requirements

Installation

Usage

Approximate number of tokens

Related Projects

Credits

License

About

Releases

Packages

Languages

izikeros/count_tokens

Folders and files

Latest commit

History

Repository files navigation

Count tokens

Requirements

Installation

Usage

Approximate number of tokens

Related Projects

Credits

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages