Theoretical Barriers of Modern Tokenizer on Symbolic and Arithmetic Computation in Language Models

This project investigates the impact of tokenization on the inductive counting, sorting and reversing tasks performed by large language models. Tokenization, the process of converting a sequence of characters into tokens, can significantly affect how well models perform on tasks requiring the counting of characters or tokens. Different large language models may use different tokenization methods, which can affect their performance on counting tasks.

The workflow below demonstrates the experiment process using counting as an example. Reversing and sorting experiments follow the indentical procedure.

Example: Counting

The following example demonstrates how various tokenization approaches can impact the character counting (e.g., counting the number of 'a's) in a simple string.

The String Variations

(a) Original String: abbab
(b) Spaced String: a b b a b
(c) Comma-separated String: a, b, b, a, b
(d) Quoted String: "a", "b", "b", "a", "b"

Each string produces a different number of tokens and characters, which impacts the model's ability to count accurately.

Tokenization and Character Count

(a) Original String: 5 characters, 2 tokens
(b) Spaced String: 9 characters, 5 tokens
(c) Comma-separated String: 13 characters, 9 tokens
(d) Quoted String: 23 characters, 14 tokens

The figure below illustrates these examples with tokenization details.

Examples of Running Experiments

# Run an experiment with a newly generated dataset
# -n: Number of examples (e.g., 1000)
# -l: Minimum sequence length (e.g., 20)
# -u: Maximum sequence length (e.g., 30)
# -t: The letter set (e.g., "ab" for constructing a dataset with "a" and "b")
# -c: The target character to count (e.g., "a")
# -e: The experiment version or type (e.g., 1 for original string)
# -a: The agent to use (e.g., "claude" for Anthropic)
python -m counting.count_experiment -n 1000 -l 20 -u 30 -t "ab" -c "a" -e 1 -a "claude"

# Run an experiment using an existing dataset with supervised chain-of-thought
# -d: Path to the existing dataset
# -c: The target character to count (e.g., "a")
# -e: The experiment type (e.g., 3 for Quoted String)
# -s: Enables supervised chain-of-thought
python -m counting.count_experiment -d "path_to_dataset" -c "a" -e 3 -s

Experiment Number Correspondence to Tokenization

(a) Original String: -e 1
(b) Spaced String: -e 4
(c) Comma-separated String: -e 2
(d) Quoted String: -e 3

Examples of Evaluating Experiment Results

# Evaluate experiment results
# -d: Path to the dataset used for the experiment
# -p: Path to the result file to be evaluated
# -c: The target character that was counted (e.g., "a")
python -m counting.evaluate -d "path_to_dataset" -p "path_to_result_file" -c "a"

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
agents		agents
counting		counting
reverse_string		reverse_string
sorting		sorting
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
ReadMe.md		ReadMe.md
tokenizations.png		tokenizations.png
tokenizations_long.png		tokenizations_long.png
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Theoretical Barriers of Modern Tokenizer on Symbolic and Arithmetic Computation in Language Models

Example: Counting

The String Variations

Tokenization and Character Count

Examples of Running Experiments

Experiment Number Correspondence to Tokenization

Examples of Evaluating Experiment Results

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Theoretical Barriers of Modern Tokenizer on Symbolic and Arithmetic Computation in Language Models

Example: Counting

The String Variations

Tokenization and Character Count

Examples of Running Experiments

Experiment Number Correspondence to Tokenization

Examples of Evaluating Experiment Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages