Skip to content

juntaic7/Counting-Tokenization

Repository files navigation

Theoretical Barriers of Modern Tokenizer on Symbolic and Arithmetic Computation in Language Models

This project investigates the impact of tokenization on the inductive counting, sorting and reversing tasks performed by large language models. Tokenization, the process of converting a sequence of characters into tokens, can significantly affect how well models perform on tasks requiring the counting of characters or tokens. Different large language models may use different tokenization methods, which can affect their performance on counting tasks.

The workflow below demonstrates the experiment process using counting as an example. Reversing and sorting experiments follow the indentical procedure.

Tokenization Example Long

Example: Counting

The following example demonstrates how various tokenization approaches can impact the character counting (e.g., counting the number of 'a's) in a simple string.

The String Variations

  • (a) Original String: abbab
  • (b) Spaced String: a b b a b
  • (c) Comma-separated String: a, b, b, a, b
  • (d) Quoted String: "a", "b", "b", "a", "b"

Each string produces a different number of tokens and characters, which impacts the model's ability to count accurately.

Tokenization and Character Count

  • (a) Original String: 5 characters, 2 tokens
  • (b) Spaced String: 9 characters, 5 tokens
  • (c) Comma-separated String: 13 characters, 9 tokens
  • (d) Quoted String: 23 characters, 14 tokens

The figure below illustrates these examples with tokenization details.

Tokenization Example

Examples of Running Experiments

# Run an experiment with a newly generated dataset
# -n: Number of examples (e.g., 1000)
# -l: Minimum sequence length (e.g., 20)
# -u: Maximum sequence length (e.g., 30)
# -t: The letter set (e.g., "ab" for constructing a dataset with "a" and "b")
# -c: The target character to count (e.g., "a")
# -e: The experiment version or type (e.g., 1 for original string)
# -a: The agent to use (e.g., "claude" for Anthropic)
python -m counting.count_experiment -n 1000 -l 20 -u 30 -t "ab" -c "a" -e 1 -a "claude"

# Run an experiment using an existing dataset with supervised chain-of-thought
# -d: Path to the existing dataset
# -c: The target character to count (e.g., "a")
# -e: The experiment type (e.g., 3 for Quoted String)
# -s: Enables supervised chain-of-thought
python -m counting.count_experiment -d "path_to_dataset" -c "a" -e 3 -s

Experiment Number Correspondence to Tokenization

  • (a) Original String: -e 1
  • (b) Spaced String: -e 4
  • (c) Comma-separated String: -e 2
  • (d) Quoted String: -e 3

Examples of Evaluating Experiment Results

# Evaluate experiment results
# -d: Path to the dataset used for the experiment
# -p: Path to the result file to be evaluated
# -c: The target character that was counted (e.g., "a")
python -m counting.evaluate -d "path_to_dataset" -p "path_to_result_file" -c "a"

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages