<a href="https://colab.research.google.com/github/leonpalafox/colab_notebooks/blob/main/TokenizerCount.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PDF Token Counter for Google Colab

This script helps you analyze PDF documents and count tokens using different tokenizers from Hugging Face. It's specifically designed to work in Google Colab notebooks.

## Features

- Extract and analyze text from PDF documents
- Count tokens using multiple tokenizers (GPT-2, T5, BERT, RoBERTa)
- Support for custom tokenizers from Hugging Face
- Interactive file selection in Colab
- Detailed text statistics

## Requirements

The script requires these Python packages:
- PyPDF2
- transformers
- torch

These are typically pre-installed in Google Colab, but you can install any missing packages with:

```python
!pip install PyPDF2 transformers torch
```

## Basic Usage

### Simple Method

Run the script and then call the `analyze_pdf()` function without arguments:

```python
# After pasting the script in a cell and running it
analyze_pdf()
```

This will:
1. Search for PDF files in your Colab environment
2. Let you select one from a list or upload a new one
3. Analyze the PDF and display token counts from multiple tokenizers

### Advanced Method

You can specify parameters directly:

```python
analyze_pdf(
    pdf_path="your_document.pdf",              # Path to a specific PDF file
    tokenizers=["gpt2", "bert"],               # Only use these tokenizers
    custom_tokenizer="facebook/opt-350m"       # Add a custom tokenizer
)
```

## Function Reference

### `analyze_pdf(pdf_path=None, tokenizers=None, custom_tokenizer=None)`

Analyzes a PDF file and counts tokens.

**Parameters:**
- `pdf_path` (str, optional): Path to the PDF file. If not provided, will interactively find/request a file.
- `tokenizers` (list, optional): List of tokenizer names to use. Defaults to ["gpt2", "t5", "bert", "roberta"].
- `custom_tokenizer` (str, optional): Name of a custom tokenizer from Hugging Face.

**Returns:**
A dictionary containing:
- `text`: The extracted text
- `statistics`: Character and word counts
- `token_counts`: Token counts for each tokenizer

## Example Output

```
PDF has 5 pages
Extracting text from example.pdf...

Text statistics:
Characters: 15,428
Words: 2,547

Counting tokens using 4 tokenizers...

Token counts:
gpt2: 3,256 tokens
t5: 4,121 tokens
bert: 3,891 tokens
roberta: 3,347 tokens
```

## Understanding Token Counts

Different tokenizers will produce different token counts for the same text because they:

1. **Use different vocabularies** - Words common in the training data may be a single token
2. **Have different tokenization strategies** - Some keep punctuation separate, others combine
3. **Handle whitespace differently** - Some tokenizers preserve whitespace as tokens
4. **Process capitalization differently** - Some are case-sensitive

This is why the script shows results from multiple tokenizers, helping you understand the range of possible token counts for your document.

## Troubleshooting

- **No text extracted**: Some PDFs contain images of text rather than actual text characters. Try using OCR software first.
- **Error loading tokenizer**: Check your internet connection or try a different tokenizer.
- **Memory errors**: Very large PDFs may cause memory issues. Try processing the document in smaller chunks.

## Advanced Usage

You can capture the results for further analysis:

```python
results = analyze_pdf("your_document.pdf")

# Access extracted text
text = results["text"]

# Get statistics
character_count = results["statistics"]["characters"]
word_count = results["statistics"]["words"]

# Get token counts for specific tokenizers
gpt2_count = results["token_counts"]["gpt2"]
bert_count = results["token_counts"]["bert"]
```

In [6]:
# PDF Token Counter for Google Colab
# This script counts tokens in PDF files using various tokenizers from Hugging Face

import os
import sys
import argparse
from typing import Dict, List, Optional
from google.colab import files

# PDF processing
import PyPDF2

# Hugging Face tokenizers
from transformers import (
    AutoTokenizer,
    GPT2Tokenizer,
    T5Tokenizer,
    BertTokenizer,
    RobertaTokenizer
)

def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extract all text from a PDF file

    Args:
        pdf_path: Path to the PDF file

    Returns:
        Extracted text as a string
    """
    text = ""

    try:
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            num_pages = len(reader.pages)

            print(f"PDF has {num_pages} pages")

            for page_num in range(num_pages):
                page = reader.pages[page_num]
                text += page.extract_text() + "\n"

        return text

    except Exception as e:
        print(f"Error extracting text from PDF: {e}")
        return ""

def count_tokens(text: str, tokenizer_name: str) -> int:
    """
    Count tokens in text using the specified tokenizer

    Args:
        text: Text to tokenize
        tokenizer_name: Name of the tokenizer to use

    Returns:
        Number of tokens
    """
    tokenizers = {
        "gpt2": GPT2Tokenizer.from_pretrained("gpt2"),
        "t5": T5Tokenizer.from_pretrained("t5-base"),
        "bert": BertTokenizer.from_pretrained("bert-base-uncased"),
        "roberta": RobertaTokenizer.from_pretrained("roberta-base"),
    }

    # Add support for custom tokenizers from Hugging Face
    if tokenizer_name not in tokenizers:
        try:
            tokenizers[tokenizer_name] = AutoTokenizer.from_pretrained(tokenizer_name)
        except Exception as e:
            print(f"Error loading tokenizer {tokenizer_name}: {e}")
            return -1

    tokenizer = tokenizers[tokenizer_name]
    tokens = tokenizer.encode(text)

    return len(tokens)

def count_tokens_with_multiple_tokenizers(text: str, tokenizer_names: Optional[List[str]] = None) -> Dict[str, int]:
    """
    Count tokens in text using multiple tokenizers

    Args:
        text: Text to tokenize
        tokenizer_names: List of tokenizer names to use

    Returns:
        Dictionary mapping tokenizer names to token counts
    """
    if not tokenizer_names:
        # Default tokenizers
        tokenizer_names = ["gpt2", "t5", "bert", "roberta"]

    results = {}

    for tokenizer_name in tokenizer_names:
        token_count = count_tokens(text, tokenizer_name)
        results[tokenizer_name] = token_count

    return results

def find_pdf_files():
    """Find PDF files in the current directory and subdirectories"""
    pdf_files = []
    for root, _, files in os.walk('.'):
        for file in files:
            if file.lower().endswith('.pdf'):
                pdf_files.append(os.path.join(root, file))
    return pdf_files

# For use in Colab notebooks
def analyze_pdf(pdf_path=None, tokenizers=None, custom_tokenizer=None):
    """Analyze a PDF file and count tokens"""

    if pdf_path is None:
        # List available PDF files
        pdf_files = find_pdf_files()

        if not pdf_files:
            print("No PDF files found in the current directory.")
            print("Please upload a PDF file using the code below:")
            print("from google.colab import files")
            print("uploaded = files.upload()")
            return

        print("Available PDF files:")
        for i, file in enumerate(pdf_files):
            print(f"{i+1}. {file}")

        choice = input("Enter the number of the PDF to analyze (or upload a new one): ")

        try:
            choice_idx = int(choice) - 1
            if 0 <= choice_idx < len(pdf_files):
                pdf_path = pdf_files[choice_idx]
            else:
                print("Invalid choice. Please upload a PDF file.")
                uploaded = files.upload()
                pdf_path = list(uploaded.keys())[0]
        except:
            print("Invalid input. Please upload a PDF file.")
            uploaded = files.upload()
            pdf_path = list(uploaded.keys())[0]

    # Check if the PDF file exists
    if not os.path.isfile(pdf_path):
        print(f"Error: PDF file {pdf_path} not found")
        return

    # Set default tokenizers if not provided
    if tokenizers is None:
        tokenizers = ["gpt2", "t5", "bert", "roberta"]

    # Add custom tokenizer if provided
    tokenizers_to_use = tokenizers.copy()
    if custom_tokenizer:
        tokenizers_to_use.append(custom_tokenizer)

    # Extract text from PDF
    print(f"Extracting text from {pdf_path}...")
    text = extract_text_from_pdf(pdf_path)

    if not text:
        print("No text extracted from PDF. Check if the PDF contains extractable text.")
        return

    # Print text statistics
    print(f"\nText statistics:")
    print(f"Characters: {len(text):,}")
    print(f"Words: {len(text.split()):,}")

    # Count tokens using different tokenizers
    print(f"\nCounting tokens using {len(tokenizers_to_use)} tokenizers...")
    token_counts = count_tokens_with_multiple_tokenizers(text, tokenizers_to_use)

    # Print results
    print("\nToken counts:")
    for tokenizer_name, count in token_counts.items():
        if count >= 0:
            print(f"{tokenizer_name}: {count:,} tokens")
        else:
            print(f"{tokenizer_name}: Error loading tokenizer")

    return {
        "text": text,
        "statistics": {
            "characters": len(text),
            "words": len(text.split())
        },
        "token_counts": token_counts
    }

# For direct execution in Colab
if __name__ == "__main__":
    # Filter out Colab-specific arguments
    args = [arg for arg in sys.argv if not arg.startswith('-f')]
    print(args)
    sys.argv = args

    # Create argparse without the required argument, which causes problems in Colab
    parser = argparse.ArgumentParser(description="Count tokens in PDF files using various tokenizers")
    parser.add_argument("pdf_path", nargs='?', help="Path to the PDF file")
    parser.add_argument("--tokenizers", nargs="+", default=["gpt2", "t5", "bert", "roberta"],
                      help="List of tokenizers to use (default: gpt2, t5, bert, roberta)")
    parser.add_argument("--custom", help="Use a custom tokenizer from Hugging Face (e.g., 'gpt-neox-20b')")

    try:
        args = parser.parse_args()
        analyze_pdf(args.pdf_path, args.tokenizers, args.custom)
    except SystemExit:
        # Called from a notebook without arguments
        analyze_pdf()

['/usr/local/lib/python3.11/dist-packages/colab_kernel_launcher.py', '/root/.local/share/jupyter/runtime/kernel-f1bed850-b491-4613-870e-05e66de8f011.json']
Extracting text from /root/.local/share/jupyter/runtime/kernel-f1bed850-b491-4613-870e-05e66de8f011.json...
Error extracting text from PDF: EOF marker not found
No text extracted from PDF. Check if the PDF contains extractable text.


In [7]:
analyze_pdf()

Available PDF files:
1. ./2504.01990v1.pdf
Enter the number of the PDF to analyze (or upload a new one): 1
Extracting text from ./2504.01990v1.pdf...
PDF has 264 pages

Text statistics:
Characters: 1,115,992
Words: 156,388

Counting tokens using 4 tokenizers...


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (278683 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (265430 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (278685 > 512). Running this sequence through the model will result in indexing errors



Token counts:
gpt2: 278,683 tokens
t5: 303,155 tokens
bert: 265,430 tokens
roberta: 278,685 tokens


 'statistics': {'characters': 1115992, 'words': 156388},
 'token_counts': {'gpt2': 278683,
  't5': 303155,
  'bert': 265430,
  'roberta': 278685}}