<a href="https://colab.research.google.com/github/jw409/modelforecast/blob/main/notebooks/colab_tpu_gpu_benchmark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TPU vs GPU Inference Benchmark

**Purpose**: Compare inference cost/performance across hardware for investment thesis.

**Hardware Options** (Colab Dec 2024):
- **TPU**: v6e-1, v5e-1
- **GPU**: L4, T4, A100 (high VRAM switch available)

**Models Tested**:
- Embedding: Qwen3-Embedding-4B
- Reranking: Qwen3-Reranker-4B
- Inference: Mistral-7B, Llama-3-8B

In [None]:
# Setup - run this first
!pip install -q torch transformers accelerate sentencepiece
!pip install -q huggingface_hub

import os
import torch
import time
import json
from datetime import datetime

# Detect hardware
if torch.cuda.is_available():
    device = "cuda"
    hw_name = torch.cuda.get_device_name(0)
    hw_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {hw_name} ({hw_mem:.1f}GB)")
elif 'TPU_NAME' in os.environ:
    device = "xla"
    hw_name = os.environ.get('TPU_NAME', 'TPU')
    hw_mem = 'N/A'
    print(f"TPU: {hw_name}")
else:
    device = "cpu"
    hw_name = "CPU"
    hw_mem = 'N/A'
    print("Warning: No GPU/TPU detected")

In [None]:
# Benchmark: Embedding throughput
from transformers import AutoTokenizer, AutoModel

MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"  # 80MB, fast download

print(f"Loading {MODEL_ID}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModel.from_pretrained(MODEL_ID).to(device)
model.eval()

# Test data
texts = [
    "What is the capital of France?",
    "How do neural networks learn?",
    "Explain quantum computing in simple terms.",
] * 100  # 300 texts

# Warmup
with torch.no_grad():
    inputs = tokenizer(texts[:10], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
    _ = model(**inputs)

# Benchmark
batch_size = 32
start = time.time()
total_tokens = 0

with torch.no_grad():
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
        outputs = model(**inputs)
        total_tokens += inputs['input_ids'].numel()

elapsed = time.time() - start
tokens_per_sec = total_tokens / elapsed

result = {
    "timestamp": datetime.now().isoformat(),
    "hardware": hw_name,
    "hardware_mem_gb": hw_mem,
    "model": MODEL_ID,
    "task": "embedding",
    "texts_processed": len(texts),
    "total_tokens": total_tokens,
    "elapsed_seconds": round(elapsed, 2),
    "tokens_per_second": round(tokens_per_sec, 1),
    "texts_per_second": round(len(texts) / elapsed, 1),
}

print(json.dumps(result, indent=2))

Loading Alibaba-NLP/gte-Qwen2-1.5B-instruct...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenization_qwen.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct:
- tokenization_qwen.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/370 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/901 [00:00<?, ?B/s]

modeling_qwen.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct:
- modeling_qwen.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

In [None]:
# Save results for aggregation
import os

results_dir = "/content/benchmark_results"
os.makedirs(results_dir, exist_ok=True)

filename = f"{results_dir}/{hw_name.replace(' ', '_')}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(filename, 'w') as f:
    json.dump(result, f, indent=2)

print(f"Saved to {filename}")
print("\nTo download: Files -> benchmark_results/")

## Investment Thesis Data Points

After running on different hardware tiers, compare:

| Hardware | Type | VRAM | Cost/hr | Tokens/sec | Cost per 1M tokens |
|----------|------|------|---------|------------|--------------------|
| T4 | GPU | 16GB | $0 (free) | ? | ? |
| L4 | GPU | 24GB | ? | ? | ? |
| A100 | GPU | 40/80GB | ? | ? | ? |
| v5e-1 | TPU | - | ? | ? | ? |
| v6e-1 | TPU | - | ? | ? | ? |

**Key questions**:
1. At what scale does TPU beat GPU?
2. What's the break-even vs API pricing (OpenRouter ~$0.10/1M tokens)?
3. Which workloads favor which hardware?
4. L4 vs T4 - worth the upgrade?

## Cross-Hardware Matrix (The Complete Picture)

Current notebook measures **same-hardware** inference. But real investment thesis needs:

|  | **Serve GPU** | **Serve TPU** |
|--|---------------|---------------|
| **Train GPU** | Baseline (PyTorch) | Export to JAX |
| **Train TPU** | Export to ONNX | Native JAX/Flax |

**Next notebooks:**
- `cross_hardware_serving.ipynb` - conversion overhead
- `training_cost.ipynb` - train time comparison  
- `total_cost_calculator.ipynb` - optimal split for N requests