# TranslateGemma - Document Translator

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jimmyliao/trans-gemma/blob/main/document-translator-colab.ipynb)

Translate PDFs, images, and websites to Traditional Chinese (zh-TW) using Google's TranslateGemma model.

**Features:**
- üìÑ Download from arXiv automatically
- üñºÔ∏è Upload images or PDFs
- üåê Screenshot websites and translate
- üöÄ Fast GPU inference on Colab (T4)
- üáπüáº Force Traditional Chinese output (configurable)

**Single Source of Truth:** Uses the same code from [trans-gemma repo](https://github.com/jimmyliao/trans-gemma)

---

## üë§ About the Author

**Jimmy Liao** - AI GDE (Google Developer Expert), CTO/Co-Founder of AI Startup

Dedicated to smart manufacturing and finance sectors, focusing on transforming technical challenges from AI advancement into competitive advantages while enhancing client value and operational efficiency.

- üê¶ Twitter: [@jimmyliao](https://twitter.com/jimmyliao)
- üíº LinkedIn: [jimmyliao](https://linkedin.com/in/jimmyliao)
- üìù Blog: [memo.jimmyliao.net](https://memo.jimmyliao.net)
- üîó Sessionize: [jimmy-liao](https://sessionize.com/jimmy-liao/)

---

**Disclaimer:** This notebook is provided for educational and research purposes. The author is not affiliated with Google's TranslateGemma team. Use at your own discretion.

## 1Ô∏è‚É£ Setup: Clone Repository

In [None]:
# Clean up existing directory if it exists
!rm -rf trans-gemma

# Clone the repository (single source of truth)
!git clone https://github.com/jimmyliao/trans-gemma.git
%cd trans-gemma

## 2Ô∏è‚É£ Install Dependencies

In [None]:
# Install uv (fast Python package manager)
!pip install uv -q

# Install project dependencies
!uv pip install --system -e ".[examples]"

## 2.5Ô∏è‚É£ HuggingFace Authentication

**IMPORTANT:** TranslateGemma is a gated model. You need to:
1. Get a HuggingFace token from [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
2. Accept model access at [https://huggingface.co/google/translategemma-4b-it](https://huggingface.co/google/translategemma-4b-it)

### üîê Configuration Methods (Choose ONE based on your environment)

#### **Option A: Web Colab** (Using browser)
1. Click the üîë icon on left sidebar
2. Add secret: `HF_TOKEN` = your token
3. Run the cell below ‚Üí Token loaded automatically from Colab Secrets

#### **Option B: VS Code Colab Extension** (Using VS Code locally)

**‚ö†Ô∏è Important:** Your local `.env` file is NOT automatically synced to remote Colab runtime!

**Solution: Create .env in remote runtime**

The cell below will:
1. First check if `.env` exists in remote runtime
2. If not found, prompt you to enter token
3. Automatically create `.env` file in remote runtime
4. Use this token for authentication

This way, you only need to enter your token once per runtime session.

#### **Option C: Manual Input Every Time** (Not recommended)
Skip .env creation and enter token manually each time.

---

**What happens when you run the cell below?**
1. Checks for `.env` in current directory (remote runtime)
2. If not found, prompts for token and creates `.env`
3. If found, reads token from `.env`
4. Authenticates with HuggingFace


In [None]:
from huggingface_hub import login
import os
from pathlib import Path

def get_hf_token():
    """Smart HF Token retrieval with .env creation for VS Code"""
    
    # Method 1: Try .env file in current directory
    env_file = Path('.env')
    
    if env_file.exists():
        try:
            with open('.env', 'r') as f:
                for line in f:
                    line = line.strip()
                    if line.startswith('HF_TOKEN='):
                        token = line.split('=', 1)[1].strip().strip('"').strip("'")
                        if token:
                            print("‚úÖ HF_TOKEN loaded from .env file")
                            return token
            print("‚ö†Ô∏è  .env file found but HF_TOKEN not set correctly")
        except Exception as e:
            print(f"‚ö†Ô∏è  Error reading .env: {e}")
    
    # Method 2: Try environment variables
    token = os.getenv('HF_TOKEN') or os.getenv('HUGGING_FACE_HUB_TOKEN')
    if token:
        print("‚úÖ HF_TOKEN loaded from environment variable")
        return token
    
    # Method 3: Try Colab Secrets (Web Colab only)
    try:
        from google.colab import userdata
        token = userdata.get('HF_TOKEN')
        print("‚úÖ HF_TOKEN loaded from Colab Secrets (Web Colab)")
        return token
    except Exception:
        pass
    
    # Method 4: Prompt for token and create .env (VS Code friendly)
    print("\n" + "="*80)
    print("‚ö†Ô∏è  HF_TOKEN not found - Creating .env file")
    print("="*80)
    print("\nüìù Please enter your HuggingFace token:")
    print("   Get token: https://huggingface.co/settings/tokens")
    print("   Accept access: https://huggingface.co/google/translategemma-4b-it")
    print("\nüí° Your token will be saved to .env for this runtime session")
    print()
    
    token = input("HuggingFace Token: ").strip()
    
    if token:
        # Save to .env for future use in this session
        try:
            with open('.env', 'w') as f:
                f.write(f"HF_TOKEN={token}\n")
            print("\n‚úÖ Token saved to .env file (runtime session)")
            print("   Next time you run this cell, it will load automatically")
        except Exception as e:
            print(f"\n‚ö†Ô∏è  Could not save to .env: {e}")
            print("   Token will work this time but won't persist")
        
        return token
    else:
        raise ValueError("‚ùå HF_TOKEN is required to use TranslateGemma")

# Authenticate
try:
    HF_TOKEN = get_hf_token()
    os.environ['HF_TOKEN'] = HF_TOKEN
    login(token=HF_TOKEN)
    print("\n‚úÖ Successfully authenticated with HuggingFace\n")
except Exception as e:
    print(f"\n‚ùå Authentication failed: {e}\n")
    raise

## 3Ô∏è‚É£ Configuration

In [None]:
import os

# Target language (default: Traditional Chinese)
TARGET_LANG = "zh-TW"  # Change to "zh-CN", "ja", "ko", etc. if needed

# Backend (transformers is best for Colab GPU)
BACKEND = "transformers"

print(f"‚úÖ Target language: {TARGET_LANG}")
print(f"‚úÖ Backend: {BACKEND}")

## 4Ô∏è‚É£ Option A: Download from arXiv

Automatically download and translate arXiv papers.

In [None]:
# Enter arXiv ID (e.g., "2601.09012v2" or "2601.09012")
ARXIV_ID = "2601.09012v2"  # TranslateGemma technical report

# Translate specific pages (1-indexed)
START_PAGE = 1
END_PAGE = 1  # Set to None for all pages

# Build command
cmd = f"python examples/translate.py --mode pdf --arxiv {ARXIV_ID} --backend {BACKEND} --target {TARGET_LANG}"
if START_PAGE:
    cmd += f" --start-page {START_PAGE}"
if END_PAGE:
    cmd += f" --end-page {END_PAGE}"

print(f"Running: {cmd}\n")
!{cmd}

## 4Ô∏è‚É£ Option B: Upload PDF

Upload your own PDF file.

In [None]:
from google.colab import files
import os

# Upload PDF
print("üì§ Please upload your PDF file:")
uploaded = files.upload()

# Get uploaded filename
pdf_file = list(uploaded.keys())[0]
print(f"\n‚úÖ Uploaded: {pdf_file}")

# Translate settings
START_PAGE = 1
END_PAGE = 3  # Change as needed
USE_IMAGE_MODE = False  # Set to True for multimodal (slower but preserves layout)
DPI = 96  # For image mode: lower = faster (72, 96, or 150)

# Build command
cmd = f"python examples/translate.py --mode pdf --file {pdf_file} --backend {BACKEND} --target {TARGET_LANG}"
if START_PAGE:
    cmd += f" --start-page {START_PAGE}"
if END_PAGE:
    cmd += f" --end-page {END_PAGE}"
if USE_IMAGE_MODE:
    cmd += f" --pdf-as-image --dpi {DPI}"

print(f"\nRunning: {cmd}\n")
!{cmd}

## 4Ô∏è‚É£ Option C: PDF with Image Mode (Multimodal)

Use multimodal TranslateGemma to preserve visual context (tables, charts).

In [None]:
# Translate PDF page with charts/figures (multimodal)
ARXIV_ID = "2601.09012v2"
START_PAGE = 3  # Page with Figure 1 (language distribution charts)
END_PAGE = 3
DPI = 96  # Lower DPI = faster (72, 96, or 150)

cmd = f"python examples/translate.py --mode pdf --arxiv {ARXIV_ID} --backend {BACKEND} --target {TARGET_LANG} --pdf-as-image --dpi {DPI}"
if START_PAGE:
    cmd += f" --start-page {START_PAGE}"
if END_PAGE:
    cmd += f" --end-page {END_PAGE}"

print(f"Running: {cmd}\n")
!{cmd}

## 5Ô∏è‚É£ Single Image Translation

Translate text from a single image using multimodal TranslateGemma.

In [None]:
from google.colab import files
from PIL import Image
import sys
import urllib.request
import os

# Add examples directory to path
sys.path.insert(0, 'examples')
sys.path.insert(0, 'examples/backends')

# Configuration: Choose image source
USE_DEFAULT_IMAGE = True  # Set to False to upload your own image
DEFAULT_IMAGE_URL = "https://cdn.odigo.net/f91b9c108a1e0cd1117e1c46ee36eeca.jpg"

# Language configuration
SOURCE_LANG = "ja"  # This is a Japanese menu image

# Get image
if USE_DEFAULT_IMAGE:
    print(f"üì• Downloading default image from:\n   {DEFAULT_IMAGE_URL}\n")
    image_file = "demo_image.jpg"
    urllib.request.urlretrieve(DEFAULT_IMAGE_URL, image_file)
    print(f"‚úÖ Downloaded: {image_file}")
else:
    print("üì§ Please upload your image:")
    uploaded = files.upload()
    image_file = list(uploaded.keys())[0]
    print(f"\n‚úÖ Uploaded: {image_file}")

# Load backend
from transformers_multimodal_backend import TransformersMultimodalBackend

print("\nüîÑ Loading multimodal backend...")
backend = TransformersMultimodalBackend()
backend.load_model()

# Translate
print(f"\nüîÑ Translating {image_file}...")
print(f"Source language: {SOURCE_LANG} ‚Üí Target language: {TARGET_LANG}")
result = backend.translate_image(image_file, source_lang=SOURCE_LANG, target_lang=TARGET_LANG)

# Display result
print(f"\n‚úÖ Translation:")
print(result['translation'])
print(f"\n‚è±Ô∏è  Time: {result['time']:.2f}s, Speed: {result['metadata']['tokens_per_second']:.1f} tok/s")

## 6Ô∏è‚É£ Website Article Translation (Web Scraping)

Extract text from websites and translate them accurately using web scraping instead of screenshots.

In [None]:
# Install web scraping dependencies
!pip install beautifulsoup4 requests -q

import requests
from bs4 import BeautifulSoup
import sys
import time
sys.path.insert(0, 'examples')
sys.path.insert(0, 'examples/backends')

# Configuration
ARTICLE_URL = "https://aismiley.co.jp/ai_news/gemma3-rag-api-local-use/"
SOURCE_LANG = "ja"  # Japanese article

def extract_article_text(url):
    """Extract main article content from webpage"""
    print(f"üåê Fetching webpage: {url}\n")
    
    # Fetch webpage
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    response = requests.get(url, headers=headers)
    response.encoding = response.apparent_encoding
    
    # Parse HTML
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Extract title from h1
    title = soup.find('h1')
    title_text = title.get_text(strip=True) if title else "No title found"
    
    # Remove unwanted elements (navigation, sidebar, footer, scripts)
    for element in soup.select('nav, aside, footer, script, style, .sidebar, .navigation, .menu, .footer, .header'):
        element.decompose()
    
    # Try to find main content area with multiple strategies
    content_area = None
    
    # Strategy 1: Look for specific content containers
    content_selectors = [
        'main',
        'article',
        '.main-content',
        '.article-content',
        '.post-content',
        '.entry-content',
        '#content',
        '.content'
    ]
    
    for selector in content_selectors:
        content_area = soup.select_one(selector)
        if content_area and len(content_area.find_all('p')) > 3:
            print(f"üìç Found content using selector: {selector}")
            break
    
    # Strategy 2: If no content area found, look for area with most paragraphs
    if not content_area or len(content_area.find_all('p')) < 3:
        print("üìç Using body and filtering paragraphs by length")
        content_area = soup.find('body')
    
    # Extract paragraphs and headings
    paragraphs = []
    seen_texts = set()  # Avoid duplicates
    
    for element in content_area.find_all(['p', 'h2', 'h3', 'li']):
        text = element.get_text(strip=True)
        
        # Filter conditions
        if (len(text) < 15 or  # Too short
            text in seen_texts or  # Duplicate
            text.lower().startswith(('cookie', 'privacy', 'terms', 'Âà©Áî®Ë¶èÁ¥Ñ', '„Éó„É©„Ç§„Éê„Ç∑„Éº')) or  # Legal text
            'href' in text.lower() or  # Likely a link
            text.count('|') > 2):  # Navigation menu
            continue
        
        seen_texts.add(text)
        paragraphs.append(text)
    
    print(f"‚úÖ Extracted {len(paragraphs)} unique paragraphs")
    
    # Show first few paragraphs for debugging
    if paragraphs:
        print(f"\nüìã First 3 paragraphs:")
        for i, p in enumerate(paragraphs[:3], 1):
            preview = p[:80] + "..." if len(p) > 80 else p
            print(f"   {i}. {preview}")
    
    # Combine text (limit to first 10 paragraphs to stay within token limits)
    # IMPORTANT: Reduced from 20 to 10 paragraphs for better translation quality
    full_text = f"{title_text}\n\n" + "\n\n".join(paragraphs[:10])
    
    return {
        'title': title_text,
        'text': full_text,
        'paragraph_count': len(paragraphs),
        'paragraphs_used': min(10, len(paragraphs))
    }

# Extract article
print("üìÑ Extracting article content...\n")
article = extract_article_text(ARTICLE_URL)

print(f"\n‚úÖ Article summary:")
print(f"   Title: {article['title']}")
print(f"   Total paragraphs: {article['paragraph_count']}")
print(f"   Using paragraphs: {article['paragraphs_used']}")
print(f"   Text length: {len(article['text'])} characters\n")

# Check if we have enough content
if article['paragraph_count'] < 3:
    print("‚ö†Ô∏è  Warning: Very few paragraphs extracted. The article might not be accessible or requires different extraction logic.")
    print("   Proceeding with available content...\n")

# Load translation backend
from transformers_backend import TransformersBackend

print("üîÑ Loading translation backend...")
backend = TransformersBackend()
backend.load_model()

# Translate article
print(f"\nüîÑ Translating article...")
print(f"Source language: {SOURCE_LANG} ‚Üí Target language: {TARGET_LANG}\n")

start_time = time.time()

# Monkey patch to increase max_new_tokens and add debug output
import torch
original_translate = backend.translate

def translate_with_more_tokens(text, source_lang, target_lang):
    """Modified translate with more tokens and debug output"""
    # Build structured message
    messages = [{
        "role": "user",
        "content": [{
            "type": "text",
            "text": text,
            "source_lang_code": source_lang,
            "target_lang_code": target_lang
        }]
    }]

    # Apply chat template
    inputs = backend.tokenizer.apply_chat_template(
        messages,
        return_tensors="pt"
    ).to(backend.model.device)

    start = time.time()

    # Generate with MORE tokens
    with torch.no_grad():
        outputs = backend.model.generate(
            inputs,
            max_new_tokens=1024,  # Increased from 256 to 1024
            do_sample=False
        )

    duration = time.time() - start

    # Decode full output
    full_output = backend.tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Debug: Show full output
    print("\nüîç Debug - Full model output (first 500 chars):")
    print(full_output[:500])
    print("\n" + "="*80 + "\n")

    # Extract translation (improved logic)
    # Strategy 1: Split by newline and get last non-empty line
    lines = [line.strip() for line in full_output.split('\n') if line.strip()]
    translation = lines[-1] if lines else full_output
    
    # Strategy 2: Remove prompt prefix if present
    if ':' in translation and len(translation.split(':', 1)[1].strip()) > 10:
        translation = translation.split(':', 1)[1].strip()

    # Post-processing: Convert Simplified to Traditional Chinese
    if target_lang == "zh-TW":
        try:
            from hanziconv import HanziConv
            translation = HanziConv.toTraditional(translation)
        except ImportError:
            pass

    # Calculate tokens
    input_tokens = inputs.shape[1]
    output_tokens = outputs.shape[1] - input_tokens
    total_tokens = outputs.shape[1]

    return {
        "translation": translation,
        "time": duration,
        "tokens": total_tokens,
        "metadata": {
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "tokens_per_second": total_tokens / duration if duration > 0 else 0,
            "full_output_preview": full_output[:200]
        }
    }

backend.translate = translate_with_more_tokens

result = backend.translate(article['text'], source_lang=SOURCE_LANG, target_lang=TARGET_LANG)
end_time = time.time()

# Display result with word wrap
import textwrap

print(f"\n‚úÖ Translation Result:")
print("=" * 80)
wrapped_translation = textwrap.fill(result['translation'], width=80, break_long_words=False, break_on_hyphens=False)
print(wrapped_translation)
print("=" * 80)

print(f"\n‚è±Ô∏è  Time: {result['time']:.2f}s")
print(f"üìä Speed: {result['metadata']['tokens_per_second']:.1f} tok/s")
print(f"üî§ Tokens: {result['tokens']} (input: {result['metadata']['input_tokens']}, output: {result['metadata']['output_tokens']})")

## 7Ô∏è‚É£ Website Screenshot Translation

Capture a screenshot of any website and translate it to Traditional Chinese.

In [None]:
# Install system dependencies for Chromium
!apt-get update -qq
!apt-get install -y -qq libatk1.0-0 libatk-bridge2.0-0 libcups2 libxkbcommon0 libxcomposite1 libxdamage1 libxrandr2 libgbm1 libpango-1.0-0 libcairo2 libasound2

# Install Playwright
!pip install playwright -q
!playwright install chromium --with-deps

# Screenshot website and translate
import asyncio
from playwright.async_api import async_playwright
from PIL import Image
import sys
sys.path.insert(0, 'examples')
sys.path.insert(0, 'examples/backends')

# Configuration
WEBSITE_URL = "https://www.yomiuri.co.jp/national/20260117-GYT1T00119/"
SOURCE_LANG = "ja"  # Japanese news website

async def capture_screenshot(url):
    """Capture website screenshot using Playwright async API"""
    print(f"üì∏ Capturing screenshot of: {url}\n")
    
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page(viewport={'width': 1280, 'height': 1024})
        await page.goto(url, wait_until='networkidle')
        await page.screenshot(path='website_screenshot.png', full_page=False)
        await browser.close()
    
    print("‚úÖ Screenshot saved: website_screenshot.png\n")

# Capture screenshot
await capture_screenshot(WEBSITE_URL)

# Load backend
from transformers_multimodal_backend import TransformersMultimodalBackend

print("üîÑ Loading multimodal backend...")
backend = TransformersMultimodalBackend()
backend.load_model()

# Translate
print(f"\nüîÑ Translating screenshot...")
print(f"Source language: {SOURCE_LANG} ‚Üí Target language: {TARGET_LANG}\n")
result = backend.translate_image('website_screenshot.png', source_lang=SOURCE_LANG, target_lang=TARGET_LANG)

# Display result
print(f"\n‚úÖ Translation:")
print(result['translation'])
print(f"\n‚è±Ô∏è  Time: {result['time']:.2f}s, Speed: {result['metadata']['tokens_per_second']:.1f} tok/s")

# Display screenshot
from IPython.display import Image as IPImage, display
display(IPImage('website_screenshot.png', width=800))

## üìù Notes

- **Backend**: `transformers` is best for Colab GPU (T4)
- **Target Language**: Default is `zh-TW` (Traditional Chinese), change in Configuration section
- **Image Mode**: Slower but preserves visual context (tables, charts, layout)
- **DPI**: Lower DPI (72-96) is faster, higher DPI (150) has better quality

## üîó Links

- [GitHub Repository](https://github.com/jimmyliao/trans-gemma)
- [TranslateGemma Model](https://huggingface.co/google/translategemma-4b-it)
- [Documentation](https://github.com/jimmyliao/trans-gemma/blob/main/examples/README.md)