# Phase 2: DeepSeek-OCR Accuracy Test (FIXED)

**Objective**: Test OCR accuracy on code-heavy conversation images.

**Target**: ≥90% character accuracy to proceed to Phase 3.

**IMPORTANT**: Use Colab Pro with A100 GPU (Colab Free T4 runs out of RAM)

---

## Setup Instructions

1. **Subscribe to Colab Pro**: $10/month (required for A100 GPU)
2. **Enable A100 GPU**: Runtime → Change runtime type → A100 GPU → Save
3. **Run cells in order**: Runtime → Run all
4. **Upload test image**: When prompted in Cell 3
5. **Document results**: Copy final output to `tests/PHASE2_RESULTS.md`

---

## Cell 1: Install Dependencies (~5 min)

In [1]:
# Install required packages
!pip install -q transformers==4.46.3 tokenizers==0.20.3 einops pillow easydict addict
!pip install -q flash-attn --no-build-isolation

# Verify GPU availability
import torch
print("\n" + "="*70)
print(f"🎮 GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"🎮 GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"🎮 GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    if torch.cuda.get_device_properties(0).total_memory / 1e9 < 30:
        print("⚠️  WARNING: GPU has <30GB RAM. Use A100 or this will crash!")
else:
    print("⚠️  WARNING: No GPU detected. This will be VERY slow.")
print("="*70)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m126.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m108.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m118.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for flash-attn (setup.py) ... [?25l[?25hdone

🎮 GPU Available: True
🎮 GPU Name: NVIDIA A100-SXM4-40GB
🎮 GPU Memory: 42.47 GB


## Cell 2: Download DeepSeek-OCR Model (~10 min, ~10GB)

In [2]:
from huggingface_hub import snapshot_download
import os

print("Downloading DeepSeek-OCR model (~10GB)...")
print("This will take ~10 minutes on Colab's fast network.\n")

model_path = snapshot_download(
    repo_id='deepseek-ai/DeepSeek-OCR',
    local_dir='/content/deepseek-ocr',
    local_dir_use_symlinks=False
)

print(f"\n✓ Model downloaded to: {model_path}")
print(f"✓ Size: {sum(os.path.getsize(os.path.join(dirpath, filename)) for dirpath, dirnames, filenames in os.walk(model_path) for filename in filenames) / 1e9:.2f} GB")

Downloading DeepSeek-OCR model (~10GB)...
This will take ~10 minutes on Colab's fast network.



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Fetching 21 files:   0%|          | 0/21 [00:00<?, ?it/s]

README.md: 0.00B [00:00, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

README-checkpoint.md: 0.00B [00:00, ?B/s]

LICENSE: 0.00B [00:00, ?B/s]

assets/show3.jpg:   0%|          | 0.00/247k [00:00<?, ?B/s]

assets/show2.jpg:   0%|          | 0.00/216k [00:00<?, ?B/s]

assets/fig1.png:   0%|          | 0.00/396k [00:00<?, ?B/s]

assets/show1.jpg:   0%|          | 0.00/117k [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

configuration_deepseek_v2.py: 0.00B [00:00, ?B/s]

conversation.py: 0.00B [00:00, ?B/s]

assets/show4.jpg:   0%|          | 0.00/269k [00:00<?, ?B/s]

deepencoder.py: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-000001.safetensors:   0%|          | 0.00/6.67G [00:00<?, ?B/s]

modeling_deepseekocr.py: 0.00B [00:00, ?B/s]

modeling_deepseekv2.py: 0.00B [00:00, ?B/s]

processor_config.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]


✓ Model downloaded to: /content/deepseek-ocr
✓ Size: 6.68 GB


## Cell 3: Upload Test Image (~5 sec)

**Action Required**: Upload `/tmp/poc_long_conversation.png` when prompted.

In [11]:
from google.colab import files
import shutil

print("📤 Please upload: /tmp/poc_long_conversation.png\n")
uploaded = files.upload()

# Move to predictable path
image_filename = list(uploaded.keys())[0]
shutil.move(image_filename, '/content/test_image.png')

# Verify
from PIL import Image
img = Image.open('/content/test_image.png')
print(f"\n✓ Image ready at: /content/test_image.png")
print(f"✓ Dimensions: {img.width}x{img.height}")
print(f"✓ Format: {img.format}")
print(f"✓ Mode: {img.mode}")

📤 Please upload: /tmp/poc_long_conversation.png



Saving poc_long_conversation.png to poc_long_conversation.png

✓ Image ready at: /content/test_image.png
✓ Dimensions: 1024x2048
✓ Format: PNG
✓ Mode: RGB


## Cell 4: Define Original Text (instant)

This is the ground truth from `tests/poc_long_conversation.py:create_long_conversation()`

In [12]:
# Original text from tests/poc_long_conversation.py (lines 22-168)
ORIGINAL_TEXT = """[USER]: I'm trying to implement a recursive function to find all files in a directory, but it's not working as expected. Here's my code:

import os

def find_files(path, pattern):
    results = []
    for item in os.listdir(path):
        if item.endswith(pattern):
            results.append(item)
    return results

files = find_files('/Users/me/Documents', '.txt')
print(files)

[ASSISTANT]: Your function is only checking the immediate directory, not recursing into subdirectories. Here's a fixed version:

import os

def find_files(path, pattern):
    results = []
    for root, dirs, files in os.walk(path):
        for file in files:
            if file.endswith(pattern):
                full_path = os.path.join(root, file)
                results.append(full_path)
    return results

files = find_files('/Users/me/Documents', '.txt')
print(f'Found {len(files)} files:')
for f in files:
    print(f'  {f}')

The key changes:
1. Use os.walk() instead of os.listdir() - this recursively traverses directories
2. Store full paths using os.path.join()
3. Added better output formatting

[USER]: That works! But now I want to add a filter to exclude certain directories like __pycache__ and .git. How do I do that?

[ASSISTANT]: You can modify the dirs list in-place during os.walk() to skip directories:

import os

def find_files(path, pattern, exclude_dirs=None):
    if exclude_dirs is None:
        exclude_dirs = ['__pycache__', '.git', 'node_modules', '.venv']

    results = []
    for root, dirs, files in os.walk(path):
        # Remove excluded directories from dirs in-place
        dirs[:] = [d for d in dirs if d not in exclude_dirs]

        for file in files:
            if file.endswith(pattern):
                full_path = os.path.join(root, file)
                results.append(full_path)

    return results

# Usage
files = find_files('/Users/me/Documents', '.txt')
print(f'Found {len(files)} .txt files')

# Or with custom exclusions
files = find_files(
    '/Users/me/Projects',
    '.py',
    exclude_dirs=['build', 'dist', 'venv']
)

The key trick is `dirs[:] = [...]` which modifies the list in-place. This tells os.walk() to skip those directories.

[USER]: Perfect! One more thing - can I make it case-insensitive? I want to match both .txt and .TXT files.

[ASSISTANT]: Sure! Use str.lower() for case-insensitive matching:

import os

def find_files(path, pattern, exclude_dirs=None, case_sensitive=True):
    if exclude_dirs is None:
        exclude_dirs = ['__pycache__', '.git', 'node_modules', '.venv']

    # Normalize pattern for case-insensitive matching
    if not case_sensitive:
        pattern = pattern.lower()

    results = []
    for root, dirs, files in os.walk(path):
        dirs[:] = [d for d in dirs if d not in exclude_dirs]

        for file in files:
            # Compare lowercased filename if case-insensitive
            filename = file.lower() if not case_sensitive else file

            if filename.endswith(pattern):
                full_path = os.path.join(root, file)
                results.append(full_path)

    return results

# Case-insensitive search
files = find_files('/Users/me/Documents', '.txt', case_sensitive=False)
# Will match: document.txt, README.TXT, notes.Txt, etc.

[USER]: Excellent! This is exactly what I needed. Thanks!

[ASSISTANT]: You're welcome! A few best practices to consider:

1. **Error handling**: Add try-except for permission errors:
   ```python
   try:
       for root, dirs, files in os.walk(path):
           # ... your code ...
   except PermissionError as e:
       print(f"Permission denied: {e}")
   ```

2. **Performance**: For large directories, consider using pathlib:
   ```python
   from pathlib import Path

   def find_files_pathlib(path, pattern, case_sensitive=True):
       p = Path(path)
       if case_sensitive:
           return list(p.rglob(f'*{pattern}'))
       else:
           # pathlib doesn't support case-insensitive, use os.walk
           pass
   ```

3. **Testing**: Add unit tests:
   ```python
   def test_find_files():
       # Create temp directory structure
       import tempfile
       with tempfile.TemporaryDirectory() as tmpdir:
           # Create test files
           Path(tmpdir, 'test.txt').touch()
           Path(tmpdir, 'subdir').mkdir()
           Path(tmpdir, 'subdir', 'nested.txt').touch()

           # Test
           results = find_files(tmpdir, '.txt')
           assert len(results) == 2
   ```

Let me know if you need help with any of these!"""

print(f"Original text: {len(ORIGINAL_TEXT)} characters")
print(f"Estimated tokens: ~{len(ORIGINAL_TEXT) // 4}")

Original text: 4509 characters
Estimated tokens: ~1127


## Cell 5: Load Model & Run OCR (~10-30 sec inference)

**FIXED**: Using official DeepSeek-OCR parameters from Context7 docs

In [13]:
from transformers import AutoModel, AutoTokenizer
import torch
import time
import os

# CRITICAL: Create output directory first
os.makedirs('/content/ocr_output', exist_ok=True)

print("Loading DeepSeek-OCR model...")
tokenizer = AutoTokenizer.from_pretrained(
    '/content/deepseek-ocr',
    trust_remote_code=True
)

model = AutoModel.from_pretrained(
    '/content/deepseek-ocr',
    _attn_implementation='flash_attention_2',
    trust_remote_code=True,
    use_safetensors=True  # Add this!
).eval().cuda().to(torch.bfloat16)

print("✓ Model loaded to GPU\n")
print("Running OCR inference...\n")

start = time.time()

# Use exact parameters from official docs
result = model.infer(
    tokenizer,
    prompt="<image>\nFree OCR.",
    image_file='/content/test_image.png',
    output_path='/content/ocr_output',  # Must be valid directory
    base_size=1024,
    image_size=640,
    crop_mode=True,
    save_results=True,
    test_compress=True
)

elapsed = time.time() - start
print(f"\n✓ OCR complete in {elapsed:.2f}s")
print(f"✓ Output: {len(result)} characters")

Loading DeepSeek-OCR model...


You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR. This is not supported for all configurations of models and can yield errors.
Some weights of DeepseekOCRForCausalLM were not initialized from the model checkpoint at /content/deepseek-ocr and are newly initialized: ['model.vision_model.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


✓ Model loaded to GPU

Running OCR inference...

BASE:  torch.Size([1, 256, 1280])
PATCHES:  torch.Size([8, 100, 1280])
[USER]: I'm trying to implement a recursive function to find all files in a directory, but it's not working as expected. Here's my code:

import os

def find_files(path, pattern):
    results = []
    for item in os.listdir(path):
    if item.endsWith(pattern):
    results.append(item)
    return results

files = find_files('/Users/me/Documents', '.txt')
print(files)

[ASSISTANT]: Your function is only checking the immediate directory, not recursing into subdirectories. Here's a fixed version:

import os

def find_files(path, pattern):
    results = []
    for root, dirs, files in os.walk(path):
    for file in files:
    if file.endsWith(pattern):
    full_path = os.path.join(root, file)
    results.append(full_path)
    return results

files = find_files('/Users/me/Documents', '.txt')
for f in files:
    # Compare lowercased filename if case-insensitive
    filename

image: 0it [00:00, ?it/s]
other: 0it [00:00, ?it/s]


✓ OCR complete in 42.86s





TypeError: object of type 'NoneType' has no len()

In [23]:
from transformers import AutoModel, AutoTokenizer
import torch
import time
import os

# CRITICAL: Create output directory first
os.makedirs('/content/ocr_output_grounding', exist_ok=True)

print("Loading DeepSeek-OCR model...")
tokenizer = AutoTokenizer.from_pretrained(
    '/content/deepseek-ocr',
    trust_remote_code=True
)

model = AutoModel.from_pretrained(
    '/content/deepseek-ocr',
    _attn_implementation='flash_attention_2',
    trust_remote_code=True,
    use_safetensors=True
).eval().cuda().to(torch.bfloat16)

print("✓ Model loaded to GPU\n")
print("Running OCR inference with GROUNDING prompt...\n")

start = time.time()

# TEST GROUNDING PROMPT FOR CODE
result = model.infer(
    tokenizer,
    prompt="<image>\nExtract all text preserving exact indentation and spacing.",    image_file='/content/test_image.png',
    output_path='/content/ocr_output_grounding',
    base_size=1024,
    image_size=640,
    crop_mode=True,
    save_results=True,
    test_compress=True
)

# Load result from file
with open('/content/ocr_output_grounding/result.mmd', 'r') as f:
    result = f.read()

elapsed = time.time() - start
print(f"\n✓ OCR complete in {elapsed:.2f}s")
print(f"✓ Output: {len(result)} characters")

Loading DeepSeek-OCR model...


You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR. This is not supported for all configurations of models and can yield errors.
Some weights of DeepseekOCRForCausalLM were not initialized from the model checkpoint at /content/deepseek-ocr and are newly initialized: ['model.vision_model.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


✓ Model loaded to GPU

Running OCR inference with GROUNDING prompt...

BASE:  torch.Size([1, 256, 1280])
PATCHES:  torch.Size([8, 100, 1280])
[USER]: I'm trying to implement a recursive function to find all files in a directory, but it's not working as expected. Here's my code:
import os

def find_files(path, pattern):
    results = []
    for item in os.listdir(path):
    if item.endsWith(pattern):
    results.append(item)
    return results

files = find_files('/Users/me/Documents', '.txt')
print(files)

[ASSISTANT]: Your function is only checking the immediate directory, not recursing into subdirectories. Here's a fixed version:
import os

def find_files(path, pattern):
    results = []
    for root, dirs, files in os.walk(path):
    for file in files:
    if file.endsWith(pattern):
    full_path = os.path.join(root, file)
    results.append(full_path)
    return results

files = find_files('/Users/me/Documents', '.txt')
for f in files:
    # Compare lowercased filename if case-inse

image: 0it [00:00, ?it/s]
other: 0it [00:00, ?it/s]


✓ OCR complete in 40.10s
✓ Output: 3605 characters





In [24]:
# Load the saved OCR result
with open('/content/ocr_output/result.mmd', 'r') as f:
    result = f.read()

print(f"✓ Loaded OCR output: {len(result)} characters")

✓ Loaded OCR output: 3948 characters


## Cell 6: Calculate Accuracy & Display Results

Character-level accuracy using difflib.

In [25]:
import difflib

# Calculate accuracy
matcher = difflib.SequenceMatcher(None, ORIGINAL_TEXT.strip(), result.strip())
accuracy = matcher.ratio() * 100

# Display results
print("\n" + "="*70)
print("🎯 PHASE 2 RESULTS")
print("="*70)
print(f"CHARACTER ACCURACY: {accuracy:.2f}%")
print(f"LATENCY: {elapsed:.2f}s")
print(f"ORIGINAL: {len(ORIGINAL_TEXT)} chars")
print(f"OCR OUTPUT: {len(result)} chars")
print(f"DIFFERENCE: {abs(len(result) - len(ORIGINAL_TEXT))} chars")
print("="*70)

# Token compression estimate
text_tokens = len(ORIGINAL_TEXT) // 4
visual_tokens = 256 * 2  # 2 pages at 1024x1024 = 256 tokens each
token_compression = text_tokens / visual_tokens
print(f"\nEst. text tokens: ~{text_tokens}")
print(f"Visual tokens (2 pages): {visual_tokens}")
print(f"Token compression: {token_compression:.2f}x")


🎯 PHASE 2 RESULTS
CHARACTER ACCURACY: 59.00%
LATENCY: 40.10s
ORIGINAL: 4509 chars
OCR OUTPUT: 3948 chars
DIFFERENCE: 561 chars

Est. text tokens: ~1127
Visual tokens (2 pages): 512
Token compression: 2.20x


## Cell 7: Show Comparisons

Visual comparison of original vs OCR output.

In [26]:
# Show first 800 chars comparison
print("\n" + "="*70)
print("📄 ORIGINAL (first 800 chars):")
print("="*70)
print(ORIGINAL_TEXT[:800])

print("\n" + "="*70)
print("📄 OCR OUTPUT (first 800 chars):")
print("="*70)
print(result[:800])

# Show differences if accuracy < 100%
if accuracy < 100:
    print("\n" + "="*70)
    print("⚠️  SAMPLE DIFFERENCES (first 20):")
    print("="*70)

    diff_count = 0
    for i, s in enumerate(difflib.ndiff(ORIGINAL_TEXT[:500], result[:500])):
        if s[0] != ' ':
            print(f"Pos {i}: {repr(s)}")
            diff_count += 1
            if diff_count >= 20:
                print(f"... (showing first 20 of many differences)")
                break
else:
    print("\n✅ PERFECT MATCH: 100% accuracy!")


📄 ORIGINAL (first 800 chars):
[USER]: I'm trying to implement a recursive function to find all files in a directory, but it's not working as expected. Here's my code:

import os

def find_files(path, pattern):
    results = []
    for item in os.listdir(path):
        if item.endswith(pattern):
            results.append(item)
    return results

files = find_files('/Users/me/Documents', '.txt')
print(files)

[ASSISTANT]: Your function is only checking the immediate directory, not recursing into subdirectories. Here's a fixed version:

import os

def find_files(path, pattern):
    results = []
    for root, dirs, files in os.walk(path):
        for file in files:
            if file.endswith(pattern):
                full_path = os.path.join(root, file)
                results.append(full_path)
    return results

file

📄 OCR OUTPUT (first 800 chars):
[USER]: I'm trying to implement a recursive function to find all files in a directory, but it's not working as expected. Here's my code

## Cell 8: Decision Tree

Based on accuracy, determine next steps.

In [27]:
print("\n" + "="*70)
print("🎯 DECISION FOR PHASE 3")
print("="*70)

if accuracy >= 90:
    print("\n✅ SUCCESS: Accuracy ≥90%")
    print("\nNext Steps:")
    print("1. Document results in tests/PHASE2_RESULTS.md")
    print("2. Proceed to Phase 3: Full Implementation")
    print("   - Rewrite MCP server with OCR functions")
    print("   - Update client (rosie_client.py) for archive/reconstruct")
    print("   - Integrate with garbage collector")
    print("   - Write comprehensive tests")
    print("\n💡 Visual memory compression is VIABLE for production.")

elif accuracy >= 85:
    print("\n⚠️  MARGINAL: Accuracy 85-89%")
    print("\nNext Steps:")
    print("1. Iterate on rendering:")
    print("   - Try different fonts (DejaVu Sans Mono, Courier)")
    print("   - Adjust font size (12pt, 14pt, 16pt)")
    print("   - Test grayscale vs color")
    print("   - Disable syntax highlighting")
    print("2. Try different OCR prompts:")
    print("   - '<image>\\n<|grounding|>OCR this image.'")
    print("   - '<image>\\n<|grounding|>Convert to markdown.'")
    print("3. Re-test with adjustments")
    print("\n💡 Close, but needs optimization before production.")

else:
    print("\n❌ INSUFFICIENT: Accuracy <85%")
    print("\nNext Steps:")
    print("1. Document findings in tests/PHASE2_FINDINGS.md")
    print("2. Pivot to alternative approaches:")
    print("   Option A: Hybrid (PNG + text embeddings)")
    print("   Option B: Pure summarization (no visual compression)")
    print("   Option C: Skip visual memory entirely")
    print("3. Update project roadmap")
    print("\n💡 Visual memory compression NOT viable with current approach.")

print("\n" + "="*70)
print(f"📊 Summary: {accuracy:.2f}% accuracy in {elapsed:.2f}s")
print("="*70)


🎯 DECISION FOR PHASE 3

❌ INSUFFICIENT: Accuracy <85%

Next Steps:
1. Document findings in tests/PHASE2_FINDINGS.md
2. Pivot to alternative approaches:
   Option A: Hybrid (PNG + text embeddings)
   Option B: Pure summarization (no visual compression)
   Option C: Skip visual memory entirely
3. Update project roadmap

💡 Visual memory compression NOT viable with current approach.

📊 Summary: 59.00% accuracy in 40.10s
