# Source Verification

**Purpose:** Verify all data sources are accessible and match expected formats before implementation.

This notebook runs through the source verification checklist from the VCAT Horizon 1 Construction Plan.

## Prerequisites

- VCAT package installed (`pip install -e .`)
- Internet connection (for fetching remote sources)

In [None]:
import sys
from pathlib import Path

# Ensure project root is in path
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print(f"Project root: {project_root}")

## 1. Configuration

Define source URLs and expected formats from our `sources.yaml` configuration.

In [None]:
import yaml

# Load sources configuration
sources_file = project_root / "data_sources" / "sources.yaml"
with open(sources_file) as f:
    sources = yaml.safe_load(f)

print("Configured sources:")
for source_id, info in sources.items():
    if isinstance(info, dict) and 'url' in info:
        print(f"  - {source_id}: {info.get('description', 'No description')}")
        print(f"    URL: {info['url']}")

## 2. Primary Source Verification: ZL Transcription (voynich.nu)

The ZL transcription by René Zandbergen and Gabriel Landini is our primary source.
- Format: IVTFF 2.0 (Intermediate Voynich Text File Format)
- Contains: Complete EVA transcription with page metadata

In [None]:
from scripts.verify_sources import print_result, verify_voynich_nu

# Check ZL transcription
zl_url = sources.get('zl_transcription', {}).get('url', 'https://www.voynich.nu/data/ZL3b-n.txt')
print(f"Verifying ZL transcription at: {zl_url}")
print()

zl_result = verify_voynich_nu(zl_url)
print_result("ZL Transcription (Primary)", zl_result)

In [None]:
# Check local cached copy
local_zl = project_root / "data_sources" / "raw_sources" / "ZL3b-n.txt"
if local_zl.exists():
    from scripts.verify_sources import compute_sha256
    local_hash = compute_sha256(local_zl)
    print(f"✓ Local copy exists: {local_zl}")
    print(f"  SHA256: {local_hash[:16]}...")

    # Compare with remote if available
    if zl_result.is_success:
        if local_hash == zl_result.content_hash:
            print("  ✓ Hash matches remote - local copy is current")
        else:
            print("  ⚠ Hash differs from remote - consider updating local copy")
else:
    print(f"✗ Local copy not found at: {local_zl}")
    print("  Run download script to fetch sources")

## 3. Secondary Source: IT Transcription

The IT (Interlinear Transcription) file contains multiple historical transcribers for comparison.

In [None]:
# Check IT transcription
it_url = sources.get('it_transcription', {}).get('url', 'https://www.voynich.nu/data/IT2a-n.txt')
print(f"Verifying IT transcription at: {it_url}")
print()

it_result = verify_voynich_nu(it_url)
print_result("IT Transcription (Secondary)", it_result)

In [None]:
# Check local cached copy of IT
local_it = project_root / "data_sources" / "raw_sources" / "IT2a-n.txt"
if local_it.exists():
    local_hash = compute_sha256(local_it)
    print(f"✓ Local copy exists: {local_it}")
    print(f"  SHA256: {local_hash[:16]}...")
else:
    print(f"✗ Local copy not found at: {local_it}")

## 4. Stolfi Interlinear (UNICAMP)

Jorge Stolfi's interlinear file from UNICAMP provides an alternative transcription format.

In [None]:
from scripts.verify_sources import verify_stolfi

stolfi_url = sources.get('stolfi_interlinear', {}).get('url',
    'https://www.ic.unicamp.br/~stolfi/voynich/98-12-28-interln16e6/text16e6.evt')
print(f"Verifying Stolfi interlinear at: {stolfi_url}")
print()

stolfi_result = verify_stolfi(stolfi_url)
print_result("Stolfi Interlinear", stolfi_result)

In [None]:
# Check local cached copy
local_stolfi = project_root / "data_sources" / "raw_sources" / "text16e6.evt"
if local_stolfi.exists():
    local_hash = compute_sha256(local_stolfi)
    print(f"✓ Local copy exists: {local_stolfi}")
    print(f"  SHA256: {local_hash[:16]}...")
else:
    print(f"✗ Local copy not found at: {local_stolfi}")

## 5. IVTFF Format Verification

Parse a sample of the ZL file to verify our parser understands the format correctly.

In [None]:
from parsers.ivtff_parser import parse_ivtff_file

if local_zl.exists():
    print("Parsing ZL transcription...")
    pages = parse_ivtff_file(local_zl)

    print(f"\n✓ Successfully parsed {len(pages)} pages")

    # Show sample page
    if pages:
        sample = pages[0]
        print(f"\nSample page: {sample.page_id}")
        print(f"  Quire: {sample.quire}")
        print(f"  Section: {sample.section}")
        print(f"  Language: {sample.currier_language}")
        print(f"  Hand: {sample.hand}")
        print(f"  Loci (lines): {len(sample.loci)}")
        if sample.loci:
            print(f"  First line: {sample.loci[0].text[:50]}...")
else:
    print("Cannot parse - local ZL file not found")

## 6. EVA Character Set Verification

Check that the transcription uses the expected EVA character set.

In [None]:
from vcat.eva_charset import EVA_BASIC, EVA_COMPOUNDS, EVA_RARE, validate_eva_text

print("EVA Character Set:")
print(f"  Basic glyphs: {sorted(EVA_BASIC)}")
print(f"  Rare glyphs: {sorted(EVA_RARE)}")
print(f"  Compound glyphs: {sorted(EVA_COMPOUNDS)}")

In [None]:
# Validate EVA characters in transcription
if local_zl.exists() and pages:
    total_warnings = 0
    unknown_chars = set()

    for page in pages:
        for locus in page.loci:
            result = validate_eva_text(locus.text, strict=False)
            total_warnings += len(result.warnings)
            unknown_chars.update(result.unknown_characters)

    print("Character validation results:")
    print(f"  Total warnings: {total_warnings}")
    if unknown_chars:
        print(f"  Unknown characters found: {unknown_chars}")
    else:
        print("  ✓ All characters are valid EVA")

## 7. Summary

Verification status for all sources.

In [None]:
print("=" * 60)
print("SOURCE VERIFICATION SUMMARY")
print("=" * 60)

results = [
    ("ZL Transcription (Primary)", zl_result),
    ("IT Transcription (Secondary)", it_result),
    ("Stolfi Interlinear", stolfi_result),
]

all_pass = True
for name, result in results:
    status = "✓" if result.is_success else "✗"
    print(f"{status} {name}")
    if not result.is_success:
        all_pass = False
        for err in result.errors:
            print(f"    Error: {err}")

print()
if all_pass:
    print("✓ ALL SOURCES VERIFIED - Proceed to implementation")
else:
    print("✗ SOME SOURCES FAILED - Investigate before proceeding")

In [None]:
# Save verification results
import datetime

from scripts.verify_sources import save_results

output_file = project_root / "reports" / f"source_verification_{datetime.date.today()}.json"
output_file.parent.mkdir(exist_ok=True)

results_dict = {
    "zl_transcription": zl_result.to_dict(),
    "it_transcription": it_result.to_dict(),
    "stolfi_interlinear": stolfi_result.to_dict(),
    "all_pass": all_pass,
    "verification_date": str(datetime.date.today()),
}

save_results(results_dict, output_file)
print(f"Results saved to: {output_file}")