A single-page reading of the May 2026 Department of War "UFO" release. Twenty famous UAP cases told as short stories — claim vs. official line vs. what AARO's drop actually added — with primary-source citations down to the line number.
Live: jmdlab.github.io/warorgufo · Source: war.gov/UFO · License: MIT
Most public UFO archives are file dumps. This one is editorial: every case in the canon — Roswell, Mantell, Lakenheath, RB-47, Tehran, Trans-en-Provence, Tic Tac, the modern AARO PR-series — gets a short narrative, a verbatim primary-source quote, the ufologist claim and the official line side-by-side, and a precise file/line citation in the local corpus so anyone can verify.
Behind the page is a fully reproducible pipeline: scrape the war.gov drop, download the PDFs/videos, OCR the scans, run claim regexes, resolve DVIDS / war.gov media URLs, inline data into the page. When drop 2 lands, the same pipeline regenerates everything in roughly an hour.
Major General Twining of Wright Field, Ohio … gained the impression that the AAF instituted this investigation to wash out the disc reports since they are definitely not of AAF origin.
— FBI memo, D.M. Ladd → J. Edgar Hoover, 19 Aug 1947 ·
text/65_hs1-834228961_62-hq-83894_section_3.txt:950–952
Estimate at least 100 total sightings, AEC, AFSWP, 4th Army, local commanders perturbed by implications of phenomena. Sighting reported from El Paso, Albuquerque, Alamogordo, Roswell, Socorro, and other locations.
— USAF priority cable, Kirtland AFB → Chief of Staff, 31 Jan 1949 ·
text/65_hs1-834228961_62-hq-83894_section_4.txt:2700–2718
The chlorophyll, as well as certain amino acids of the plants, exhibited significant variations in concentration, variations which decreased with the distance from the center of the mechanical track. These effects disappeared completely two years later. … the cause … could likely be a powerful pulsed electromagnetic field in the high frequency (microwave) range.
— Pr. Michel Bounias, INRA / GEPAN-SEPRA on Trans-en-Provence, 1983 ·
text/255_413270_*.txt:1127–1140
The page has ten more like these, each with a citation.
| Section | Content |
|---|---|
| § I — 1947 founding | Arnold · Twining · Roswell · Maury Island · Mogul (5 cases) |
| § II — Cold War files | Mantell · Kirtland · Lubbock · Lakenheath · RB-47 (5) |
| § III — French traces | Valensole · Cussac · Trans-en-Provence (3) |
| § IV — Late 20c | Apollo 11 · Tehran · JAL 1628 · Belgian Wave (4) |
| § V — Modern military | Nimitz / Tic Tac · USCENTCOM · 49 PR-series (3) |
| § VI — Threads | 5 cross-cutting patterns (nuclear adjacency, EM cutoff, French traces, internal-public gap, disinformation) |
| § VII — Video archive | All 28 DVIDS clips with AARO descriptions, agency filters, click-to-play |
| § VIII — Image archive | All 14 stills with descriptions, filters, lightbox |
| § IX — Researchers | 15 ufologists named in the canon · filled-dot = in this corpus |
| § X — What's missing | 18 cases not in the drop, with sourcing routes |
Every quote has a file:line citation pointing into ./text/. Every video streams from the public DVIDS CDN. Every image renders from war.gov or the bundled ./images/. Nothing copyrighted is republished.
The repo ships with the lightweight artifacts (115 OCR'd .txt files, 14 stills, all JSON). The 2.3 GB of PDFs and 1.3 GB of DVIDS videos are excluded by .gitignore and reconstructed by the pipeline.
# Prereqs: Python 3.12+, Tesseract (Windows: winget install UB-Mannheim.TesseractOCR)
python -m pip install --user pymupdf pytesseract Pillow
# 1. Build the manifest from raw/uap-csv.csv → manifest.json (161 entries)
python build_manifest.py
# 2. Download all PDFs / images / videos (~3.6 GB, parallel + idempotent)
python download.py pdf img vid
# 3. Extract embedded text via PyMuPDF
python pdf_to_text.py
# 4. OCR the scanned PDFs (Tesseract @ 220 dpi)
python ocr_pdfs.py
# 5. Run the claim regex analyzer → claims/<slug>.md
python analyze_claims.py
# 6. Resolve DVIDS + war.gov media URLs → media.json
python resolve_media.py
# 7. Inline media.json into index.html (works under file:// — no fetch needed)
python inline_data.pyEach step is idempotent. Re-run any single step in isolation.
warorgufo/
├── index.html ← single-page archive (175 KB · live on GH Pages)
├── 404.html ← styled fallback for any dead URL
│
├── README.md · LICENSE · HIGHLIGHTS.md · CROSS_REFERENCE.md
│
├── build_manifest.py ← step 1
├── download.py ← step 2 (excluded by .gitignore: pdfs/, videos/)
├── pdf_to_text.py ← step 3
├── ocr_pdfs.py ← step 4
├── analyze_claims.py ← step 5
├── resolve_media.py ← step 6
├── inline_data.py ← step 7
├── scan_canon.py ← extended canon scanner
│
├── manifest.json ← parsed CSV → 161 entries (PDFs / videos / images)
├── media.json ← resolved video + image URLs (consumed by index.html)
│
├── claims/ ← 16 per-claim grep dumps with file:line context
│ └── *.md
├── text/ ← .txt extraction for every PDF (~5.7 MB)
├── images/ ← 14 stills (8 FBI plates + 6 NASA Apollo)
└── raw/
├── index.html ← landing-page snapshot from war.gov
└── uap-csv.csv ← the master CSV the page renders from
| Source | Period | Count |
|---|---|---|
| FBI 62-HQ-83894 case file | 1947 – 1968 | ~10 sections + 4 redacted serials + 24 photo plates |
| Project Sign incident summaries (Box 7) | 1947 – 1949 | 3 PDFs covering #1 – 233 |
| NASA Apollo / Skylab / Gemini transcripts | 1965 – 1973 | 7 |
| DOW UAP "D" series mission reports | 2013 – 2026 | ~80 |
| DOW UAP "PR" series Unresolved UAP Reports | 2020 – 2026 | 49 (PR1–PR49) |
| COMETA Report (English translation) | 1999 | 1 |
| Slideshow stills + composite witness sketches | — | 14 |
| DVIDS videos (FLIR / IR / audio) | 1965 – 2026 | 28 |
By type: 115 unique PDFs · 14 images · 28 videos · 161 manifest entries.
- One PDF unrecoverable.
65_HS1-834228961_62-HQ-83894_Serial_153has a malformed URL in the source CSV (...8342289+M5+M11, no extension). The other 118 PDF entries resolve cleanly; 3 are duplicate URLs in the CSV. - 24 FBI photo plates yield no OCR text. They're scans of physical photographs — text content is zero by design.
- No Sign / Grudge / Blue Book primary docs. Those live in NICAP and the National Archives. Same for Robertson Panel, Condon Report, Battelle Special Report 14, AATIP / AAWSAP / Skinwalker DIRDs, MJ-12, and the 2023 Grusch testimony. The page's § X — What's missing lists each gap with sourcing.
- The COMETA English translation is unofficial. It's a private translation distributed alongside the war.gov drop. Quoted passages in this archive are quoted under fair use, not republished.
- Python 3.12 +
pymupdf,pytesseract,Pillow - Tesseract OCR (220 dpi, eng,
--psm 6) - Static site — no framework, no build step beyond
inline_data.py. Vanilla HTML / CSS / JS, deployed by GitHub Pages frommainbranch root - DVIDS CDN for video streaming (zero GitHub bandwidth)
Pipeline scripts: MIT (see LICENSE).
Source documents: US federal works are public domain by 17 U.S.C. § 105. The COMETA Report's English translation is a third-party private translation; quoted passages here are excerpted under fair use, not republished.
If you find a citation that's wrong, a verdict that overreaches, or a case the pipeline missed — open an issue or a PR.
Built by re-reading the war.gov / UFO drop one document at a time. Drop 2 will extend, not replace.