Skip to content

jmdlab/warorgufo

Repository files navigation

war.gov / UFO

A single-page reading of the May 2026 Department of War "UFO" release. Twenty famous UAP cases told as short stories — claim vs. official line vs. what AARO's drop actually added — with primary-source citations down to the line number.

Live: jmdlab.github.io/warorgufo   ·   Source: war.gov/UFO   ·   License: MIT


Why this exists

Most public UFO archives are file dumps. This one is editorial: every case in the canon — Roswell, Mantell, Lakenheath, RB-47, Tehran, Trans-en-Provence, Tic Tac, the modern AARO PR-series — gets a short narrative, a verbatim primary-source quote, the ufologist claim and the official line side-by-side, and a precise file/line citation in the local corpus so anyone can verify.

Behind the page is a fully reproducible pipeline: scrape the war.gov drop, download the PDFs/videos, OCR the scans, run claim regexes, resolve DVIDS / war.gov media URLs, inline data into the page. When drop 2 lands, the same pipeline regenerates everything in roughly an hour.


A flavor of what's inside

Major General Twining of Wright Field, Ohio … gained the impression that the AAF instituted this investigation to wash out the disc reports since they are definitely not of AAF origin.

— FBI memo, D.M. Ladd → J. Edgar Hoover, 19 Aug 1947  ·  text/65_hs1-834228961_62-hq-83894_section_3.txt:950–952

Estimate at least 100 total sightings, AEC, AFSWP, 4th Army, local commanders perturbed by implications of phenomena. Sighting reported from El Paso, Albuquerque, Alamogordo, Roswell, Socorro, and other locations.

— USAF priority cable, Kirtland AFB → Chief of Staff, 31 Jan 1949  ·  text/65_hs1-834228961_62-hq-83894_section_4.txt:2700–2718

The chlorophyll, as well as certain amino acids of the plants, exhibited significant variations in concentration, variations which decreased with the distance from the center of the mechanical track. These effects disappeared completely two years later. … the cause … could likely be a powerful pulsed electromagnetic field in the high frequency (microwave) range.

— Pr. Michel Bounias, INRA / GEPAN-SEPRA on Trans-en-Provence, 1983  ·  text/255_413270_*.txt:1127–1140

The page has ten more like these, each with a citation.


What's on the live page

Section Content
§ I — 1947 founding Arnold · Twining · Roswell · Maury Island · Mogul (5 cases)
§ II — Cold War files Mantell · Kirtland · Lubbock · Lakenheath · RB-47 (5)
§ III — French traces Valensole · Cussac · Trans-en-Provence (3)
§ IV — Late 20c Apollo 11 · Tehran · JAL 1628 · Belgian Wave (4)
§ V — Modern military Nimitz / Tic Tac · USCENTCOM · 49 PR-series (3)
§ VI — Threads 5 cross-cutting patterns (nuclear adjacency, EM cutoff, French traces, internal-public gap, disinformation)
§ VII — Video archive All 28 DVIDS clips with AARO descriptions, agency filters, click-to-play
§ VIII — Image archive All 14 stills with descriptions, filters, lightbox
§ IX — Researchers 15 ufologists named in the canon · filled-dot = in this corpus
§ X — What's missing 18 cases not in the drop, with sourcing routes

Every quote has a file:line citation pointing into ./text/. Every video streams from the public DVIDS CDN. Every image renders from war.gov or the bundled ./images/. Nothing copyrighted is republished.


Quick start — regenerate the corpus

The repo ships with the lightweight artifacts (115 OCR'd .txt files, 14 stills, all JSON). The 2.3 GB of PDFs and 1.3 GB of DVIDS videos are excluded by .gitignore and reconstructed by the pipeline.

# Prereqs: Python 3.12+, Tesseract (Windows: winget install UB-Mannheim.TesseractOCR)
python -m pip install --user pymupdf pytesseract Pillow

# 1. Build the manifest from raw/uap-csv.csv → manifest.json (161 entries)
python build_manifest.py

# 2. Download all PDFs / images / videos (~3.6 GB, parallel + idempotent)
python download.py pdf img vid

# 3. Extract embedded text via PyMuPDF
python pdf_to_text.py

# 4. OCR the scanned PDFs (Tesseract @ 220 dpi)
python ocr_pdfs.py

# 5. Run the claim regex analyzer → claims/<slug>.md
python analyze_claims.py

# 6. Resolve DVIDS + war.gov media URLs → media.json
python resolve_media.py

# 7. Inline media.json into index.html (works under file:// — no fetch needed)
python inline_data.py

Each step is idempotent. Re-run any single step in isolation.


Repo layout

warorgufo/
├── index.html              ← single-page archive (175 KB · live on GH Pages)
├── 404.html                ← styled fallback for any dead URL
│
├── README.md  ·  LICENSE  ·  HIGHLIGHTS.md  ·  CROSS_REFERENCE.md
│
├── build_manifest.py       ← step 1
├── download.py             ← step 2  (excluded by .gitignore: pdfs/, videos/)
├── pdf_to_text.py          ← step 3
├── ocr_pdfs.py             ← step 4
├── analyze_claims.py       ← step 5
├── resolve_media.py        ← step 6
├── inline_data.py          ← step 7
├── scan_canon.py           ← extended canon scanner
│
├── manifest.json           ← parsed CSV → 161 entries (PDFs / videos / images)
├── media.json              ← resolved video + image URLs (consumed by index.html)
│
├── claims/                 ← 16 per-claim grep dumps with file:line context
│   └── *.md
├── text/                   ← .txt extraction for every PDF (~5.7 MB)
├── images/                 ← 14 stills (8 FBI plates + 6 NASA Apollo)
└── raw/
    ├── index.html          ← landing-page snapshot from war.gov
    └── uap-csv.csv         ← the master CSV the page renders from

What's in the corpus

Source Period Count
FBI 62-HQ-83894 case file 1947 – 1968 ~10 sections + 4 redacted serials + 24 photo plates
Project Sign incident summaries (Box 7) 1947 – 1949 3 PDFs covering #1 – 233
NASA Apollo / Skylab / Gemini transcripts 1965 – 1973 7
DOW UAP "D" series mission reports 2013 – 2026 ~80
DOW UAP "PR" series Unresolved UAP Reports 2020 – 2026 49 (PR1–PR49)
COMETA Report (English translation) 1999 1
Slideshow stills + composite witness sketches 14
DVIDS videos (FLIR / IR / audio) 1965 – 2026 28

By type: 115 unique PDFs · 14 images · 28 videos · 161 manifest entries.


Known limitations

  • One PDF unrecoverable. 65_HS1-834228961_62-HQ-83894_Serial_153 has a malformed URL in the source CSV (...8342289+M5+M11, no extension). The other 118 PDF entries resolve cleanly; 3 are duplicate URLs in the CSV.
  • 24 FBI photo plates yield no OCR text. They're scans of physical photographs — text content is zero by design.
  • No Sign / Grudge / Blue Book primary docs. Those live in NICAP and the National Archives. Same for Robertson Panel, Condon Report, Battelle Special Report 14, AATIP / AAWSAP / Skinwalker DIRDs, MJ-12, and the 2023 Grusch testimony. The page's § X — What's missing lists each gap with sourcing.
  • The COMETA English translation is unofficial. It's a private translation distributed alongside the war.gov drop. Quoted passages in this archive are quoted under fair use, not republished.

Stack

  • Python 3.12 + pymupdf, pytesseract, Pillow
  • Tesseract OCR (220 dpi, eng, --psm 6)
  • Static site — no framework, no build step beyond inline_data.py. Vanilla HTML / CSS / JS, deployed by GitHub Pages from main branch root
  • DVIDS CDN for video streaming (zero GitHub bandwidth)

License & provenance

Pipeline scripts: MIT (see LICENSE).

Source documents: US federal works are public domain by 17 U.S.C. § 105. The COMETA Report's English translation is a third-party private translation; quoted passages here are excerpted under fair use, not republished.

If you find a citation that's wrong, a verdict that overreaches, or a case the pipeline missed — open an issue or a PR.


Built by re-reading the war.gov / UFO drop one document at a time. Drop 2 will extend, not replace.

About

A single-page reading of the May 2026 war.gov UFO drop — 20 famous UAP cases told as short stories with primary-source citations. Includes the scraper / OCR / cross-reference pipeline that built it.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors