A command-line pipeline for bulk acquisition and multi-pass proofreading of WJEC GCSE "Made for Wales" PDF documents. The project automates document collection, conversion to Markdown, language checks, and LLM-assisted review so large document sets can be analysed consistently and at scale. A command-line pipeline for bulk acquisition and multi-pass proofreading of WJEC GCSE "Made for Wales" PDF documents. The project automates document collection, conversion to Markdown, language checks, and LLM-assisted review so large document sets can be analysed consistently and at scale.
- Scrapes and downloads PDFs from the WJEC qualification pages.
- Normalises and organises files into subject-specific folders.
- Converts PDFs to Markdown using high-fidelity converters (default: Marker).
- Runs automated language checks with LanguageTool to identify spelling and grammar issues.
- Performs LLM-based review passes to categorise issues, de-duplicate false positives, and catch contextual errors.
- Exports structured outputs (CSV reports) for auditing and analysis.
- Scrapes and downloads PDFs from the WJEC qualification pages.
- Normalises and organises files into subject-specific folders.
- Converts PDFs to Markdown using high-fidelity converters (default: Marker).
- Runs automated language checks with LanguageTool to identify spelling and grammar issues.
- Performs LLM-based review passes to categorise issues, de-duplicate false positives, and catch contextual errors.
- Exports structured outputs (CSV reports) for auditing and analysis.
The project combines conventional scraping and text processing with LLM-assisted review:
- Python 3.12 runtime with
uvfor dependency management and execution. - Requests + BeautifulSoup-style HTML parsing for PDF discovery.
- Marker (default) or Docling for PDF-to-Markdown conversion with OCR support.
- LanguageTool via
language-tool-pythonfor rule-based spelling and grammar checks. - Gemini and Mistral APIs for LLM-assisted categorisation and proofreading passes.
- Pydantic models (notably
LanguageIssue) to validate and normalise issue payloads. - json-repair integration to recover malformed JSON responses before validation.
The workflow is intentionally staged so each pass refines the results of the previous one:
- Acquisition pass
- Scrape subject pages and download linked PDFs into
Documents/<Subject>/.
- Scrape subject pages and download linked PDFs into
- Conversion pass
- Convert PDFs to Markdown, preserving layout cues and page markers for downstream processing.
- LanguageTool pass
- Run spelling/grammar checks and create a baseline issue list.
- LLM categorisation pass
- Use LLMs to filter false positives, classify issues, and prioritise review effort.
- LLM outputs are repaired with
repair_jsonbefore being parsed and validated. - The
LanguageIssuemodel enforces required fields, normalises values, and rejects partial LLM payloads to keep the results consistent.
- LLM proofreading pass
- Review document text in small page chunks to catch contextual, consistency, and factual errors.
- Outputs are validated against the same
LanguageIssueschema to keep issue metadata aligned with earlier passes.
- Reporting pass
- Emit per-document CSV reports for analysis and aggregation.
This layered approach reduces noise from OCR artefacts while making the remaining issues easier to triage.
- CLI entry point:
main.py(delegates tosrc/cli) - Scraper:
src/scraper/__init__.py(subject list, URL discovery, file naming) - Post-processing:
src/postprocessing/__init__.py(organises PDFs, converts to Markdown) - Converters:
src/converters/converters.py(Marker and Docling backends) - Language checking:
src/language_check/ - LLM review modules:
src/llm_review/(categoriser, proofreader, batch orchestration) - Reporting utilities:
src/utils/page_utils.py,src/scripts/document_stats.py
Dependencies are managed with uv.
uv syncList available subjects: List available subjects:
uv run python main.py --list-subjectsDownload all subjects: Download all subjects:
uv run python main.pyDownload selected subjects to a custom location: Download selected subjects to a custom location:
uv run python main.py --subjects "Art and Design" French --root ./wjec-pdfs
uv run python main.py --subjects "Art and Design" French --root ./wjec-pdfsPreview downloads without writing files: Preview downloads without writing files:
uv run python main.py --subjects Geography --dry-runThe CLI can organise PDFs into pdfs/, convert them to markdown/, and extract images as needed:
The CLI can organise PDFs into pdfs/, convert them to markdown/, and extract images as needed:
--post-processruns the organiser after downloading.--post-process-onlyskips downloading and processes existing PDFs.--post-process-file <path>converts a single PDF.--post-process-workers Ncaps concurrent subject processing.--converter {marker,docling}chooses the PDF-to-Markdown backend (default:marker).--post-processruns the organiser after downloading.--post-process-onlyskips downloading and processes existing PDFs.--post-process-file <path>converts a single PDF.--post-process-workers Ncaps concurrent subject processing.--converter {marker,docling}chooses the PDF-to-Markdown backend (default:marker).
Examples:
# Download and convert
# Download and convert
uv run python main.py --subjects Geography --post-process
# Convert an existing folder
uv run python main.py --root Documents --post-process-only --converter marker
# Convert an existing folder
uv run python main.py --root Documents --post-process-only --converter marker
# Convert a single PDF
uv run python main.py --post-process-file Documents/Art-and-Design/sample.pdfThe LLM review modules build on the LanguageTool output and the Markdown page markers. They support both live and batch modes, with resumable state tracking and per-document CSV outputs. See docs/developer/LLM_REVIEW_MODULE_GUIDE.md for implementation details and docs/developer/LLM_PROVIDER_SPEC.md for provider integration.
Typical output layout under Documents/<Subject>/:
pdfs/— normalised PDF copiesmarkdown/— converted Markdown and extracted imagesdocument_reports/— per-document CSV reports from LLM review passes
Additional data artefacts are produced under data/ (state files, logs, and categoriser outputs). Notebooks under notebooks/ and docs/notebooks/ provide analysis and reporting examples.
uv run python main.py --post-process-file Documents/Art-and-Design/sample.pdf
## LLM review tooling
The LLM review modules build on the LanguageTool output and the Markdown page markers. They support both live and batch modes, with resumable state tracking and per-document CSV outputs. See `docs/developer/LLM_REVIEW_MODULE_GUIDE.md` for implementation details and `docs/developer/LLM_PROVIDER_SPEC.md` for provider integration.
## Outputs and data layout
Typical output layout under `Documents/<Subject>/`:
- `pdfs/` — normalised PDF copies
- `markdown/` — converted Markdown and extracted images
- `document_reports/` — per-document CSV reports from LLM review passes
Additional data artefacts are produced under `data/` (state files, logs, and categoriser outputs). Notebooks under `notebooks/` and `docs/notebooks/` provide analysis and reporting examples.
## Configuration and environment variables
Key environment variables used by the LLM integrations:
## Configuration and environment variables
Key environment variables used by the LLM integrations:
- `GEMINI_API_KEY`
- `MISTRAL_API_KEY`
- `LLM_PRIMARY`, `LLM_FALLBACK`
- `LLM_CATEGORISER_BATCH_SIZE`, `LLM_CATEGORISER_MAX_RETRIES`
Refer to `docs/developer/LLM_PROVIDER_SPEC.md` for the full list and behavioural details.
## Known limitations
- **PDF conversion noise:** OCR artefacts (merged words, missing hyphens) can still surface; multi-pass filtering reduces, but does not eliminate, them.
- **Images are not fully checked:** Image-based text is only analysed if the converter extracts it into Markdown.
- **LLM runs require API keys:** Provider quotas and rate limits apply.
## Further documentation
- `docs/developer/ARCHITECTURE.md` — data flow, parsing rules, invariants
- `docs/developer/DEV_WORKFLOWS.md` — testing and debugging workflows
- `docs/methodology.md` — explanation of the proofreading approach
- `SCRIPTS.md` — helper scripts and batch processing utilities
- `GEMINI_API_KEY`
- `MISTRAL_API_KEY`
- `LLM_PRIMARY`, `LLM_FALLBACK`
- `LLM_CATEGORISER_BATCH_SIZE`, `LLM_CATEGORISER_MAX_RETRIES`
Refer to `docs/developer/LLM_PROVIDER_SPEC.md` for the full list and behavioural details.
## Known limitations
- **PDF conversion noise:** OCR artefacts (merged words, missing hyphens) can still surface; multi-pass filtering reduces, but does not eliminate, them.
- **Images are not fully checked:** Image-based text is only analysed if the converter extracts it into Markdown.
- **LLM runs require API keys:** Provider quotas and rate limits apply.
## Further documentation
- `docs/developer/ARCHITECTURE.md` — data flow, parsing rules, invariants
- `docs/developer/DEV_WORKFLOWS.md` — testing and debugging workflows
- `docs/methodology.md` — explanation of the proofreading approach
- `SCRIPTS.md` — helper scripts and batch processing utilities