Skip to content

loic-cunningham/energy-consents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

energy-consents

A fast, resumable scraper for energyconsents.scot — every case, every document, every field.

Two components:

  • scraper/Bun + TypeScript crawler. No headless browser, no jsdom — just raw HTTP, node-html-parser, and a small ASP.NET WebForms client that replays __VIEWSTATE postbacks the way the page itself does.
  • parser/ — Python text/metadata extractor that walks the downloaded PDFs and turns them into JSONL. PDF parsing is powered by kreuzberg — many thanks to its authors for the heavy lifting.

Features

  • Full corpus index — paginates through every Advanced Search page and emits a single case_refs.json.
  • Every detail field — every label/value, plus structured agent + applicant contact blocks.
  • Every document — paginated documents tab, each row captured with date received, publication date, category, description, and the postback coordinates needed to fetch the file.
  • Real PDF downloads — replays the ASP.NET __doPostBack chain to pull the actual file bytes (with the original filename from Content-Disposition).
  • Resumable — JSONL output, dedupes by caseRef on every restart.
  • Bounded concurrency — global HTTP cap via p-limit, per-case parallelism on top.
  • Pretty CLI — yargs commands, cli-progress bar that keeps log lines above it, picocolors for colour.

Install

You need Bun (this project uses Bun.serve, bun:sqlite, and Bun.file idioms — see CLAUDE.md).

curl -fsSL https://bun.sh/install | bash   # if you don't have bun yet
git clone https://github.com/<your-fork>/energy-consents
cd energy-consents/scraper
bun install

For the Python parser, see parser/README.md (uv sync inside parser/).

Quickstart

# 1) crawl the search results (~30s)
bun run scraper/src/cli.ts index

# 2) scrape every case → data/energy_consents.jsonl  (~30 min, resumable)
bun run scraper/src/cli.ts scrape

# 3) pull the PDFs for any case you care about
bun run scraper/src/cli.ts download --case ECU00004696

Or via the npm scripts (from inside scraper/):

cd scraper
bun run index
bun run scrape
bun run download -- --case ECU00004696

Then extract text + metadata from the downloaded PDFs:

cd parser
uv run energy-consents-parse   # writes out/energy_consents_extracted.jsonl

PDF parsing is handled by kreuzberg — credit to its authors.

Output layout

Everything lives under data/ as JSON / JSONL / binary files.

data/
├── case_refs.json                       # CaseRow[] — every case ref from the search
├── energy_consents.jsonl                # one CaseRecord per line (the bulk corpus)
└── <CASE_REF>/
    ├── case.json                        # CaseRecord, pretty-printed
    ├── documents.json                   # download manifest (when files were fetched)
    └── documents/
        └── <original-filename>.pdf

A CaseRecord looks like:

{
  "caseRef": "EC00002069",
  "projectName": "Loch Urr",
  "caseType": "Development",
  "projectType": "Wind Farm",
  "caseStatus": "Withdrawn",
  "detail": {
    "url": "https://www.energyconsents.scot/ApplicationDetails.aspx?cr=EC00002069",
    "fields": {
      "Project Type": "Wind Farm (Other Generating Station)",
      "Status": "Withdrawn",
      "Planning Authority": "Dumfries and Galloway Council",
      "Application Received Date": "19 Nov 2014",
      "Max Total MW Of Development Applied For": "83.20",
      "...": "..."
    },
    "contacts": [
      { "role": "Agent",     "lines": ["Axis PED", "Well House Barns, Chester, CH4 0DH", "..."] },
      { "role": "Applicant", "lines": ["Nick Taylor", "EON Climate & Renewable Developments Limited", "..."] }
    ]
  },
  "documents": [
    {
      "dateReceived": "18 May 2016",
      "publicationDate": "19 May 2016",
      "documentCategory": "ECU Correspondence (external)",
      "description": "Notification of withdrawal of application",
      "postbackTarget": "ctl00$ContentPlaceHolder1$grdResults$ctl02$AddButton",
      "documentId": "ContentPlaceHolder1_grdResults_AddButton_0",
      "pageOnDocsTab": 1,
      "rowIndexOnPage": 0,
      "downloadEndpoint": "https://www.energyconsents.scot/ApplicationDetails.aspx?cr=EC00002069"
    }
  ],
  "representations": [],
  "scrapedAt": "2026-04-29T20:59:00.000Z"
}

postbackTarget + pageOnDocsTab + rowIndexOnPage are the coordinates the download command needs. There is no static URL per document (see Why no plain URL? below).

CLI

Global options (apply to every command)

Flag Default Notes
--base-url https://www.energyconsents.scot Override the site root.
--data-dir data Where every JSON/JSONL/binary goes.
--refs <data-dir>/case_refs.json Override the index file.
--output <data-dir>/energy_consents.jsonl Override the JSONL corpus path.
--http-concurrency 6 Global cap on in-flight HTTP requests.

index

Walk the search-result pagination, write every case ref.

bun run scraper/src/cli.ts index

scrape

Fetch case detail + documents + representations into JSONL. Optionally pull binaries too.

# whole corpus, resumable (default)
bun run scraper/src/cli.ts scrape

# just one case
bun run scraper/src/cli.ts scrape --case EC00002069

# first 50 cases, fresh start
bun run scraper/src/cli.ts scrape --limit 50 --no-resume

# scrape + pull every document binary
bun run scraper/src/cli.ts scrape --case EC00002069 --download
Flag Default Notes
--case <ref> Single case ref instead of the whole list.
--limit <n> Stop after N cases.
--concurrency 4 Cases worked on in parallel.
--resume true Skip refs already in the JSONL. Use --no-resume to truncate.
--download false Also pull every document into data/<caseRef>/documents/.

download

Replays the postback chain to fetch real files.

# every doc for one case
bun run scraper/src/cli.ts download --case ECU00004696

# one specific doc (page + row from the JSONL)
bun run scraper/src/cli.ts download --case EC00002069 --page 1 --row 0

# every doc for every case (slow!)
bun run scraper/src/cli.ts download --all --concurrency 2
Flag Default Notes
--case <ref> Required unless --all.
--page <n> Documents-tab page (1-based). Single-doc mode.
--row <n> 0-based row on that page. Required with --page.
--all false Loop over every case in case_refs.json.
--concurrency 2 Cases in parallel when --all.
--limit <n> Cap cases processed when --all.

Inspecting the data

# pretty-print one record
head -1 data/energy_consents.jsonl | jq

# variety summary
jq -r '"\(.caseRef) \(.caseStatus) docs=\(.documents|length) – \(.projectName)"' data/energy_consents.jsonl | head

# every document in the corpus, flattened
jq -rc '. as $c | .documents[] | [$c.caseRef, .pageOnDocsTab, .rowIndexOnPage, .documentCategory, .description] | @tsv' data/energy_consents.jsonl | head

# replay download for any row found in the JSONL
jq -rc '.caseRef as $c | .documents[] | "\($c) \(.pageOnDocsTab) \(.rowIndexOnPage)"' data/energy_consents.jsonl \
  | head -5 \
  | xargs -L1 -P4 bash -c 'bun run scraper/src/cli.ts download --case "$0" --page "$1" --row "$2"'

Why no plain URL per document?

Each document is served by POST-ing the case page with __EVENTTARGET=...$AddButton and the current __VIEWSTATE. The response body is the file, with Content-Disposition giving the filename. There is no GET URL we can hand to a browser — the only way to fetch a file is to replay the postback chain (open the case → switch to Documents tab → page-forward to the right page → POST the AddButton). That is exactly what bun run scraper/src/cli.ts download does.

Source layout

scraper/src/
├── cli.ts          yargs entry point — index / scrape / download
├── config.ts       base-url + path resolution
├── aspnet.ts       CookieJar, viewstate extraction, fetch wrapper, p-limit HTTP cap
├── parse.ts        HTML → typed records (CaseDetail, DocumentEntry, Representation)
├── index-step.ts   the search-result crawl
├── scrape.ts       per-case detail + paginated documents + representations
├── download.ts     single-doc + parallel all-docs-for-case + manifest writer
└── logger.ts       cli-progress wrapper that keeps log output above the bar

parser/src/energy_consents_parser/
├── models.py       Pydantic models mirroring the TS schema
├── extract.py      walker + kreuzberg extraction
└── cli.py          Cyclopts CLI

Notes

  • The site uses __VIEWSTATE heavily, so every browse step needs a fresh round-trip. Document downloads are particularly chatty: the response to an AddButton POST is the file body itself, so any subsequent download in the same case has to re-acquire viewstate by re-clicking the Documents tab and paging forward. downloadAllDocsForCase works around this by spawning a fresh per-doc session and parallelising via p-limit (subject to --http-concurrency).
  • HTTP concurrency is global — a single p-limit instance is shared by every worker. Per-case concurrency is layered on top.
  • Be polite. The defaults (--concurrency 4 --http-concurrency 6) keep the load gentle for a small public service. Don't crank them.

License

MIT.

About

Scrapes all cases and documents from the Scottish energy consent website (energyconsents.scot).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors