Parse and extract structured data from UK government documents — GOV.UK, Hansard, ICO, FCA, BAILII, and ATRS. Research and governance analysis toolkit.
pip install gov-doc-parserZero external dependencies — pure Python stdlib.
from gov_doc_parser import GovDocParser
parser = GovDocParser()
# Parse any UK gov source
doc = parser.parse(html_text, source="ico")
print(doc.title, doc.date, doc.metadata)
# Auto-detect from URL
doc = parser.parse(html, source="auto", url="https://ico.org.uk/...")
# Extract AI references with sentiment
result = parser.parse_full(html, source="govuk")
for ref in result.ai_references:
print(f"[{ref.sentiment}] {ref.term}: {ref.context[:100]}")
# [regulatory] algorithm: ...automated decision-making must comply with UK GDPR...
# [negative] artificial intelligence: ...AI found to be unlawful under Equality Act...
# Parse ATRS record
doc, atrs = parser.parse_atrs(atrs_html)
print(atrs.system_name, atrs.governance_score) # 0-100 transparency score
print(atrs.dpia_completed, atrs.human_review, atrs.legal_basis)
# Batch
results = parser.batch_parse([
{"html": govuk_html, "source": "govuk"},
{"html": ico_html, "source": "ico"},
])| Source | Extracts |
|---|---|
| GOV.UK | Title, date, department, document type, sections |
| Hansard | Date, house, speaker, debate text, AI mentions |
| ICO | Enforcement type, penalty amount, decision text |
| FCA | Document type (Dear CEO/PS/CP), FCA reference |
| BAILII | Citation, court, judge, judgment text |
| ATRS | System name, risk tier, DPIA status, governance score |
gov-doc-parser document.html --source ico
gov-doc-parser document.html --source govuk --ai-refs
gov-doc-parser atrs_record.html --atrs
gov-doc-parser document.html --json