Skip to content

obielin/gov-doc-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gov-doc-parser

Parse and extract structured data from UK government documents — GOV.UK, Hansard, ICO, FCA, BAILII, and ATRS. Research and governance analysis toolkit.

Tests Dependencies Python License LinkedIn

Install

pip install gov-doc-parser

Zero external dependencies — pure Python stdlib.

Quick start

from gov_doc_parser import GovDocParser

parser = GovDocParser()

# Parse any UK gov source
doc = parser.parse(html_text, source="ico")
print(doc.title, doc.date, doc.metadata)

# Auto-detect from URL
doc = parser.parse(html, source="auto", url="https://ico.org.uk/...")

# Extract AI references with sentiment
result = parser.parse_full(html, source="govuk")
for ref in result.ai_references:
    print(f"[{ref.sentiment}] {ref.term}: {ref.context[:100]}")
# [regulatory] algorithm: ...automated decision-making must comply with UK GDPR...
# [negative] artificial intelligence: ...AI found to be unlawful under Equality Act...

# Parse ATRS record
doc, atrs = parser.parse_atrs(atrs_html)
print(atrs.system_name, atrs.governance_score)  # 0-100 transparency score
print(atrs.dpia_completed, atrs.human_review, atrs.legal_basis)

# Batch
results = parser.batch_parse([
    {"html": govuk_html, "source": "govuk"},
    {"html": ico_html, "source": "ico"},
])

Supported sources

Source Extracts
GOV.UK Title, date, department, document type, sections
Hansard Date, house, speaker, debate text, AI mentions
ICO Enforcement type, penalty amount, decision text
FCA Document type (Dear CEO/PS/CP), FCA reference
BAILII Citation, court, judge, judgment text
ATRS System name, risk tier, DPIA status, governance score

CLI

gov-doc-parser document.html --source ico
gov-doc-parser document.html --source govuk --ai-refs
gov-doc-parser atrs_record.html --atrs
gov-doc-parser document.html --json

Linda Oraegbunam | LinkedIn | GitHub

About

Parse and extract structured data from UK government documents — GOV.UK, Hansard, ICO, FCA, BAILII, and ATRS. Research and governance analysis toolkit.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages