# Using NLP for metadata extraction from MEMO style documents

### Sample text

In [6]:
sample_text = """
From: Derek Murphy – Cabinet Member for Economic Development
Simon Jones, Corporate Director for Growth, Environment and
Transport
To: Cabinet – 4 March 2025
Subject: Adoption of the Kent Minerals and Waste Local Plan 2024-2039
Classification: Unrestricted
Past Pathway of Paper: N/A
Future Pathway of Paper: County Council
Electoral Division: Countywide
Summary: The County Council has a statutory responsibility to plan for future
minerals supply and waste management within Kent. As a result, the Kent
Minerals and Waste Local Plan 2013-30 was adopted by County Council in July
2016 with some limited changes adopted in September 2020. The Kent Minerals
and Waste Local Plan contains planning policies relating to minerals supply and
waste management against which the Council assesses planning applications for
these types of development. In addition, the Kent Mineral Sites Plan (KMSP)
(adopted in September 2020) identifies three sites suitable for the quarrying of
sand and gravel.
Regulations require local plans to be reviewed every five years and a review of the
adopted Kent Mineral and Waste Local Plan in 2021 concluded a need for
updates including to its Vision, Strategic Objectives, policies and supporting text to
reflect changes in national and local policy and guidance since 2016. These
include changes to the National Planning Policy Framework, government policy on
climate change, protection and enhancement of the natural environment and
achievement of a circular economy.
In December 2023, following three rounds of public consultation between 2021
and 2023, and consideration by Environment and Transport Cabinet Committee
(ETCC), County Council agreed that the ‘Pre-Submission’ Draft of the Kent
Minerals and Waste Local Plan 2024 to 2039 should be published for
representations on soundness and legality and submitted to the Secretary of State
for independent examination. The Plan, its evidence base and the 58
representations received were submitted in May 2024. The examination was
required to ensure that the Plan is sound and prepared in accordance with
statutory requirements relating to plan-making. On 10 September 2024, Planning
Inspector Joanne Burston BSc MA MRTPI AIPROW commenced hearings
associated with the examination which ran for four days. During the examination
the Inspector identified the need for certain modifications, and these were subject
to public consultation.
On 6th February 2025, the Council received the Inspector’s Report (see Appendix
A) which concludes that, subject to modifications, the Plan is sound and legally
compliant. Following receipt of the Inspector’s Report, Council is now able to
adopt the Plan (see Appendix B) subject to the modifications being made. The
modifications clarify the wording of certain policies which includes confirming
safeguards to the environment and communities associated with waste and
minerals development. Following adoption, policies of the updated KMWLP will be
monitored to assess whether they are being effective in meeting the local plan
objectives on waste management and minerals supply.
Recommendation(s):
Cabinet is asked to
(i) NOTE the Inspector’s Report (see Appendix A) on the examination of the
Kent Minerals and Waste Local Plan 2024-2039 (KWMLP);
(ii) NOTE the recommendations of the Sustainability Appraisal of the KMWLP
(Appendix D); and,
(iii) ENDORSE the Cabinet Member’s proposal to recommend the KMWLP
(Appendix B), as modified, to County Council for Approval and Adoption.
1. Introduction and Background
1.1 As the minerals and waste planning authority for Kent, the County Council is
required to prepare and maintain planning policy concerning waste
management and minerals supply in the County. The Kent Minerals and
Waste Local Plan 2013-30 was adopted by the Council in July 2016 and sets
out the strategy and policy framework for minerals and waste development in
Kent which includes future capacity and supply requirements. The Kent
Minerals and Waste Local Plan, together with the Kent Mineral Sites Plan,
forms part of the Development Plan for Kent which is key, both for the
determination of planning applications for minerals and waste development
by the County Council, and applications relating to other development that
may affect minerals and waste development or other aspects determined by
the District and Borough Councils in Kent.
1.2 Following its adoption, the Kent Minerals and Waste Local Plan 2016 was
subject to an ‘Early Partial Review’ and changes to the Plan resulting from
this review were adopted by the Council in September 2020. Also in
September 2020, the Council adopted the Kent Mineral Sites Plan.
1.3 The National Planning Policy Framework (NPPF) and associated legislation
states policies in Local Plans should be reviewed at least once every five
years to assess whether they need updating and should then be updated, as
necessary. A review of the Vision, Strategic Objectives and policies in the
Kent Minerals and Waste Local Plan was undertaken in 2021 that concluded
a need for updates to the Plan in response to relevant Government policy
and legislation published since the Plan was adopted in 2016. The review
also identified changes to the local context requiring further updates to be
made.
1.4 The process of updating the Plan needs to follow that set out in the Planning
and Compulsory Purchase Act 2004 and the Town and Country Planning
(Local Planning) (England) Regulations 2012 (‘the plan making regulations’)
as well as the NPPF and Planning Practice Guidance. In line with the
legislation and guidance, updates to the Plan were proposed which
communities and relevant stakeholders were consulted upon in accordance
with the Council’s Statement of Community Involvement. The table below
outlines the consultation that took place. Additional consultation also took
place with stakeholders on matters of particular concern to them e.g. Natural
England.
1.5 The Environment and Transport Cabinet Committee and related Cabinet
Members considered the draft updates to the Kent Minerals and Waste Local
Plan prior to their publication for public consultation. In addition, from time to
time an informal members’ group met to oversee the work on updating the
Plan.
Consultation Dates Summary
Initial consultation
with key
stakeholders
26th March 2021 -
9th April 2021
(14 days)
Initial evidence gathering to
determine which parts of the
Plan needed updating
‘Regulation 18’’
public consultation
on Kent Minerals
and Waste Local
Plan Refresh
16th December
2021 - 9th February
2022
(8 weeks (over
Christmas period))
Consultation on proposed
changes to the KMWLP’s vision,
objectives, polices and
supporting text in light of
government policy and legislation
published since 2016.
Second
‘Regulation 18’
public consultation
on draft Kent
Minerals and
Waste Local Plan
2023-38
24th October 2022 -
5th December 2022
(6 weeks)
Consultation on a further draft
updated KMWLP with changes
including, amongst other matters,
extending the plan period to
2038 and changes to policies
CSW 8, 12 and 17 and the
removal of the strategic mineral
site at Holborough (CSM 3).
Third ‘Regulation
18’ public
consultation on
Further Proposed
Changes to the
13th June - 25th
July 2023
(6 weeks)
Consultation focused on further
proposed changes to KMWLP
including: Extending the plan
period to 2039; changes to
aggregate provision (CSM2);
Kent Minerals and
Waste Local Plan
removal of the Norwood Quarry
strategic waste site (CSW5);
and, removal of a commitment to
make specific provision for the
management of certain waste
produced in London.
1.6 Comments on the proposed changes to the Kent Minerals and Waste Local
Plan were received at each public consultation stage and taken into account
during the preparation of the updated Plan.
1.7 A final draft version of the updated Plan, known as the ‘Pre-Submission Draft
Kent Minerals and Waste Local Plan’ was considered by the Environment
and Transport Cabinet Committee, the Cabinet Members with responsibility
for the Local Plan and subsequently County Council in December 2023. This
Plan included changes to the adopted Plan which can be summarised as
follows:
• Updates to the National Planning Policy Framework in 2018,
2019 and 2021 and associated Planning Practice Guidance;
• legislation and policy concerning: The need to adapt to, and
mitigate, climate change; and, associated low carbon
growth;
• policy and legislation concerned with achieving a circular
economy where more waste is prevented or reused;
• adoption by the County Council of the Kent Environment
Strategy and Kent and Medway Energy and Low Emissions
Strategy;
• extending the plan period to 2039;
• updates to aggregate requirements in Policy CSM2 and
waste management targets in Policy CSW4;
• deletion of Policy CSM5 that allocates a strategic site for
minerals (as planning permission has been granted);
• removal of a strategic site allocation at Norwood Quarry,
Sheppey for the landfill of hazardous waste specifically
incinerator fly ash (Policy CSW5);
• a recognition within supporting text of the need for the
development of additional capacity for the management of
household waste identified by the Waste Disposal Authority;
• removal of a commitment to make specific provision for the
management of residual non-hazardous waste by landfill or
energy recovery that arises in London;
• Changes to Policy CSW17 relating to waste management at
Dungeness were made to ensure that it is consistent with
national policy.
• a change to Policy DM3 seeking the achievement of
maximum biodiversity net gain on the basis that restoration
of quarries can often easily result in much greater than the
statutory minimum of 10% and Kent Nature Partnership
preferred level of 20%;
• changes to settlement boundaries affecting the extent of
areas identified in the Kent Minerals and Waste Local Plan
where the presence of economic minerals needs to be taken
into account before surface development can take place.
These areas are known as ‘Mineral Safeguarding Areas;
• changes to the monitoring framework to ensure it properly
reflects the updated policies; and,
• further changes intended to improve the clarity of the Plan’s
wording and hence the meaning of certain objectives and
policies.
Independent Examination
1.8 Before the updated Plan can be adopted by the Council, it must be submitted
to the Secretary of State for independent Examination by a Government-
appointed inspector. The examination is to determine whether the Plan is
sound and has been prepared in accordance with statutory plan making
requirements. The National Planning Policy Framework (NPPF) defines a
‘sound’ local plan as one that is:
a) Positively prepared – provides a strategy which, as a minimum,
seeks to meet the area’s objectively assessed need; and is informed
by agreements with other authorities, so that unmet need from
neighbouring areas is accommodated where it is practical to do so
and is consistent with achieving sustainable development;
b) Justified – an appropriate strategy, taking into account the
reasonable alternatives, and based on proportionate evidence;
c) Effective – deliverable over the plan period, and based on effective
joint working on cross-boundary strategic matters that have been dealt
with rather than deferred, as evidenced by the statement of common
ground; and,
d) Consistent with national policy – enabling the delivery of
sustainable development in accordance with the policies in the NPPF
and other statements of national planning policy, where relevant.
1.9 Prior to its submission to the Secretary of State, the Plan was published for a
statutory minimum six-week period1 providing an opportunity for communities
and other stakeholders to provide views on whether they thought the Plan
1 Regulation 19 of the Town and Country Planning (Local Planning) (England) Regulations 2012 (as
amended).
was sound and legally compliant. In response to this consultation 58
representations were received.
1.10 The updated Plan, evidence base documents and the representations were
submitted in May 2024 and the Secretary of State appointed Planning
Inspector Joanne Burston BSc MA MRTPI AIPROW to examine the Plan.
The Inspector convened public hearings for four days in September 2024.
1.11 During the examination, the Inspector considered all the representations
received. At the request of the Council, the Inspector considered a small
number of changes needed to ensure soundness of the Plan (known as
‘main modifications’). In addition, a number of minor modifications (known as
additional modifications) were also proposed. These latter seek to improve
the clarity of the plan and addressed presentation and typographical or
factual changes such as revised government department names. These
additional modifications were not necessary to address soundness and legal
compliance matters. The modifications were discussed with the Council and
representors during the hearings as well as the KCC Cabinet Member for
Economic Development.
1.12 Following the hearings, the proposed main modifications were published for
representations over a six-week period from 17 October 2024 to 28
November 2024. 25. Representations were received from 25 parties during
the consultation which were considered by the Inspector, but these did not
result in any further main modifications. A number of minor (additional)
modifications were however proposed to address comments raised. A copy
of the Plan showing the modifications arising from the inspector’s
examination as tracked changes is included at Appendix C.
1.13 The Council received the report of the Inspector on 6th February 2025 and
this Cabinet report provides a summary and details of next steps.
2. The Inspector’s Report
2.1 The Inspector’s Report is included in Appendix A and this confirms that,
subject to modifications, the submitted Kent Minerals and Waste Local Plan
2024-39 is ‘sound’ and has been prepared in accordance with statutory plan
making requirements.
2.2. The modifications are set out in Appendix 1 of the Inspector’s Report with
justification included in the body of her report. The main modifications
concerned tightening up the wording of the Plan in a number of areas as
follows:
• Changes to ensure consistency with current terminology,
legislation, policies, and guidance (e.g., replacing "Areas of
Outstanding Natural Beauty (AONB)" with "National
Landscapes"; reference to the latest Kent Joint Municipal
Waste Management Strategy);
• updates to references to organisations which have an
interest in the Plan e.g. addition of reference to Ebbsfleet
Development Corporation and Kent and Medway Economic
Partnership (KMEP);
• update to footnotes to clarify references to relevant
documents;
• changes to supporting text and policies to ensure the Plan’s
intentions are clear e.g. deletion of reference to ‘Mineral
Consultation Areas’ as these are the same as ‘Mineral
Safeguarding Areas’ and use of a separate term may cause
confusion;
• Amendments to supporting text and Policies: CSW 3; CSW
4; CSW 6; CSW 15; DM 2; DM 3; DM 4; DM 7; DM 10; DM
13; DM 14; DM 17; DM19; DM 20; and DM 22 to ensure that
they are effective and/or consistent with national policy.
• Specific changes to policy included:
o A change to Policy CSW15 to reflect potential need
for upgrades to wastewater treatment works to control
releases of nitrates and phosphates to watercourses;
o Updates to Policy DM2 to ensure consistency with
statutory requirements in relation to protection of
landscapes; internationally and nationally designated
habitats;
o Updates to Policy DM10 to ensure consistency with
Environment Agency requirements concerning the
protection of the water environment;
2.3 The modifications are taken into account in the text of the Plan provided in
Appendix B to this report. Importantly, the modifications do not alter the
objectives or intentions of policy. The Plan in Appendix B will upon adoption
become the published Kent Minerals and Waste Local Plan Strategy upon
which planning decisions in the County will be determined.
2.4 The additional modifications referred to above were also published for
information alongside the main modifications.
3. Strategic environmental assessment and sustainability appraisal
3.1 During its preparation, the Kent Minerals and Waste Local Plan 2024-39 was
subject to sustainability appraisal (SA) (incorporating strategic environmental
assessment (SEA)). The SA report provides an assessments of impacts
(both beneficial and detrimental) on environmental, social and economic
objectives which are expected to arise from development consistent with the
Kent Minerals and Waste Local Plan 2024-39. The SA also considered
reasonable alternatives to the proposals in the Kent Minerals and Waste
Local Plan 2024-39. The recommendations from the SA were taken into
account as the Plan was prepared.
3.2 A non-technical summary of the SA of the Kent Minerals and Waste Local
Plan 2024-39 (with modifications recommended by the Inspector) is included
in Appendix D. The full SA report is available on the Council’s website here .
The inspector concluded that the SA was in line with the legal requirements
4. Adoption
4.1 In accordance with Section 23 (3) of the Planning and Compulsory Purchase
Act 2004, having received a report confirming the soundness and legality of
the Kent Minerals and Waste Local Plan 2024-39, provided the Council
makes the modifications recommended by the Inspector, it may now adopt
the Kent Minerals and Waste Local Plan 2024-39 as updated planning policy
for minerals supply and waste management in Kent.
4.2 The new and revised policy will be used by the County Council when
determining planning applications related to proposals for waste
management and minerals supply. It will ensure that planning decisions in
Kent will be made in accordance with national policy and have regard to local
policy considerations, such as the Kent Environment Strategy and the Kent
and Medway Energy and Low Emissions Strategy. The updated policies
concerning mineral and waste safeguarding and the circular economy will
also be used by District and Borough Councils when determining applications
for non-waste and mineral development. It should be noted that the adopted
Kent Mineral Sites Plan remains in place and information concerning a
related potential change to that Plan is included below (see paragraphs 4.3
and 6.2-6.4).
4.3 Members may recall that there is considerable objection to the Mineral Sites
Plan work and in particular the merits of the nominated hard rock site at
Oaken Wood, Aylesford, that has been submitted in response to the
Council’s ‘call for sites’ as part of the Sites Plan work. For the avoidance of
doubt, the Kent Mineral and Waste Local Plan 2024-39 before Cabinet
makes no decision in relation to the Oaken Wood site. This is a matter for
the separate Mineral Sites Plan work. Work on the review of this Plan is
ongoing and currently subject to detailed technical assessment of the
submitted hard rock site. Until that assessment is complete, no decision can
be taken on whether the site should be allocated or not. Once the
assessment is complete, a report will be considered by a future Growth,
Economic Development and Communities Cabinet Committee. The KMWLP
before Cabinet for adoption provides the strategy for minerals supply
including the quantity of minerals required, not where sites to meet this need
are to be allocated.
5. Kent Minerals and Waste Local Plan 2024-39 - Next Steps
5.1 Following consideration by Growth, Economic Development and
Communities Cabinet Committee (6th March 2025) and Cabinet, County
Council will be asked to agree that the Kent Minerals and Waste Local Plan
2024-39 be adopted as updated waste and minerals planning policy for Kent.
In accordance with Regulation 26 of the Town and Country Planning (Local
Planning) (England) Regulations 2012 (as amended) stakeholders will be
notified of the Council’s adoption of the updated planning policy.
5.2 Prior to final publication of the Plan, minor non-material changes (e.g.
changes related to format and grammar) may be needed, and it is proposed
if required that the agreement to such changes be delegated to the
Corporate Director for Growth, Environment and Transport, in consultation
with the Cabinet Member for Economic Development.
5.3 Following adoption there is a six-week period for legal challenges. To be
successful any such challenge would need to demonstrate that the Kent
Minerals and Waste Local Plan 2024-39 has not been prepared in
accordance with the relevant legislation.
5.4 Once adopted, policies in the Plan will be implemented and monitoring will be
undertaken to assess the effect of the policies. Legislation requires a review
of planning policy every five years and so the outcome of a review of Kent
Minerals and Waste Local Plan 2024-39 policies will be required by 2030.
6. Financial Implications
6.1 The costs of preparing and adopting the Kent Minerals and Waste Local Plan
2024-39 has been met from the existing Planning Applications Group
budget.
6.2 Implementation of the Plan will ensure the wider Kent economy continues to
benefit from the sustainable management of waste and supply of minerals
within its area. For example, costs of waste management and mineral supply
to businesses in Kent would be higher if a Plan was not in place which does
not clearly state how and where waste can be managed and minerals
supplied in Kent. It would also assist in measures to address fly-tipping by
providing adequate capacity and facilities to manage Kent’s waste.
7. Policy Framework
7.1 Updating minerals and waste planning policies takes account of changes since
2016 to national planning policy and guidance and the County Council’s
corporate policies which are concerned with the way in which land is
developed in Kent. These include the Kent Environment Strategy, the Kent
and Medway Energy and Low Emissions Strategy and the Kent’s Plan Bee
Pollinator Action Plan. In light of the timing, there is no requirement for this
Plan to take account of changes made to the National Planning Policy
Framework in December 2024 and in any event these changes were focussed
mainly on the provision of housing.
7.2 The adoption of the Kent Minerals and Waste Local Plan supports the County
Council’s strategy, Framing Kent’s Future 2022-2026. In particular, the
KMWLP helps facilitate the key strategic priorities of an Environmental Step
Change and Infrastructure for Communities by supporting the delivery of
sustainable growth in Kent’s economy. The Plan recognises Kent’s
environment as a core asset and seeks to adapt to, and mitigate, the impacts
of climate change and assist in the delivery of net zero objectives. The
proposed planning strategy reflects recent changes to the environmental
agenda including mitigation and adaptation to Climate Change and Kent’s
Climate Change Statement, the circular economy and biodiversity.
7.3 The Local Plan work is a statutory requirement as part of the Council’s town
planning responsibilities. The local plan work has been carried out in
accordance with Objective 3 of Securing Kent’s Future which seeks to ensure
that the Council prioritises its Best Value Statutory obligations.
8. Legal Implications
8.1 The County Council has a legal obligation under the Town and Country
Planning Acts to prepare a statutory Development Plan for planning purposes
(commonly known as the Local Plan).
8.2 The County Council is also required by national planning policy to ensure that
local plans promote sustainable minerals and waste development. The Kent
Minerals and Waste Local Plan 2024-39 plays an important role in ensuring
that minerals and waste development in Kent is in line with national planning
policy.
8.3 There is an expectation by the Secretary of State for Housing, Communities
and Local Government that all planning authorities have an up-to-date Local
Plan in place. Without an up to date adopted plan, there is a risk that central
government will step in as the plan making authority, reducing local
accountability.
8.4 During its preparation, the Kent Minerals and Waste Local Plan 2024-39 has
been the subject of Strategic Environmental Assessment in accordance with
the Environmental Assessment of Plans and Programme Regulations 2004,
and an Appropriate Assessment in accordance with the Conservation of
Habitats and Species Regulations 2017.
8.5 The resulting Sustainability Appraisal and the Habitats Regulations
Assessment were published for consultation and taken into consideration
when making decisions with regard to the Kent Minerals and Waste Local
Plan 2024-39. These reports are available as background papers.
9. Equalities implications
9.1 An equality impact assessment (EQIA) has been completed, and no equality
implications have been identified. A copy of the assessment is attached at
Appendix E. The earlier Local Plan work was accompanied by a separate
EQIA.
10. Conclusion
10.1 The Town and Country Planning Acts requires the County Council to prepare
a Development Plan (local plan) setting out how mineral and waste planning
matters will be considered in Kent. The Kent Mineral and Waste Plan adopted
in July 2016 and partially updated in 2020 set out the overarching strategy and
vision until 2030.
10.2 In accordance with statutory requirements, a full review of the Kent Mineral
and Waste Local Plan was undertaken in 2021 which revealed the need for
updates to the Plan to bring it into line with Government and local policy,
legislation, and to reflect changes to other factors affecting future waste
management and mineral supply in Kent. Changes to the Plan were proposed,
consulted upon and ultimately examined by a Planning Inspector.
10.3 Before the updated Kent Mineral and Waste Local Plan can be adopted, the
Council must receive a report from the Planning Inspectorate (on behalf of the
Secretary of State) which states that the Plan is sound and has been prepared
in accordance with relevant legislation. On 6th February 2025, the Council
received the report of the Inspector who examined the updated KMWLP and
this states that the legislation was followed and that, subject to modifications
that were promoted and considered during the examination, the Kent Minerals
and Waste Local Plan 2024-39 is sound. Having received the Inspector’s
report, subject to the Council accepting the recommended modifications it can
now adopt the Plan.
11. Recommendation(s):
Cabinet is asked to:
(i) NOTE the Inspector’s Report (see Appendix A) on the examination of the
Kent Minerals and Waste Local Plan 2024-2039 (KWMLP);
(ii) NOTE the recommendations of the Sustainability Appraisal of the KMWLP
(Appendix D); and,
(iii) ENDORSE the Cabinet Member’s proposal to recommend the KMWLP
(Appendix B), as modified, to County Council for Approval and Adoption.
12. Contact details
Lead Officer:
Sharon Thompson – Head of Planning Applications Group
Phone number: 03000 413468 E-mail: sharon.thompson@kent.gov.uk
Lead Director:
Stephanie Holt-Castle – Director of Growth and Communities
Phone number: 03000 412064 Email: stephanie.holt-castle@kent.gov.uk
Appendix A:
Planning Inspector’s Report on the Examination of the Kent Minerals and
Waste Local Plan 2024-39 (including appendix 1 Schedule of Main
Modifications)
Appendix B:
Kent Minerals and Waste Local Plan 2024-39 (as modified by the Inspector’s
recommendations) – the Plan for adoption
Appendix C:
KMWLP 2024-39 showing modifications as tracked - March 2025
Appendix D:
KMWLP 2024-39 Sustainability Appraisal Non-Technical Summary
The main document is available via this hyperlink. KMWLP 2024-39
Sustainability Appraisal
Appendix E:
Kent Minerals and Waste Local Plan 2024-39 (as modified by the Inspector’s
recommendations) – Equality Impact Assessment Equality Impact
Assessment
Background Documents
The supporting documents to the Mineral and Waste Local Plan work are
available on the Council’s website as part of the Examination library via this
link here.
    """
    

### Using NLP non-LLM methods (tested in Colab)

In [10]:
### Extracting metadtaa from council documents 
### and analyzing document structure
# This script extracts metadata from council documents, analyzes the document structure,
# and generates visualizations of the document statistics.
# It uses libraries such as pandas, spacy, dateparser, BeautifulSoup, textract, nltk, matplotlib, and seaborn.
# The script is designed to work with various file formats including text, PDF, Word, and HTML.


# Import necessary libraries
import os
import re
import pandas as pd
import spacy
from spacy.matcher import Matcher
import dateparser
from bs4 import BeautifulSoup
import textract
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import seaborn as sns

# Make sure to install required packages:
# pip install pandas spacy dateparser beautifulsoup4 textract nltk matplotlib seaborn

# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Load spaCy model
try:
    nlp = spacy.load("en_core_web_lg")
except:
    print("Installing spaCy model...")
    import subprocess
    subprocess.call(["python", "-m", "spacy", "download", "en_core_web_sm"])
    nlp = spacy.load("en_core_web_sm")

def extract_council_document_metadata(text):
    """
    Extract metadata from council document text
    """
    metadata = {}
    
    # Extract document title
    title_pattern = r"Subject:\s*(.*?)(?:\n|Classification:)"
    title_match = re.search(title_pattern, text, re.DOTALL)
    if title_match:
        metadata['document_title'] = title_match.group(1).strip()
    
    # Extract authors/sources
    from_pattern = r"From:\s*(.*?)(?:\n|To:)"
    from_match = re.search(from_pattern, text, re.DOTALL)
    if from_match:
        metadata['authors'] = [name.strip() for name in from_match.group(1).split('–')]
    
    # Extract recipients
    to_pattern = r"To:\s*(.*?)(?:\n|Subject:)"
    to_match = re.search(to_pattern, text, re.DOTALL)
    if to_match:
        metadata['recipients'] = to_match.group(1).strip()
    
    # Extract classification
    classification_pattern = r"Classification:\s*(.*?)(?:\n|Past Pathway)"
    classification_match = re.search(classification_pattern, text, re.DOTALL)
    if classification_match:
        metadata['classification'] = classification_match.group(1).strip()
    
    # Extract contact information
    contact_section_pattern = r"Contact details(.*?)(?:\d{1,2}\.\s+\w+|\Z)"
    contact_section_match = re.search(contact_section_pattern, text, re.DOTALL | re.IGNORECASE)
    
    if contact_section_match:
        contact_section = contact_section_match.group(1)
        
        # Extract email addresses
        email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
        metadata['emails'] = re.findall(email_pattern, contact_section)
        
        # Extract phone numbers
        phone_pattern = r'\b\d{5}\s*\d{6}\b'
        metadata['phone_numbers'] = re.findall(phone_pattern, contact_section)
        
        # Extract names and titles
        name_pattern = r'(?:Lead\s+Officer|Lead\s+Director):\s*(.*?)(?:\n|Phone)'
        metadata['contact_names'] = re.findall(name_pattern, contact_section)
    
    # Extract dates using dateparser
    date_pattern = r'\b\d{1,2}(?:st|nd|rd|th)?\s+(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{4}\b|\b\d{1,2}/\d{1,2}/\d{4}\b|\b\d{1,2}-\d{1,2}-\d{4}\b'
    potential_dates = re.findall(date_pattern, text, re.IGNORECASE)
    
    metadata['dates'] = []
    for date_str in potential_dates:
        parsed_date = dateparser.parse(date_str)
        if parsed_date:
            metadata['dates'].append(str(parsed_date.date()))
    
    # Extract plan periods
    plan_period_pattern = r'\b\d{4}[-–]\d{2,4}\b'
    metadata['plan_periods'] = re.findall(plan_period_pattern, text)
    
    # Extract section headings
    section_pattern = r'\n\d+\.\d*\s+([A-Z][A-Za-z\s]+)'
    metadata['section_headings'] = re.findall(section_pattern, text)
    
    # Extract recommendations
    recommendation_pattern = r'Recommendation\(s\):(.*?)(?:\d+\.\s+\w+|\Z)'
    recommendation_match = re.search(recommendation_pattern, text, re.DOTALL)
    if recommendation_match:
        rec_text = recommendation_match.group(1)
        item_pattern = r'\(\s*([ivxlcdm]+)\s*\)\s*([^()]+?)(?=\(\s*[ivxlcdm]+\s*\)|\Z)'
        metadata['recommendations'] = [(m.group(1), m.group(2).strip()) for m in re.finditer(item_pattern, rec_text, re.IGNORECASE)]
    
    # Extract appendices
    appendix_pattern = r'Appendix\s+([A-Z]):\s*(.*?)(?=Appendix\s+[A-Z]:|Background Documents|\Z)'
    metadata['appendices'] = [(m.group(1), m.group(2).strip()) for m in re.finditer(appendix_pattern, text, re.DOTALL)]
    
    # Extract hyperlinks/URLs
    url_pattern = r'https?://\S+'
    metadata['urls'] = re.findall(url_pattern, text)
    
    # Extract meeting information
    committee_pattern = r'(?:Committee|Cabinet Committee|Council):\s*([^,\n]+)'
    metadata['committees'] = re.findall(committee_pattern, text)
    
    # Use spaCy for named entity recognition
    doc = nlp(text)
    
    # Extract organizations
    metadata['organizations'] = [ent.text for ent in doc.ents if ent.label_ == "ORG"]
    
    # Extract locations
    metadata['locations'] = [ent.text for ent in doc.ents if ent.label_ == "GPE" or ent.label_ == "LOC"]
    
    # Extract consultation periods
    consultation_pattern = r'(?:Consultation|consultation)\s+(?:period|dates)[^\n]*?(\d{1,2}(?:st|nd|rd|th)?\s+[A-Za-z]+\s+\d{4})\s*-\s*(\d{1,2}(?:st|nd|rd|th)?\s+[A-Za-z]+\s+\d{4})'
    metadata['consultation_periods'] = []
    
    for match in re.finditer(consultation_pattern, text, re.IGNORECASE):
        start_date = dateparser.parse(match.group(1))
        end_date = dateparser.parse(match.group(2))
        if start_date and end_date:
            metadata['consultation_periods'].append({
                'start': str(start_date.date()),
                'end': str(end_date.date())
            })
    
    return metadata

def analyze_document_structure(text):
    """
    Analyze document structure and return statistical information
    without relying on NLTK
    """
    results = {}
    
    # Break into sections
    sections = re.split(r'\n\d+\.\d*\s+[A-Z]', text)
    results['num_sections'] = len(sections) - 1  # First split is preamble
    
    # Count paragraphs
    paragraphs = re.split(r'\n\s*\n', text)
    results['num_paragraphs'] = len(paragraphs)
    
    # Simple sentence tokenization using regex
    sentences = re.split(r'[.!?]\s+', text)
    results['num_sentences'] = len(sentences)
    
    # Simple word tokenization
    words = re.findall(r'\b\w+\b', text)
    results['num_words'] = len(words)
    
    # Average sentence length
    results['avg_sentence_length'] = results['num_words'] / max(results['num_sentences'], 1)
    
    return results

def visualize_document_stats(stats):
    """
    Create visualizations of document statistics
    """
    # Plot section counts
    plt.figure(figsize=(12, 6))
    metrics = ['num_sections', 'num_paragraphs', 'num_sentences', 'avg_sentence_length']
    values = [stats[m] for m in metrics]
    
    plt.bar(metrics, values)
    plt.title('Document Structure Statistics')
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig('document_stats.png')
    
    return 'document_stats.png'

def extract_and_analyze_from_file(file_path):
    """
    Extract text from file and analyze it
    """
    # For text files
    if file_path.endswith('.txt'):
        with open(file_path, 'r', encoding='utf-8') as f:
            text = f.read()
    # For PDF files
    elif file_path.endswith('.pdf'):
        text = textract.process(file_path).decode('utf-8')
    # For docx files
    elif file_path.endswith('.docx'):
        text = textract.process(file_path).decode('utf-8')
    # For HTML files
    elif file_path.endswith('.html'):
        with open(file_path, 'r', encoding='utf-8') as f:
            soup = BeautifulSoup(f.read(), 'html.parser')
            text = soup.get_text()
    else:
        raise ValueError(f"Unsupported file format: {file_path}")
        
    # Extract metadata
    metadata = extract_council_document_metadata(text)
    
    # Analyze structure
    structure_stats = analyze_document_structure(text)
    
    return {
        'metadata': metadata,
        'structure_stats': structure_stats
    }

# Example usage
if __name__ == "__main__":
    # For demonstration only - replace with actual file path
    # file_path = "council_document.txt"
    
    # Sample text from the document pasted in the prompt
    sample_text = sample_text #
    
    # Extract metadata
    metadata = extract_council_document_metadata(sample_text)
    print("Extracted Metadata:")
    for key, value in metadata.items():
        print(f"{key}: {value}")
    
    # Analyze structure
    structure_stats = analyze_document_structure(sample_text)
    print("\nDocument Structure Analysis:")
    for key, value in structure_stats.items():
        print(f"{key}: {value}")
    
    # Create visualization
    # visualize_document_stats(structure_stats)
    
    # Advanced: Generate report
    print("\nExample for full processing would include:")
    print("1. Loading document from disk")
    print("2. Extracting full text")
    print("3. Analyzing document structure")
    print("4. Extracting all metadata")
    print("5. Generating visualizations")
    print("6. Exporting results to CSV/Excel")

[nltk_data] Downloading package punkt to /Users/lgfolder/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lgfolder/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Installing spaCy model...
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h

    extract-msg (<=0.29.*)
                 ~~~~~~~^[0m[33m
[0m

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Extracted Metadata:
document_title: Adoption of the Kent Minerals and Waste Local Plan 2024-2039
authors: ['Derek Murphy', 'Cabinet Member for Economic Development']
recipients: Cabinet – 4 March 2025
classification: Unrestricted
emails: ['sharon.thompson@kent.gov.uk', 'stephanie.holt-castle@kent.gov.uk']
phone_numbers: ['03000 413468', '03000 412064']
contact_names: ['Sharon Thompson – Head of Planning Applications Group', 'Stephanie Holt-Castle – Director of Growth and Communities']
dates: ['2025-03-04', '2024-09-10', '2025-02-06', '2021-03-26', '2021-04-09', '2021-12-16', '2022-02-09', '2022-10-24', '2022-12-05', '2023-07-25', '2024-10-17', '2024-11-28', '2025-02-06', '2025-03-06', '2025-02-06']
plan_periods: ['2024-2039', '2013-30', '2024-2039', '2013-30', '2023-38', '2024-39', '2024-39', '2024-39', '2024-39', '2024-39', '2024-39', '2024-39', '2024-39', '2024-39', '2024

### Cleanup extracted organisations

In [13]:
import re
import pandas as pd
from collections import Counter
import spacy
from spacy.tokens import Span
import networkx as nx
import matplotlib.pyplot as plt
from thefuzz import fuzz, process

# Helper functions for cleaning and normalizing organization names
def normalize_org_name(org_name):
    """Normalize organization names by removing articles and extra whitespace"""
    # Convert to string if not already
    if not isinstance(org_name, str):
        return ""
    
    # Remove leading 'the' and other articles
    cleaned = re.sub(r'^(the|a|an)\s+', '', org_name.lower(), flags=re.IGNORECASE)
    
    # Remove trailing apostrophe-s
    cleaned = re.sub(r"'s$", '', cleaned)
    
    # Remove newlines and extra whitespace
    cleaned = re.sub(r'\s+', ' ', cleaned.replace('\n', ' ')).strip()
    
    # Remove quotes
    cleaned = cleaned.replace("'", "").replace('"', '')
    
    # Remove trailing period
    cleaned = re.sub(r'\.$', '', cleaned)
    
    return cleaned

def is_valid_org(org_name, min_length=3):
    """Check if an organization name is likely valid"""
    if not isinstance(org_name, str):
        return False
    
    # Check minimum length
    if len(org_name.strip()) < min_length:
        return False
    
    # Check if it's just a single word that's likely not an organization
    if len(org_name.split()) == 1:
        non_org_terms = ['plan', 'council', 'report', 'recommendation', 'cabinet', 
                         'committee', 'sa', 'transport', 'strategy']
        if org_name.lower() in non_org_terms:
            return False
    
    # Exclude obvious non-organizations
    if re.match(r'^\d+(\.\d+)?$', org_name):  # Just numbers
        return False
    
    # Exclude policy codes
    if re.match(r'^[A-Z]{1,4}\s*\d+$', org_name):  # Like "DM 3"
        return False
    
    return True

def get_canonical_name(org_name, known_orgs):
    """Get the canonical name for an organization from a list of known organizations"""
    # First check exact matches after normalization
    norm_name = normalize_org_name(org_name)
    norm_known = {normalize_org_name(k): k for k in known_orgs}
    
    if norm_name in norm_known:
        return norm_known[norm_name]
    
    # Try fuzzy matching
    matches = []
    for known_norm, known_original in norm_known.items():
        ratio = fuzz.token_sort_ratio(norm_name, known_norm)
        if ratio > 85:  # Threshold for considering a match
            matches.append((known_original, ratio))
    
    if matches:
        matches.sort(key=lambda x: x[1], reverse=True)
        return matches[0][0]
    
    return org_name

def extract_and_clean_organizations(text):
    """Extract and clean organization names from text using spaCy"""
    # Load spaCy model
    try:
        nlp = spacy.load("en_core_web_sm")
    except:
        print("Please install spaCy and the English model with:")
        print("pip install spacy")
        print("python -m spacy download en_core_web_sm")
        return []
    
    # Process the text
    doc = nlp(text)
    
    # Extract all organization entities
    org_entities = [ent.text for ent in doc.ents if ent.label_ == "ORG"]
    
    # Clean the list
    clean_orgs = []
    for org in org_entities:
        # Basic cleaning
        cleaned_org = org.strip()
        
        # Skip if too short or invalid
        if not is_valid_org(cleaned_org):
            continue
            
        clean_orgs.append(cleaned_org)
    
    # Remove duplicates while preserving order
    seen = set()
    unique_orgs = [x for x in clean_orgs if x not in seen and not seen.add(x)]
    
    # Build a lookup of canonical names
    org_counter = Counter(clean_orgs)
    frequent_orgs = [org for org, count in org_counter.most_common() if count >= 2]
    
    # Get canonical names
    canonical_orgs = []
    for org in unique_orgs:
        canonical = get_canonical_name(org, frequent_orgs)
        canonical_orgs.append(canonical)
    
    # Final deduplication
    final_orgs = []
    seen = set()
    for org in canonical_orgs:
        norm = normalize_org_name(org)
        if norm not in seen and norm:
            seen.add(norm)
            final_orgs.append(org)
    
    return final_orgs

# Clean the given organizations from the document
original_orgs = ['Cabinet', 'Cabinet', 'The County Council']#, 'the Kent\nMinerals and Waste Local Plan 2013-30', 'County Council', 'Council', 'the Kent Mineral Sites Plan', 'Vision', 'the National Planning Policy Framework', 'Transport Cabinet Committee', 'County Council', 'the 'Pre-Submission' Draft', 'the Kent\nMinerals and Waste Local Plan 2024 to', 'State', 'Plan', 'MA MRTPI AIPROW', 'Council', 'the Inspector's Report', 'the Inspector's Report', 'Council', 'KMWLP', 'Recommendation(s', 'Cabinet', 'Kent Minerals and Waste Local Plan', 'KMWLP', 'Cabinet', 'KMWLP', 'County Council for Approval and Adoption', 'the County Council', 'Council', 'The Kent\nMinerals and Waste Local Plan', 'the Kent Mineral Sites Plan', 'the County Council', 'Borough Councils', 'the Kent Minerals and Waste Local Plan 2016', 'an 'Early Partial Review'', 'Plan', 'Council', 'Council', 'the Kent Mineral Sites Plan', 'The National Planning Policy Framework', 'NPPF', 'Local Plans', 'Kent Minerals and Waste Local Plan', 'Plan', 'Compulsory Purchase', 'the Town and Country Planning', 'Plan', 'the Council's Statement of Community Involvement', 'Transport Cabinet Committee', 'Cabinet', 'the Kent Minerals and Waste Local', 'KMWLP', 'KMWLP', 'KMWLP', 'Norwood', 'the Kent Minerals and Waste Local', 'the 'Pre-Submission Draft\nKent Minerals and Waste Local Plan', 'Transport Cabinet Committee', 'Cabinet', 'County Council', 'the National Planning Policy Framework', 'Planning Practice Guidance', 'the County Council of the Kent Environment\nStrategy', 'Medway Energy', 'Norwood Quarry', 'the Waste Disposal Authority', 'Policy DM3', 'the Kent Minerals and Waste Local Plan', 'Independent Examination', 'Council', 'State', 'Plan', 'The National Planning Policy Framework', 'NPPF', 'NPPF', 'State', 'period1', 'State', 'MA MRTPI AIPROW', 'Plan', 'Council', 'Plan', 'Council', 'KCC Cabinet', 'Plan', 'Council', 'Cabinet', 'Kent Minerals', 'Plan', 'Ebbsfleet\nDevelopment Corporation', '• Amendments', 'CSW', 'DM 2', 'DM 3', 'DM 4', 'DM 7', 'DM 10', 'DM', 'DM 14', 'DM 17', 'DM 20', 'DM 22', 'Environment Agency', 'Kent Minerals and Waste Local Plan Strategy', 'the Kent Minerals and Waste Local Plan', 'SA', 'SA', 'Kent Minerals and Waste Local Plan', 'SA', 'the Kent Minerals and Waste\nLocal Plan', 'SA', 'the SA of the Kent Minerals and Waste Local\nPlan 2024-39', 'SA', 'Council', 'SA', 'Council', 'the Kent Minerals and Waste Local Plan', 'the County Council', 'the Kent Environment Strategy', 'Medway Energy', 'Borough Councils', 'Plan', 'Aylesford', 'Council's', 'Cabinet', 'the Oaken Wood', 'Economic Development', 'Communities Cabinet Committee', 'KMWLP', 'Cabinet', 'Growth, Economic Development', 'Communities Cabinet Committee', 'Cabinet', 'County\nCouncil', 'the Kent Minerals and Waste Local Plan', 'the Town and Country Planning (Local\nPlanning', 'Council', 'Plan', 'Cabinet', 'the Kent\nMinerals and Waste Local Plan 2024-39', 'Plan', 'Kent\nMinerals and Waste Local Plan', 'Financial Implications', 'the Kent Minerals and Waste Local Plan', 'Planning Applications Group', 'Plan', 'the County Council's', 'the Kent Environment Strategy', 'Medway Energy', 'the Kent's Plan Bee\nPollinator Action Plan', 'the National Planning Policy\nFramework', 'the Kent Minerals and Waste Local Plan', 'Council', 'KMWLP', 'Kent', 'Climate Change', 'Council', 'Securing Kent's Future', 'Council', 'Best Value Statutory', 'The County Council', 'Development Plan', 'The County Council', 'The Kent\nMinerals and Waste Local Plan 2024-39', 'State', 'Local Government', 'the Kent Minerals and Waste Local Plan', 'Strategic Environmental Assessment', 'Sustainability Appraisal', 'the Kent Minerals and Waste Local\nPlan 2024-39', 'Local Plan', 'EQIA', 'The Town and Country Planning Acts', 'the County Council', 'Plan', 'Plan', 'the Planning Inspectorate', 'State', 'Council', 'KMWLP', 'Council', 'Plan', 'Recommendation(s', 'Cabinet', 'Kent Minerals and Waste Local Plan', 'KMWLP', 'Cabinet', 'KMWLP', 'County Council for Approval and Adoption', 'Planning Applications Group\nPhone', 'Stephanie Holt-Castle', 'Communities', 'Sustainability', 'Council', 'Examination']

# Normalize and filter the organizations
clean_orgs = []
for org in original_orgs:
    if is_valid_org(org):
        clean_orgs.append(org)

# Create a set of unique normalized organizations
normalized_orgs = {normalize_org_name(org): org for org in clean_orgs}

# Group similar organizations
org_groups = {}
for norm_name, orig_name in normalized_orgs.items():
    # Skip empty names
    if not norm_name:
        continue
        
    # Check if similar to any existing group
    found_group = False
    for group_key in list(org_groups.keys()):
        if fuzz.token_sort_ratio(norm_name, group_key) > 85:
            org_groups[group_key].append(orig_name)
            found_group = True
            break
    
    # If not similar to any group, create a new group
    if not found_group:
        org_groups[norm_name] = [orig_name]

# Extract the genuine organizations
genuine_orgs = []
for group, variations in org_groups.items():
    # Skip common non-organizations
    if group in ['plan', 'council', 'cabinet', 'kmwlp', 'state', 'sa']:
        continue
    
    # Skip policy codes
    if re.match(r'^dm\s*\d+$', group):
        continue
        
    # Find the most representative name (longest)
    best_name = max(variations, key=len)
    genuine_orgs.append(best_name)

# Sort alphabetically
genuine_orgs.sort()

print("Cleaned and deduplicated organizations:")
for i, org in enumerate(genuine_orgs, 1):
    print(f"{i}. {org}")

# Create a function to categorize organizations
def categorize_organization(org_name):
    """Categorize an organization name into predefined groups"""
    lower_name = org_name.lower()
    
    # Government bodies
    if any(term in lower_name for term in ['council', 'cabinet', 'committee', 'planning']):
        return "Government Body"
    
    # Policy/Plan/Framework
    if any(term in lower_name for term in ['plan', 'policy', 'framework', 'strategy']):
        return "Policy/Plan"
    
    # Agencies
    if any(term in lower_name for term in ['agency', 'authority', 'commission', 'corporation']):
        return "Agency"
    
    # Default
    return "Other"

# Categorize the organizations
categorized_orgs = {}
for org in genuine_orgs:
    category = categorize_organization(org)
    if category not in categorized_orgs:
        categorized_orgs[category] = []
    categorized_orgs[category].append(org)

print("\nOrganizations by Category:")
for category, orgs in categorized_orgs.items():
    print(f"\n{category}:")
    for i, org in enumerate(orgs, 1):
        print(f"  {i}. {org}")

# Create a function to determine if a name is an acronym that might need expansion
def is_acronym(name):
    # Check if it's all uppercase and between 2-6 letters
    if re.match(r'^[A-Z]{2,6}$', name.strip()):
        return True
    return False

# Find potential acronyms and their expansions
acronyms = {}
for org in original_orgs:
    if is_acronym(org):
        # Look for potential expansions
        potential_expansions = []
        for other_org in original_orgs:
            if len(other_org) > len(org) and org != other_org:
                # Check if first letters of words match the acronym
                words = [w for w in other_org.split() if w[0].isalpha()]
                if words:
                    first_letters = ''.join(w[0].upper() for w in words)
                    if org in first_letters:
                        potential_expansions.append(other_org)
        
        if potential_expansions:
            acronyms[org] = potential_expansions

if acronyms:
    print("\nPotential Acronym Expansions:")
    for acronym, expansions in acronyms.items():
        print(f"{acronym}: {expansions[0]}")

Cleaned and deduplicated organizations:
1. The County Council

Organizations by Category:

Government Body:
  1. The County Council


### Multiples pdf files  - extracted in a loop

In [14]:
!pip install pdfplumber spacy dateparser tqdm
!python -m spacy download en_core_web_sm

Collecting pdfminer.six==20250327 (from pdfplumber)
  Using cached pdfminer_six-20250327-py3-none-any.whl.metadata (4.1 kB)
Using cached pdfminer_six-20250327-py3-none-any.whl (5.6 MB)
    extract-msg (<=0.29.*)
                 ~~~~~~~^[0m[33m
[0mInstalling collected packages: pdfminer.six
  Attempting uninstall: pdfminer.six
    Found existing installation: pdfminer.six 0.0.0
    Uninstalling pdfminer.six-0.0.0:
      Successfully uninstalled pdfminer.six-0.0.0
Successfully installed pdfminer.six-20250327
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
    extract-msg (<=0.29.*)
                 ~~~~~~~^[0m[33m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core

### MoM extraction

In [16]:
!pip install spacy rake-nltk pdfplumber
!python -m spacy download en_core_web_md

    extract-msg (<=0.29.*)
                 ~~~~~~~^[0m[33m
[0mCollecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
    extract-msg (<=0.29.*)
                 ~~~~~~~^[0m[33m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


#### 1. Preprocessing Pipeline

In [17]:
import re
import spacy
from collections import defaultdict
from rake_nltk import Rake

# Load medium English model (no need to train)
nlp = spacy.load("en_core_web_md")

def preprocess_text(text):
    # Remove line breaks and excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    return text

def extract_entities(doc):
    """Enhanced entity extraction with rules for council-specific terms"""
    entities = defaultdict(list)
    
    for ent in doc.ents:
        entities[ent.label_].append(ent.text)
    
    # Custom rules for council-specific patterns
    resolutions = re.findall(r"RESOLVED that (.+?)(?=\n|$)", text, re.IGNORECASE)
    if resolutions:
        entities["RESOLUTION"] = resolutions
        
    # Extract attendees (e.g., "PRESENT: Mr A, Mrs B")
    attendees = re.search(r"PRESENT: (.+?)(?=\n\n|\n\w+:|$)", text, re.DOTALL)
    if attendees:
        entities["ATTENDEES"] = [a.strip() for a in attendees.group(1).split(",")]
    
    return dict(entities)

def extract_keywords(text):
    """Focus on action-oriented terms"""
    r = Rake()
    r.extract_keywords_from_text(text)
    return r.get_ranked_phrases()[:10]  # Top 10 keywords

#### 2. Execution Workflow

In [18]:
def process_minutes(file_text):
    text = preprocess_text(file_text)
    doc = nlp(text)
    
    # Extract structured data
    metadata = {
        "entities": extract_entities(doc),
        "keywords": extract_keywords(text),
        "resolutions": list(set(re.findall(r"RESOLVED that (.+?)(?=\n|$)", text, re.I))),
        "motions": list(set(re.findall(r"(?:Proposed|Seconded) that (.+?)(?=\n|$)", text, re.I)))
    }
    
    # Generate summary template
    summary = (
        f"Meeting chaired by {metadata['entities'].get('PERSON', ['unknown'])[0]} with "
        f"{len(metadata['entities'].get('ATTENDEES', []))} attendees. "
        f"Key resolutions: {'; '.join(metadata['resolutions'][:2])}."
    )
    
    return {"metadata": metadata, "summary": summary}

#### Deployment Script for Bulk Processing

In [None]:
INPUT_DIR = "/Users/lgfolder/Downloads/data scrape full 1 page only/full_council/"

In [20]:
import os
import re
import json
import logging
from pathlib import Path
from collections import defaultdict

import pdfplumber
import spacy
import yake
from tqdm import tqdm

# -------------------- Setup --------------------

OUTPUT_DIR = "/Users/lgfolder/Downloads/minutes_processed"
MINUTES_KEYWORD = "minutes"
FIRST_PAGE_CHECK_LIMIT = 2000  # Characters

os.makedirs(OUTPUT_DIR, exist_ok=True)

# Suppress pdfminer noise
logging.getLogger("pdfminer").setLevel(logging.ERROR)

# Load spaCy model
print("Loading spaCy model...")
nlp = spacy.load("en_core_web_md")

# Initialize YAKE
kw_extractor = yake.KeywordExtractor(
    lan="en", n=1, dedupLim=0.9, top=10, features=None
)

# -------------------- Functions --------------------

def extract_keywords(text):
    keywords = kw_extractor.extract_keywords(text)
    return [kw for kw, _ in keywords]

def extract_attendance_info(text):
    info = {"present": [], "absent": [], "virtual": []}

    present_match = re.search(r"PRESENT:\s+(.*?)(?=\n\n|\nIN ATTENDANCE|\n\d+\.\s)", text, re.DOTALL | re.IGNORECASE)
    if present_match:
        names = re.split(r",\s*", present_match.group(1).strip())
        info["present"] = [name.strip() for name in names if name]

    apologies_match = re.search(r"apologies from (.*?)(?=\n\n|\n\(?\d+\)\s|\n\s*\w+:|\n\w+\s+declared|\n\n)", text, re.DOTALL | re.IGNORECASE)
    if apologies_match:
        names = re.split(r",\s*", apologies_match.group(1).strip().rstrip('.'))
        info["absent"] = [name.strip() for name in names if name]

    virtual_match = re.findall(r"(\b(?:Mr|Mrs|Ms|Miss|Dr)\s+[A-Z][a-z]+(?:\s[A-Z][a-z]+)?)\s+.*?attending the meeting virtually", text, re.IGNORECASE)
    info["virtual"] = list(set(virtual_match))

    return info

def extract_motions(text):
    motions = []
    motion_blocks = re.findall(
        r"(?:Mr|Mrs|Ms|Dr)\s+[A-Z][a-z]+.*?\sproposed.*?\n+(.+?)(?=(?:\n\([0-9]+?\) Following the debate|Amendment lost|Motion carried|Substantive Motion Carried))",
        text, re.DOTALL | re.IGNORECASE
    )

    vote_blocks = re.findall(
        r"vote.*?as follows:\s+For\s+\((\d+)\).*?Against\s+\((\d+)\).*?Abstain\s+\((\d+)\)", text, re.DOTALL | re.IGNORECASE
    )

    for i, motion_text in enumerate(motion_blocks):
        motion_text = motion_text.strip().replace("\n", " ")
        result = vote_blocks[i] if i < len(vote_blocks) else None
        outcome = "Passed" if result and int(result[0]) > int(result[1]) else "Failed"
        motions.append({
            "text": motion_text,
            "vote_result": {
                "for": int(result[0]) if result else None,
                "against": int(result[1]) if result else None,
                "abstain": int(result[2]) if result else None
            } if result else None,
            "outcome": outcome if result else "Unknown"
        })
    return motions

def extract_entities(doc):
    entities = defaultdict(list)
    for ent in doc.ents:
        entities[ent.label_].append(ent.text)

    resolutions = re.findall(r"RESOLVED that (.+?)(?=\n|$)", doc.text, re.IGNORECASE)
    if resolutions:
        entities["RESOLUTION"] = resolutions

    attendees = re.search(r"PRESENT: (.+?)(?=\n\n|\n\w+:|$)", doc.text, re.DOTALL)
    if attendees:
        entities["ATTENDEES"] = [a.strip() for a in attendees.group(1).split(",")]

    return dict(entities)

def is_minutes_file(filepath, first_page_text):
    filename = str(filepath).lower()
    first_page = first_page_text[:FIRST_PAGE_CHECK_LIMIT].lower()
    return (MINUTES_KEYWORD in filename) or (MINUTES_KEYWORD in first_page)

def extract_text_from_pdf(filepath):
    with pdfplumber.open(filepath) as pdf:
        return "\n".join([page.extract_text() for page in pdf.pages if page.extract_text()])

def process_file(filepath):
    try:
        text = extract_text_from_pdf(filepath)
        if not text:
            raise ValueError("Empty or unreadable first page")

        attendance = extract_attendance_info(text)
        motions = extract_motions(text)

        if not is_minutes_file(filepath.name, text):
            return None

        doc = nlp(text)

        metadata = {
            "filepath": str(filepath),
            "entities": extract_entities(doc),
            "keywords": extract_keywords(text),
            "resolutions": list(set(re.findall(r"RESOLVED that (.+?)(?=\n|$)", text, re.I))),
            "motions": motions,
            "attendance": attendance,
        }

        chair = metadata["entities"].get("PERSON", ["unknown"])[0]
        num_attendees = len(attendance.get("present", []))
        num_absent = len(attendance.get("absent", []))
        num_virtual = len(attendance.get("virtual", []))
        resolutions = metadata["resolutions"][:2]
        num_motions = len(motions)

        summary = (
            f"Meeting chaired by {chair} with {num_attendees} attendees "
            f"({num_virtual} virtual). {num_absent} apologies. "
            f"{num_motions} motions considered. "
            f"Key resolutions: {'; '.join(resolutions) if resolutions else 'None recorded'}."
        )

        return {
            "metadata": metadata,
            "summary": summary
        }

    except Exception as e:
        print(f"❌ Error processing {filepath}:\n  {e}")
        return None

def process_all_files(input_dir, output_dir):
    pdf_files = list(Path(input_dir).rglob("*.pdf"))
    print(f"🔍 Found {len(pdf_files)} PDF files. Starting processing...")

    output_jsonl_path = Path(output_dir) / "ALL_MINUTES.jsonl"

    # Clear the .jsonl file if it exists
    if output_jsonl_path.exists():
        output_jsonl_path.unlink()

    processed_count = 0

    for pdf_path in tqdm(pdf_files, desc="Processing PDFs"):
        result = process_file(pdf_path)
        if result:
            # Save individual JSON
            individual_path = Path(output_dir) / f"{pdf_path.stem}.json"
            with open(individual_path, 'w') as f:
                json.dump(result, f, indent=2)

            # Append to .jsonl file
            with open(output_jsonl_path, 'a') as f:
                f.write(json.dumps(result) + '\n')

            processed_count += 1

    print(f"\n✅ Done. {processed_count} minutes files processed. Results saved to:\n{output_dir}")

# -------------------- Run --------------------

if __name__ == "__main__":
    process_all_files(INPUT_DIR, OUTPUT_DIR)


Loading spaCy model...
🔍 Found 407 PDF files. Starting processing...


Processing PDFs:   7%|▋         | 30/407 [02:04<26:10,  4.17s/it]


KeyboardInterrupt: 

In [21]:
import json
from pathlib import Path

results_path = Path("/Users/lgfolder/Downloads/minutes_processed/ALL_MINUTES.jsonl")

with open(results_path, "r") as f:
    results = [json.loads(line) for line in f]

print(f"Loaded {len(results)} records.")

Loaded 5 records.


In [22]:
results

[{'metadata': {'filepath': '/Users/lgfolder/Downloads/data scrape full 1 page only/full_council/2024-09-12/originals/CPP Minutes 040624.pdf',
   'entities': {'ORG': ['Sessions House',
     'County\nHall',
     'Alison\nFarmer',
     'Cabinet',
     'Integrated Children’s\nServices',
     'Operational Integrated Children’s Services',
     "Children's Countywide Services",
     'Committee',
     'the Participation Team',
     'Public Service Operational Delivery',
     'Easter',
     'the Super Council',
     'Care Leavers',
     'Easter',
     'the Adoptables Council',
     'the Children WHO Care\nCouncil',
     'Easter',
     'University of Kent',
     'Christ Church University',
     'the Kent Care Leavers Summer Event',
     'Verbal Update',
     'Cabinet',
     'the National Transfer\nScheme',
     'KCC',
     'the National\nTransfer Scheme',
     'FAQ',
     'KCC',
     'Reception Centres',
     'United Nations',
     'Millbank\nChildren’s Accommodation Centre',
     'CYPE',
     '