# Exploring the Legislation Corpus with Neo4j

In [18]:
# Load dotenv
from dotenv import load_dotenv
import os

load_dotenv()

NEO4J_URI = os.getenv("NEO4J_URI")
NEO4J_USER = os.getenv("NEO4J_USER")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")
NEO4J_DATABASE = os.getenv("NEO4J_DATABASE", "neo4j")

In [19]:
from neo4j_analysis import Neo4jAnalysis

# Initialize the analysis helper
analysis = Neo4jAnalysis(NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD, NEO4J_DATABASE)

In [20]:
colors = {
    "Legislation": "#1f77b4",  # Blue for Legislation
    "Part": "#ff7f0e",  # Orange for Parts
    "Chapter": "#2ca02c",  # Green for Chapters
    "Section": "#d62728",  # Red for Sections
    "Paragraph": "#9467bd",  # Purple for Paragraphs
    "Schedule": "#8c564b",  # Brown for Schedules
    "ScheduleParagraph": "#e377c2",  # Pink for Schedule Paragraphs
    "ScheduleSubparagraph": "#7f7f7f",  # Gray for Schedule Subparagraphs
    "Commentary": "#bcbd22",  # Olive for Commentaries
    "Citation": "#17becf",  # Cyan for Citations
    "CitationSubRef": "#aec7e8",  # Light Blue for Citation Sub References
    "ExplanatoryNotes": "#ffbb78",  # Light Orange for Explanatory Notes
    "ExplanatoryNotesParagraph": "#98df8a",  # Light Green for Explanatory Notes Paragraphs
}

## The complete graph schema

We will start by visualising the complete graph schema of the Legislation Corpus. This will give us an overview of the different types of nodes and relationships that exist in the dataset.

In [21]:
# Show the graph schema
from neo4j_viz.neo4j import from_neo4j, ColorSpace

query = """
CALL db.schema.visualization()
"""
results = analysis.run_query_viz(query)

VG = from_neo4j(results)
VG.color_nodes(
    field="caption",  # Using the internal labels property
    color_space=ColorSpace.DISCRETE,
    colors=colors,
)

generated_html = VG.render(layout="forcedirected")
await analysis.capture_graph_to_png(
    generated_html, "renderings/schema_graph.png", width=1080
)

![Graph Schema](renderings/schema_graph.png)

## The corpus

The following query retrieves all the core legislation documents in the corpus, along with their titles and enactment dates.

In [22]:
query = """
MATCH p=(l:Legislation)
RETURN l.category AS Category, l.status AS Status, l.title AS Title, l.uri AS URI, l.enactment_date AS Enactment
ORDER BY Enactment
"""

corpus_df = analysis.run_query_df(query)
# filter by Status="final" and sort by enactment date
corpus_df[corpus_df["Status"] == "final"].sort_values(
    "Enactment", ascending=False
).head(10)

Unnamed: 0,Category,Status,Title,URI,Enactment
1388,primary,final,Biodiversity Beyond National Jurisdiction Act ...,http://www.legislation.gov.uk/ukpga/2026/6,2026-02-12
1382,primary,final,General Cemetery Act 2025,http://www.legislation.gov.uk/ukla/2025/2,2025-10-27
1314,primary,final,Housing (Amendment) Act (Northern Ireland) 2020,http://www.legislation.gov.uk/nia/2020/5,2020-08-28
1280,primary,final,City of London Corporation (Open Spaces) Act 2018,http://www.legislation.gov.uk/ukla/2018/1,2018-03-15
1173,primary,final,Humber Bridge Act 2013,http://www.legislation.gov.uk/ukla/2013/6,2013-12-18
949,primary,final,London Local Authorities and Transport for Lon...,http://www.legislation.gov.uk/ukla/2003/3,2003-10-30
894,primary,final,Colchester Borough Council Act 2001,http://www.legislation.gov.uk/ukla/2001/2,2001-03-22
719,primary,final,London Underground (Jubilee) Act 1993,http://www.legislation.gov.uk/ukla/1993/9,1993-07-01
711,primary,final,London Docklands Railway (Lewisham) Act 1993,http://www.legislation.gov.uk/ukla/1993/7,1993-05-27
709,primary,final,British Railways Act 1993,http://www.legislation.gov.uk/ukla/1993/4,1993-03-29


## A piece of legislation down to the section level

With the graph in place, we can start exploring the legislation corpus. Let's start with a piece of legislation, the **Corporation Tax Act 2010**, and explore its structure down to the section level.

In [23]:
query = """
MATCH p=(l:Legislation)-[:HAS_PART]->(:Part)-[:HAS_CHAPTER]->(:Chapter)-[:HAS_SECTION]->(:Section)
WHERE l.uri CONTAINS "ukpga/2010/4"
RETURN p
"""

results = analysis.run_query_viz(query)

VG = from_neo4j(results)
VG.color_nodes(
    field="caption",
    color_space=ColorSpace.DISCRETE,
    colors=colors,
)

generated_html = VG.render(layout="forcedirected")
await analysis.capture_graph_to_png(
    generated_html, "renderings/legislation_example.png", width=1080
)

![Legislation Example](renderings/legislation_example.png)

## Focusing on a single part down to paragraphs and citations

Let us focus on a single part of the **Corporation Tax Act 2010**, and explore the network of paragraphs and commentaries in that specific part. The `order` property of the `Paragraph` nodes allows us to reconstruct the original order of the paragraphs in the legislation.

In [24]:
query = """
MATCH p=(l:Legislation)-[:HAS_PART]->(part:Part)-[:HAS_CHAPTER]->(:Chapter)-[:HAS_SECTION]->(section:Section)-[:HAS_PARAGRAPH]->(para:Paragraph)-[:HAS_COMMENTARY]->(comm:Commentary)
WHERE l.uri CONTAINS "ukpga/2010/4" AND part.order=2
RETURN p
"""

results = analysis.run_query_viz(query)

VG = from_neo4j(results)
VG.color_nodes(
    field="caption",
    color_space=ColorSpace.DISCRETE,
    colors=colors,
)

generated_html = VG.render(layout="forcedirected", initial_zoom=1.0)
await analysis.capture_graph_to_png(
    generated_html, "renderings/legislation_example_detail.png", width=1080
)

![Legislation Example](renderings/legislation_example_detail.png)

## Commentaries

Let us now run a query that retrieves the network of commentaries which cite a specific piece of legislation (in this case, the **Data Protection Act 2018**). In this case we filter by the `uri` property of the `Legislation` node, which is a unique identifier for each piece of legislation in the corpus. You could also perform an exact or fuzzy match on the title of the legislation.

In [25]:
query = """
MATCH p=(:Commentary)-[:HAS_CITATION]->(:Citation)-[:CITES_ACT]->(l:Legislation)
WHERE l.uri CONTAINS "ukpga/2018/12"
RETURN p
"""

results = analysis.run_query_viz(query)

VG = from_neo4j(results)
VG.color_nodes(
    field="caption",
    color_space=ColorSpace.DISCRETE,
    colors=colors,
)

generated_html = VG.render(layout="forcedirected", initial_zoom=1.0)
await analysis.capture_graph_to_png(
    generated_html, "renderings/commentary_network.png", width=1080
)

![Commentary Network](renderings/commentary_network.png)

## Schedules

Schedules appear in legislation as a means to include additional information that is relevant to the legislation but does not fit into the main body of the text. They are often used to provide details, examples, or exceptions related to specific sections of the legislation. In our graph, we represent schedules as nodes that are connected to the relevant sections or parts of the legislation.

In [26]:
query = """
MATCH p=(l:Legislation)-[:HAS_SCHEDULE]->(sc:Schedule)-[:HAS_PARAGRAPH]->(scp:ScheduleParagraph)-[:HAS_SUBPARAGRAPH]->(scsp:ScheduleSubparagraph)
WHERE l.uri CONTAINS "ukpga/2010/4"
OPTIONAL MATCH (scp)-[:HAS_COMMENTARY]-(:Commentary)-[:HAS_CITATION]-(:Citation)-[:HAS_SUBREF]->(:CitationSubRef)
RETURN p
"""

results = analysis.run_query_viz(query)

VG = from_neo4j(results)
VG.color_nodes(
    field="caption",
    color_space=ColorSpace.DISCRETE,
    colors=colors,
)

generated_html = VG.render(layout="forcedirected", initial_zoom=1.0)
await analysis.capture_graph_to_png(
    generated_html, "renderings/schedules.png", width=1080
)

![Schedule Example](renderings/schedules.png)

# Creating synthetic relationships

Let us now create synthetic relationships between legislation nodes that cite each other, and store the count of citations as a 'weight' property.

In [27]:
query = """
// Assign the structural traversal to a path variable 'p'
MATCH p = (source:Legislation)-[:HAS_PART|HAS_CHAPTER|HAS_SECTION|HAS_PARAGRAPH|HAS_SCHEDULE|HAS_SUBPARAGRAPH|HAS_COMMENTARY|HAS_CITATION|HAS_SUBREF*1..10]->(citation_link)
// Match the final jump to the target legislation
MATCH (citation_link)-[:CITES_ACT|REFERENCES]->(target:Legislation)
// Prevent self-citations
WHERE source.uri <> target.uri
// Ensure no intermediate node is another Legislation node.
// nodes(p)[1..] skips the starting node (source) and checks everything else.
  AND ALL(n IN nodes(p)[1..] WHERE NOT n:Legislation)
// Aggregate and create the synthetic relationship
WITH source, target, count(citation_link) AS citation_count
MERGE (source)-[rel:CITES_LEGISLATION]->(target)
SET rel.weight = citation_count
"""

results = analysis.run_query(query)

We can then visualise the resulting graph, which shows the network of citations between Acts in our corpus.

In [28]:
query = """
MATCH p = (l1:Legislation)-[r:CITES_LEGISLATION]->(l2:Legislation)
RETURN p
LIMIT 1000
"""

results = analysis.run_query_viz(query)

VG = from_neo4j(results)
VG.color_nodes(
    field="caption",
    color_space=ColorSpace.DISCRETE,
    colors=colors,
)

generated_html = VG.render(layout="forcedirected", initial_zoom=1.0)
await analysis.capture_graph_to_png(
    generated_html, "renderings/citation_network.png", width=1080
)

![Citation Network](renderings/citation_network.png)

Or we can run a similar query to create a table of citations.

In [29]:
query = """
MATCH p = (l1:Legislation)-[r:CITES_LEGISLATION]->(l2:Legislation)
RETURN l1.title AS Source, l2.title AS Target, r.weight AS CitationCount
ORDER BY CitationCount DESC
"""

citations_df = analysis.run_query_df(query)
citations_df.head(10)

Unnamed: 0,Source,Target,CitationCount
0,Value Added Tax Act 1994,Taxation (Cross-border Trade) Act 2018,22933
1,Value Added Tax Act 1994,Taxation (Cross-border Trade) Act 2018,22933
2,The Insolvency (Northern Ireland) Order 1989,The Payment and Electronic Money Institution I...,22287
3,Value Added Tax Act 1994,The Value Added Tax (Miscellaneous and Transit...,15293
4,Value Added Tax Act 1994,The Value Added Tax (Miscellaneous and Transit...,15293
5,Value Added Tax Act 1994,"The Finance Act 2016, Section 126 (Appointed D...",15206
6,Value Added Tax Act 1994,"The Finance Act 2016, Section 126 (Appointed D...",15206
7,The European Qualifications (Health and Social...,European Union (Withdrawal Agreement) Act 2020,14221
8,The Insolvency (Northern Ireland) Order 1989,The Payment and Electronic Money Institution I...,11221
9,Insolvency Act 1986,The Payment and Electronic Money Institution I...,8587


## Superseeded legislation

Over time legislation can be superseeded by newer legislation. The graph holds information about which pieces of legislation have been superseeded, and by which newer pieces of legislation. We can run a query to retrieve this information and visualise its network.

In [30]:
query = """
MATCH p=(:Legislation)-[:SUPERSEDED_BY|SUPERSEDES]-(:Legislation)
RETURN p
"""

results = analysis.run_query_viz(query)
VG = from_neo4j(results)
VG.color_nodes(
    field="caption",
    color_space=ColorSpace.DISCRETE,
    colors=colors,
)

generated_html = VG.render(layout="forcedirected", initial_zoom=1.0)
await analysis.capture_graph_to_png(
    generated_html, "renderings/superseded_network.png", width=1080
)

![Superseeded Legislation Network](renderings/superseded_network.png)


## Rebuilding the full text of a piece of legislation

Because we hold the full text of legislation in the graph, we can run a query to retrieve all the text associated with a piece of legislation, and rebuild part or the whole of the text.

In [31]:
query = """
MATCH (l:Legislation)-[:HAS_PART|HAS_CHAPTER|HAS_SECTION|HAS_PARAGRAPH|HAS_SCHEDULE|HAS_SUBPARAGRAPH|HAS_EXPLANATORY_NOTES*0..6]->(node)
WHERE l.uri CONTAINS 'ukpga/2010/4'
  // Only keep nodes that actually hold readable content (ignoring empty structural containers)
  AND (node.text IS NOT NULL OR node.title IS NOT NULL OR node.description IS NOT NULL)
RETURN labels(node)[0] AS node_type,
       node.id AS node_id,
       node.number AS number,
       coalesce(node.text, node.title, node.description) AS content
"""

full_text_df = analysis.run_query_df(query)
full_text = full_text_df["content"].str.cat(sep="\n\n")
print(full_text[:2000])  # Print the first 2000 characters of the full

Corporation Tax Act 2010

Introduction

Overview of Act

Part 2 is about calculation of the corporation tax chargeable on a company's profits, in particular— the rates at which corporation tax on profits is charged (see Chapter 2), ascertaining the amount of profits to which the rates of tax are applied (see Chapter 3), and the currency in which profits are to be calculated and expressed (see Chapter 4).

the rates at which corporation tax on profits is charged (see Chapter 2),

ascertaining the amount of profits to which the rates of tax are applied (see Chapter 3), and

the currency in which profits are to be calculated and expressed (see Chapter 4).

Parts 3A to 7 make provision for the following reliefs— . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . relief for companies with small profits (other than ring-fence profits) (see Part 3A), relief for trade losses (see Chapters 2 and 3 of Part 4), relief for losses from property businesses (see Chapter 4 of Part 4), rel

## Point in time legislation

Because we store restriction dates for chapters, sections and paragraphs, we can retrieve legislation text which applies only to or from a certain date. For example, let us retrieve content for the **Corporation Tax Act 2010**, but as is from 1st January 2020.

In [32]:
query = """
MATCH p=(l:Legislation)-[:HAS_PART]->(part:Part)-[:HAS_CHAPTER]->(:Chapter)-[:HAS_SECTION]->(section:Section)-[:HAS_PARAGRAPH]->(para:Paragraph)-[:HAS_COMMENTARY]->(comm:Commentary)
WHERE l.uri CONTAINS "ukpga/2010/4" AND (para.restrict_start_date >= date("2020-01-01") OR para.restrict_start_date IS NULL)
RETURN p
"""

results = analysis.run_query_viz(query)
VG = from_neo4j(results)
VG.color_nodes(
    field="caption",
    color_space=ColorSpace.DISCRETE,
    colors=colors,
)

generated_html = VG.render(layout="forcedirected", initial_zoom=1.0)
await analysis.capture_graph_to_png(
    generated_html, "renderings/point_in_time_legislation.png", width=1080
)

![Point in Time Legislation](renderings/point_in_time_legislation.png)

## Finding paths between acts

In some cases, we might want to understand how two concepts, or specific acts are "tied" together in the legislative framework. For example, a legal analyst might need to know how a specific tax regulation, like the **Value Added Tax Act 1994**, ultimately influences a completely different domain, such as the **Gambling Act 2005**, even if they never reference each other directly.

By using graph traversal algorithms, we can uncover the "hidden bridges" between them. Perhaps a secondary piece of financial regulation or an obscure statutory instrument that cites them both.

This kind of deep-link analysis is incredibly difficult in standard relational databases or keyword search engines, but it is exactly what a graph database like Neo4j is designed to do. Instead of just finding documents that contain the same words, you are mapping the actual legal dependencies and structural lineage.

In [33]:
query = """
MATCH (source:Legislation)
MATCH (target:Legislation)
WHERE source.title = "Value Added Tax Act 1994" AND target.title = "Gambling Act 2005"
// Find the shortest path using the synthetic CITES_LEGISLATION relationship
// The *1..10 limits the search depth to prevent runaway queries on massive graphs
MATCH p = shortestPath((source)-[:CITES_LEGISLATION*1..10]-(target))

RETURN p
"""

results = analysis.run_query_viz(query)
VG = from_neo4j(results)
VG.color_nodes(
    field="caption",
    color_space=ColorSpace.DISCRETE,
    colors=colors,
)

generated_html = VG.render(layout="forcedirected", initial_zoom=1.0)
await analysis.capture_graph_to_png(
    generated_html, "renderings/shortest_path.png", width=1080
)

![Shortest Path Between Acts](renderings/shortest_path.png)

We can also explore more complex conditional paths. For example, we might want to see **all** shortest paths, but where a given condition (say, the enactment date) is met.

In [34]:
query = """
MATCH (source:Legislation {title: "Value Added Tax Act 1994"})
MATCH (target:Legislation {title: "Gambling Act 2005"})
// Find the paths
MATCH p = allShortestPaths((source)-[:CITES_LEGISLATION*1..5]-(target))
// Slice the list of nodes so it only checks the intermediate "bridge" nodes
WHERE ALL(n IN nodes(p)[1..-1] WHERE n.enactment_date.year <= 1990)
RETURN p
"""

results = analysis.run_query_viz(query)
VG = from_neo4j(results)
VG.color_nodes(
    field="caption",
    color_space=ColorSpace.DISCRETE,
    colors=colors,
)

generated_html = VG.render(layout="forcedirected", initial_zoom=1.0)
await analysis.capture_graph_to_png(
    generated_html, "renderings/filtered_shortest_path.png", width=1080
)

![Filtered Shortest Path Between Acts](renderings/filtered_shortest_path.png)

## Finding topical networks

An interesting use case is to find topical networks of legislation. For example, we might want to find all the legislation that is related to "data protection", and then explore the network of citations between them. In addition, let us also look for legislation which has only been enacted after 2000.

> Note that in practice you would use vector search to find legislation related to a specific topic, but here we are using a simple full text search for demonstration purposes.

In [35]:
query = """
// Find all structural elements matching "data protection"
MATCH (l:Legislation)-[:HAS_PART|HAS_CHAPTER|HAS_SECTION|HAS_PARAGRAPH|HAS_SCHEDULE*1..6]->(element)
WHERE (element:Chapter OR element:Section OR element:Paragraph)
  AND toLower(element.title) CONTAINS toLower("data protection")
// Collect the distinct parent Acts that house these matching elements
WITH collect(DISTINCT l) AS topic_acts
// Unwind the collection so we can evaluate their relationships
UNWIND topic_acts AS source_act
// Match the synthetic citation relationships, but ONLY where the target 
// is also one of our identified topic acts. This builds a closed-domain network.
MATCH p = (source_act)-[r:CITES_LEGISLATION]->(target_act:Legislation)
WHERE target_act IN topic_acts AND source_act.enactment_date.year > 2000 AND target_act.enactment_date.year > 2000
RETURN p"""

results = analysis.run_query_viz(query)
VG = from_neo4j(results)
VG.color_nodes(
    field="caption",
    color_space=ColorSpace.DISCRETE,
    colors=colors,
)

generated_html = VG.render(layout="forcedirected", initial_zoom=1.0)
await analysis.capture_graph_to_png(
    generated_html, "renderings/data_protection_network.png", width=1080
)

![Topical Network](renderings/data_protection_network.png)