# Entity Extraction from New Hampshire Case Law
*With IBM Granite Models*

The [New Hampshire Case Law Dataset](https://huggingface.co/datasets/free-law/nh) comes from the Caselaw Access Project via Hugging Face.

## In this notebook
This notebook contains instructions for performing entity extraction.

## Prerequisites

To get started, you'll need:
* A [Replicate account](https://replicate.com/) and API token.

## Setting up the environment

### Install dependencies

Granite Kitchen comes with a bundle of dependencies that are required for notebooks. See the list of packages in its [`setup.py`](https://github.com/ibm-granite-community/granite-kitchen/blob/main/setup.py). 

In [47]:
!pip install git+https://github.com/ibm-granite-community/utils \
    "langchain_community<0.3.0" \
    replicate \
    datasets \
    transformers \
    tiktoken \
    neo4j \
    stringcase

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting git+https://github.com/ibm-granite-community/utils
  Cloning https://github.com/ibm-granite-community/utils to /private/var/folders/nc/jrql4k0n2j73h7xktzxdr4pr0000gn/T/pip-req-build-4wi9a7ku
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/utils /private/var/folders/nc/jrql4k0n2j73h7xktzxdr4pr0000gn/T/pip-req-build-4wi9a7ku
  Resolved https://github.com/ibm-granite-community/utils to commit a4b663310cdc11be2f3039a11d263dae98584582
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Selecting System Components

### Choose your LLM
The LLM will be used for answering the question, given the retrieved text.

Follow the instructions in [Getting Started with Replicate](https://github.com/ibm-granite-community/granite-kitchen/blob/cee1513c77429d7ddbf0e5a49b29b7bc9ca0d996/recipes/Getting_Started/Getting_Started_with_Replicate.ipynb), selecting a Granite Code model from the [`ibm-granite`](https://replicate.com/ibm-granite) org.

To connect to a model on a provider other than Replicate, substitute this code cell with one from the [LLM component recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_LLMs.ipynb).

In [48]:
from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import set_env_var, get_env_var

model = Replicate(
    model="ibm-granite/granite-3.0-8b-instruct",
    replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
)

## Get the tokenizer

Retrieve the tokenizer used by your chosen LLM.

In [49]:
from transformers import AutoTokenizer

model_path = "ibm-granite/granite-3.0-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)

## Acquiring the Data

We will use a New Hampshire case law dataset to help the model answer questions about NH laws.

### Download the documents

Download the [New Hampshire CAP Caselaw](https://huggingface.co/datasets/free-law/nh) dataset from HuggingFace using the datasets library.

In [50]:
from langchain.document_loaders import HuggingFaceDatasetLoader

# Load the documents from the dataset
loader = HuggingFaceDatasetLoader("free-law/nh", page_content_column="text")
documents = loader.load()
print("Document Count: " + str(len(documents)))

Document Count: 21540


### Add metadata to the documents

Add the `source` field, which is used below, to the metadata.

In [51]:
for doc in documents:
    doc.metadata['source'] = doc.metadata['id']

### Inspect the documents

In [52]:
for doc in documents[:1]:
    print(doc.metadata, "\n")
    print(doc.page_content, "\n")

{'id': '4439812', 'name': 'Louis C. Wyman v. John A. Durkin Robert L. Stark, Secretary of State Carmen Chimento', 'name_abbreviation': 'Wyman v. Stark', 'decision_date': '1975-01-06', 'docket_number': 'No. 7112', 'first_page': 1, 'last_page': '3', 'citations': '115 N.H. 1', 'volume': '115', 'reporter': 'New Hampshire Reports', 'court': 'New Hampshire Supreme Court', 'jurisdiction': 'New Hampshire', 'last_updated': '2021-08-10T17:25:43.934256+00:00', 'provenance': 'CAP', 'judges': '', 'parties': 'Louis C. Wyman v. John A. Durkin Robert L. Stark, Secretary of State Carmen Chimento', 'head_matter': 'Hillsborough\nNo. 7112\nLouis C. Wyman v. John A. Durkin Robert L. Stark, Secretary of State Carmen Chimento\nJanuary 6, 1975\nStanley M. Brown, Dart S. Bigg, Eugene M. Van Loan III and David R. DePuy (Mr. Brown orally) for the plaintiff.\nDevine, Millimet, Stahl & Branch and Matthias J. Reynolds and William S. Gannon (Mr. Joseph A. Millimet), by brief and orally, for John A. Durkin.\nThomas D

## Building the Document Database

We'll use the caselaw document database to retrieve the full text of the cases by case id.

### Create the database file and document table

In [53]:
# # put the json objects in a sqlite database, keyed by id
# import sqlite3, os, json

# # remove database file if exists
# if os.path.isfile('data.db'):
#     os.remove('data.db')

# conn = sqlite3.connect('data.db')
# c = conn.cursor()

# # create the table if it doesn't exist. include id, text, and size
# c.execute('''CREATE TABLE IF NOT EXISTS data
#              (id INTEGER PRIMARY KEY UNIQUE,
#               metadata TEXT,
#               text TEXT,
#               char_count INTEGER)''')


### Insert the documents into the table

In [54]:
# for doc in documents:
#     id = doc.metadata["id"]
#     c.execute("INSERT INTO data (id, metadata, text, char_count) VALUES (?,?,?,?)", (id, json.dumps(doc.metadata), doc.page_content, doc.metadata["char_count"]))
#     conn.commit()

### Count the documents

In [55]:
# c.execute("SELECT count(*) FROM data")
# doc_count = c.fetchone()[0]
# print(f"Document count: {doc_count}")

## Extracting the entities

In this example, we take the caselaw text, split it into chunks, and extract entities from each chunk. 

### Split the document into chunks

Split the document into text segments that can fit into the model's context window.

In [56]:
from langchain.text_splitter import TokenTextSplitter

# Split the documents into chunks
text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=10)
chunks = text_splitter.split_documents(documents[:1])
print("Chunk Count: " + str(len(chunks)))

Chunk Count: 1


### Inspect the chunks

In [57]:
import json
for i in range(1):
    print(chunks[i].page_content)
    print(json.dumps(chunks[i].metadata, indent=4))

"Per curiam.\nThis transfer arises out of the same case which was the subject matter of the petition for writ of prohibition in Durkin v. Hillsborough County Superior Court, 114 N.H. 788, 330 A.2d 777 (1974). The Superior Court (Bois, J.) has transferred without ruling seven questions, the first of which is as follows: \u201cDoes the Superior Court have jurisdiction either through RSA 68:4 II; other jurisdictional statutes or through precedent, to invalidate an election for United States Senator?\u201d\nThe several States may regulate the conduct of senatorial elections and may provide procedures necessary to guard against irregularity and error in the tabulation of votes and against fraud and corrupt practices. U.S. Const. art. I, \u00a7 4; Smiley v. Holm, 285 U.S. 355 (1932). They may provide procedures for a recount so long as they do not impair or frustrate the Senate\u2019s ability to make an independent judgment. Roudebush v. Hartke, 405 U.S. 15 (1972).\nThe proceedings before th

# Provide taxonomy of entities

An LLM may produce this with the prompt:

```
I am building a knowledge graph from legal case law. What are the entities I should extract for this knowledge graph?
Prefix the major categories with numbers, and the minor categories with letters.
```

In [58]:
query = """
Here are the categories of entity I want you to consider:

1. Case Information:
A. Case Name (e.g., "Brown v. Board of Education")
B. Docket Number (unique case identifier)
C. Court (e.g., Supreme Court, District Court)
D. Jurisdiction (state or federal jurisdiction, e.g., California, United States)
2. Legal Parties:
A. Plaintiff/Petitioner (the party initiating the case)
B. Defendant/Respondent (the party responding to the case)
C. Appellant/Appellee (for appeals)
3. Attorneys:
A. Counsel for Plaintiff/Petitioner (name, law firm, or organization)
B. Counsel for Defendant/Respondent (name, law firm, or organization)
4. Judges:
A. Judge(s)/Justice(s) (name and role: trial judge, appellate judge, presiding justice)
B. Panel Composition (for appellate cases, listing judges involved)
5. Legal Issues:
A. Legal Questions (the issues or questions of law presented to the court)
B. Claims (the specific claims or complaints raised by the plaintiff)
6. Legal Doctrines and Principles:
A. Statutes/Acts (e.g., "Civil Rights Act of 1964")
B. Precedents Cited (previous case law referred to)
C. Constitutional Provisions (e.g., "First Amendment," "Article III")
7. Facts and Context:
A. Material Facts (key facts on which the case is based)
B. Timeline (sequence of events leading up to and during the case)
8. Case Outcome:
A. Decision/Holding (the final judgment, such as "Affirmed," "Reversed")
B. Disposition (e.g., "dismissed with prejudice," "remanded")
C. Majority Opinion (summary or reasoning of the court's majority)
D. Concurring Opinion (opinion agreeing with the majority but for different reasons)
E. Dissenting Opinion (opinion disagreeing with the majority decision)
9. Procedural History:
A. Lower Court Rulings (previous decisions that led to the current case)
B. Appeals (sequence of appeals, appellate court involvement)
10. Relationships Between Entities:
A. Case Citation Relationship (e.g., "Case A cited Case B")
B. Statute Application (e.g., "Statute X was applied in Case Y")
C. Fact Relationships (linking individuals or events to legal consequences)
11. Court Documents:
A. Pleadings (complaints, answers, motions)
B. Briefs (e.g., appellate briefs, amicus briefs)
C. Orders (e.g., summary judgment, dismissal orders)
12. Dates:
A. Date of Filing (date the case was initiated)
B. Date of Decision (date the judgment was rendered)
C. Hearing Dates (important hearings or oral argument dates)
13. Legal Outcomes:
A. Remedies (e.g., "compensatory damages," "injunctive relief")
B. Sentences (in criminal cases, e.g., "5 years imprisonment")
14. Legal Concepts:
A. Standards of Review (e.g., "de novo," "abuse of discretion")
B. Burden of Proof (e.g., "preponderance of the evidence," "beyond a reasonable doubt")
"""


In [59]:

query = """
Case Name: The official name of the case (e.g., "Brown v. Board of Education").
Docket Number: The unique case identifier assigned to the case.
Court: The court where the case was heard (e.g., Supreme Court, District Court).
Jurisdiction: The jurisdiction under which the case falls (e.g., state or federal jurisdiction, like California, United States).
Plaintiff/Petitioner: The party initiating the case.
Defendant/Respondent: The party responding to the case.
Appellant/Appellee: The party appealing the decision and the party responding to the appeal, respectively.
Counsel for Plaintiff/Petitioner: The attorney or law firm representing the plaintiff/petitioner.
Counsel for Defendant/Respondent: The attorney or law firm representing the defendant/respondent.
Judge/Justice: The name of the judge or justice involved in the case, including their role (e.g., trial judge, appellate judge, presiding justice).
Panel Composition: The list of judges who heard the case (for appellate cases).
Legal Question: The issue or question of law presented to the court.
Claim: The specific claim or complaint raised by the plaintiff.
Statute/Act: The statute or act referenced or applied in the case (e.g., "Civil Rights Act of 1964").
Precedent Cited: Previous case law referred to in the case.
Constitutional Provision: The constitutional article or amendment referenced in the case (e.g., "First Amendment," "Article III").
Material Fact: The key fact on which the case is based.
Timeline: The sequence of events leading up to and during the case.
Decision/Holding: The final judgment of the court (e.g., "Affirmed," "Reversed").
Disposition: The outcome of the case (e.g., "dismissed with prejudice," "remanded").
Majority Opinion: The reasoning and summary of the decision given by the majority of the judges.
Concurring Opinion: An opinion that agrees with the majority's decision but for different reasons.
Dissenting Opinion: An opinion that disagrees with the majority decision.
Lower Court Ruling: A previous decision made by a lower court that led to the current case.
Appeal: The process and sequence of taking the case to a higher court.
Case Citation Relationship: A relationship indicating which case was cited by another case (e.g., "Case A cited Case B").
Statute Application: How a particular statute or law was applied in the case (e.g., "Statute X was applied in Case Y").
Fact Relationship: A relationship linking an individual or event to legal consequences within the case.
Pleading: A document such as a complaint, answer, or motion filed during the case.
Brief: A written document submitted by a party, including an appellate brief or amicus brief.
Order: A court order related to the case (e.g., summary judgment, dismissal order).
Date of Filing: The date when the case was initiated.
Date of Decision: The date on which the final decision was rendered.
Hearing Date: An important date for a hearing or oral argument in the case.
Remedy: Type of compensation or relief provided (e.g., "compensatory damages," "injunctive relief").
Sentence: In a criminal case, the sentence handed down (e.g., "5 years imprisonment").
Standard of Review: The standard used by the court in evaluating the case (e.g., "de novo," "abuse of discretion").
Burden of Proof: The level of proof required in the case (e.g., "preponderance of the evidence," "beyond a reasonable doubt").

"""


## Extracting Entities

### Provide a list of entity categories

In [60]:

query = """
Case Name: The official name of the case (e.g., "Brown v. Board of Education").
Docket Number: The unique case identifier assigned to the case.
Court: The court where the case was heard (e.g., Supreme Court, District Court).
Jurisdiction: The jurisdiction under which the case falls (e.g., state or federal jurisdiction, like California, United States).
Plaintiff/Petitioner: The party initiating the case.
Defendant/Respondent: The party responding to the case.
Appellant/Appellee: The party appealing the decision and the party responding to the appeal, respectively.
Counsel for Plaintiff/Petitioner: The attorney or law firm representing the plaintiff/petitioner.
Counsel for Defendant/Respondent: The attorney or law firm representing the defendant/respondent.
Judge/Justice: The name of the judge or justice involved in the case, including their role (e.g., trial judge, appellate judge, presiding justice).
Panel Composition: The list of judges who heard the case (for appellate cases).
Legal Question: The issue or question of law presented to the court.
Claim: The specific claim or complaint raised by the plaintiff.

Given this list of entity categories, find the entities of interest in the following text. List each entity, its relationship to the case, and short description of that relationship, semicolon-separated, like this:
1. John A. Durkin; Defendant; The defendant in the case, against whom the plaintiff is seeking to invalidate an election.
2. State of New Hampshire; Jurisdiction; The state where the case is being heard and where the election took place.
3. Louis C. Wyman; Plaintiff; The plaintiff in the case, seeking to invalidate an election for United States Senator.

Here is the text:
\n{}"""


### Extract entities from each chunk of text

In [61]:
responses = []
for chunk in chunks[:1]:
    print(f"Chunk of {chunk.metadata['id']}")
    full_query = query.format(chunk)
    print(str(len(tokenizer.tokenize(full_query))) + " tokens")
    response = model.invoke(full_query, max_tokens=1000)
    print(response)
    responses.append(response)

Chunk of 4439812
2027 tokens
1. Louis C. Wyman; Plaintiff; The plaintiff in the case, seeking to invalidate an election for United States Senator.
2. John A. Durkin; Defendant; The defendant in the case, against whom the plaintiff is seeking to invalidate an election for United States Senator.
3. Robert L. Stark; Party; The Secretary of State of New Hampshire, a party to the case.
4. Carmen Chimento; Party; The individual represented by Richard W. Leonard, a party to the case.
5. New Hampshire; Jurisdiction; The state where the case is being heard and where the election took place.
6. United States Senate; Legal Question; The legislative body whose election and qualifications are at issue in the case.
7. Ballot Law Commission; Entity; The entity that conducts proceedings related to senatorial elections in New Hampshire.
8. Superior Court; Court; The court where the case was heard and where the plaintiff sought to invalidate the election.
9. New Hampshire Supreme Court; Appellate Court;

### Construct Graph Triples

Using the extracted entities along with the text chunk, construct graph triples.

In [62]:
entities_lines = responses[0].splitlines()[:-1]  # Strip off the last line, which was likely truncated in the model response.
entities_triples = [(entity.split(". ", 1)[1], role, "Case 4439812") for entity, role, desc in [line.split("; ", 2) for line in entities_lines]]
print(entities_triples)

[('Louis C. Wyman', 'Plaintiff', 'Case 4439812'), ('John A. Durkin', 'Defendant', 'Case 4439812'), ('Robert L. Stark', 'Party', 'Case 4439812'), ('Carmen Chimento', 'Party', 'Case 4439812'), ('New Hampshire', 'Jurisdiction', 'Case 4439812'), ('United States Senate', 'Legal Question', 'Case 4439812'), ('Ballot Law Commission', 'Entity', 'Case 4439812'), ('Superior Court', 'Court', 'Case 4439812'), ('New Hampshire Supreme Court', 'Appellate Court', 'Case 4439812'), ('U.S. Constitution', 'Legal Question', 'Case 4439812'), ('RSA 68:4 II', 'Jurisdictional Statute', 'Case 4439812'), ('RSA 68:11', 'Jurisdictional Statute', 'Case 4439812'), ('RSA 64:6', 'Jurisdictional Statute', 'Case 4439812'), ('Smiley v. Holm', 'Precedent', 'Case 4439812'), ('Roudebush v. Hartke', 'Precedent', 'Case 4439812'), ('Durkin v. Snow', 'Precedent', 'Case 4439812'), ('Barry v. United States', 'Precedent', 'Case 4439812'), ('Petition of Dondero', 'Precedent', 'Case 4439812'), ('Stanley M. Brown', 'Counsel for Plaint

## Building the Graph Database

### Define methods

In [63]:
from neo4j import GraphDatabase
from stringcase import snakecase

# Define the list of (entity, relationship, entity) triples
triples = [
    ("Alice", "KNOWS", "Bob"),
    ("Bob", "WORKS_WITH", "Charlie"),
    ("Alice", "LIVES_IN", "London"),
    ("Charlie", "VISITED", "Paris"),
]
triples = entities_triples

# Connect to the Neo4j database
uri = get_env_var("NEO4J_URI")
username = get_env_var("NEO4J_USERNAME")
password = get_env_var("NEO4J_PASSWORD")
driver = GraphDatabase.driver(uri, auth=(username, password))

def create_graph(tx, entity1, role, entity2):
    query = (
        "MERGE (a:Entity {name: $entity1}) "
        "MERGE (b:Entity {name: $entity2}) "
        "MERGE (a)-[r:%s]->(b)"
    ) % snakecase(role)
    tx.run(query, entity1=entity1, entity2=entity2)

def build_graph(triples):
    with driver.session() as session:
        # Empty the graph first
        session.run("MATCH (n) DETACH DELETE n")
        # Fill the graph
        for entity1, role, entity2 in triples:
            session.write_transaction(create_graph, entity1, role, entity2)

def query_graph():
    with driver.session() as session:
        # Query to find all nodes
        result = session.run("MATCH (n) RETURN n.name AS name")
        print("Nodes in the graph:")
        for record in result:
            print(record["name"])

        # Query to find all relationships
        result = session.run("MATCH (a)-[r]->(b) RETURN a.name AS from, type(r) AS rel, b.name AS to")
        print("\nRelationships in the graph:")
        for record in result:
            print(f"{record['from']} -[{record['rel']}]-> {record['to']}")

# Build the graph from the triples list
build_graph(triples)

# Issue some basic queries against the graph
query_graph()

# Close the connection to the database
driver.close()

print("Graph successfully built and queried in Neo4j!")

  session.write_transaction(create_graph, entity1, role, entity2)


Nodes in the graph:
Louis C. Wyman
Case 4439812
John A. Durkin
Robert L. Stark
Carmen Chimento
New Hampshire
United States Senate
Ballot Law Commission
Superior Court
New Hampshire Supreme Court
U.S. Constitution
RSA 68:4 II
RSA 68:11
RSA 64:6
Smiley v. Holm
Roudebush v. Hartke
Durkin v. Snow
Barry v. United States
Petition of Dondero
Stanley M. Brown
Dart S. Bigg
Eugene M. Van Loan III
David R. DePuy
Thomas D. Rath
Matthias J. Reynolds
William S. Gannon

Relationships in the graph:
Louis C. Wyman -[plaintiff]-> Case 4439812
John A. Durkin -[defendant]-> Case 4439812
Robert L. Stark -[party]-> Case 4439812
Carmen Chimento -[party]-> Case 4439812
New Hampshire -[jurisdiction]-> Case 4439812
United States Senate -[legal__question]-> Case 4439812
Ballot Law Commission -[entity]-> Case 4439812
Superior Court -[court]-> Case 4439812
New Hampshire Supreme Court -[appellate__court]-> Case 4439812
U.S. Constitution -[legal__question]-> Case 4439812
RSA 68:4 II -[jurisdictional__statute]-> Case

In [69]:
with driver.session() as session:
    # Query to find all nodes
    result = session.run("MATCH (a)-[:jurisdictional__statute]->() RETURN a.name AS name")
    print("Nodes in the graph:")
    for record in result:
        print(record["name"])

Nodes in the graph:
RSA 68:4 II
RSA 68:11
RSA 64:6


  with driver.session() as session:
