# Entity Extraction from New Hampshire Case Law
*With IBM Granite Models*

The [New Hampshire Case Law Dataset](https://huggingface.co/datasets/free-law/nh) comes from the Caselaw Access Project via Hugging Face.

## In this notebook
This notebook contains instructions for performing entity extraction.

## Prerequisites

To get started, you'll need:
* A [Replicate account](https://replicate.com/) and API token.

## Setting up the environment

### Install dependencies

Granite Kitchen comes with a bundle of dependencies that are required for notebooks. See the list of packages in its [`setup.py`](https://github.com/ibm-granite-community/granite-kitchen/blob/main/setup.py). 

In [47]:
!pip install git+https://github.com/ibm-granite-community/utils \
    "langchain_community<0.3.0" \
    replicate \
    datasets \
    transformers \
    tiktoken \
    neo4j \
    stringcase

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting git+https://github.com/ibm-granite-community/utils
  Cloning https://github.com/ibm-granite-community/utils to /private/var/folders/nc/jrql4k0n2j73h7xktzxdr4pr0000gn/T/pip-req-build-4wi9a7ku
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/utils /private/var/folders/nc/jrql4k0n2j73h7xktzxdr4pr0000gn/T/pip-req-build-4wi9a7ku
  Resolved https://github.com/ibm-granite-community/utils to commit a4b663310cdc11be2f3039a11d263dae98584582
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Selecting System Components

### Choose your LLM
The LLM will be used for answering the question, given the retrieved text.

Follow the instructions in [Getting Started with Replicate](https://github.com/ibm-granite-community/granite-kitchen/blob/cee1513c77429d7ddbf0e5a49b29b7bc9ca0d996/recipes/Getting_Started/Getting_Started_with_Replicate.ipynb), selecting a Granite Code model from the [`ibm-granite`](https://replicate.com/ibm-granite) org.

To connect to a model on a provider other than Replicate, substitute this code cell with one from the [LLM component recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_LLMs.ipynb).

In [48]:
from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import set_env_var, get_env_var

model = Replicate(
    model="ibm-granite/granite-3.0-8b-instruct",
    replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
)

## Get the tokenizer

Retrieve the tokenizer used by your chosen LLM.

In [49]:
from transformers import AutoTokenizer

model_path = "ibm-granite/granite-3.0-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)

## Acquiring the Data

We will use a New Hampshire case law dataset to help the model answer questions about NH laws.

### Download the documents

Download the [New Hampshire CAP Caselaw](https://huggingface.co/datasets/free-law/nh) dataset from HuggingFace using the datasets library.

In [111]:
from langchain.document_loaders import HuggingFaceDatasetLoader

# Load the documents from the dataset
loader = HuggingFaceDatasetLoader("free-law/nh", page_content_column="text")
documents = loader.load()
print("Document Count: " + str(len(documents)))

Document Count: 21540


### Add metadata to the documents

Add the `source` field, which is used below, to the metadata.

In [51]:
for doc in documents:
    doc.metadata['source'] = doc.metadata['id']

### Inspect the documents

In [88]:
for doc in documents[:1]:
    print(doc.metadata, "\n")
    print(doc.page_content, "\n")

{'id': '4439812', 'name': 'Louis C. Wyman v. John A. Durkin Robert L. Stark, Secretary of State Carmen Chimento', 'name_abbreviation': 'Wyman v. Stark', 'decision_date': '1975-01-06', 'docket_number': 'No. 7112', 'first_page': 1, 'last_page': '3', 'citations': '115 N.H. 1', 'volume': '115', 'reporter': 'New Hampshire Reports', 'court': 'New Hampshire Supreme Court', 'jurisdiction': 'New Hampshire', 'last_updated': '2021-08-10T17:25:43.934256+00:00', 'provenance': 'CAP', 'judges': '', 'parties': 'Louis C. Wyman v. John A. Durkin Robert L. Stark, Secretary of State Carmen Chimento', 'head_matter': 'Hillsborough\nNo. 7112\nLouis C. Wyman v. John A. Durkin Robert L. Stark, Secretary of State Carmen Chimento\nJanuary 6, 1975\nStanley M. Brown, Dart S. Bigg, Eugene M. Van Loan III and David R. DePuy (Mr. Brown orally) for the plaintiff.\nDevine, Millimet, Stahl & Branch and Matthias J. Reynolds and William S. Gannon (Mr. Joseph A. Millimet), by brief and orally, for John A. Durkin.\nThomas D

## Building the Document Database

We'll use the caselaw document database to retrieve the full text of the cases by case id.

### Create the database file and document table

In [53]:
# # put the json objects in a sqlite database, keyed by id
# import sqlite3, os, json

# # remove database file if exists
# if os.path.isfile('data.db'):
#     os.remove('data.db')

# conn = sqlite3.connect('data.db')
# c = conn.cursor()

# # create the table if it doesn't exist. include id, text, and size
# c.execute('''CREATE TABLE IF NOT EXISTS data
#              (id INTEGER PRIMARY KEY UNIQUE,
#               metadata TEXT,
#               text TEXT,
#               char_count INTEGER)''')


### Insert the documents into the table

In [54]:
# for doc in documents:
#     id = doc.metadata["id"]
#     c.execute("INSERT INTO data (id, metadata, text, char_count) VALUES (?,?,?,?)", (id, json.dumps(doc.metadata), doc.page_content, doc.metadata["char_count"]))
#     conn.commit()

### Count the documents

In [55]:
# c.execute("SELECT count(*) FROM data")
# doc_count = c.fetchone()[0]
# print(f"Document count: {doc_count}")

## Extracting the entities

In this example, we take the caselaw text, split it into chunks, and extract entities from each chunk. 

### Split the document into chunks

Split the document into text chunks that can fit into the model's context window.

In [112]:
from langchain.text_splitter import TokenTextSplitter

num_docs = 30
doc_chunks = {}
documents = [doc for doc in documents[:num_docs] if doc.metadata["id"] in ['4440632', '4441078']]

# Split the documents into chunks
text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=50)
for doc in documents:
    id = doc.metadata["id"]
    chunks = text_splitter.split_documents([doc])
    doc_chunks[id] = chunks
    print(f"Case {id}: " + str(len(chunks)))

Case 4440632: 1
Case 4441078: 3


### Inspect the chunks

In [115]:
import json
for doc in documents[1:2]:
    id = doc.metadata["id"]
    print(json.dumps(doc.metadata, indent=4))
    for chunk in doc_chunks[id]:
        print(chunk.page_content)

{
    "id": "4441078",
    "name": "Dana A. Desrochers v. Real J. Desrochers",
    "name_abbreviation": "Desrochers v. Desrochers",
    "decision_date": "1975-10-31",
    "docket_number": "No. 7135",
    "first_page": 591,
    "last_page": "595",
    "citations": "115 N.H. 591",
    "volume": "115",
    "reporter": "New Hampshire Reports",
    "court": "New Hampshire Supreme Court",
    "jurisdiction": "New Hampshire",
    "last_updated": "2021-08-10T17:25:43.934256+00:00",
    "provenance": "CAP",
    "judges": "All concurred.",
    "parties": "Dana A. Desrochers v. Real J. Desrochers",
    "head_matter": "Hillsborough\nNo. 7135\nDana A. Desrochers v. Real J. Desrochers\nOctober 31, 1975\nCraig, Wenners, Craig Si McDowell (Mr. Joseph F. McDowell III orally) for the plaintiff.\nClifford J. Ross, by brief and orally, for the defendant.",
    "word_count": "1466",
    "char_count": "8963"
}
"Kenison, C.J.\nThe parties married in September 1970. Their only child, a daughter, was born in J

We can see from this output that the "judge" in the metadata is not reliable, so we will pick that entity out of the text.

# Provide taxonomy of entities

An LLM may produce this with the prompt:

```
I am building a knowledge graph from legal case law. What are the entities I should extract for this knowledge graph?
Prefix the major categories with numbers, and the minor categories with letters.
```

## Extracting Entities

### Provide a list of entity categories

In [176]:

query = """\
<|start_of_role|>system<|end_of_role|>
Below is a list of entity categories:

Counsel for Plaintiff/Petitioner: The attorney or law firm representing the plaintiff/petitioner.
Counsel for Defendant/Respondent: The attorney or law firm representing the defendant/respondent.
Judge/Justice: The name of the judge or justice involved in the case, including their role (e.g., trial judge, appellate judge, presiding justice).
Statute/Act: The statute or act referenced or applied in the case (e.g., "Civil Rights Act of 1964").
Precedent Cited: Previous case law referred to in the case.
Constitutional Provision: The constitutional article or amendment referenced in the case (e.g., "First Amendment," "Article III").
Decision/Holding: The final judgment of the court (e.g., "Affirmed," "Reversed").
Disposition: The outcome of the case (e.g., "dismissed with prejudice," "remanded").
Remedy: Type of compensation or relief provided (e.g., "compensatory damages," "injunctive relief").
Sentence: In a criminal case, the sentence handed down (e.g., "5 years imprisonment").

Given this list of entity categories, you will be asked to extract entities belonging to these categories from a text passage.
Consider only the list of entity categories above; do not extract any additional entities. For each entity found, list the category and the entity, separated by a semicolon. Do not use the words "Entity" or "Category".

Here are some examples:
1. Remedy: Compensatory Damages
2. Counsel for Defendant/Respondent: Jane C.
3. Precedent Cited: State vs. Tiger
<|end_of_text|>
<|start_of_role|>user<|end_of_role|>
Find the entities in the following text, and list them in the format specified above:

{}
<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>"""


### Extract entities from each chunk of text

In [177]:
doc_extracts = {}
for doc in documents:
    id = doc.metadata['id']
    extracts = []
    for i, chunk in enumerate(doc_chunks[id]):
        print(f"Chunk {i} of {id}")
        full_query = query.format(chunk.page_content)
        print(str(len(tokenizer.tokenize(full_query))) + " tokens")
        response = model.invoke(full_query, max_tokens=1000)
        print(response)
        extracts.append(response)

    doc_extracts[id] = extracts

Chunk 0 of 4440632
841 tokens
1. Disposition: Exception Overruled
2. Judge or Justice: Kenison, C.J. (did not sit)
3. Statute or Act: Not explicitly mentioned
4. Precedent Cited: State v. Costello, 110 N.H. 182, 263 A.2d 671 (1970); State v. Allen, 114 N.H. 682, 327 A.2d 715 (1974)
5. Decision or Holding: Not explicitly mentioned
6. Counsel for Defendant/Respondent: Not explicitly mentioned
7. Counsel for Plaintiff/Petitioner: Not explicitly mentioned
8. Remedy: Not explicitly mentioned
9. Sentence: Not applicable (criminal case)
Chunk 0 of 4441078
1787 tokens
1. Counsel for Plaintiff/Petitioner: Not explicitly mentioned in the text.
2. Counsel for Defendant/Respondent: Not explicitly mentioned in the text.
3. Judge or Justice: Kenison, C.J.
4. Statute or Act: RSA 458:7-a (Supp. 1973)
5. Precedent Cited: Not explicitly mentioned in the text.
6. Constitutional Provision: Not explicitly mentioned in the text.
7. Decision or Holding: Not explicitly mentioned in the text.
8. Disposition: N

### Construct Graph Triples

Using the extracted entities along with the text chunk, construct graph triples.

In [205]:
def get_triples_from_extract(extract, case_name):
    triples = []
    lines = extract.splitlines()
    for line in lines:
        try:
            entity, role = line.split(": ", 2)
            triple = (entity.split(". ", 1)[1], role, case_name)
            triples.append(triple)
        except ValueError:
            print(f"Error parsing case {id} line: {line}")
    return triples

doc_triples = {}
for doc in documents:
    id = doc.metadata['id']
    name = doc.metadata['name_abbreviation']
    triples = []
    for i, extract in enumerate(doc_extracts[id]):
        # Break response up into entity triples.
        new_triples = get_triples_from_extract(extract, name);
        triples.extend(new_triples)
    # Add triples from metadata.
    triples.append(('Court', doc.metadata["court"], name))

    # Add to triples for the document.
    if id in doc_triples:
        doc_triples[id].append(triples)
    else:
        doc_triples[id] = triples

all_triples = []
for id, triples in doc_triples.items():
    print(f"Case {id}")
    for triple in triples:
        r = triple[1].lower()
        if "not explicitly mentioned" not in r and "not applicable" not in r:
            all_triples.append(triple)
            print(triple)


Case 4440632
('Disposition', 'Exception Overruled', 'State v. Craigue')
('Judge or Justice', 'Kenison, C.J. (did not sit)', 'State v. Craigue')
('Precedent Cited', 'State v. Costello, 110 N.H. 182, 263 A.2d 671 (1970); State v. Allen, 114 N.H. 682, 327 A.2d 715 (1974)', 'State v. Craigue')
('Court', 'New Hampshire Supreme Court', 'State v. Craigue')
Case 4441078
('Judge or Justice', 'Kenison, C.J.', 'Desrochers v. Desrochers')
('Statute or Act', 'RSA 458:7-a (Supp. 1973)', 'Desrochers v. Desrochers')
('Statute or Act', 'RSA 458:7-a', 'Desrochers v. Desrochers')
('Precedent Cited', 'Riley v. Riley, 271 So. 2d 181, 183 (Fla. App. 1972)', 'Desrochers v. Desrochers')
('Precedent Cited', 'Ballou v. Ballou', 'Desrochers v. Desrochers')
('Decision or Holding', 'Remanded', 'Desrochers v. Desrochers')
('Disposition', 'Remanded', 'Desrochers v. Desrochers')
('Court', 'New Hampshire Supreme Court', 'Desrochers v. Desrochers')


## Building the Graph Database

### Define methods

In [228]:
from neo4j import GraphDatabase
from stringcase import snakecase, lowercase

# Define the list of (entity, relationship, entity) triples
triples = all_triples

# Connect to the Neo4j database
uri = get_env_var("NEO4J_URI")
username = get_env_var("NEO4J_USERNAME")
password = get_env_var("NEO4J_PASSWORD")
driver = GraphDatabase.driver(uri, auth=(username, password))

def create_graph(tx, entity1, role, case):
    query = (
        "MERGE (a:Entity {name: $entity1}) "
        "MERGE (c:Case {name: $case}) "
        "MERGE (a)-[r:%s]->(c)"
    ) % snakecase(lowercase(role.replace('/', '_')))
    tx.run(query, entity1=entity1, case=case)

def build_graph(triples):
    with driver.session() as session:
        # Empty the graph first
        session.run("MATCH (n) DETACH DELETE n")
        # Fill the graph
        for role, entity1, case in triples:
            session.write_transaction(create_graph, entity1, role, case)

def query_graph():
    with driver.session() as session:
        # Query to find all nodes
        result = session.run("MATCH (n) RETURN n.name AS name")
        print("Nodes in the graph:")
        for record in result:
            print(record["name"])

        # Query to find all relationships
        result = session.run("MATCH (a)-[r]->(b) RETURN a.name AS from, type(r) AS rel, b.name AS to")
        print("\nRelationships in the graph:")
        for record in result:
            print(f"{record['from']} -[{record['rel']}]-> {record['to']}")

# Build the graph from the triples list
build_graph(triples)

# Issue some basic queries against the graph
query_graph()

# Close the connection to the database
driver.close()

print("Graph successfully built and queried in Neo4j!")

  session.write_transaction(create_graph, entity1, role, case)


Nodes in the graph:
Exception Overruled
State v. Craigue
Kenison, C.J. (did not sit)
State v. Costello, 110 N.H. 182, 263 A.2d 671 (1970); State v. Allen, 114 N.H. 682, 327 A.2d 715 (1974)
New Hampshire Supreme Court
Kenison, C.J.
RSA 458:7-a (Supp. 1973)
RSA 458:7-a
Riley v. Riley, 271 So. 2d 181, 183 (Fla. App. 1972)
Ballou v. Ballou
Remanded
Desrochers v. Desrochers

Relationships in the graph:
Kenison, C.J. -[judge_or_justice]-> Desrochers v. Desrochers
RSA 458:7-a (Supp. 1973) -[statute_or_act]-> Desrochers v. Desrochers
RSA 458:7-a -[statute_or_act]-> Desrochers v. Desrochers
Riley v. Riley, 271 So. 2d 181, 183 (Fla. App. 1972) -[precedent_cited]-> Desrochers v. Desrochers
Ballou v. Ballou -[precedent_cited]-> Desrochers v. Desrochers
Remanded -[decision_or_holding]-> Desrochers v. Desrochers
Remanded -[disposition]-> Desrochers v. Desrochers
New Hampshire Supreme Court -[court]-> Desrochers v. Desrochers
Exception Overruled -[disposition]-> State v. Craigue
Kenison, C.J. (did no

In [229]:
driver = GraphDatabase.driver(uri, auth=(username, password))
with driver.session() as session:
    # Query to find all nodes
    result = session.run("MATCH (a)-[:precedent_cited]->() RETURN a.name AS name")
    print("Nodes in the graph:")
    for record in result:
        print(record["name"])

Nodes in the graph:
Riley v. Riley, 271 So. 2d 181, 183 (Fla. App. 1972)
Ballou v. Ballou
State v. Costello, 110 N.H. 182, 263 A.2d 671 (1970); State v. Allen, 114 N.H. 682, 327 A.2d 715 (1974)


## Populate a vector database with entities

In [230]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

In [193]:
from langchain_chroma import Chroma

vector_db = Chroma(embedding_function=embeddings_model)

In [227]:
vector_db._collection.metadata
print(vector_db._collection.metadata)

None


In [210]:
from langchain.docstore.document import Document

names = []
with driver.session() as session:
    # Query to find all nodes
    result = session.run("MATCH (n) RETURN n.name AS name")
    print("Nodes in the graph:")
    for record in result:
        doc = Document(record["name"])
        names.append(doc)
        print(record["name"])

ids = vector_db.add_documents(names)
print("Documents added: ", len(ids))


Nodes in the graph:
Exception Overruled
State v. Craigue
New Hampshire Supreme Court
Kenison, C.J.
Desrochers v. Desrochers
RSA 458:7-a (Supp. 1973)
RSA 458:7-a
Riley v. Riley, 271 So. 2d 181, 183 (Fla. App. 1972)
Ballou v. Ballou
Remanded
Kenison, C.J. (did not sit)
State v. Costello, 110 N.H. 182, 263 A.2d 671 (1970); State v. Allen, 114 N.H. 682, 327 A.2d 715 (1974)
Documents added:  12


## Answer questions

### Extract entities from question

This is one type of question that can be asked. We will find cases with these entities in common.

In [237]:
question = "How has Judge Kenison used Ballou v. Ballou to rule on cases?"

response = model.invoke(query.format(question))
print(response)
question_entity_triples = get_triples_from_extract(response, "")
print(question_entity_triples)


1. Judge or Justice: Judge Kenison;
2. Precedent Cited: Ballou v. Ballou.
[('Judge or Justice', 'Judge Kenison;', ''), ('Precedent Cited', 'Ballou v. Ballou.', '')]


### Match entities to the graph

In [242]:

def match_entity(name, threshold=1.0):
    """Match entities by embedding vector distance given a similarity threshold. With Chroma, l2 (Euclidean) distance is used."""
    docs_with_score = vector_db.similarity_search_with_score(name, k=5)
    for doc, score in docs_with_score:
        # print(f"{doc.page_content} has a similarity score of {score}")
        next
    if len(docs_with_score):
        doc, score = docs_with_score[0]
        if score <= threshold:
            # Return first close match.
            return doc.page_content
    else:
        # No match.
        return None


In [224]:
for triple in question_entity_triples:
    name = triple[1]
    print(f"Matching {name}")
    match = match_entity(name)
    if match is not None:
        print(match)

Matching Judge Kenison;
Kenison, C.J. has a similarity score of 0.6313690543174744
Kenison, C.J. has a similarity score of 0.6313690543174744
Kenison, C.J. has a similarity score of 0.6313690543174744
Kenison, C.J. (did not sit) has a similarity score of 0.7332189083099365
Kenison, C.J. (did not sit) has a similarity score of 0.7332189083099365
Kenison, C.J.
Matching Ballou v. Ballou.
Ballou v. Ballou has a similarity score of 0.016251880675554276
Ballou v. Ballou has a similarity score of 0.016251880675554276
Ballou v. Ballou has a similarity score of 0.016251880675554276
State v. Craigue has a similarity score of 1.601686954498291
Desrochers v. Desrochers has a similarity score of 1.622413158416748
Ballou v. Ballou


### Query the graph for cases

Single entity:

In [247]:
def query_for_cases(entity_name, role):
    with driver.session() as session:
        relationship = snakecase(lowercase(role.replace('/', '_')))
        query = f"MATCH (e:Entity {{name: '{entity_name}'}})-[:{relationship}]->(c:Case) RETURN c.name AS name"
        print(query)
        result = session.run(query)
        print("Cases:")
        for record in result:
            print(record["name"])

for triple in question_entity_triples:
    role, entity, c = triple
    entity_match = match_entity(entity)
    query_for_cases(entity_match, role)

MATCH (e:Entity {name: 'Kenison, C.J.'})-[:judge_or_justice]->(c:Case) RETURN c.name AS name
Cases:
Desrochers v. Desrochers
MATCH (e:Entity {name: 'Ballou v. Ballou'})-[:precedent_cited]->(c:Case) RETURN c.name AS name
Cases:
Desrochers v. Desrochers


Multiple entities:

In [257]:
def query_for_cases(entity_role_pairs):
    with driver.session() as session:
        query = ""
        for i, (entity, role) in enumerate(entity_role_pairs):
            relationship = snakecase(lowercase(role.replace('/', '_')))
            query += f"MATCH (e{str(i)}:Entity {{name: '{entity}'}})-[:{relationship}]->(c)\n"
        query += "RETURN c.name AS name"
        print(query)
        result = session.run(query)
        print("Cases:")
        for record in result:
            print(record["name"])

entity_role_pairs = []
for triple in question_entity_triples:
    role, entity, c = triple
    entity_match = match_entity(entity)
    entity_role_pairs.append((entity_match, role))
    
query_for_cases(entity_role_pairs)

MATCH (e0:Entity {name: 'Kenison, C.J.'})-[:judge_or_justice]->(c)
MATCH (e1:Entity {name: 'Ballou v. Ballou'})-[:precedent_cited]->(c)
RETURN c.name AS name
Cases:
Desrochers v. Desrochers
