# Entity Extraction from New Hampshire Case Law
*With IBM Granite Models*

The [New Hampshire Case Law Dataset](https://huggingface.co/datasets/free-law/nh) comes from the Caselaw Access Project via Hugging Face.

## In this notebook

In this notebook, we'll explore the process of extracting meaningful information from text using entity extraction techniques, and then leverage that information to build and query a simple knowledge graph. Specifically, we'll guide you through the following steps:

- **Entity Extraction**: We'll start by processing a body of text to identify and extract key entities, such as people, organizations, dates, and locations. Entity extraction is a crucial part of natural language processing (NLP) that helps in transforming raw text into structured data.
- **Knowledge Graph Construction**: Once we've extracted entities, we'll build a basic knowledge graph. A knowledge graph represents entities as nodes and relationships as edges, providing a structured way to understand the interconnections between different entities within the text. This helps in visualizing and storing the extracted data meaningfully.
- **Querying the Knowledge Graph**: With the knowledge graph in place, we can retrieve specific information by posing questions. We'll implement methods to query the graph, including resolving entities from the question to entities in the graph. This will allow us to identify relevant graph structures that correspond to the user's query.
- **Question Answering**: Finally, we'll use the results retrieved from the knowledge graph to answer the user's question. By using the structured information from the graph, we can provide detailed answers and offer insights into the relationships and context within the body of text.

This process of transforming unstructured text into a knowledge graph and then querying it is useful for applications such as legal research, medical case studies, or business intelligence. By the end of this notebook, you'll have hands-on experience building a simple pipeline that takes raw text, extracts valuable entities, and then allows users to query the data to obtain meaningful answers. Equipped with these techniques, we will move on to the more sophisticated techniques of Graph RAG.

## Prerequisites

To get started, you'll need:
* A [Replicate account](https://replicate.com/) and API token.

## Setting up the environment

### Install dependencies

Granite Kitchen comes with a bundle of dependencies that are required for notebooks. See the list of packages in its [`setup.py`](https://github.com/ibm-granite-community/granite-kitchen/blob/main/setup.py). 

In [1]:
!pip install git+https://github.com/ibm-granite-community/utils \
    "langchain_community<0.3.0" \
    replicate \
    datasets \
    transformers \
    tiktoken \
    neo4j \
    stringcase \
    langchain_huggingface \
    sentence-transformers \
    langchain_chroma

Successfully installed MarkupSafe-3.0.2 Pillow-11.0.0 PyYAML-6.0.2 SQLAlchemy-2.0.36 aiohappyeyeballs-2.4.3 aiohttp-3.10.10 aiosignal-1.3.1 annotated-types-0.7.0 anyio-4.6.2.post1 asgiref-3.8.1 async-timeout-4.0.3 attrs-24.2.0 backoff-2.2.1 bcrypt-4.2.0 build-1.2.2.post1 cachetools-5.5.0 certifi-2024.8.30 charset-normalizer-3.4.0 chroma-hnswlib-0.7.6 chromadb-0.5.16 click-8.1.7 coloredlogs-15.0.1 dataclasses-json-0.6.7 datasets-3.1.0 deprecated-1.2.14 dill-0.3.8 durationpy-0.9 fastapi-0.115.4 filelock-3.16.1 flatbuffers-24.3.25 frozenlist-1.5.0 fsspec-2024.9.0 google-auth-2.35.0 googleapis-common-protos-1.65.0 greenlet-3.1.1 grpcio-1.67.1 h11-0.14.0 httpcore-1.0.6 httptools-0.6.4 httpx-0.27.2 huggingface-hub-0.26.2 humanfriendly-10.0 ibm_granite_community-0.1.0 idna-3.10 importlib-metadata-8.4.0 importlib-resources-6.4.5 jinja2-3.1.4 joblib-1.4.2 jsonpatch-1.33 jsonpointer-3.0.0 kubernetes-31.0.0 langchain-0.2.16 langchain-core-0.2.42 langchain-text-splitters-0.2.4 langchain_chroma-0.1

## Selecting System Components

### Choose your LLM
The LLM will be used for answering the question, given the retrieved text.

Follow the instructions in [Getting Started with Replicate](https://github.com/ibm-granite-community/granite-kitchen/blob/cee1513c77429d7ddbf0e5a49b29b7bc9ca0d996/recipes/Getting_Started/Getting_Started_with_Replicate.ipynb), selecting a Granite Code model from the [`ibm-granite`](https://replicate.com/ibm-granite) org.

To connect to a model on a provider other than Replicate, substitute this code cell with one from the [LLM component recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_LLMs.ipynb).

In [2]:
from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import set_env_var, get_env_var

model = Replicate(
    model="ibm-granite/granite-3.0-8b-instruct",
    replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
)

### Get the tokenizer

Retrieve the tokenizer used by your chosen LLM.

In [3]:
from transformers import AutoTokenizer

model_path = "ibm-granite/granite-3.0-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)

  from .autonotebook import tqdm as notebook_tqdm


## Acquiring the Data

We will use a New Hampshire case law dataset to help the model answer questions about NH laws.

### Download the documents

Download the [New Hampshire CAP Caselaw](https://huggingface.co/datasets/free-law/nh) dataset from HuggingFace using the datasets library.

In [4]:
from langchain.document_loaders import HuggingFaceDatasetLoader

# Load the documents from the dataset
loader = HuggingFaceDatasetLoader("free-law/nh", page_content_column="text")
documents = loader.load()
print("Document Count: " + str(len(documents)))

Document Count: 21540


### Inspect the documents

In [6]:
for doc in documents[:1]:
    print(doc.metadata, "\n")
    print(doc.page_content, "\n")

{'id': '4439812', 'name': 'Louis C. Wyman v. John A. Durkin Robert L. Stark, Secretary of State Carmen Chimento', 'name_abbreviation': 'Wyman v. Stark', 'decision_date': '1975-01-06', 'docket_number': 'No. 7112', 'first_page': 1, 'last_page': '3', 'citations': '115 N.H. 1', 'volume': '115', 'reporter': 'New Hampshire Reports', 'court': 'New Hampshire Supreme Court', 'jurisdiction': 'New Hampshire', 'last_updated': '2021-08-10T17:25:43.934256+00:00', 'provenance': 'CAP', 'judges': '', 'parties': 'Louis C. Wyman v. John A. Durkin Robert L. Stark, Secretary of State Carmen Chimento', 'head_matter': 'Hillsborough\nNo. 7112\nLouis C. Wyman v. John A. Durkin Robert L. Stark, Secretary of State Carmen Chimento\nJanuary 6, 1975\nStanley M. Brown, Dart S. Bigg, Eugene M. Van Loan III and David R. DePuy (Mr. Brown orally) for the plaintiff.\nDevine, Millimet, Stahl & Branch and Matthias J. Reynolds and William S. Gannon (Mr. Joseph A. Millimet), by brief and orally, for John A. Durkin.\nThomas D

## Extracting the entities

In this example, we take the caselaw text, split it into chunks, and extract entities from each chunk. 

### Split the document into chunks

Split the document into text chunks that can fit into the model's context window.

In [7]:
from langchain.text_splitter import TokenTextSplitter

doc_chunks = {}
documents = [doc for doc in documents[:30] if doc.metadata["id"] in ['4440632', '4441078']]

# Split the documents into chunks
text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=50)
for doc in documents:
    id = doc.metadata["id"]
    chunks = text_splitter.split_documents([doc])
    doc_chunks[id] = chunks
    print(f"Case {id}: " + str(len(chunks)))

Case 4440632: 1
Case 4441078: 3


### Inspect the chunks

In [8]:
import json
for doc in documents[1:2]:
    id = doc.metadata["id"]
    print(json.dumps(doc.metadata, indent=4))
    for chunk in doc_chunks[id]:
        print(chunk.page_content)

{
    "id": "4441078",
    "name": "Dana A. Desrochers v. Real J. Desrochers",
    "name_abbreviation": "Desrochers v. Desrochers",
    "decision_date": "1975-10-31",
    "docket_number": "No. 7135",
    "first_page": 591,
    "last_page": "595",
    "citations": "115 N.H. 591",
    "volume": "115",
    "reporter": "New Hampshire Reports",
    "court": "New Hampshire Supreme Court",
    "jurisdiction": "New Hampshire",
    "last_updated": "2021-08-10T17:25:43.934256+00:00",
    "provenance": "CAP",
    "judges": "All concurred.",
    "parties": "Dana A. Desrochers v. Real J. Desrochers",
    "head_matter": "Hillsborough\nNo. 7135\nDana A. Desrochers v. Real J. Desrochers\nOctober 31, 1975\nCraig, Wenners, Craig Si McDowell (Mr. Joseph F. McDowell III orally) for the plaintiff.\nClifford J. Ross, by brief and orally, for the defendant.",
    "word_count": "1466",
    "char_count": "8963",
    "source": "4441078"
}
"Kenison, C.J.\nThe parties married in September 1970. Their only child, 

We can see from this output that the "judge" in the metadata is not reliable, so we will pick that entity out of the text.

# Provide taxonomy of entities

An LLM may produce this with the prompt:

```
I am building a knowledge graph from legal case law. What are the entities I should extract for this knowledge graph?
Prefix the major categories with numbers, and the minor categories with letters.
```

## Extracting Entities

### Provide a list of entity categories

In [9]:

query = """\
<|start_of_role|>system<|end_of_role|>
Below is a list of entity categories:

Counsel for Plaintiff/Petitioner: The attorney or law firm representing the plaintiff/petitioner.
Counsel for Defendant/Respondent: The attorney or law firm representing the defendant/respondent.
Judge/Justice: The name of the judge or justice involved in the case, including their role (e.g., trial judge, appellate judge, presiding justice).
Statute/Act: The statute or act referenced or applied in the case (e.g., "Civil Rights Act of 1964").
Precedent Cited: Previous case law referred to in the case.
Constitutional Provision: The constitutional article or amendment referenced in the case (e.g., "First Amendment," "Article III").
Decision/Holding: The final judgment of the court (e.g., "Affirmed," "Reversed").
Disposition: The outcome of the case (e.g., "dismissed with prejudice," "remanded").
Remedy: Type of compensation or relief provided (e.g., "compensatory damages," "injunctive relief").
Sentence: In a criminal case, the sentence handed down (e.g., "5 years imprisonment").

Given this list of entity categories, you will be asked to extract entities belonging to these categories from a text passage.
Consider only the list of entity categories above; do not extract any additional entities. For each entity found, list the category and the entity, separated by a semicolon. Do not use the words "Entity" or "Category".

Here are some examples:
1. Remedy: Compensatory Damages
2. Counsel for Defendant/Respondent: Jane C.
3. Precedent Cited: State vs. Tiger
<|end_of_text|>
<|start_of_role|>user<|end_of_role|>
Find the entities in the following text, and list them in the format specified above:

{}
<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>"""


### Extract entities from each chunk of text

In [10]:
doc_extracts = {}
for doc in documents:
    id = doc.metadata['id']
    extracts = []
    for i, chunk in enumerate(doc_chunks[id]):
        print(f"Chunk {i} of {id}")
        full_query = query.format(chunk.page_content)
        print(str(len(tokenizer.tokenize(full_query))) + " tokens")
        response = model.invoke(full_query, max_tokens=1000)
        print(response)
        extracts.append(response)

    doc_extracts[id] = extracts

Chunk 0 of 4440632
840 tokens
1. Disposition: Exception overruled
2. Judge/Justice: Dunfey, J.
3. Statute/Act: Not specified
4. Precedent Cited: State v. Costello, 110 N.H. 182, 263 A.2d 671 (1970); State v. Allen, 114 N.H. 682, 327 A.2d 715 (1974)
5. Counsel for Defendant/Respondent: Not specified
6. Counsel for Plaintiff/Petitioner: Not specified
7. Decision/Holding: Exception overruled
8. Remedy: Not specified
9. Sentence: Not applicable (criminal case)
Chunk 0 of 4441078
1786 tokens
1. Counsel for Plaintiff/Petitioner: Not explicitly mentioned in the text.
2. Counsel for Defendant/Respondent: Not explicitly mentioned in the text.
3. Judge/Justice: Kenison, C.J.
4. Statute/Act: RSA 458:7-a (Supp. 1973)
5. Precedent Cited: Not explicitly mentioned in the text.
6. Constitutional Provision: Not explicitly mentioned in the text.
7. Decision/Holding: The court transferred the question of whether cause exists for granting a divorce under the provisions of RSA 458:7-a to the superior court

### Construct Graph Triples

Using the extracted entities along with the text chunk, construct graph triples.

In [11]:
def get_triples_from_extract(extract, case_name):
    triples = []
    lines = extract.splitlines()
    for line in lines:
        try:
            entity, role = line.split(": ", 2)
            triple = (entity.split(". ", 1)[1], role, case_name)
            triples.append(triple)
        except ValueError:
            print(f"Error parsing case {id} line: {line}")
    return triples

doc_triples = {}
for doc in documents:
    id = doc.metadata['id']
    name = doc.metadata['name_abbreviation']
    triples = []
    for i, extract in enumerate(doc_extracts[id]):
        # Break response up into entity triples.
        new_triples = get_triples_from_extract(extract, name);
        triples.extend(new_triples)
    # Add triples from metadata.
    triples.append(('Court', doc.metadata["court"], name))

    # Add to triples for the document.
    if id in doc_triples:
        doc_triples[id].append(triples)
    else:
        doc_triples[id] = triples

all_triples = []
for id, triples in doc_triples.items():
    print(f"Case {id}")
    for triple in triples:
        r = triple[1].lower()
        if "not explicitly mentioned" not in r and "not applicable" not in r:
            all_triples.append(triple)
            print(triple)


Case 4440632
('Disposition', 'Exception overruled', 'State v. Craigue')
('Judge/Justice', 'Dunfey, J.', 'State v. Craigue')
('Statute/Act', 'Not specified', 'State v. Craigue')
('Precedent Cited', 'State v. Costello, 110 N.H. 182, 263 A.2d 671 (1970); State v. Allen, 114 N.H. 682, 327 A.2d 715 (1974)', 'State v. Craigue')
('Counsel for Defendant/Respondent', 'Not specified', 'State v. Craigue')
('Counsel for Plaintiff/Petitioner', 'Not specified', 'State v. Craigue')
('Decision/Holding', 'Exception overruled', 'State v. Craigue')
('Remedy', 'Not specified', 'State v. Craigue')
('Court', 'New Hampshire Supreme Court', 'State v. Craigue')
Case 4441078
('Judge/Justice', 'Kenison, C.J.', 'Desrochers v. Desrochers')
('Statute/Act', 'RSA 458:7-a (Supp. 1973)', 'Desrochers v. Desrochers')
('Decision/Holding', 'The court transferred the question of whether cause exists for granting a divorce under the provisions of RSA 458:7-a to the superior court without ruling.', 'Desrochers v. Desrochers')

## Building the Graph Database

### Define methods

In [12]:
from neo4j import GraphDatabase
from stringcase import snakecase, lowercase

# Define the list of (entity, relationship, entity) triples
triples = all_triples

# Connect to the Neo4j database
uri = get_env_var("NEO4J_URI")
username = get_env_var("NEO4J_USERNAME")
password = get_env_var("NEO4J_PASSWORD")
driver = GraphDatabase.driver(uri, auth=(username, password))

def create_graph(tx, entity1, role, case):
    query = (
        "MERGE (a:Entity {name: $entity1}) "
        "MERGE (c:Case {name: $case}) "
        "MERGE (a)-[r:%s]->(c)"
    ) % snakecase(lowercase(role.replace('/', '_')))
    tx.run(query, entity1=entity1, case=case)

def build_graph(triples):
    with driver.session() as session:
        # Empty the graph first
        session.run("MATCH (n) DETACH DELETE n")
        # Fill the graph
        for role, entity1, case in triples:
            session.write_transaction(create_graph, entity1, role, case)

def query_graph():
    with driver.session() as session:
        # Query to find all nodes
        result = session.run("MATCH (n) RETURN n.name AS name")
        print("Nodes in the graph:")
        for record in result:
            print(record["name"])

        # Query to find all relationships
        result = session.run("MATCH (a)-[r]->(b) RETURN a.name AS from, type(r) AS rel, b.name AS to")
        print("\nRelationships in the graph:")
        for record in result:
            print(f"{record['from']} -[{record['rel']}]-> {record['to']}")

# Build the graph from the triples list
build_graph(triples)

# Issue some basic queries against the graph
query_graph()

# Close the connection to the database
driver.close()

print("Graph successfully built and queried in Neo4j!")

  session.write_transaction(create_graph, entity1, role, case)


Nodes in the graph:
Exception overruled
State v. Craigue
Dunfey, J.
Not specified
State v. Costello, 110 N.H. 182, 263 A.2d 671 (1970); State v. Allen, 114 N.H. 682, 327 A.2d 715 (1974)
New Hampshire Supreme Court
Kenison, C.J.
Desrochers v. Desrochers
RSA 458:7-a (Supp. 1973)
The court transferred the question of whether cause exists for granting a divorce under the provisions of RSA 458:7-a to the superior court without ruling.
The case was transferred to the superior court without ruling.
Riley v. Riley
RSA 458:7-a
Woodruff v. Woodruff
Rodrique v. Rodrique
Ballou v. Ballou
Remanded
All concurred

Relationships in the graph:
Exception overruled -[disposition]-> State v. Craigue
Dunfey, J. -[judge_justice]-> State v. Craigue
Not specified -[statute_act]-> State v. Craigue
State v. Costello, 110 N.H. 182, 263 A.2d 671 (1970); State v. Allen, 114 N.H. 682, 327 A.2d 715 (1974) -[precedent_cited]-> State v. Craigue
Not specified -[counsel_for_defendant_respondent]-> State v. Craigue
Not s

In [13]:
driver = GraphDatabase.driver(uri, auth=(username, password))
with driver.session() as session:
    # Query to find all nodes
    result = session.run("MATCH (a)-[:precedent_cited]->() RETURN a.name AS name")
    print("Nodes in the graph:")
    for record in result:
        print(record["name"])

Nodes in the graph:
State v. Costello, 110 N.H. 182, 263 A.2d 671 (1970); State v. Allen, 114 N.H. 682, 327 A.2d 715 (1974)
Riley v. Riley
Woodruff v. Woodruff
Rodrique v. Rodrique
Ballou v. Ballou


## Populate a vector database with entities

In [14]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

In [15]:
from langchain_chroma import Chroma

vector_db = Chroma(embedding_function=embeddings_model)

In [16]:
from langchain.docstore.document import Document

names = []
with driver.session() as session:
    # Query to find all nodes
    result = session.run("MATCH (n) RETURN n.name AS name")
    print("Nodes in the graph:")
    for record in result:
        doc = Document(record["name"])
        names.append(doc)
        print(record["name"])

ids = vector_db.add_documents(names)
print("Documents added: ", len(ids))


Nodes in the graph:
Exception overruled
State v. Craigue
Dunfey, J.
Not specified
State v. Costello, 110 N.H. 182, 263 A.2d 671 (1970); State v. Allen, 114 N.H. 682, 327 A.2d 715 (1974)
New Hampshire Supreme Court
Kenison, C.J.
Desrochers v. Desrochers
RSA 458:7-a (Supp. 1973)
The court transferred the question of whether cause exists for granting a divorce under the provisions of RSA 458:7-a to the superior court without ruling.
The case was transferred to the superior court without ruling.
Riley v. Riley
RSA 458:7-a
Woodruff v. Woodruff
Rodrique v. Rodrique
Ballou v. Ballou
Remanded
All concurred
Documents added:  18


## Answer questions

### Extract entities from question

This is one type of question that can be asked. We will find cases with these entities in common.

In [17]:
question = "How has Judge Kenison used Ballou v. Ballou to rule on cases?"

response = model.invoke(query.format(question))
print(response)
question_entity_triples = get_triples_from_extract(response, "")
print(question_entity_triples)


1. Judge/Justice: Judge Kenison;
2. Precedent Cited: Ballou v. Ballou.
[('Judge/Justice', 'Judge Kenison;', ''), ('Precedent Cited', 'Ballou v. Ballou.', '')]


### Match entities to the graph

Currently name-to-name. Could be context-to-context.

In [18]:

def match_entity(name, threshold=1.0):
    """Match entities by embedding vector distance given a similarity threshold. With Chroma, l2 (Euclidean) distance is used."""
    docs_with_score = vector_db.similarity_search_with_score(name, k=5)
    for doc, score in docs_with_score:
        # print(f"{doc.page_content} has a similarity score of {score}")
        next
    if len(docs_with_score):
        doc, score = docs_with_score[0]
        if score <= threshold:
            # Return first close match.
            return doc.page_content
    else:
        # No match.
        return None


In [19]:
for triple in question_entity_triples:
    name = triple[1]
    print(f"\nMatching {name}")
    match = match_entity(name)
    if match is not None:
        print(f"Match: {match}")


Matching Judge Kenison;
Match: Kenison, C.J.

Matching Ballou v. Ballou.
Match: Ballou v. Ballou


### Query the graph for cases

Query for cases given a single entity and its relationship to the case.

In [20]:
def query_for_cases(entity_name, role):
    with driver.session() as session:
        relationship = snakecase(lowercase(role.replace('/', '_')))
        query = f"MATCH (e:Entity {{name: '{entity_name}'}})-[:{relationship}]->(c:Case) RETURN c.name AS name"
        print(query)
        result = session.run(query)
        print("Cases:")
        for record in result:
            print(record["name"])

for triple in question_entity_triples:
    role, entity, c = triple
    entity_match = match_entity(entity)
    query_for_cases(entity_match, role)

MATCH (e:Entity {name: 'Kenison, C.J.'})-[:judge_justice]->(c:Case) RETURN c.name AS name
Cases:
Desrochers v. Desrochers
MATCH (e:Entity {name: 'Ballou v. Ballou'})-[:precedent_cited]->(c:Case) RETURN c.name AS name
Cases:
Desrochers v. Desrochers


Query for cases given multiple entities and their relationships to the case.

In [21]:
def query_for_cases(entity_role_pairs):
    with driver.session() as session:
        query = ""
        for i, (entity, role) in enumerate(entity_role_pairs):
            relationship = snakecase(lowercase(role.replace('/', '_')))
            query += f"MATCH (e{str(i)}:Entity {{name: '{entity}'}})-[:{relationship}]->(c)\n"
        query += "RETURN c.name AS name"
        print(query)
        result = session.run(query)
        cases = []
        print("Cases:")
        for record in result:
            cases.append(record["name"])
            print(record["name"])
        return cases

entity_role_pairs = []
for triple in question_entity_triples:
    role, entity, c = triple
    entity_match = match_entity(entity)
    entity_role_pairs.append((entity_match, role))
    
cases = query_for_cases(entity_role_pairs)

MATCH (e0:Entity {name: 'Kenison, C.J.'})-[:judge_justice]->(c)
MATCH (e1:Entity {name: 'Ballou v. Ballou'})-[:precedent_cited]->(c)
RETURN c.name AS name
Cases:
Desrochers v. Desrochers


### Retrieve the case text

In [22]:
case_text = [doc.page_content for doc in documents if doc.metadata["name_abbreviation"] == cases[0]][0]
print(case_text)


Answer the question using the following text from one case: 

"Kenison, C.J.\nThe parties married in September 1970. Their only child, a daughter, was born in January 1973. The parties separated in May of that year and the wife brought this libel for divorce the following September. A month later the parties agreed to and the court approved arrangements for custody, visitation and support. The defendant did not support his wife and child from the time of separation until the temporary decree. He made the payments called for by the decree from its entry until June 1975. In July 1974, the Hillsborough County Superior Court, Loughlin, J., held a hearing and made certain findings of fact. The critical portion of these findings is: \u201c[T]he action was originally brought because the defendant did not work steadily and stated that he, when he learned that the plaintiff was pregnant, wanted a boy instead of a girl; if the plaintiff bore a girl he would like to put the child up for adoption

### Answer the question

Retrieve the case text, and answer the question given the case text.

In [None]:
q = f"""
Answer the question using the following text from one case: \n\n{case_text}

Question: {question}
"""

print(question)
response = model.invoke(q)
print(response)