# Entity Extraction from New Hampshire Case Law
*With IBM Granite Models*

The [New Hampshire Case Law Dataset](https://huggingface.co/datasets/free-law/nh) comes from the Caselaw Access Project via Hugging Face.

## In this notebook
This notebook contains instructions for performing entity extraction.

## Prerequisites

To get started, you'll need:
* A [Replicate account](https://replicate.com/) and API token.

## Setting up the environment

### Install dependencies

Granite Kitchen comes with a bundle of dependencies that are required for notebooks. See the list of packages in its [`setup.py`](https://github.com/ibm-granite-community/granite-kitchen/blob/main/setup.py). 

In [1]:
!pip install git+https://github.com/ibm-granite-community/utils \
    "langchain_community<0.3.0" \
    replicate \
    datasets \
    transformers \
    tiktoken

Collecting git+https://github.com/ibm-granite-community/utils
  Cloning https://github.com/ibm-granite-community/utils to /private/var/folders/nc/jrql4k0n2j73h7xktzxdr4pr0000gn/T/pip-req-build-2p_unsvs
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/utils /private/var/folders/nc/jrql4k0n2j73h7xktzxdr4pr0000gn/T/pip-req-build-2p_unsvs
  Resolved https://github.com/ibm-granite-community/utils to commit a4b663310cdc11be2f3039a11d263dae98584582
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


## Selecting System Components

### Choose your LLM
The LLM will be used for answering the question, given the retrieved text.

Follow the instructions in [Getting Started with Replicate](https://github.com/ibm-granite-community/granite-kitchen/blob/cee1513c77429d7ddbf0e5a49b29b7bc9ca0d996/recipes/Getting_Started/Getting_Started_with_Replicate.ipynb), selecting a Granite Code model from the [`ibm-granite`](https://replicate.com/ibm-granite) org.

To connect to a model on a provider other than Replicate, substitute this code cell with one from the [LLM component recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_LLMs.ipynb).

In [2]:
from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import set_env_var

set_env_var("REPLICATE_API_TOKEN")

model = Replicate(
    model="ibm-granite/granite-3.0-8b-instruct",
)

## Get the tokenizer

Retrieve the tokenizer used by your chosen LLM.

In [3]:
from transformers import AutoTokenizer

model_path = "ibm-granite/granite-3.0-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)

  from .autonotebook import tqdm as notebook_tqdm


## Acquiring the Data

We will use a New Hampshire case law dataset to help the model answer questions about NH laws.

### Download the documents

Download the [New Hampshire CAP Caselaw](https://huggingface.co/datasets/free-law/nh) dataset from HuggingFace using the datasets library.

In [4]:
from langchain.document_loaders import HuggingFaceDatasetLoader

# Load the documents from the dataset
loader = HuggingFaceDatasetLoader("free-law/nh", page_content_column="text")
documents = loader.load()
print("Document Count: " + str(len(documents)))

Document Count: 21540


### Add metadata to the documents

Add the `source` field, which is used below, to the metadata.

In [5]:
for doc in documents:
    doc.metadata['source'] = doc.metadata['id']

### Inspect the documents

In [6]:
for doc in documents[:1]:
    print(doc.metadata, "\n")
    print(doc.page_content, "\n")

{'id': '4439812', 'name': 'Louis C. Wyman v. John A. Durkin Robert L. Stark, Secretary of State Carmen Chimento', 'name_abbreviation': 'Wyman v. Stark', 'decision_date': '1975-01-06', 'docket_number': 'No. 7112', 'first_page': 1, 'last_page': '3', 'citations': '115 N.H. 1', 'volume': '115', 'reporter': 'New Hampshire Reports', 'court': 'New Hampshire Supreme Court', 'jurisdiction': 'New Hampshire', 'last_updated': '2021-08-10T17:25:43.934256+00:00', 'provenance': 'CAP', 'judges': '', 'parties': 'Louis C. Wyman v. John A. Durkin Robert L. Stark, Secretary of State Carmen Chimento', 'head_matter': 'Hillsborough\nNo. 7112\nLouis C. Wyman v. John A. Durkin Robert L. Stark, Secretary of State Carmen Chimento\nJanuary 6, 1975\nStanley M. Brown, Dart S. Bigg, Eugene M. Van Loan III and David R. DePuy (Mr. Brown orally) for the plaintiff.\nDevine, Millimet, Stahl & Branch and Matthias J. Reynolds and William S. Gannon (Mr. Joseph A. Millimet), by brief and orally, for John A. Durkin.\nThomas D

## Building the Document Database

We'll use the caselaw document database to retrieve the full text of the cases by case id.

### Create the database file and document table

In [7]:
# # put the json objects in a sqlite database, keyed by id
# import sqlite3, os, json

# # remove database file if exists
# if os.path.isfile('data.db'):
#     os.remove('data.db')

# conn = sqlite3.connect('data.db')
# c = conn.cursor()

# # create the table if it doesn't exist. include id, text, and size
# c.execute('''CREATE TABLE IF NOT EXISTS data
#              (id INTEGER PRIMARY KEY UNIQUE,
#               metadata TEXT,
#               text TEXT,
#               char_count INTEGER)''')


### Insert the documents into the table

In [8]:
# for doc in documents:
#     id = doc.metadata["id"]
#     c.execute("INSERT INTO data (id, metadata, text, char_count) VALUES (?,?,?,?)", (id, json.dumps(doc.metadata), doc.page_content, doc.metadata["char_count"]))
#     conn.commit()

### Count the documents

In [9]:
# c.execute("SELECT count(*) FROM data")
# doc_count = c.fetchone()[0]
# print(f"Document count: {doc_count}")

## Extracting the entities

In this example, we take the caselaw text, split it into chunks, and extract entities from each chunk. 

### Split the document into chunks

Split the document into text segments that can fit into the model's context window.

In [17]:
from langchain.text_splitter import TokenTextSplitter

# Split the documents into chunks
text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=10)
chunks = text_splitter.split_documents(documents[:1])
print("Chunk Count: " + str(len(chunks)))

Chunk Count: 1


### Inspect the chunks

In [18]:
import json
for i in range(1):
    print(chunks[i].page_content)
    print(json.dumps(chunks[i].metadata, indent=4))

"Per curiam.\nThis transfer arises out of the same case which was the subject matter of the petition for writ of prohibition in Durkin v. Hillsborough County Superior Court, 114 N.H. 788, 330 A.2d 777 (1974). The Superior Court (Bois, J.) has transferred without ruling seven questions, the first of which is as follows: \u201cDoes the Superior Court have jurisdiction either through RSA 68:4 II; other jurisdictional statutes or through precedent, to invalidate an election for United States Senator?\u201d\nThe several States may regulate the conduct of senatorial elections and may provide procedures necessary to guard against irregularity and error in the tabulation of votes and against fraud and corrupt practices. U.S. Const. art. I, \u00a7 4; Smiley v. Holm, 285 U.S. 355 (1932). They may provide procedures for a recount so long as they do not impair or frustrate the Senate\u2019s ability to make an independent judgment. Roudebush v. Hartke, 405 U.S. 15 (1972).\nThe proceedings before th

In [23]:
query = """
Here are the categories of entity I want you to consider:

1. Case Information:
A. Case Name (e.g., "Brown v. Board of Education")
B. Docket Number (unique case identifier)
C. Court (e.g., Supreme Court, District Court)
D. Jurisdiction (state or federal jurisdiction, e.g., California, United States)
2. Legal Parties:
A. Plaintiff/Petitioner (the party initiating the case)
B. Defendant/Respondent (the party responding to the case)
C. Appellant/Appellee (for appeals)
3. Attorneys:
A. Counsel for Plaintiff/Petitioner (name, law firm, or organization)
B. Counsel for Defendant/Respondent (name, law firm, or organization)
4. Judges:
A. Judge(s)/Justice(s) (name and role: trial judge, appellate judge, presiding justice)
B. Panel Composition (for appellate cases, listing judges involved)
5. Legal Issues:
A. Legal Questions (the issues or questions of law presented to the court)
B. Claims (the specific claims or complaints raised by the plaintiff)
6. Legal Doctrines and Principles:
A. Statutes/Acts (e.g., "Civil Rights Act of 1964")
B. Precedents Cited (previous case law referred to)
C. Constitutional Provisions (e.g., "First Amendment," "Article III")
7. Facts and Context:
A. Material Facts (key facts on which the case is based)
B. Timeline (sequence of events leading up to and during the case)
8. Case Outcome:
A. Decision/Holding (the final judgment, such as "Affirmed," "Reversed")
B. Disposition (e.g., "dismissed with prejudice," "remanded")
C. Majority Opinion (summary or reasoning of the court's majority)
D. Concurring Opinion (opinion agreeing with the majority but for different reasons)
E. Dissenting Opinion (opinion disagreeing with the majority decision)
9. Procedural History:
A. Lower Court Rulings (previous decisions that led to the current case)
B. Appeals (sequence of appeals, appellate court involvement)
10. Relationships Between Entities:
A. Case Citation Relationship (e.g., "Case A cited Case B")
B. Statute Application (e.g., "Statute X was applied in Case Y")
C. Fact Relationships (linking individuals or events to legal consequences)
11. Court Documents:
A. Pleadings (complaints, answers, motions)
B. Briefs (e.g., appellate briefs, amicus briefs)
C. Orders (e.g., summary judgment, dismissal orders)
12. Dates:
A. Date of Filing (date the case was initiated)
B. Date of Decision (date the judgment was rendered)
C. Hearing Dates (important hearings or oral argument dates)
13. Legal Outcomes:
A. Remedies (e.g., "compensatory damages," "injunctive relief")
B. Sentences (in criminal cases, e.g., "5 years imprisonment")
14. Legal Concepts:
A. Standards of Review (e.g., "de novo," "abuse of discretion")
B. Burden of Proof (e.g., "preponderance of the evidence," "beyond a reasonable doubt")


What are the entities of interest in the following text? Use the same numbering and lettering scheme as the entity categories above, using numbers for the major category of and letters for the minor category. \n\n{}"""


In [25]:

for chunk in chunks[:1]:
    print(f"Chunk of {chunk.metadata['id']}")
    response = model.invoke(query.format(chunk), max_tokens=1000)
    print(response)

Chunk of 4439812
1. Case Information:
   A. Case Name: Louis C. Wyman v. John A. Durkin Robert L. Stark, Secretary of State Carmen Chimento
   B. Docket Number: No. 7112
   C. Court: New Hampshire Supreme Court
   D. Jurisdiction: New Hampshire

2. Legal Parties:
   A. Plaintiff/Petitioner: Louis C. Wyman
   B. Defendant/Respondent: John A. Durkin
   C. Appellant/Appellee: Not explicitly mentioned, but the case involves an appeal to the Senate of the United States.

3. Attorneys:
   A. Counsel for Plaintiff/Petitioner: Stanley M. Brown, Dart S. Bigg, Eugene M. Van Loan III, and David R. DePuy
   B. Counsel for Defendant/Respondent: John A. Durkin (represented by Devine, Millimet, Stahl & Branch and Matthias J. Reynolds and William S. Gannon)
   C. Counsel for Robert L. Stark, Secretary of State: Thomas D. Rath, assistant attorney general
   D. Counsel for Carmen Chimento: Richard W. Leonard

4. Judges:
   A. Judge(s)/Justice(s): Not explicitly mentioned, but the case was decided by the