# Semantically mapping citations between legal documents
This Isaacus Cookbook teaches you how to extract citations from legal documents, classify them by treatment, cluster them by area of law, and then visualize the resulting knowledge graph interactively in 3D.

Isaacus' [Kanon 2 Enricher](https://docs.isaacus.com/models/introduction#enrichment) document enrichment model will be used to extract and classify entities in documents, while [Kanon 2 Embedder](https://docs.isaacus.com/models/introduction#embedding) will be used to represent documents as vectors for clustering and dimensionality reduction. Isaacus' open source dataset of [Australian High Court cases](https://huggingface.co/datasets/isaacus/open-australian-legal-corpus) will be used as an example corpus, however, this Cookbook can easily be adapted to work with any other corpus, including non-legal corpora. All Isaacus models are trained on a diverse range of data from around the world (in fact, there is much more data from the US and UK than from Australia in our training data simply due to such jurisdictions having much older legal traditions) and although they are primarily optimized for legal tasks, they can be applied to a wide variety of other domains with strong performance.

## 1. Setup
Before we begin, you will need an Isaacus account, valid Isaacus API key, and access to the Kanon 2 Enricher closed beta. You can apply for access to the beta [here](https://isaacus.com/beta). Once accepted, follow the first step of our [quickstart guide](https://docs.isaacus.com/quickstart) to set up an account and obtain an API key.

After you have your API key, set up an `ISAACUS_API_KEY` environment variable with your API key as the value. This could be done by creating a `.env` file in the same directory as this notebook with the content: `ISAACUS_API_KEY=insert_your_api_key_here`.

We will now install and import our dependencies and set up our Isaacus API client.

In [None]:
# Install dependencies.
%pip install ipykernel tqdm scikit-learn isaacus pacmap python-dotenv datasets requests

In [None]:
# Load dependencies.
import os
import re
import json
import math
import base64
import asyncio
import itertools
import unicodedata

from datetime import datetime
from collections import Counter

import numpy as np
import pacmap
import requests
import isaacus.types.ilgs.v1 as ilgs_v1

from tqdm import tqdm
from dotenv import load_dotenv
from isaacus import AsyncIsaacus
from datasets import load_dataset
from IPython.display import HTML, display, Javascript
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS

In [3]:
# Initialize an Isaacus API client.
load_dotenv()  # Load environment variables from `.env` files if any are present.
client = AsyncIsaacus(api_key=os.getenv("ISAACUS_API_KEY"))

## 2. Data
As mentioned earlier, we will be using Isaacus' open source dataset of [Australian High Court cases](https://huggingface.co/datasets/isaacus/open-australian-legal-corpus) as an example corpus, however, any arbitrary document corpus can be used instead.

To assist with linking references to documents, we will store the citations of cases alongside their text. If your documents don't have citations or titles, you can simply set `data` to a list of texts and the rest of the code in this notebook will still work without any modifications, however, we will instead have to use titles extracted by Kanon 2 Enricher, which may or may not deliver better results depending on how references are structured in your documents.

In [4]:
# Load data.
data: list[str | tuple[str, str]] = [] # NOTE If your documents don't have citations or titles, you can set this to a list of texts instead of a list of (title, text) tuples.

for decision in load_dataset("isaacus/high-court-of-australia-cases", split="corpus"):
    data.append((decision["citation"], decision["text"]))

## 3. Enrichment
We will now enrich our dataset into the [Isaacus Legal Graph Schema (ILGS)](https://docs.isaacus.com/ilgs/introduction) using Kanon 2 Enricher.

Because our dataset is relatively large (~8k documents with an average of ~12k tokens per document), we'll leverage async to speed up the enrichment process by making up to 8 concurrent enrichment requests at a time. Using a maximum concurrency above 8 may lead to throttling and is unlikely to speed up the process in our experience.

Kanon 2 Enricher has a maximum input length of 16,384 tokens. To get around that, we'll also set our `overflow_strategy` argument to `auto`, which means that if a document exceeds the model's context window, it will automatically be split up into chunks that fit within the model's context window and then the enriched chunks will be intelligently stitched back together in order to form a single enriched document. This ensures that we capture all citations in our documents no matter where they appear.

In [5]:
# Define an asynchronous helper function to enrich a single document with Kanon 2 Enricher.
async def enrich(text: str) -> ilgs_v1.document.Document:
    """Enrich a document with Kanon 2 Enricher."""

    response = await client.enrichments.create(
        model="kanon-2-enricher",
        texts=[text],
        overflow_strategy="auto",  # NOTE Setting our `overflow_strategy` to `auto` ensures that if a document exceeds the model's context window, it will automatically be split up into chunks and the results will be intelligently stitched back together to form a single enriched document, ensuring we capture all citations in our documents no matter where they appear.
    )

    # Retrieve and return the enriched document from the response.
    return response.results[0].document

In [6]:
# Enrich up to `CONCURRENCY` documents at a time and store the results in `enriched`.
# NOTE This should take around ~40 minutes to run with `CONCURRENCY` set to 8. Setting `CONCURRENCY` above 8 may lead to throttling and is unlikely to speed up the process in our experience.
CONCURRENCY = 8
enriched: list[ilgs_v1.document.Document] = []

for batch in tqdm(itertools.batched(data, CONCURRENCY), total=math.ceil(len(data) / CONCURRENCY)):
    enriched.extend(
        await asyncio.gather(
            *[
                enrich(title_and_text if isinstance(title_and_text, str) else title_and_text[1])
                for title_and_text in batch
            ]
        )
    )

  0%|          | 0/1012 [00:00<?, ?it/s]

100%|██████████| 1012/1012 [1:02:58<00:00,  3.73s/it]


## 4. Postprocessing
At this point, our enriched documents are stored in our `enriched` variable as a list of ILGS `Document` objects ordered in the same order as our original dataset. `Document.external_documents` stores references to cited documents, including the sentiment towards those documents (in the `ExternalDocument.reception` field which is one of `positive`, `negative`, `neutral`, or `mixed`), while `Document.title` stores the title of the document as extracted by Kanon 2 Enricher.

We will now join the enriched documents with metadata from our original dataset and then store the resulting data locally to `docs.jsonl` in case anything goes wrong later on and we need to recover the data we already enriched.

If our original dataset did not store the titles of documents, we will use the titles extracted by Kanon 2 Enricher instead. If a title is not available for a document, we will have to skip it since we won't be able to link it to other documents without a title. You could, however, customize this linking process to rely on your own structured metadata or heuristic-based extractions instead of titles.

In [7]:
# Join enriched documents with metadata from our original dataset or add extracted titles as metadata if our original dataset did not have titles.
if isinstance(data[0], tuple):
    docs = [(title, document) for (title, _), document in zip(data, enriched)]

else:
    docs = [
        (doc.title.decode(doc.text), doc) for doc in enriched if doc.title is not None
    ]  # NOTE If the model was not able to extract a title for a document, we will have to skip it since we won't be able to link it to other documents without a title. You could also customize this linking process to rely on your own structured metadata or heuristic-based extractions instead of titles.

In [8]:
# Store enriched documents locally in case anything goes wrong later on and we need to recover the data we already enriched.
with open("docs.jsonl", "w") as f:
    for title, doc in docs:
        doc_dict = {
            "name": title,
            "doc": doc.model_dump_json(), # NOTE This is how we convert ILGS Document objects into dictionary representations.
        }
        doc_json = json.dumps(doc_dict)
        f.write(doc_json + "\n")

In [9]:
# Reload enriched documents from the file we just created.
with open("docs.jsonl", "r") as f:
    docs: list[tuple[str, ilgs_v1.document.Document]] = []
    
    for line in f:
        doc_dict = json.loads(line)
        doc_dict["doc"] = ilgs_v1.document.Document.model_validate_json(doc_dict["doc"]) # NOTE This is how we convert the dictionary representation of our document back into an ILGS Document object.
        docs.append((doc_dict["name"], doc_dict["doc"]))

## 5. Citation linking
Even though Kanon 2 Enricher has correctly extracted and classified citations in our documents and identified their canonical forms within each document, we still need to link them to the documents they refer to.

For example, _Al-Kateb v Godwin_ might cite _Project Blue Sky v Australian Broadcasting Authority_ [1998] HCA 28, while _NZYQ v Minister for Immigration_ might cite _Project Blue Sky v ABA_ (1998) 194 CLR 355. We need to be able to identify that _Project Blue Sky v Australian Broadcasting Authority_ [1998] HCA 28 and _Project Blue Sky v ABA_ (1998) 194 CLR 355 are actually the same case in order to correctly link citations to the documents they refer to.

To keep things simple, we will match cases by aggressively normalizing their citations into a globally canonical form, including stripping out report numbers and stop words. We leverage simple regex patterns to achieve this.

The patterns we use to build stable case identifiers are unique to cases and may be most effective in common law jurisdictions. Accordingly, you should customize these patterns or use your own linking strategies if you want to link other types of documents.

In [None]:
# Create a function to build stable case identifiers by normalizing citations.
STOP_WORDS = {"the", "pty", "ltd", "limited", "co", "company", "proprietary"}

def canonicalize_citation(name: str) -> tuple[str, int | None]:
    """Normalize a citation and extract its year if it has one."""
    
    # Apply NFKC Unicode normalization to ensure that visually similar characters are represented in a consistent way.
    name = unicodedata.normalize("NFKC", name)
    
    # Remove capitalization.
    name = name.casefold()
    
    # Remove leading and trailing whitespace.
    name = name.strip()
    
    # Replace consecutive whitespace characters with a single space.
    name = re.sub(r"\s+", " ", name)
    
    # Drop "& ors".
    name = re.sub(r"(and|&) (ors|others)", "", name)
    
    # Drop "in re".
    name = re.sub(r"(\b)(in )?re(\b)", "", name)
    
    # Drop stop words.
    for stop_word in STOP_WORDS:
        name = re.sub(rf"\b{stop_word}\b", "", name)
    
    # If the citation has a year, extract it and remove it and everything after it.
    year = re.search(r'[\{\[\(]+((?:1[89]|2[01])\d\d)[\}\]\)]+', name)
    
    if year:
        name = name[:year.start()]
        year = int(year.group(1))
        
        if year > datetime.now().year: # NOTE We only want to extract the year if it's a plausible year (ie, not in the future).
            year = None
    
    # Replace all non-alphanumeric characters with nothing.
    name = re.sub(r"[^a-z0-9]", "", name)
    
    return name, year

In [None]:
# Canonicalize the citations of all documents.
docs_by_name_and_year: dict[tuple[str, int], tuple[ilgs_v1.document.Document, str]] = {}
seen_names = set()
names_to_years = {}

for title, doc in docs:
    name, year = canonicalize_citation(title)
    
    # If a document does not have an extractable name, we will try and use the extracted title, if one exists, otherwise, we'll have to skip it.
    if not name:
        if not doc.title:
            continue
        
        name, other_year = canonicalize_citation(doc.title.decode(doc.text))
        
        if not name:
            continue
        
        if year is None:
            year = other_year
    
    # If a document does not have an extractable year, we will try to use the year of the creation date extracted by Kanon 2 Enricher if one exists, otherwise we'll just have to skip this document.
    if year is None:
        for date in doc.dates:
            if date.type == "creation":
                year = date.value.split("-")[0] # NOTE Date values are in ISO format, so we can simply split on "-" and take the first element to extract the year.
                year = int(year)
                
                # Filter for plausible years (ie, not in the future).
                if year <= datetime.now().year:
                    break
        
        else:
            continue
        
    # Store the document's name and year.
    docs_by_name_and_year[(name, year)] = (doc, title)
    
    # If the document's name has never been seen before, store it by its name alone since we can say unambiguously that this name refers to this document. If it has been seen before, then we get rid of any document stored by that name and blacklist the name.
    if name not in seen_names:
        names_to_years[name] = year
        seen_names.add(name)
    
    elif name in names_to_years:
        del names_to_years[name]

In [None]:
# Link citations to the documents they refer to by matching their canonical forms.
docs_with_citations: dict[tuple[str, int], tuple[ilgs_v1.document.Document, list[tuple[tuple[str, int], str]], str]] = {}

for (doc_name, doc_year), (doc, title) in docs_by_name_and_year.items():
    # For each cited document, we will look at all of the different ways they are mentioned and pinpointed and will then select the most common canonicalized form among them that appears in our dataset.
    citations = {}

    for exd in doc.external_documents:
        names = []
        years = []

        for reference in (
            exd.mentions + exd.pinpoints
        ):  # NOTE Because case years can sometimes be stuck in `pinpoints` instead of `mentions`, we will look in both places.
            name, year = canonicalize_citation(reference.decode(doc.text))

            if name:
                names.append(name)

            # Filter for plausible years (ie, not in the future relative to the current document's year).
            if year and year <= doc_year:
                years.append(year)

        name_counts = Counter(names)
        year_counts = Counter(years)
        names_and_years = [(name, year) for name in name_counts.keys() for year in year_counts.keys()]
        names_and_years.sort(key=lambda x: name_counts[x[0]] + year_counts[x[1]], reverse=True)

        for name, year in names_and_years:
            if (name, year) in docs_by_name_and_year:
                citation = (name, year)
                break

            elif name in names_to_years:
                citation = (name, names_to_years[name])
                break

        else:
            for name in name_counts.keys():
                if name in names_to_years:
                    citation = (name, names_to_years[name])
                    break
                
            else:
                continue
        
        # If the same cited external document has not already been seen in this document, store its treatment. If it has been seen (ie, Kanon 2 Enricher mistakenly extracted duplicate entities, which can sometimes happen with ), override the treatment as necessary (positive/negative + neutral = positive/negative. positive/mixed + negative/mixed = mixed.).
        if citation not in citations:
            citations[citation] = exd.reception
        
        else:
            other_reception = citations[citation]
            
            if exd.reception != other_reception:
                if other_reception == 'neutral':
                    citations[citation] = exd.reception
                
                elif other_reception in {'positive', 'negative'}:
                    citations[citation] = 'mixed'
    
    citations = list(citations.items())
    
    docs_with_citations[(doc_name, doc_year)] = (doc, citations, title)

## 5. Embedding and clustering
Once we've linked citations to the documents they refer to, the next step is to represent the documents as vectors using Kanon 2 Embedder and then cluster them and reduce their dimensionality to three dimensions so they can be visualized in an interactive 3D graph.

We will use K-means for clustering, TF-IDF for labeling clusters, and [PaCMAP](https://github.com/YingfanWang/PaCMAP) for dimensionality reduction.

As we'll soon see, without any human input, the clusters we derive from Kanon 2 Embedder's representations will end up correlating quite well with areas of law thanks to Kanon 2 Embedder's unique ability to precisely model the meaning of legal texts.

In [15]:
# Define an asynchronous helper function to embed a single document with Kanon 2 Embedder.
async def embed(text: str) -> np.ndarray:
    """Embed a document with Kanon 2 Embedder."""

    response = await client.embeddings.create(model="kanon-2-embedder", texts=[text])

    # Retrieve and return the embedding.
    return np.array(response.embeddings[0].embedding)

In [16]:
# Embed the documents.
embeddings: list[np.ndarray] = []

for batch in tqdm(itertools.batched(docs_with_citations.values(), CONCURRENCY), total=math.ceil(len(docs_with_citations) / CONCURRENCY)):
    embeddings.extend(
        await asyncio.gather(
            *[
                embed(doc.text)
                for doc, _, _ in batch
            ]
        )
    )

  0%|          | 0/896 [00:00<?, ?it/s]

100%|██████████| 896/896 [28:09<00:00,  1.89s/it]


In [17]:
# Save the embeddings locally in case anything goes wrong later on and we need to recover the data we already embedded.
np.save("embeddings.npy", np.array(embeddings))

# Load embeddings from the file we just created.
embeddings = np.load("embeddings.npy")

In [18]:
# Define a helper function to automatically label clusters using TF-IDF.
def label_clusters(texts: list[str], clusters: np.ndarray, stopwords: set[str] | None = None):
    stopwords = list(ENGLISH_STOP_WORDS.union(stopwords or set()))
    tfidf_vectorizer = TfidfVectorizer(
        stop_words=stopwords, ngram_range=(1, 2), min_df=2, max_df=0.8, max_features=50_000
    )
    
    M = tfidf_vectorizer.fit_transform(texts)
    keywords = np.array(tfidf_vectorizer.get_feature_names_out())
    
    k = int(np.max(clusters)) + 1
    labeled = {}
    
    for c in range(k):
        idx = np.where(clusters == c)[0]
        
        if len(idx) == 0:
            labeled[c] = f'cluster {c}'
        
        else:
            mean = M[idx].mean(axis=0).A1
            c_keywords = keywords[np.argsort(mean)[::-1][:3]].tolist()
            labeled[c] = " / ".join(c_keywords)
    
    return labeled

We will now cluster our documents using K-means. The only hyperparameter to tune is `K_CLUSTERS`, which corresponds to the number of clusters we want. The right number of clusters depends on the underlying structure of your data. If you have a very heterogeneous corpus like Australian High Court cases, you may wish to set a large number of clusters in order to properly capture the diversity of your data.

In [19]:
# Apply K-means clustering to the embeddings of our documents.
K_CLUSTERS = 14

clusters = KMeans(n_clusters=K_CLUSTERS, random_state=42).fit_predict(embeddings)

# Label clusters with TF-IDF.
cluster_labels = label_clusters(
    texts=[doc.text for doc, _, _ in docs_with_citations.values()],
    clusters=clusters,
)

Now, we will reduce the dimensionality of our embeddings from 1792 dimensions to 3 dimensions using PaCMAP, an algorithm designed to preserve as much local and global information as possible when compressing dimensions. Because PaCMAP is a non-deterministic algorithm, the final map produced may look different each time you run it. Also, because PaCMAP can produce negative coordinates, we will use min-max normalization to shift coordinates into a range of [0, 1] in order to make them easier to visualize.

In [20]:
# Reduce the dimensionality of our embeddings.
xyz = pacmap.PaCMAP(
    n_components=3,
    n_neighbors=15,
    MN_ratio=0.5,
    FP_ratio=2.0,
    verbose=True,
).fit_transform(embeddings, init="pca")

# Min-max normalize coordinates to make them easier to visualize.
xyz_min = xyz.min(axis=0)
xyz = (xyz - xyz_min) / np.maximum(np.ptp(xyz, axis=0), 1e-12)

Note: `n_components != 2` have not been thoroughly tested.


Applied PCA, the dimensionality becomes 100
PaCMAP(n_neighbors=15, n_MN=8, n_FP=30, distance=euclidean, lr=1.0, n_iters=(100, 100, 250), apply_pca=True, opt_method='adam', verbose=True, intermediate=False, seed=None)
Finding pairs
Found nearest neighbor
Calculated sigma
Found scaled dist
Pairs sampled successfully.
((107460, 2), (57312, 2), (214920, 2))
Initial Loss: 132717.9375
Iteration:   10, Loss: 94948.265625
Iteration:   20, Loss: 86948.312500
Iteration:   30, Loss: 82736.484375
Iteration:   40, Loss: 79285.718750
Iteration:   50, Loss: 75872.265625
Iteration:   60, Loss: 72223.242188
Iteration:   70, Loss: 68130.335938
Iteration:   80, Loss: 63348.812500
Iteration:   90, Loss: 57413.015625
Iteration:  100, Loss: 48976.273438
Iteration:  110, Loss: 63104.609375
Iteration:  120, Loss: 62980.257812
Iteration:  130, Loss: 62937.398438
Iteration:  140, Loss: 62922.164062
Iteration:  150, Loss: 62919.132812
Iteration:  160, Loss: 62918.304688
Iteration:  170, Loss: 62920.140625
Iterat

## 6. Visualization
Finally, we can visualize our documents in an interactive 3D graph where points represent documents, are colored by cluster, and are sized based on the number of citations to them, and lines between nodes represent citations and are colored by treatment.

We will leverage d3.js for this visualization. We have already prepared a HTML widget that can easily be rendered in a Jupyter notebook by preparing our data in the below format and then injecting it into the HTML template.

```python
[
    {
        "name": "Brown v Smith [2000] HCA 1",
        "key": "brownvsmith_2000",
        "year": 2000,
        "citations": [
            {
                "canonical_citation": "Project Blue Sky v Australian Broadcasting Authority [1998] HCA 28",
                "reception": "positive",
                "key": "projectbluesky_1998"
            }
        ],
        "coords": (0.5, 0.5, 0.5),
        "cluster_name": "Contract law",
        "cluster_keywords": ["contract", "agreement", "breach", "damages"],
        "cluster_id": 0
    },
    ...
]

In [27]:
# Format our output into the input format required by our visualization widget.
output: list[dict] = []

for ((name, year), (doc, citations, title)), cluster, (x, y, z) in zip(docs_with_citations.items(), clusters, xyz):
    output.append({
        "name": title,
        "key": f"{name}_{year}",
        "year": year,
        "citations": [
            {
                "canonical_citation": docs_with_citations[(cname, cyear)][2],
                "reception": reception,
                "key": f"{cname}_{cyear}",
            }
            for (cname, cyear), reception in citations
        ],
        "coords": (float(x), float(y), float(z)),
        "cluster_name": cluster_labels[cluster],
        "cluster_keywords": cluster_labels[cluster].split(" / "),
        "cluster_id": int(cluster),
    })

In [28]:
# Download the widget.
html: str = requests.get(
    "https://raw.githubusercontent.com/isaacus-dev/cookbooks/refs/heads/main/enrichment/kanon-2-enricher/widget.html"
).text

# Setup an iFrame to render the widget in.
display(
    HTML("""\
<div style="width:100%; height:1000px; border:1px solid #ddd; border-radius:8px; overflow:hidden;">
  <iframe id="caseGraphFrame" style="width:100%; height:1000px; border:0;" sandbox="allow-scripts allow-same-origin"></iframe>
</div>""")
)

# Inject the widget and our input data into the iFrame with JavaScript.
display(
    Javascript(f"""
(() => {{
  const iframe = document.getElementById('caseGraphFrame');
  const b64 = s => new TextDecoder().decode(Uint8Array.from(atob(s), c => c.charCodeAt(0)));

  iframe.contentWindow.document.open();
  iframe.contentWindow.document.write(b64("{base64.b64encode(html.encode()).decode("ascii")}"));
  iframe.contentWindow.document.close();

  iframe.addEventListener('load', () => {{
    iframe.contentWindow.setDocsFinal(JSON.parse(b64("{base64.b64encode(json.dumps(output).encode("utf-8")).decode("ascii")}")));
  }}, {{ once: true }});\
}})();
""")
)

<IPython.core.display.Javascript object>