# Natural Language Processing (NLP)-Based Patent Support Analysis
*Kirk A. Sigmon, kirk.a.sigmon.th@dartmouth.edu*

This is **experimental** code used to evaluate the use of modern NLP techniques to evaluate possible support (or lack thereof) in a patent specification.  Patent specifications are retrieved directly from USPTO databases.  The support searching relies on a fusion of best-in-class **local-only** text embedding systems ([BAAI's BGE-M3](https://huggingface.co/BAAI/bge-m3) + [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25)) and a **local-only** re-ranker ([BAAI's BGE-Reranker-Large](https://huggingface.co/BAAI/bge-reranker-large)).  This means that this system is totally free, executes entirely locally, and (when implemented in appropriate hardware) would be secure for confidential data processing.  That said, such local execution has the obvious caveat of not being best-in-class overall.  That merit goes to third-party, API-based models, which incur financial cost and implicate security concerns. 

Known issues:
* **More semantic/meaning-based *per sentence* than word-based.**  This is *not* a "Ctrl+F"-type search engine: it tries to match query meaning to sentence meaning.  That's both a good thing and a bad thing: it can find similar terms/concepts at a sentence level quite easily but doesn't try to find exact word matches.  You can easily do that yourself in a web browser.
* **English-only (for now).** The present approach relies on USPTO systems and assumes English language specifications for OCR, lemmatization, and the like.  Other languages cause unexpected behavior.
* **Confidence intervals.**  In short, they're arbitrary.  Good matches usually pop around a fusion score of 0.4 or so, with higher scores often present for explicit definitional sentences.  In contrast, there's a lot of gray area between 0.2 and 0.4.
* **OCR.** Older specifications, especially those poorly scanned in or those using obtuse formatting, tend to provide odd text for processing.  This creates a pretty significant knock-on series of issues, meaning worse search results.

**THIS SHOULD NOT BE USED FOR REAL LEGAL WORK.**  This is experimental only and is not intended for "live" use.  It is not legal advice and should not be construed as such.  It is largely programmed in my free time and is full of limitations and errors.  

In [1]:
# We have to define the patent to search early, as this code
# calculates embeddings early.  You can, after all cells have run,
# run queries freely by editing only the LAST cell of this notebook.
patent_number_to_search = None
app_number_to_search = "19306812"

In [2]:
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# USE YOUR OWN USPTO API KEY HERE - DON'T SHARE
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
API_KEY = "ENTER YOUR OWN"
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

In [3]:
## ----------------------------------------------------------
## IMPORTS
## ----------------------------------------------------------

# Generic imports/IO
import re
from pathlib import Path
from html import escape
import numpy as np
from IPython.display import display, HTML
import requests

# NLP Models for Tokenization, Embeddings, Etc.
import torch
import spacy
from sentence_transformers import SentenceTransformer, CrossEncoder
from rank_bm25 import BM25Okapi

# OCR Stuff
from pdf2image import convert_from_path
import pytesseract

In [4]:
## ----------------------------------------------------------
## USPTO FILE RETRIEVAL FOR RETRIEVAL OF SPEC
## ----------------------------------------------------------

# Base USPTO API URL
BASE_URL = "https://api.uspto.gov/api/v1"

# Session instantiation
def _session(API_KEY: str | None):
    s = requests.Session()
    s.headers.update({"Accept": "application/json"})
    if API_KEY:
        s.headers.update({"x-api-key": API_KEY})
    return s

# Clean up digits, just in case
def _clean_digits(s):
    return re.sub(r"\D+", "", s or "")

# Find the first key in the JSON
def _find_first_key(obj, keys):
    if isinstance(obj, dict):
        for k in keys:
            if k in obj and obj[k]:
                return obj[k]
        for v in obj.values():
            hit = _find_first_key(v, keys)
            if hit is not None:
                return hit
    elif isinstance(obj, list):
        for it in obj:
            hit = _find_first_key(it, keys)
            if hit is not None:
                return hit
    return None

# Resolve application number using USPTO API.  Supports both application_number_q and patent_number_q
def resolve_application_number(*, patent_number, application_number, API_KEY):

    # If we have an app number already, bail unless it's formatted wrong
    # Also bail if we get nothing to work with.
    if application_number:
        app_no = _clean_digits(application_number)
        if len(app_no) < 8:
            raise RuntimeError(f"Application number looks wrong: {app_no!r}")
        return app_no
    if not patent_number:
        raise ValueError("ERROR: Need patent or application number.")

    # Instantiate a session with the session key
    s = _session(API_KEY)

    # Set up our parameters
    params = {"limit": 1, "offset": 0}

    # Set up our query based on the app/patent number
    if application_number:
        params["applicationNumberQ"] = _clean_digits(application_number)
    else:
        params["patentNumberQ"] = _clean_digits(patent_number)

    # Define the search endpoint
    url = f"{BASE_URL}/patent/applications/search"
    r = s.get(url, params=params, timeout=60)
    r.raise_for_status()
    data = r.json()

    # Try common field spellings first; fall back to best-effort scan.
    app_no = _find_first_key(data, ["applicationNumberText", "applicationNumber", "application_number"])
    if not app_no:
        raise RuntimeError("ERROR: Couldn't locate application number in search response.")

    # Otherwise, try to clean up, but complain if there's an issue.
    app_no = _clean_digits(str(app_no))
    if len(app_no) < 8:
        raise RuntimeError(f"Resolved application number looks wrong: {app_no!r}")

    return app_no

# List documents for a particular application number
def list_documents(application_number, *, API_KEY):

    # Instantiate the session and the application of interest
    s = _session(API_KEY)
    application_number = _clean_digits(application_number)

    # Establish our endpoint
    url = f"{BASE_URL}/patent/applications/{application_number}/documents"
    r = s.get(url, timeout=60)
    r.raise_for_status()
    data = r.json()

    # Try to grab what we can, essentially just hunt for the documents
    docs = (
        _find_first_key(data, ["documents", "documentBag", "document_bag"])
        or _find_first_key(data, ["results", "items"])
        or []
    )
    if not isinstance(docs, list):
        docs = docs if isinstance(docs, list) else []
    return docs

# Function to pick specification document amongst various documents
def pick_spec_doc(docs, which="earliest"):
    spec_docs = [d for d in docs if str(d.get("documentCode", "")).upper() == "SPEC"]
    if not spec_docs:
        raise RuntimeError("No SPEC documents found in docs list.")

    def date_key(d):
        # ISO-ish string sorts chronologically; keep as string for simplicity
        return str(d.get("officialDate") or "")

    if which == "latest":
        return max(spec_docs, key=date_key)
    elif which == "earliest":
        return min(spec_docs, key=date_key)
    else:
        raise ValueError("which must be 'earliest' or 'latest'")

# Function to identify PDF to download from USPTO records
def get_pdf_download_url(doc):
    bag = doc.get("downloadOptionBag") or []
    for opt in bag:
        if str(opt.get("mimeTypeIdentifier", "")).upper() == "PDF" and opt.get("downloadUrl"):
            return opt["downloadUrl"]
    for opt in bag:
        if opt.get("downloadUrl"):
            return opt["downloadUrl"]
    raise RuntimeError("No downloadUrl found for this document.")

# Function to download USPTO file
def download_file(url, out_path, *, api_key=None):
    s = requests.Session()
    if api_key:
        s.headers.update({"x-api-key": api_key})
    with s.get(url, stream=True, timeout=120) as r:
        r.raise_for_status()
        out_path = Path(out_path)
        out_path.parent.mkdir(parents=True, exist_ok=True)
        with out_path.open("wb") as f:
            for chunk in r.iter_content(chunk_size=1024 * 256):
                if chunk:
                    f.write(chunk)
    return out_path

# Function to download the SPECIFICATION itself - grabs pat/app number, resolves app
# if patent number provided, figures out docs in file wrapper, guesses spec, then downloads it.
def download_specification(*, patent_number = None, application_number = None, API_KEY, out_dir = "uspto_pfw_downloads"):
    
    # Resolve the application number
    app_no = resolve_application_number(
        patent_number=patent_number,
        application_number=application_number,
        API_KEY=API_KEY,
    )

    # Grab the documents, and do a best-effort guess of the specification docuemnt
    docs = list_documents(app_no, API_KEY=API_KEY)
    spec_doc = pick_spec_doc(docs, which="earliest")

    # Get the URL of the best-guess specification docuemnt
    url = get_pdf_download_url(spec_doc)

    # Go ahead and download the file
    out = download_file(url, f"uspto_pfw_downloads/{app_no}_SPEC_{spec_doc['documentIdentifier']}.pdf", api_key=API_KEY)
    
    return out

# TEST
local_specification_path = download_specification(patent_number=patent_number_to_search,application_number=app_number_to_search, API_KEY=API_KEY)
print("Saved:", local_specification_path)

Saved: uspto_pfw_downloads/19306812_SPEC_MELZJWYJX228X66.pdf


In [34]:
%%time
## ----------------------------------------------------------
## OCR FUNCTIONALITY FOR SPECIFICATION PLAIN TEXT GENERATION
## ----------------------------------------------------------
# Function to "unwrap" (that is, remove excessive newlines from) patent text.
# Lots of temporary hacks here to handle issues like new page breaks.
def unwrap_ocr_text(s):

    WS = re.compile(r"[ \t]+")
    PARA_NO = re.compile(r"\[\s*(\d{3,4,5})\s*\]")
    
    if not s:
        return ""

    # Simplistic newline clean-ups
    s = s.replace("\r\n", "\n").replace("\r", "\n")
    s = re.sub(r"(\w)-\n(\w)", r"\1\2", s)
    s = re.sub(r"[ \t]+", " ", s)

    # Normalize spaces if any remain
    s = WS.sub(" ", s)

    # Protect page breaks so they cannot create blank lines
    s = re.sub(r"\s*<<<PAGE_BREAK>>>\s*", " __PB__ ", s)

    # Paragraph numbers become boundaries
    s = re.sub(r"\[\s*(\d{3,4})\s*\]", lambda m: f"\n\n[{m.group(1)}] ", s)

    # Now split on blank lines
    blocks = re.split(r"\n\s*\n+", s)

    # Iterate through the blocks and clean up further
    cleaned_blocks = []
    for blk in blocks:
        blk = blk.strip()
        if not blk:
            continue

        # Now unwrap, clean up page breaks, and strip addendum
        blk = re.sub(r"\s*\n\s*", " ", blk)
        blk = blk.replace("__PB__", " ")
        blk = re.sub(r"\s{2,}", " ", blk).strip()
        cleaned_blocks.append(blk)

    # Re-join blocks with a single blank line between paragraphs
    return "\n\n".join(cleaned_blocks)

# Function to OCR retrieved USPTO PDF
def ocr_specification(pdf_path, *, dpi=300, crop=(0.07, 0.08, 0.07, 0.08)):

    # Definition of OCR-identified page breaks
    PAGE_BREAK = "\n<<<PAGE_BREAK>>>\n"
    
    # Identify pages
    pages = convert_from_path(pdf_path, dpi=dpi)
    text_pages = []

    # For each page...
    for img in pages:

        # Perform a lazy crop to get rid of headers within reason
        w, h = img.size
        l = int(w * crop[0])
        t = int(h * crop[1])
        r = int(w * (1 - crop[2]))
        b = int(h * (1 - crop[3]))
        img = img.crop((l, t, r, b))
        
        # Convert to a string using pytesseract
        txt = pytesseract.image_to_string(
            img,
            lang="eng",
            config="--oem 1 --psm 6"
        )
        text_pages.append(txt)

    # Merge output
    full_text = PAGE_BREAK.join(text_pages)
    
    # Output the totality of the pages
    return unwrap_ocr_text(full_text)

patent_full_text = ocr_specification(local_specification_path)
print("Specification successfully OCRed.")

Specification successfully OCRed.
CPU times: user 4.09 s, sys: 2 s, total: 6.09 s
Wall time: 40.1 s


In [35]:
## ----------------------------------------------------------
## TEXT SPLITTING FUNCTIONALITY
## ----------------------------------------------------------

# Splitting rules
PARA_SPLIT = re.compile(r"\n\s*\n+")       # Paragraphs defined as newlines
TOKEN_RE = re.compile(r"[A-Za-z0-9]+")     # Tokens generally defined by contiguous alphanumeric split by whitespace

# Helper abbreviations ("FIG.," "U.S.") that commonly trip up sentence-level splitting in spaCy
ABBR_END = re.compile(
    r"(?:\bFIGS?\.|\bFIG(?:URE)?S?\.|\bU\.?\s*S\.?|\bU\.?\s*K\.?|\bE\.?\s*U\.?|\bNO\.|\bNOS\.|\bPAT\.|\bPCT\.|\bAPP(?:L)?\.|\bAPPLN\.|\bPUB(?:L)?\.|\bINC\.|\bCO\.|\bCORP\.|\bLTD\.|\bLLC\.|\bDR\.|\bMR\.|\bMRS\.|\bMS\.|\bet\s+al\.|\betc\.)\s*$",
    re.IGNORECASE
)
CONTINUATION_START = re.compile(
    r"^\s*((and|or|wherein|whereby|that|which|who|whose|including|comprising)\b|[a-z]|\d|\([0-9A-Za-z]+\)|[A-Z]\d+|[,:;])",
    re.IGNORECASE
)

# Simplistic function for splitting paragraphs based on PARA_SPLIT
def split_paragraphs(text):
    return [p.strip() for p in PARA_SPLIT.split(text or "") if p.strip()]

# Using SpaCy for sentence-level splitting, using spaCy as desired.
# Note that we could load a blank spaCy instance just for sentence splitting,
# but this model does double duty for lemmatization as well, so we want
# the whole thing in-hand for later. We could arguably go through and jettison some
# features, like NER, but I keep 'em here because I secretly suspect I could use them later
# for some clever tricks.
_nlp = spacy.load("en_core_web_sm")

# Function to split sentences using spaCy
def split_sentences(paragraph):

    # Grab the paragraphs, and generally try to normalize whitespace, tabs, etc.
    paragraph = re.sub(r"[ \t\r\f\v]+", " ", (paragraph or "").strip())
    paragraph = re.sub(r"\s*\n\s*", " ", paragraph)

    # Perform sentence-level splitting with SpaCy
    doc = _nlp(paragraph)
    sents = [s.text.strip() for s in doc.sents if s.text.strip()]

    # Merge false sentence breaks after abbreviations like "FIGS." using
    # the rules/definitions defined above.  This essentially merges the corresponding
    # content into another sentence, avoiding small sentences (which tend to overmatch anyway).
    merged = []
    i = 0
    while i < len(sents):
        cur = sents[i]
        while (
            i + 1 < len(sents)
            and ABBR_END.search(cur)
            and CONTINUATION_START.search(sents[i + 1])
        ):
            cur = f"{cur} {sents[i + 1]}".strip()
            i += 1
        merged.append(cur)
        i += 1

    return merged

# BM25-level tokenization using TOKEN_RE rules
def bm25_tokenize(text):
    return TOKEN_RE.findall((text or "").lower())

# Function to return top-k scores WITHOUT sorting the array. Marginally faster.  
def topk_indices(scores: np.ndarray, k: int):
    if k <= 0:
        return np.array([], dtype=int)
    else:
        k = min(k, scores.shape[0])
        idx = np.argpartition(scores, -k)[-k:]
        return idx[np.argsort(scores[idx])[::-1]]

In [36]:
## ----------------------------------------------------------
## EMBEDDING MODELS
## ----------------------------------------------------------

# Use NVIDIA CUDA cores if available, much of this process is painfully slow even
# when they are used.
device = "cuda" if torch.cuda.is_available() else "cpu"

# Define our embedder (BGE-M3) and BGE-based re-ranker.  Both are provided by BAAI and
# are, generally, some of the best-in-class for LOCAL performance of such tasks, at least
# in this domain and when constrained to English.
embedder = SentenceTransformer("BAAI/bge-m3", device=device)
reranker = CrossEncoder("BAAI/bge-reranker-large", device=device)

# Embedding functionality, uses the embedder (here, BGE-M3) to encode 
# while standardizing conversion parameters.
def dense_embed_texts(texts: list[str], batch_size: int = 32, normalize: bool = True) -> np.ndarray:
    return embedder.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=normalize,
    ).astype(np.float32)

# Reranking function, uses the BGE-Reranker-Large model to re-rank the scores.
def rerank(query, candidates, batch_size = 16):
    pairs = [(query, c) for c in candidates]
    scores = reranker.predict(pairs, batch_size=batch_size)
    return np.asarray(scores, dtype=np.float32)

# BM25 retrieval function, uses the BM25 lexical scoring to get top-k results
def bm25_retrieve(query, bm25, *, k=300):
    q_tokens = bm25_tokenize(query)
    scores = np.asarray(bm25.get_scores(q_tokens), dtype=np.float32)
    idx = topk_indices(scores, k)
    return [(int(i), float(scores[i])) for i in idx]

# DENSE retrieval function, uses the DENSE embeddings to get top-k results
def dense_retrieve(query, sent_emb, *, k=300):
    q = dense_embed_texts([query], batch_size=1, normalize=True)[0]
    scores = (sent_emb @ q).astype(np.float32)
    idx = topk_indices(scores, k)
    return [(int(i), float(scores[i])) for i in idx]

# FUSION function for BM25 + DENSE, basically just merges top-k results
# using weights
def rrf_fuse(bm25_hits, dense_hits, *, k_rrf = 60, w_bm25 = 1.0, w_dense = 1.0, top_n = 200):

    # Grab the rankings in an easily comparable way through enumeration
    bm25_rank = {idx: r for r, (idx, _) in enumerate(bm25_hits, start=1)}
    dense_rank = {idx: r for r, (idx, _) in enumerate(dense_hits, start=1)}

    # For each, fuse based on our weight (w_bm25 for BM25, w_dense for Dense).
    fused = {}
    for idx in (set(bm25_rank) | set(dense_rank)):
        s = 0.0
        if idx in bm25_rank:
            s += w_bm25 * (1.0 / (k_rrf + bm25_rank[idx]))
        if idx in dense_rank:
            s += w_dense * (1.0 / (k_rrf + dense_rank[idx]))
        fused[idx] = s

    # Generate and output the top-n results of the fused list.
    items = list(fused.items())
    items.sort(key=lambda x: x[1], reverse=True)
    return items[:top_n]


In [37]:
%%time
## ----------------------------------------------------------
## BUILD PARA/SENTENCE INVENTORY
## ----------------------------------------------------------

# Grab the specification text, split the paragraphs using our rough splitting definition
# that largely relies on newlines.
paragraphs = split_paragraphs(patent_full_text)

# Set up empty holders for sentences, metadata, paragraph-to-sentence identifiers (since a sentence might
# "pop" as relevant and cause us to recommend citation of the whole paragraph).
sentences: list[str] = []
meta: list[tuple[int, int]] = []
para_to_sentence_idxs: list[list[int]] = [[] for _ in range(len(paragraphs))]

# For each paragraph...
for pi, p in enumerate(paragraphs):
    # Split the paragraph into discrete sentences
    sents = split_sentences(p)

    # For each sentence...
    for si, s in enumerate(sents):

        # Grab the sentence, append the sentence to our keeper, and log metadata appropriately
        gi = len(sentences)
        sentences.append(s)
        meta.append((pi, si))
        para_to_sentence_idxs[pi].append(gi)

# Output basic stats
print(f"{len(paragraphs)} paragraphs, {len(sentences)} sentences observed")

125 paragraphs, 458 sentences observed
CPU times: user 1.58 s, sys: 5.68 ms, total: 1.59 s
Wall time: 1.59 s


In [38]:
%%time
## ----------------------------------------------------------
## INDEXING AND FUSION FUNCTIONALITY 
## PART 1: BM25
## ----------------------------------------------------------

# Perform BM25-style tokenization
bm25_corpus = [bm25_tokenize(s) for s in sentences]
bm25 = BM25Okapi(bm25_corpus)
print("BM25 Complete.")

BM25 Complete.
CPU times: user 4.29 ms, sys: 83 μs, total: 4.37 ms
Wall time: 4.31 ms


In [39]:
%%time
## ----------------------------------------------------------
## INDEXING AND FUSION FUNCTIONALITY 
## PART 2: DENSE
## ----------------------------------------------------------

# Perform DENSE-style tokenization
sent_emb = dense_embed_texts(sentences, batch_size=256, normalize=True)
print("Dense Complete.")

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Dense Complete.
CPU times: user 3.6 s, sys: 34.8 ms, total: 3.63 s
Wall time: 3.56 s


In [40]:
## -----------------------------
## SEARCH FUNCTIONALITY
## -----------------------------

# Function to conduct a search in a manner that ultimately
# leverages BOTH BM25 and Dense embeddings.
def search(query, *, bm25, sent_emb, sentences, meta, bm25_k = 300, dense_k = 300, fused_top_n = 200, rerank_top_n = 50, rerank_batch_size = 16):

    # Retrieve the BM25 and Dense hits based on the query
    bm25_hits = bm25_retrieve(query, bm25, k=bm25_k)
    dense_hits = dense_retrieve(query, sent_emb, k=dense_k)

    # Fuse the results into a single set of query hits, organize 'em
    fused = rrf_fuse(bm25_hits, dense_hits, top_n=fused_top_n)
    cand_idx = [idx for idx, _ in fused]
    cand_text = [sentences[i] for i in cand_idx]

    # Use the re-ranker to re-rank the FUSED results, providing additional accuracy
    rr_scores = rerank(query, cand_text, batch_size=rerank_batch_size)
    order = topk_indices(rr_scores, rerank_top_n)

    # For each hit, build a nice hits list with detail, including the paragraph ID,
    # sentence ID, and the reranker score (note: not the BM25/dense scores).
    hits = []
    for j in order:
        sidx = cand_idx[int(j)]
        pi, si = meta[sidx]
        hits.append(
            {
                "paragraph_id": pi,
                "sentence_id": si,
                "score": float(rr_scores[int(j)]),
                "sentence": sentences[sidx],
                "sentence_global_idx": sidx,
            }
        )
    return hits

In [48]:
## ----------------------------------------------------------
## DISPLAY FUNCTIONALITY
## Mostly nice beautification, highlighting, etc.
## ----------------------------------------------------------

# Words to basically ignore when grabbing query terms to highlight, these
# have virtually no probative value and, if used in a query, could cause
# wild over-highlighting
DEFAULT_STOPWORDS = {
    "the", "a", "an", "and", "or", "of", "to", "in", "for", "with",
    "on", "by", "at", "from", "as", "is", "are", "be", "being", "been",
    "using", "use", "used", "based"
}

# Extracts query terms from a query, largely just for highlighting
def extract_query_terms(query, min_len = 3):

    # If we have no query, do nothing
    if not query:
        return []

    # Tokenize (roughly) the query
    tokens = re.findall(r"\b\w+\b", query.lower())

    # Determine terms as long as they are over a sufficient length and
    # aren't a stopword
    terms = [t for t in tokens if len(t) >= min_len and t not in DEFAULT_STOPWORDS]

    # Build a list of remaining terms and output the same
    seen, out = set(), []
    for t in terms:
        if t not in seen:
            seen.add(t)
            out.append(t)
    return out

# Highlights HTML in a nice way for output
def highlight_html(text):
    return (
        "<span style='background-color:#fff3cd;"
        "border-bottom:2px solid #f0ad4e;padding:0.1em 0.25em;font-weight:400;'>"
        f"{text}</span>"
    )

# Underlines HTML in a nice way for output
def underline_html(text):
    return (
        "<span style='text-decoration: underline;"
        "text-decoration-thickness: 1px;text-decoration-color: #888;'>"
        f"{text}</span>"
    )


# Function to bold verbatim(-ish - uses lemmas) words
def bold_lemmas_in_sentence_html(sentence, query_lemmas, *, nlp, stopwords):

    # Use the NLP model to process the sentence anew
    tokenized_sentence = _nlp(sentence)
    out = []
    last = 0

    # For each token...
    for tok in tokenized_sentence:
        
        # Include any text (e.g., spaces, punctuation) after previous token
        if tok.idx > last:
            out.append(escape(sentence[last:tok.idx]))

        # Identify the token text
        token_text = sentence[tok.idx : tok.idx + len(tok.text)]
        token_html = escape(token_text)

        # Decide whether to bold this token
        if (
            not tok.is_space
            and not tok.is_punct
            and tok.lower_ not in DEFAULT_STOPWORDS
            and tok.lemma_.lower() in query_lemmas
        ):
            token_html = f"<strong>{token_html}</strong>"

        out.append(token_html)
        last = tok.idx + len(tok.text)

    # Trailing text after the last token
    if last < len(sentence):
        out.append(escape(sentence[last:]))

    return "".join(out)

# Function to compute query lemmas using SpaCy, results in a small speed increase
# relative to doing so repeatedly
def compute_query_lemmas(query_terms, nlp):
    query_lemmas = set()
    for term in query_terms:
        query_lemmas.update(
            tok.lemma_.lower()
            for tok in nlp(term.lower())
            if not tok.is_punct and not tok.is_space
        )
    return query_lemmas

# Function to use display systems to identify support 
def display_sentence_support(
    sentence_hits,
    query,
    *,
    sentences,
    para_to_sentence_idxs,
    extract_query_terms_fn,
    highlight_html_fn,
    underline_html_fn,
    nlp,
    stopwords,
    min_score=-1e9,
    strong_score=0.4,
    max_results=5,
):

    # Grab the query terms (ignoring short terms and stopwords) and lemmatize
    query_terms = extract_query_terms(query)
    query_lemmas = compute_query_lemmas(query_terms, nlp)
    shown = 0

    # For each of the sentence hits...
    for h in sentence_hits:

        # If the score is too low, do nothing
        if h["score"] < min_score:
            break

        # If the score IS high enough, then grab the paragraph identifier(s)
        pi = h["paragraph_id"]
        target_gi = h["sentence_global_idx"]
        
        # Once we've identified it, safely clean up the relevant text, determine
        # whether the sentence contains any of the terms (lemmatized), and highlight 
        # high-scoring sentences
        rendered = []
        for gi in para_to_sentence_idxs[pi]:
            if gi == target_gi:
                # HTML with lemma-based bolding. Note that in some (intentional!) circumstances
                # we could bold but later not highlight a sentence - this might be where
                # the word is present but the sentence as a whole is not particularly related to
                # defining that word, so it's there but probably discussing something orthogonal.
                safe_html = bold_lemmas_in_sentence_html(
                    sentences[gi],
                    query_lemmas,
                    nlp=nlp,
                    stopwords=stopwords,
                )

                # Now, if the score is high enough, highlight it. Otherwise, underline.
                if h["score"] >= strong_score:
                    rendered.append(highlight_html(safe_html))
                else:
                    rendered.append(underline_html(safe_html))
            else:
                # Non-relevant sentences remain plain (escaped) text
                rendered.append(escape(sentences[gi]))

        # Output the score (here, the fused/re-ranked score) in a nice header, just for diagnostics
        header = f"Score {h['score']:.3f} — chunk {pi}"
        display(HTML(f"<h4>{escape(header)}</h4>"))

        # Output a pretty version of the relevant paragraph
        display(HTML(
            "<div style='margin-left:2em;padding-left:0.75em;"
            "border-left:3px solid #ddd;line-height:1.6;font-size:0.95em;'>"
            f"{' '.join(rendered)}</div><hr>"
        ))

        # Continue to count and output results until max_results
        shown += 1
        if shown >= max_results:
            break


In [49]:
%%time
## ----------------------------------------------------------
## EXAMPLE RUN(S)
## ----------------------------------------------------------
query = "how braces work"
hits = search(
    query,
    bm25=bm25,
    sent_emb=sent_emb,
    sentences=sentences,
    meta=meta,
    bm25_k=400,
    dense_k=400,
    fused_top_n=250,
    rerank_top_n=30,
)
display_sentence_support(
    hits,
    query,
    sentences=sentences,
    para_to_sentence_idxs=para_to_sentence_idxs,
    extract_query_terms_fn=extract_query_terms,
    highlight_html_fn=highlight_html,
    underline_html_fn=underline_html,
    nlp=_nlp,
    stopwords=DEFAULT_STOPWORDS,
    min_score=0.20,
    strong_score=0.4,
    max_results=10,
)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

CPU times: user 2.4 s, sys: 144 ms, total: 2.54 s
Wall time: 1.33 s
