# NER Powered Semantic Search

This notebook shows how to use Named Entity Recognition (NER) for vector search with LanceDB. We will:

1. Extract named entities from text.
2. Store them in a LanceDB as metadata (alongside respective text vectors).
3. We extract named entities from incoming queries and use them to filter and search only through records containing these named entities.

This is particularly helpful if you want to restrict the search score to records that contain information about the named entities that are also found within the query.

In [1]:
import os
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
from IPython.display import Markdown, display, HTML

os.chdir(os.path.dirname(os.getcwd()))

df = pd.read_parquet('./data/splade.parquet')
print(f"df shape: {df.shape}")
df.head(1)

df shape: (1987, 24)


Unnamed: 0,index,id,citation,name,name_abbreviation,decision_date,court_id,court_name,court_slug,judges,attorneys,citations,url,head,body,name_contains_lm,body_contains_lm,year,context,context_citation,context_tokens,openai_embeddings,splade_embeddings,state
0,0,411690,154 Ill. 2d 90,"RICHARD R. JOHNSON, Plaintiff-Appellant and Cr...",Johnson v. Halloran,2000-01-13,8837,Illinois Appellate Court,ill-app-ct,[],"['Wolter, Beeman, Lynch & McIntyre, of Springf...","[{'type': 'official', 'cite': '312 Ill. App. 3...",https://api.case.law/v1/cases/411690/,"RICHARD R. JOHNSON, Plaintiff-Appellant and Cr...",JUSTICE HALL\r\ndelivered the opinion of the c...,False,True,2000,The public defender of Cook County was appoint...,154 Ill. 2d 90,1317,"[-0.0017778094625100493, -0.002360282698646187...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",New Mexico


In [2]:
from src.search.auto_entity_search import MatchedEntitySearch

In [3]:
entity_search = MatchedEntitySearch(
    db_path="./.lancedb"
)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[32m2024-04-14 22:39:32 - INFO - Loaded database with tables: ['context', 'ebay', 'ebay_1cb227d612ef4ad897c5739079935e91', 'ebay_8a4078e71f604855a0232d7a4807806e', 'ebay_df64db6c60e24423aba5f3990fc99521', 'ner_table2', 'ner_table3', 'ner_table4', 'test_table', 'test_table_class'][0m


In [4]:
sample_df = df.sample(5)
test_query = sample_df.sample(1)['context'].tolist()[0]

entity_search.ingest_data(
    df=sample_df,
    table_name="test_table_class",
    mode='append'
)

  0%|          | 0/1 [00:00<?, ?it/s]

In [5]:
Markdown(test_query)

On August 29, 2003, representatives of Vandalia terminated the authorized dealership agreement with Kub-ota. In September 2003, Kubota’s corporate office in Torrance, California, denied Newton’s dealership application. After learning that Newton’s dealership application had been rejected, Jacobsen contacted Newton and requested additional information to forward to Kubota. After reviewing this additional information, Kubota again denied Newton’s application. On July 27, 2004, the plaintiff filed a complaint against the defendants based, inter alia, on a claim of promissory estoppel. The complaint alleged that on or about July 25, 2003, Jacobsen, as an agent for Kubota, represented to Newton that Newton would be the local authorized dealer of Kubota products and that Vandalia had to terminate its authorized dealership agreement with Kubota before Newton could become a local authorized dealer of Kubota. In reliance on the promise of Jacobsen, Newton directed the representatives of Vandalia to terminate Vandalia’s right to sell Kubota equipment and Vandalia terminated its authorized dealership agreement with Kubota. The complaint further alleged that in breach of its July 25, 2003, promise, during the last weeks of September 2003, Kubota indicated that Newton would not be a local authorized dealer of Kubota products and removed its products from the possession of Newton. As a result, Newton suffered damages including unpaid warranty work that had been performed on behalf of Kubota. On September 29, 2005, the defendants filed a motion for a summary judgment. The defendants argued that a summary judgment should be awarded in their favor because promissory estoppel is not a proper cause of action and is available only as an affirmative defense in Illinois. On January 30, 2006, the trial court entered a written order granting a summary judgment in favor of the defendants. In its order, the trial court noted that this court has held in DeWitt v. Fleming, 357 Ill. App. 3d 571 (2005), and in ESM Development Corp. v. Dawson, 342 Ill. App. 3d 688 (2003), that promissory estoppel is not a theory under which a cause of action seeking monetary relief can be asserted and that “[i]t is the trial court’s obligation to follow existing appellate authority from the same district in which the trial court sits.” On February 14, 2006, the plaintiff filed a notice of appeal arguing that the trial court erred in granting the summary judgment in favor of the defendants. A summary judgment is appropriate where the pleadings, affidavits, depositions, admissions, and exhibits on file, when viewed in the light most favorable to the nonmoving party, reveal that there is no genuine issue on any material fact and that the moving party is entitled to a judgment as a matter of law. Busch v. Graphic Color Corp., 169 Ill. 2d 325, 333 (1996). Although a summary judgment is an efficient and useful aid in the expeditious disposition of a lawsuit, a summary judgment is a drastic remedy and should only be granted when the moving party’s right to a judgment is clear and free from doubt. Outboard Marine Corp. v. Liberty Mutual Insurance Co., 154 Ill. 2d 90, 102 (1992). A summary judgment is proper where the nonmoving party fails to establish an element of the cause of action. Pyne v. Witmer, 129 Ill. 2d 351, 358 (1989). We review de novo the trial court’s award of a summary judgment. Chatham Foot Specialists, P.C. v. Health Care Service Corp., 216 Ill. 2d 366, 376 (2005). The plaintiff contends that damages can be recovered from the defendants under a theory of promissory estoppel and moreover that allowing recovery is supported by public policy and is necessary to prevent improper business practices. In support of the plaintiff’s argument, the plaintiff urges this court to depart from its recent holding in DeWitt v. Fleming, 357 Ill. App. 3d 571 (2005), and to instead follow our dissenting colleague’s opinion in DeWitt that promissory estoppel can be used as a viable cause of action in Illinois. In response, the defendants argue that we should not depart from our recent ruling. In DeWitt, the plaintiffs filed a complaint to recover the costs of a survey performed on a parcel of land, which the defendant had orally promised to sell to the plaintiffs but subsequently refused to sell. DeWitt, 357 Ill. App. 3d 571. On appeal, the defendant argued that the trial court erred in applying the doctrine of promissory estoppel because the doctrine of promissory estoppel cannot be used to bar the application of the statute of frauds. DeWitt, 357 Ill. App. 3d 571. In DeWitt, we concluded that “in Illinois promissory estoppel is available only as a defense (i.e., as a shield), not as a cause of action (i.e., as a sword).” DeWitt, 357 Ill. App. 3d at 573. We reinforced our position by noting that we had held the same in our previous ruling in ESM Development Corp. v. Dawson, 342 Ill. App. 3d 688 (2003), where we explicitly stated that promissory estoppel “ ‘is not a proper vehicle for direct relief,’ ” “ ‘cannot properly be pled as a cause of action,’ ” “ ‘is meant to be utilized as a defensive mechanism — not as a means of attack,’ ” and “ ‘does not form the basis for a damages claim.’ ” DeWitt, 357 Ill. App. 3d at 573, quoting ESM Development Corp., 342 Ill. App. 3d at 695. Our analysis in DeWitt is worth repeating here: “Unlike our dissenting colleague, we are not convinced that the Illinois Supreme Court’s opinions in Doyle v. Holy Cross Hospital, 186 Ill. 2d 104, 110 (1999), and Quake Construction, Inc. v. American Airlines, Inc., 141 Ill. 2d 281, 287 (1990), directly contradict our holding in ESM Development Corp. In Doyle, *** the issue before the court was whether an employer could make unilateral changes to provisions in an employee handbook, in the absence of a previous reservation of the right to do so, that would operate to the disadvantage of existing employees. Doyle v. Holy Cross Hospital, 186 Ill. 2d 104, 110 (1999).

In [6]:
test_res = entity_search.search(test_query, limit=100, verbose=True)

[32m2024-04-14 22:39:39 - INFO - Found 33 matches[0m
[32m2024-04-14 22:39:39 - INFO - Extracted Named Entities: ['Vandalia, Torrance, California, Newton, Illinois, Dawson'][0m


In [7]:
test_res['named_entities'].head(10).tolist()

['Lexmark, BDT, Irvine, California',
 'Potomac, Illinois, Liberty',
 'CERCLA, State of Illinois, Commercial Union',
 'ILCS, Bokodi, Foster Wheeler Robbins, Inc, Wal - Mart Stores, Inc, Hobbs, Midwest, Rich, Life, Illinois, Light Co, American',
 'Pekin, Illinois, D & J Tarp Service, East Graham Road, Mt. Vernon',
 'ILCS, Bokodi, Foster Wheeler Robbins, Inc, Wal - Mart Stores, Inc, Hobbs, Midwest, Rich, Life, Illinois, Light Co, American',
 'Lexmark, BDT, Irvine, California',
 'Pekin, Illinois, D & J Tarp Service, East Graham Road, Mt. Vernon',
 'Lexmark, BDT, Irvine, California',
 'Infinity Insurance, Zurich, Connecticut, Michigan, U. S. Fire, New York, Illinois, United States']

In [8]:
test_res

Unnamed: 0,vector,metadata,named_entities,context,id,_distance,score,match_count
1,"[-0.02506749, -0.010197514, 0.010170661, -0.01...","{'attorneys': '['Barbara I. Michaelides, Steve...","Lexmark, BDT, Irvine, California",We agree with Essex’s second argument. While E...,1270,0.2537,0.7463,1
4,"[-0.023424773, -0.00713016, -0.008927907, -0.0...",{'attorneys': '['Anthony C. Valiulis and Wendy...,"Potomac, Illinois, Liberty","On June 23, 2000, the circuit court found that...",208,0.291183,0.708817,1
5,"[-0.021822745, -0.002486898, -0.00023750307, -...",{'attorneys': '['Philip H. Corboy and Kenneth ...,"CERCLA, State of Illinois, Commercial Union",That allegation (which is backed up by expert ...,1355,0.2924,0.7076,1
13,"[-0.035934478, 0.013343928, 0.008512983, -0.01...",{'attorneys': '['Peter C. Morse and Cynthia Ra...,"ILCS, Bokodi, Foster Wheeler Robbins, Inc, Wal...",735 ILCS 5/2— 1005 (West 2006). Although summa...,1401,0.307272,0.692728,1
16,"[-0.0070137614, 0.0002607778, 0.01342615, -0.0...","{'attorneys': '['Patricia A. Zimmer, of George...","Pekin, Illinois, D & J Tarp Service, East Grah...",", because he filed his complaint more than two...",689,0.30826,0.69174,1
17,"[-0.02141473, 0.003501525, 0.012356733, -0.019...","{'attorneys': '['John D. Chambers, of Vrdolyak...","ILCS, Bokodi, Foster Wheeler Robbins, Inc, Wal...","Plaintiff had until December 21, 2013 to file ...",1853,0.308499,0.691501,1
21,"[-0.020261064, 0.008022793, 0.004506548, -0.01...","{'attorneys': '['Leahy, Eisenberg & Fraenkel, ...","Lexmark, BDT, Irvine, California","Chatham Corp. v. Dann Insurance, 351 Ill. App....",810,0.311738,0.688262,1
22,"[-0.02187469, 0.0025084564, 0.0058898623, -0.0...",{'attorneys': '['Julie A. Boynton and Donald L...,"Pekin, Illinois, D & J Tarp Service, East Grah...",Plaintiff now appeals the trial court's granti...,2168,0.312899,0.687101,1
23,"[-0.021606533, -0.0041919625, 0.0105351955, -0...","{'attorneys': '['James M. Messineo, of Inverne...","Lexmark, BDT, Irvine, California",We consider each issue in turn. ¶ 31 I. Surviv...,2298,0.313247,0.686753,1
25,"[-0.00582983, -0.00702578, -0.0006145854, -0.0...","{'attorneys': '['David A. Shapiro, of Laser, P...","Infinity Insurance, Zurich, Connecticut, Michi...",The following facts are taken from the allegat...,510,0.315303,0.684697,1


## Initialize NER model

To extract named entities, we will use a NER model finetuned on a BERT-base model. The model can be loaded from the HuggingFace model hub as follows:

In [None]:
import torch

# set device to GPU if available
device = torch.cuda.current_device() if torch.cuda.is_available() else None

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_id = "dslim/bert-base-NER"
# model_id = "Babelscape/wikineural-multilingual-ner"

# load the tokenizer from huggingface
tokenizer = AutoTokenizer.from_pretrained(model_id)
# load the NER model from huggingface
model = AutoModelForTokenClassification.from_pretrained(model_id)
# load the tokenizer and model into a NER pipeline
nlp = pipeline(
    "ner", model=model, tokenizer=tokenizer, aggregation_strategy="max", device=device
)

In [2]:
query = """
Regarding the pollution exclusion clause under the terms of comprehensive general liability (CGL) insurance, \
how has the California court defined the phrase 'sudden and accidental', in particular for polluting events? \
Also, has there been any consideration for intentional vs unintentional polluting events? What would Judge Judy say?
"""
# use the NER pipeline to extract named entities from the text
# ents = nlp(test_text)

In [None]:
def deduplicate(seq: list[str]) -> list[str]:
    seen = set()
    return [x for x in seq if not (x in seen or seen.add(x))]


def get_entities_by_type_and_score(entities, entity_types, score_threshold):
    """
    Get entities of specific types within a certain score threshold.

    Args:
        entities (list): A list of entity dictionaries.
        entity_types (list): A list of desired entity types (e.g., ['ORG', 'PER', 'LOC']).
        score_threshold (float): The minimum score threshold for entities.

    Returns:
        list: A list of entities matching the specified types and score threshold.
    """
    entity_name_list = [
        entity['word']
        for entity in entities
        if entity['entity_group'] in entity_types and entity['score'] >= score_threshold
    ]
    
    top_entities_distinct = deduplicate(entity_name_list)
    top_entities = [item for item in top_entities_distinct if len(item) > 1]
    
    return top_entities

In [None]:
ents_list = get_entities_by_type_and_score(ents, entity_types=['ORG', 'PER', 'LOC'], score_threshold=0.99)

In [None]:
ents_list

In [None]:
import openai
import numpy as np

client = openai.OpenAI()
model_name = "text-embedding-ada-002"

def get_embedding(text: str):
    response = client.embeddings.create(input=text, model=model_name)
    return response.data[0].embedding

def encode_string(text: str) -> np.ndarray:
    embedding = get_embedding(text)
    return np.array(embedding)

## Initialize LanceDB

In [None]:
import lancedb

db = lancedb.connect("./.lancedb")

## Generate Embeddings and Insert

We generate embeddings for the title_text column we created earlier. Alongside the embeddings, we also include the named entities in the index as metadata. Later we will apply a filter based on these named entities when executing queries.

Let's first write a helper function to extract named entities from a batch of text.

In [None]:
from typing import List

def extract_named_entities(text_batch: List[str]) -> List[str]:
    # extract named entities using the NER pipeline
    extracted_batch = nlp(text_batch)
    entities = []
    # loop through the results and only select the entity names
    for text in extracted_batch:
        # ne = [entity["word"] for entity in text]
        ne = get_entities_by_type_and_score(text, entity_types=['ORG', 'LOC'], score_threshold=0.985)
        entities.append(ne)
    return entities

In [None]:
from typing import Iterable
import openai
import instructor
import nest_asyncio
from pydantic import BaseModel, Field


class ResolvedImprovedEntities(BaseModel):
    """An improved list of extracted entities."""
    
    entities: str = Field(
        ...,
        description="Accurately resolved and improved entities useful for downstream retrieval. Avoid filler words that do not add value as search input. For example, 'Liberty Mutual Insurance Company' should simply be 'Liberty Mutual'.",
    )
    
    
def resolve_entities(content) -> ResolvedImprovedEntities:
    return client.chat.completions.create(
        model="gpt-3.5-turbo-16k",
        response_model=ResolvedImprovedEntities,
        messages=[
            {
                "role": "system",
                "content": "You are a entity resolution and enhancement AI. Your task is to refine a list of extracted entities by removing, combining, or otherwise improving the text. Make every word count.",
            },
            {
                "role": "user",
                "content": content,
            },
        ],
    )  # type: ignore

In [None]:
sample_input = df.sample(20)['body'].tolist()

entity_batch = extract_named_entities(sample_input)
print(entity_batch[0])

resolved_batch = [resolve_entities(", ".join(t)) for t in entity_batch]
print(resolved_batch[0].entities)

In [None]:
processed_entities = [r.entities for r in resolved_batch]

In [None]:
processed_entities

In [3]:
from typing import List
import openai
from pydantic import BaseModel, Field
from tqdm.auto import tqdm
import warnings
import openai
import instructor
import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
import lancedb
from pprint import pprint


# set device to GPU if available
device = torch.cuda.current_device() if torch.cuda.is_available() else None
warnings.filterwarnings("ignore", category=UserWarning)


class ResolvedImprovedEntities(BaseModel):
    """An improved list of extracted entities."""
    
    entities: str = Field(
        ...,
        description="Accurately resolved and improved entities useful for downstream retrieval. Avoid filler words that do not add value as search input. For example, 'Liberty Mutual Insurance Company' should simply be 'Liberty Mutual'.",
    )
    
    
def resolve_entities(content) -> ResolvedImprovedEntities:
    client = instructor.patch(openai.OpenAI())
    return client.chat.completions.create(
        model="gpt-3.5-turbo-16k",
        response_model=ResolvedImprovedEntities,
        messages=[
            {
                "role": "system",
                "content": "You are a entity resolution and enhancement AI. Your task is to refine a list of extracted entities by removing, combining, or otherwise improving the text. Make every word count.",
            },
            {
                "role": "user",
                "content": content,
            },
        ],
    )  # type: ignore

# we will use batches of 64
batch_size = 10
data = []


model_id = "dslim/bert-base-NER"
# model_id = "Babelscape/wikineural-multilingual-ner"

# load the tokenizer from huggingface
tokenizer = AutoTokenizer.from_pretrained(model_id)
# load the NER model from huggingface
model = AutoModelForTokenClassification.from_pretrained(model_id)
# load the tokenizer and model into a NER pipeline
nlp = pipeline(
    "ner", model=model, tokenizer=tokenizer, aggregation_strategy="max", device=device
)

def get_embedding(text: str):
    client = openai.OpenAI()
    model_name = "text-embedding-ada-002"
    response = client.embeddings.create(input=text, model=model_name)
    return response.data[0].embedding

def encode_string(text: str) -> np.ndarray:
    embedding = get_embedding(text)
    return np.array(embedding)

def deduplicate(seq: list[str]) -> list[str]:
    seen = set()
    return [x for x in seq if not (x in seen or seen.add(x))]


def get_entities_by_type_and_score(entities, entity_types, score_threshold):
    """
    Get entities of specific types within a certain score threshold.

    Args:
        entities (list): A list of entity dictionaries.
        entity_types (list): A list of desired entity types (e.g., ['ORG', 'PER', 'LOC']).
        score_threshold (float): The minimum score threshold for entities.

    Returns:
        list: A list of entities matching the specified types and score threshold.
    """
    entity_name_list = [
        entity['word']
        for entity in entities
        if entity['entity_group'] in entity_types and entity['score'] >= score_threshold
    ]
    top_entities_distinct = deduplicate(entity_name_list)
    # A single character is never useful to filter on
    top_entities = [item for item in top_entities_distinct if len(item) > 1]
    
    return top_entities

def extract_named_entities(text_batch) -> List[str]:
    # extract named entities using the NER pipeline
    extracted_batch = nlp(text_batch)
    entities = []
    # loop through the results and only select the entity names
    for text in extracted_batch:
        ne = get_entities_by_type_and_score(text, entity_types=['ORG', 'LOC'], score_threshold=0.985)
        entities.append(ne)
    return entities


df = df.sample(50)

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    i_end = min(i + batch_size, len(df))
    # extract batch
    batch = df.iloc[i:i_end].copy()
    # generate embeddings for batch
    texts = batch["context"].tolist()
    idx = batch["index"].tolist()
    # get a list of embeddings
    emb = [get_embedding(t) for t in texts]
    # extract named entities from the batch
    entity_batch = extract_named_entities(batch["context"].tolist())
    # Use LLM to resolve and refine the original list of entities
    resolved_batch = [resolve_entities(", ".join(t)) for t in entity_batch]
    processed_entities = [r.entities for r in resolved_batch]
    batch["named_entities"] = processed_entities[0]
    # get metadata
    meta = batch.to_dict(orient="records")
    # add all to upsert list
    to_upsert = list(zip(idx, emb, meta, batch["named_entities"], batch["context"]))
    for id, emb, meta, entity, text in to_upsert:
        temp = {}
        temp["vector"] = np.array(emb)
        temp["metadata"] = meta
        temp['named_entities'] = entity
        temp["context"] = text
        temp["id"] = id
        data.append(temp)
        
        

db = lancedb.connect("./.lancedb")
# create table using above data
tbl = db.create_table("context", data, mode='overwrite')


def search_lancedb(query):
    # extract named entities from the query
    ne = extract_named_entities([query])
    # Use LLM to resolve and refine the original list of entities
    resolved_batch = [resolve_entities(", ".join(t)) for t in ne]
    processed_entities = [r.entities for r in resolved_batch]
    # create embeddings for the query
    xq = get_embedding(query)
    # query the lancedb table while applying named entity filter
    xc = tbl.search(np.array(xq)).limit(100).to_list()
    xdf = tbl.search(np.array(xq)).limit(100).to_pandas()
    xdf["score"] = 1 - xdf["_distance"]
    
    # count the number of matching entities for each search result
    res = []
    for x in xc:
        match_count = sum(1 for i in list(x["named_entities"]) if i in processed_entities[0][0])
        if match_count > 0:
            res.append((x['context'], x['id'], match_count))
    
    # sort the results by match count in descending order
    res.sort(key=lambda x: x[2], reverse=True)
    
    idx_list = [r[1] for r in res]
    df_out = xdf[xdf["id"].isin(idx_list)]
    
    # add a new column to store the match count
    df_out["match_count"] = [r[2] for r in res]
    
    # sort the DataFrame by match count and vector similarity
    df_out = df_out.sort_values(by=["match_count", "score"], ascending=[False, False])
    
    print(f"Found {len(df_out)} matches")
    pprint({"Extracted Named Entities": ne})
    return df_out

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  0%|          | 0/5 [00:00<?, ?it/s]

# Quering

In [4]:
import lancedb
from pprint import pprint

db = lancedb.connect("./.lancedb")
# create table using above data
tbl = db.create_table("context", data, mode='overwrite')



def search_lancedb(query):
    # extract named entities from the query
    ne = extract_named_entities([query])
    # Use LLM to resolve and refine the original list of entities
    resolved_batch = [resolve_entities(", ".join(t)) for t in ne]
    processed_entities = [r.entities for r in resolved_batch]
    # create embeddings for the query
    xq = get_embedding(query)
    # query the lancedb table while applying named entity filter
    xc = tbl.search(np.array(xq)).limit(100).to_list()
    xdf = tbl.search(np.array(xq)).limit(100).to_pandas()
    xdf["score"] = 1 - xdf["_distance"]
    
    # count the number of matching entities for each search result
    res = []
    for x in xc:
        match_count = sum(1 for i in list(x["named_entities"]) if i in processed_entities[0][0])
        if match_count > 0:
            res.append((x['context'], x['id'], match_count))
    
    # sort the results by match count in descending order
    res.sort(key=lambda x: x[2], reverse=True)
    
    idx_list = [r[1] for r in res]
    df_out = xdf[xdf["id"].isin(idx_list)]
    
    # add a new column to store the match count
    df_out["match_count"] = [r[2] for r in res]
    
    # sort the DataFrame by match count and vector similarity
    df_out = df_out.sort_values(by=["match_count", "score"], ascending=[False, False])
    
    print(f"Found {len(df_out)} matches")
    pprint({"Extracted Named Entities": ne})
    return df_out

In [10]:
tst = df.sample(1)['context'].tolist()[0]

In [11]:
res = search_lancedb(tst)

Found 20 matches
{'Extracted Named Entities': [[]]}


In [12]:
res

Unnamed: 0,vector,metadata,named_entities,context,id,_distance,score,match_count
2,"[-0.017206425, 0.0019140481, 0.018233476, -0.0...",{'attorneys': '['Bill Porter and Reagan F. Goi...,"State Farm, Niles, Illinois, Liberty",The complaint sought a declaration ruling that...,1602,0.214159,0.785841,1
4,"[-0.009269857, -0.0014559502, 0.017910114, -0....","{'attorneys': '['Power Rogers & Smith, EC., of...","State Farm, Niles, Illinois, Liberty",The trial court found the policy did not cover...,158,0.290655,0.709345,1
5,"[-0.0152307, -0.0036428552, 0.02178313, -0.045...","{'attorneys': '['James A. Doherty, Jr., Scanlo...",Liberty Mutual Insurance Company,(Doc. 21 ¶ 19; Doc. 27 ¶ 19.) The 2012-2013 Se...,2520,0.290879,0.709121,1
6,"[-0.030017506, -0.0040518306, 0.021853916, -0....","{'attorneys': '['A. Daniel Vazquez, Jack Joshu...",Liberty Mutual Insurance Company,Plaintiff brings this action against Defendant...,2492,0.292349,0.707651,1
8,"[-0.012538062, 0.0042348937, 0.022674039, -0.0...","{'attorneys': '['Beermann, Swerdlove, Woloshin...","State Farm, Niles, Illinois, Liberty",Summary judgment is appropriate if there is no...,639,0.323391,0.676609,1
10,"[-0.023384955, 0.0015839229, 0.0144932335, -0....",{'attorneys': '['James K. Horstman and Douglas...,"State Farm, Niles, Illinois, Liberty",Black’s Law Dictionary 393 (7th ed. 1999). As ...,765,0.334107,0.665893,1
11,"[-0.025934834, -0.017128048, 0.011713157, -0.0...","{'attorneys': '['Joseph E. Hoefert, of Hoefert...",Liberty Mutual Insurance Company,It is understood and agreed that the company h...,50,0.341697,0.658303,1
13,"[-0.015161019, -0.0054785195, 0.016388848, -0....","{'attorneys': '['Hugh C. Griffin, Arthur J. Mc...",Liberty Mutual Insurance Company,"In addition, the circuit court certified the d...",1092,0.348776,0.651224,1
22,"[-0.016750062, -0.01498477, 0.012936492, -0.00...","{'attorneys': '['Robert Marc Chemers (argued),...","State Farm, Niles, Illinois, Liberty",Danner and Watson also ask this court to find ...,1470,0.366239,0.633761,1
26,"[-0.019881386, -0.0014845061, 0.0019319726, -0...","{'attorneys': '['Wildman, Harrold, Allen & Dix...",Liberty Mutual Insurance Company,. Tri City settled the claims brought by the m...,734,0.377001,0.622999,1
