# NER Powered Semantic Search

This notebook shows how to use Named Entity Recognition (NER) for vector search with LanceDB. We will:

1. Extract named entities from text.
2. Store them in a LanceDB as metadata (alongside respective text vectors).
3. We extract named entities from incoming queries and use them to filter and search only through records containing these named entities.

This is particularly helpful if you want to restrict the search score to records that contain information about the named entities that are also found within the query.

In [1]:
import os
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
from IPython.display import Markdown, display, HTML

os.chdir(os.path.dirname(os.getcwd()))

df = pd.read_parquet('./data/splade.parquet')
print(f"df shape: {df.shape}")
df.head(1)

df shape: (1987, 24)


Unnamed: 0,index,id,citation,name,name_abbreviation,decision_date,court_id,court_name,court_slug,judges,attorneys,citations,url,head,body,name_contains_lm,body_contains_lm,year,context,context_citation,context_tokens,openai_embeddings,splade_embeddings,state
0,0,411690,154 Ill. 2d 90,"RICHARD R. JOHNSON, Plaintiff-Appellant and Cr...",Johnson v. Halloran,2000-01-13,8837,Illinois Appellate Court,ill-app-ct,[],"['Wolter, Beeman, Lynch & McIntyre, of Springf...","[{'type': 'official', 'cite': '312 Ill. App. 3...",https://api.case.law/v1/cases/411690/,"RICHARD R. JOHNSON, Plaintiff-Appellant and Cr...",JUSTICE HALL\r\ndelivered the opinion of the c...,False,True,2000,The public defender of Cook County was appoint...,154 Ill. 2d 90,1317,"[-0.0017778094625100493, -0.002360282698646187...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",New Mexico


In [2]:
from src.search.auto_entity_search import MatchedEntitySearch

In [3]:
entity_search = MatchedEntitySearch(
    db_path="./.lancedb"
)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[32m2024-04-28 23:32:42 - INFO - Loaded database with tables: ['context', 'ebay', 'ner_table2', 'ner_table3', 'ner_table4', 'test_table', 'test_table_class'][0m


In [4]:
entity_search.open_table('test_table_class')

In [5]:
entity_search.col_info

[('vector', FixedSizeListType(fixed_size_list<item: float>[1536])),
 ('metadata',
  StructType(struct<attorneys: string, body: string, body_contains_lm: bool, citation: string, citations: string, context: string, context_citation: string, context_tokens: int64, court_id: int64, court_name: string, court_slug: string, decision_date: timestamp[us], head: string, id: int64, index: int64, judges: string, name: string, name_abbreviation: string, name_contains_lm: bool, named_entities: string, openai_embeddings: list<item: double>, splade_embeddings: list<item: float>, state: string, url: string, year: int64>)),
 ('named_entities', DataType(string)),
 ('context', DataType(string)),
 ('id', DataType(int64))]

In [6]:
sample_df = df.sample(25)
test_query = sample_df.sample(1)['context'].tolist()[0]

entity_search.ingest_data(
    df=sample_df,
    table_name="test_table_class",
    mode='append'
)

  0%|          | 0/3 [00:00<?, ?it/s]

In [8]:
Markdown(test_query)

3. Whether Res Judicata and/or Collateral Estoppel Relieve Casualty and American States of Their Obligations to Defend and Indemnify Schal and Buck When Their Named Insureds were Dismissed From the Underlying Litigation Casualty and American States each argued that, assuming valid tenders of defense, neither was obligated to indemnify Schal and Buck for the excess portion of the Keegan judgment because their respective named insureds (Alcan and Chicago Forming) were adjudicated not to be liable for Keegan’s injuries. Therefore, they contend, Schal’s and Buck’s liabilities did not “arise out of’ or occur “with respect to” their named insureds’ conduct. They assert, essentially, that the question of coverage was either decided or should have been decided in the Keegan action and that they are protected by the doctrines of res judicata and/or collateral estoppel. The doctrine of res judicata bars a subsequent suit between the same parties involving the same cause of action where a court of competent jurisdiction has already rendered a final judgment on the merits. Rein v. David A. Noyes & Co., 172 Ill. 2d 325, 665 N.E.2d 1199 (1996). The doctrine precludes litigation of what was actually decided in the first action as well as what could have been decided in that suit. People ex rel. Burris v. Progressive Land Developers, Inc., 151 Ill. 2d 285, 602 N.E.2d 820 (1992). Our supreme court recently adopted the “transactional test” for determining whether a subsequent action involves the same cause of action as a prior suit. River Park, Inc. v. City of Highland Park, 184 Ill. 2d 290, 703 N.E.2d 883 (1998). The present action does not involve the same transaction, that is, the same group of operative facts, as the Keegan lawsuit. The Keegan action involved liability for personal injuries under the Structural Work Act, whereas Schal, Buck and Northbrook seek reimbursement for payments Northbrook made to Keegan that should have been made by Casualty and American States under the terms of certain policies of insurance. A finding that a party is not “in charge of’ work is very different than a finding that injuries did not “arise out of’ that party’s work or were not “connected with,” “incidental to,” “originating from,” “growing out of,” or “flowing from” the work. See Sportmart, Inc. v. Daisy Manufacturing Co., 268 Ill. App. 3d 974, 977 (1994), cit ing Maryland Casualty Co. v. Chicago & North Western Transportation Co., 126 Ill. App. 3d 150, 466 N.E.2d 1091 (1984); Consolidated R. Corp. v. Liberty Mutual Insurance Co., 92 Ill. App. 3d 1066, 416 N.E.2d 758 (1981). In fact, whether Schal’s and Buck’s liability arose out of the work of Alcan and Chicago Forming could not have been determined in the underlying case because that issue was germane only to coverage, rather than to liability. If Casualty and American States had wished their endorsements to extend coverage only where their named insureds were determined to be liable, they could have easily modified the endorsements to so provide. National Union Fire Insurance Co. v. Glenview Park District, 158 Ill. 2d 116, 123 (1994). Collateral estoppel applies only to issues actually litigated and decided, not to matters that might have been litigated or decided. Housing Authority v. Young Men’s Christian Ass’n, 101 Ill. 2d 246, 461 N.E.2d 959 (1984). The party asserting collateral estoppel bears the “heavy burden” of showing with clarity and certainty that the identical issue was decided in the prior case. Peregrine Financial Group, Inc. v. Martinez, 305 Ill. App. 3d 571, 581, 712 N.E.2d 861 (1999). As we have already determined, whether Schal’s and Buck’s liability arose out of the work of Alcan and Chicago Forming was neither litigated nor decided in the Keegan case. 4. Whether the Failure of the Trial Court to Apportion Fault in the Keegan Action Precludes Northbrook From Seeking Reimbursement Both Casualty and American States argue that United States Fidelity & Guaranty Co. v. Continental Casualty Co., 198 Ill. App. 3d 950, 556 N.E.2d 671 (1990) (U.S.F.&G.), stands for the proposition that Illinois courts will not apportion fault among insureds-defendants in a later declaratory judgment action and that any determination as to the degree of fault amongst insureds must be determined by the trial court hearing the underlying matter or is waived. With regard to the Schal, Buck and Northbrook complaint, however, this argument misconstrues the nature of the action. Northbrook does not seek to apportion fault at all, but simply to assert that it should not have been liable for anything other than the excess portion of the judgment. Northbrook seeks full reimbursement for payments it was required to make due to Casualty’s and American States’ refusal to indemnify Schal and Buck. Even so, the central holding of U.S.F.&G. is merely that primary insurers may not obtain equitable contribution from excess carriers. With regard to that opinion’s discussion of apportionment and the statement that “there must be a liability determination as between the joint tortfeasors *** before the obligations of the insurers can be determined” (U.S.F.&G., 198 Ill. App. 3d at 955), this language overlooks the fact that the relative liability of the insureds is not always germane to the issues raised in the underlying dispute. Where a case goes to judgment without an apportionment of fault, as in the Keegan litigation, it is inherently unfair to use that fact to prejudice an insurer that was not a party in the initial litigation and that had no right to raise the issue in the initial proceeding. Accordingly, both because the discussion of apportionment in U.S.F.&G. was not central to its decision and because it is inherently unfair to apply U.S.F.&G. against an insurer that was not a party in the prior litigation, we treat the statement's made therein relative to apportionment of fault as obiter dicta. See Ko v. Eljer Industries, Inc., 287 Ill. App. 3d 35, 41, 678 N.E.2d 641 (1997). 5. Whether Casualty and American States Are Estopped From Denying Coverage to Schal and Buck Because of a Failure to Defend the Keegan Action Under a Reservation of Rights or File a Declaratory Judgment Action to Determine Coverage Obligations

In [9]:
test_res = entity_search.search(
    test_query, 
    limit=100,
    strict=True,
    verbose=True,
    distinct_column='context',
)

[32m2024-04-28 23:33:26 - INFO - Found 92 matches[0m
[32m2024-04-28 23:33:26 - INFO - Extracted Named Entities: ['Res Judicata, Alcan, Chicago Forming, Rein, Progressive Land Developers, Inc, Inc'][0m


In [10]:
test_res['named_entities'].head(10).tolist()

['Res Judicata, Collateral Estoppel, Casualty, American States, Keegan, Alcan, Rein, Highland',
 'Res Judicata, Collateral Estoppel, Casualty, American States, Keegan, Alcan, Rein, Highland',
 'Res Judicata, Collateral Estoppel, Casualty, American States, Keegan, Alcan, Rein, Highland',
 'Res Judicata, Collateral Estoppel, Casualty, American States, Keegan, Alcan, Rein, Highland',
 'Res Judicata, Collateral Estoppel, Casualty, American States, Keegan, Alcan, Rein, Highland',
 'Industrial Commission, Plaintiff, Full Commission, Commission, Parsons, Pantry, Inc, Liberty, Mutual',
 'BFG, Royal, Insurance, Inc',
 'Crum & Forster Managers Corp, BGK, BGK Security Services, Inc, Clarendon',
 'Alliance General Insurance Co, Covington Enterprises, Inc, Evangelical Lutheran Church in America, Westchester, Employers Insurance, Ehlco Liquidating Trust, Society of Mount Carmel, United States, Texas',
 'Rollings and Bowyer, Seevers Farm Drainage, Inc, Seevers, Rollings, Bowyer, Country Mutual']

In [11]:
Markdown(test_res['context'].tolist()[0])

This type of sharp claim practice by an insurer is the very thing the Cincinnati Cos. case roundly condemned. An insurer which knows that its policy potentially covers a party to a lawsuit should not be allowed to avoid its potential obligations by remaining silent with the hope that the potentially covered insured will not formally tender its defense. Accordingly, we find that the failure of Schal and Buck to give formal written notice to American States is not a proper defense to coverage. 3. Whether Res Judicata and/or Collateral Estoppel Relieve Casualty and American States of Their Obligations to Defend and Indemnify Schal and Buck When Their Named Insureds were Dismissed From the Underlying Litigation Casualty and American States each argued that, assuming valid tenders of defense, neither was obligated to indemnify Schal and Buck for the excess portion of the Keegan judgment because their respective named insureds (Alcan and Chicago Forming) were adjudicated not to be liable for Keegan’s injuries. Therefore, they contend, Schal’s and Buck’s liabilities did not “arise out of’ or occur “with respect to” their named insureds’ conduct. They assert, essentially, that the question of coverage was either decided or should have been decided in the Keegan action and that they are protected by the doctrines of res judicata and/or collateral estoppel. The doctrine of res judicata bars a subsequent suit between the same parties involving the same cause of action where a court of competent jurisdiction has already rendered a final judgment on the merits. Rein v. David A. Noyes & Co., 172 Ill. 2d 325, 665 N.E.2d 1199 (1996). The doctrine precludes litigation of what was actually decided in the first action as well as what could have been decided in that suit. People ex rel. Burris v. Progressive Land Developers, Inc., 151 Ill. 2d 285, 602 N.E.2d 820 (1992). Our supreme court recently adopted the “transactional test” for determining whether a subsequent action involves the same cause of action as a prior suit. River Park, Inc. v. City of Highland Park, 184 Ill. 2d 290, 703 N.E.2d 883 (1998). The present action does not involve the same transaction, that is, the same group of operative facts, as the Keegan lawsuit. The Keegan action involved liability for personal injuries under the Structural Work Act, whereas Schal, Buck and Northbrook seek reimbursement for payments Northbrook made to Keegan that should have been made by Casualty and American States under the terms of certain policies of insurance. A finding that a party is not “in charge of’ work is very different than a finding that injuries did not “arise out of’ that party’s work or were not “connected with,” “incidental to,” “originating from,” “growing out of,” or “flowing from” the work. See Sportmart, Inc. v. Daisy Manufacturing Co., 268 Ill. App. 3d 974, 977 (1994), cit ing Maryland Casualty Co. v. Chicago & North Western Transportation Co., 126 Ill. App. 3d 150, 466 N.E.2d 1091 (1984); Consolidated R. Corp. v. Liberty Mutual Insurance Co., 92 Ill. App. 3d 1066, 416 N.E.2d 758 (1981). In fact, whether Schal’s and Buck’s liability arose out of the work of Alcan and Chicago Forming could not have been determined in the underlying case because that issue was germane only to coverage, rather than to liability. If Casualty and American States had wished their endorsements to extend coverage only where their named insureds were determined to be liable, they could have easily modified the endorsements to so provide. National Union Fire Insurance Co. v. Glenview Park District, 158 Ill. 2d 116, 123 (1994). Collateral estoppel applies only to issues actually litigated and decided, not to matters that might have been litigated or decided. Housing Authority v. Young Men’s Christian Ass’n, 101 Ill. 2d 246, 461 N.E.2d 959 (1984). The party asserting collateral estoppel bears the “heavy burden” of showing with clarity and certainty that the identical issue was decided in the prior case. Peregrine Financial Group, Inc. v. Martinez, 305 Ill. App. 3d 571, 581, 712 N.E.2d 861 (1999). As we have already determined, whether Schal’s and Buck’s liability arose out of the work of Alcan and Chicago Forming was neither litigated nor decided in the Keegan case. 4. Whether the Failure of the Trial Court to Apportion Fault in the Keegan Action Precludes Northbrook From Seeking Reimbursement Both Casualty and American States argue that United States Fidelity & Guaranty Co. v. Continental Casualty Co., 198 Ill. App. 3d 950, 556 N.E.2d 671 (1990) (U.S.F.&G.), stands for the proposition that Illinois courts will not apportion fault among insureds-defendants in a later declaratory judgment action and that any determination as to the degree of fault amongst insureds must be determined by the trial court hearing the underlying matter or is waived. With regard to the Schal, Buck and Northbrook complaint, however, this argument misconstrues the nature of the action. Northbrook does not seek to apportion fault at all, but simply to assert that it should not have been liable for anything other than the excess portion of the judgment. Northbrook seeks full reimbursement for payments it was required to make due to Casualty’s and American States’ refusal to indemnify Schal and Buck. Even so, the central holding of U.S.F.&G. is merely that primary insurers may not obtain equitable contribution from excess carriers. With regard to that opinion’s discussion of apportionment and the statement that “there must be a liability determination as between the joint tortfeasors *** before the obligations of the insurers can be determined” (U.S.F.&G., 198 Ill. App. 3d at 955), this language overlooks the fact that the relative liability of the insureds is not always germane to the issues raised in the underlying dispute. Where a case goes to judgment without an apportionment of fault, as in the Keegan litigation, it is inherently unfair to use that fact to prejudice an insurer that was not a party in the initial litigation and that had no right to raise the issue in the initial proceeding.

In [12]:
test_res

Unnamed: 0,context,id,named_entities,match_count,score
0,This type of sharp claim practice by an insure...,85,"Res Judicata, Collateral Estoppel, Casualty, A...",3,0.950325
1,At issue is the law of the case doctrine. The ...,628,"Res Judicata, Collateral Estoppel, Casualty, A...",3,0.746485
2,"Westchester Fire Insurance Co., 321 Ill. App. ...",371,"Res Judicata, Collateral Estoppel, Casualty, A...",3,0.722888
3,American Family responds that the exclusion pr...,1435,"Res Judicata, Collateral Estoppel, Casualty, A...",3,0.722859
4,Plaintiffs refer to the policy-and Defendants'...,2643,"Res Judicata, Collateral Estoppel, Casualty, A...",3,0.717328
...,...,...,...,...,...
113,"During oral argument, counsel for Boeing asser...",1293,Liberty Mutual Insurance Company,0,0.707254
114,"Outboard Marine, 154 Ill. 2d at 125. If the un...",1125,"Association, Illinois, State Farm",0,0.707135
115,State Farm argues that it is seeking redress f...,1083,"Barber - Colman, De Foor, Outboard Marine Corp...",0,0.706916
116,The mall defendants counterclaimed. Discovery ...,1627,"GMC Sonoma, US Bank, ILCS",0,0.706327


## Initialize NER model

To extract named entities, we will use a NER model finetuned on a BERT-base model. The model can be loaded from the HuggingFace model hub as follows:

In [21]:
import torch

# set device to GPU if available
device = torch.cuda.current_device() if torch.cuda.is_available() else None

In [22]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_id = "dslim/bert-base-NER"
# model_id = "Babelscape/wikineural-multilingual-ner"

# load the tokenizer from huggingface
tokenizer = AutoTokenizer.from_pretrained(model_id)
# load the NER model from huggingface
model = AutoModelForTokenClassification.from_pretrained(model_id)
# load the tokenizer and model into a NER pipeline
nlp = pipeline(
    "ner", model=model, tokenizer=tokenizer, aggregation_strategy="max", device=device
)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [25]:
query = """
Regarding the pollution exclusion clause under the terms of comprehensive general liability (CGL) insurance, \
how has the California court defined the phrase 'sudden and accidental', in particular for polluting events? \
Also, has there been any consideration for intentional vs unintentional polluting events? What would Judge Judy Smith say?
"""
# use the NER pipeline to extract named entities from the text
ents = nlp(query)

In [26]:
ents

[{'entity_group': 'LOC',
  'score': 0.9996952,
  'word': 'California',
  'start': 122,
  'end': 132},
 {'entity_group': 'PER',
  'score': 0.9827239,
  'word': 'Judy Smith',
  'start': 326,
  'end': 336}]

In [23]:
nlp_ = get_nlp()

doc = nlp_(text)
ents1 = extract_entities(doc, entity_types=["ORG", "LAW", "EVENT", "PRODUCT"])
ents2 = merge_similar_entities(ents1, threshold=90)

In [24]:
ents2

['section',
 'the Illinois Department of Corrections',
 'Liberty Mutual Insurance Co',
 'the Illinois Constitution of',
 'the Tort Immunity Act',
 'The supreme court',
 'Law Offices of Pretzel  Stouffer',
 'Manchester Insurance  Indemnity Co',
 'the State Lawsuit Immunity Act',
 'CONCLUSION',
 'Purtill',
 'the General Assembly',
 'Kerschner v Weiss  Co',
 'the Illinois Court of Claims',
 'J',
 'Smiley',
 'Ogena',
 'Outboard Marine Corp',
 'the Attorney Registration and Disciplinary Commission',
 'Ill']

In [23]:
def deduplicate(seq: list[str]) -> list[str]:
    seen = set()
    return [x for x in seq if not (x in seen or seen.add(x))]


def get_entities_by_type_and_score(entities, entity_types, score_threshold):
    """
    Get entities of specific types within a certain score threshold.

    Args:
        entities (list): A list of entity dictionaries.
        entity_types (list): A list of desired entity types (e.g., ['ORG', 'PER', 'LOC']).
        score_threshold (float): The minimum score threshold for entities.

    Returns:
        list: A list of entities matching the specified types and score threshold.
    """
    entity_name_list = [
        entity['word']
        for entity in entities
        if entity['entity_group'] in entity_types and entity['score'] >= score_threshold
    ]
    
    top_entities_distinct = deduplicate(entity_name_list)
    top_entities = [item for item in top_entities_distinct if len(item) > 1]
    
    return top_entities

In [27]:
ents_list = get_entities_by_type_and_score(ents, entity_types=['ORG', 'PER', 'LOC'], score_threshold=0.99)

In [28]:
ents_list

['California']

In [29]:
import openai
import numpy as np

client = openai.OpenAI()
model_name = "text-embedding-ada-002"

def get_embedding(text: str):
    response = client.embeddings.create(input=text, model=model_name)
    return response.data[0].embedding

def encode_string(text: str) -> np.ndarray:
    embedding = get_embedding(text)
    return np.array(embedding)

## Initialize LanceDB

In [30]:
import lancedb

db = lancedb.connect("./.lancedb")

## Generate Embeddings and Insert

We generate embeddings for the title_text column we created earlier. Alongside the embeddings, we also include the named entities in the index as metadata. Later we will apply a filter based on these named entities when executing queries.

Let's first write a helper function to extract named entities from a batch of text.

In [31]:
from typing import List

def extract_named_entities(text_batch: List[str]) -> List[str]:
    # extract named entities using the NER pipeline
    extracted_batch = nlp(text_batch)
    entities = []
    # loop through the results and only select the entity names
    for text in extracted_batch:
        # ne = [entity["word"] for entity in text]
        ne = get_entities_by_type_and_score(text, entity_types=['ORG', 'LOC'], score_threshold=0.985)
        entities.append(ne)
    return entities

In [37]:
from typing import Iterable
import openai
import instructor
import nest_asyncio
from pydantic import BaseModel, Field


class ResolvedImprovedEntities(BaseModel):
    """An improved list of extracted entities."""
    
    entities: str = Field(
        ...,
        description="Accurately resolved and improved entities useful for downstream retrieval. Avoid filler words that do not add value as search input. For example, 'Liberty Mutual Insurance Company' should simply be 'Liberty Mutual'.",
    )
    
    
def resolve_entities(content) -> ResolvedImprovedEntities:
    client = instructor.patch(openai.OpenAI())
    return client.chat.completions.create(
        model="gpt-3.5-turbo-16k",
        response_model=ResolvedImprovedEntities,
        messages=[
            {
                "role": "system",
                "content": "You are a entity resolution and enhancement AI. Your task is to refine a list of extracted entities by removing, combining, or otherwise improving the text. Make every word count.",
            },
            {
                "role": "user",
                "content": content,
            },
        ],
    )  # type: ignore

In [33]:
sample_input = df.sample(20)['body'].tolist()

entity_batch = extract_named_entities(sample_input)
print(entity_batch[0])

['Safeway']


In [35]:
entity_batch

[['Safeway'],
 ['UPMC',
  'UPMC Benefit Management Services, Inc',
  'BMS',
  'UPMC Health Benefits, Inc',
  'MCMC',
  'WorkPartners'],
 ['Rotondo Weirich Enterprises, Inc',
  'Sundt Construction, Inc',
  'Sundt / Layton',
  'California Department of Corrections and Rehabilitation',
  'CDCR',
  'Fidelity & Deposit Company of Maryland',
  'Zurich American Insurance Company',
  'Federal Insurance Company',
  'Liberty Mutual Insurance Company',
  'CML RW Security, LLC',
  'CML',
  'California'],
 ['State Farm Fire and Casualty Company',
  'State Farm',
  'Perry County',
  'Williamson',
  'State'],
 ['Soprafina',
  'Thomas M. Tunney Enterprises, Ltd',
  'Ann Sather Restaurants',
  'Chicago City Council',
  'Journal of Proceedings'],
 ['JUSTICE KAPALA',
  'American Standard Insurance Company',
  'Bronco',
  'Basbagill',
  'Bencak',
  'American Standard'],
 ['Ni - cor', 'London Insurers', 'Nicor', 'Northern Illinois'],
 ['Baby Fold',
  'Illinois',
  'Chicago Trust Company',
  'Chicago Trust'

In [38]:
resolved_batch = [resolve_entities(", ".join(t)) for t in entity_batch]
print(resolved_batch[0].entities)

Safeway


In [40]:
processed_entities = [r.entities for r in resolved_batch]

In [41]:
processed_entities

['Safeway',
 'UPMC, UPMC Benefit Management Services, Inc, BMS, UPMC Health Benefits, Inc, MCMC, WorkPartners',
 'Rotondo Weirich Enterprises, Inc, Sundt Construction, Inc, Sundt / Layton, California Department of Corrections and Rehabilitation, CDCR, Fidelity & Deposit Company of Maryland, Zurich American Insurance Company, Federal Insurance Company, Liberty Mutual Insurance Company, CML RW Security, LLC, CML, California',
 'State Farm Fire and Casualty Company, State Farm, Perry County, Williamson, State',
 'Soprafina, Thomas M. Tunney Enterprises, Ltd, Ann Sather Restaurants, Chicago City Council, Journal of Proceedings',
 'JUSTICE KAPALA, American Standard Insurance Company, Bronco, Basbagill, Bencak, American Standard',
 'Nicor, London Insurers, Northern Illinois',
 'Baby Fold, Illinois, Chicago Trust Company, Chicago Trust, Philadelphia',
 'Ohana Control Systems, Inc, State of Hawaii Department of Education, DOE, Plaintiff Philadelphia Indemnity Insurance Company, Philadelphia, H

In [3]:
from typing import List
import openai
from pydantic import BaseModel, Field
from tqdm.auto import tqdm
import warnings
import openai
import instructor
import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
import lancedb
from pprint import pprint


# set device to GPU if available
device = torch.cuda.current_device() if torch.cuda.is_available() else None
warnings.filterwarnings("ignore", category=UserWarning)


class ResolvedImprovedEntities(BaseModel):
    """An improved list of extracted entities."""
    
    entities: str = Field(
        ...,
        description="Accurately resolved and improved entities useful for downstream retrieval. Avoid filler words that do not add value as search input. For example, 'Liberty Mutual Insurance Company' should simply be 'Liberty Mutual'.",
    )
    
    
def resolve_entities(content) -> ResolvedImprovedEntities:
    client = instructor.patch(openai.OpenAI())
    return client.chat.completions.create(
        model="gpt-3.5-turbo-16k",
        response_model=ResolvedImprovedEntities,
        messages=[
            {
                "role": "system",
                "content": "You are a entity resolution and enhancement AI. Your task is to refine a list of extracted entities by removing, combining, or otherwise improving the text. Make every word count.",
            },
            {
                "role": "user",
                "content": content,
            },
        ],
    )  # type: ignore

# we will use batches of 64
batch_size = 10
data = []


model_id = "dslim/bert-base-NER"
# model_id = "Babelscape/wikineural-multilingual-ner"

# load the tokenizer from huggingface
tokenizer = AutoTokenizer.from_pretrained(model_id)
# load the NER model from huggingface
model = AutoModelForTokenClassification.from_pretrained(model_id)
# load the tokenizer and model into a NER pipeline
nlp = pipeline(
    "ner", model=model, tokenizer=tokenizer, aggregation_strategy="max", device=device
)

def get_embedding(text: str):
    client = openai.OpenAI()
    model_name = "text-embedding-ada-002"
    response = client.embeddings.create(input=text, model=model_name)
    return response.data[0].embedding

def encode_string(text: str) -> np.ndarray:
    embedding = get_embedding(text)
    return np.array(embedding)

def deduplicate(seq: list[str]) -> list[str]:
    seen = set()
    return [x for x in seq if not (x in seen or seen.add(x))]


def get_entities_by_type_and_score(entities, entity_types, score_threshold):
    """
    Get entities of specific types within a certain score threshold.

    Args:
        entities (list): A list of entity dictionaries.
        entity_types (list): A list of desired entity types (e.g., ['ORG', 'PER', 'LOC']).
        score_threshold (float): The minimum score threshold for entities.

    Returns:
        list: A list of entities matching the specified types and score threshold.
    """
    entity_name_list = [
        entity['word']
        for entity in entities
        if entity['entity_group'] in entity_types and entity['score'] >= score_threshold
    ]
    top_entities_distinct = deduplicate(entity_name_list)
    # A single character is never useful to filter on
    top_entities = [item for item in top_entities_distinct if len(item) > 1]
    
    return top_entities

def extract_named_entities(text_batch) -> List[str]:
    # extract named entities using the NER pipeline
    extracted_batch = nlp(text_batch)
    entities = []
    # loop through the results and only select the entity names
    for text in extracted_batch:
        ne = get_entities_by_type_and_score(text, entity_types=['ORG', 'LOC'], score_threshold=0.985)
        entities.append(ne)
    return entities


df = df.sample(50)

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    i_end = min(i + batch_size, len(df))
    # extract batch
    batch = df.iloc[i:i_end].copy()
    # generate embeddings for batch
    texts = batch["context"].tolist()
    idx = batch["index"].tolist()
    # get a list of embeddings
    emb = [get_embedding(t) for t in texts]
    # extract named entities from the batch
    entity_batch = extract_named_entities(batch["context"].tolist())
    # Use LLM to resolve and refine the original list of entities
    resolved_batch = [resolve_entities(", ".join(t)) for t in entity_batch]
    processed_entities = [r.entities for r in resolved_batch]
    batch["named_entities"] = processed_entities[0]
    # get metadata
    meta = batch.to_dict(orient="records")
    # add all to upsert list
    to_upsert = list(zip(idx, emb, meta, batch["named_entities"], batch["context"]))
    for id, emb, meta, entity, text in to_upsert:
        temp = {}
        temp["vector"] = np.array(emb)
        temp["metadata"] = meta
        temp['named_entities'] = entity
        temp["context"] = text
        temp["id"] = id
        data.append(temp)
        
        

db = lancedb.connect("./.lancedb")
# create table using above data
tbl = db.create_table("context", data, mode='overwrite')


def search_lancedb(query):
    # extract named entities from the query
    ne = extract_named_entities([query])
    # Use LLM to resolve and refine the original list of entities
    resolved_batch = [resolve_entities(", ".join(t)) for t in ne]
    processed_entities = [r.entities for r in resolved_batch]
    # create embeddings for the query
    xq = get_embedding(query)
    # query the lancedb table while applying named entity filter
    xc = tbl.search(np.array(xq)).limit(100).to_list()
    xdf = tbl.search(np.array(xq)).limit(100).to_pandas()
    xdf["score"] = 1 - xdf["_distance"]
    
    # count the number of matching entities for each search result
    res = []
    for x in xc:
        match_count = sum(1 for i in list(x["named_entities"]) if i in processed_entities[0][0])
        if match_count > 0:
            res.append((x['context'], x['id'], match_count))
    
    # sort the results by match count in descending order
    res.sort(key=lambda x: x[2], reverse=True)
    
    idx_list = [r[1] for r in res]
    df_out = xdf[xdf["id"].isin(idx_list)]
    
    # add a new column to store the match count
    df_out["match_count"] = [r[2] for r in res]
    
    # sort the DataFrame by match count and vector similarity
    df_out = df_out.sort_values(by=["match_count", "score"], ascending=[False, False])
    
    print(f"Found {len(df_out)} matches")
    pprint({"Extracted Named Entities": ne})
    return df_out

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  0%|          | 0/5 [00:00<?, ?it/s]

# Quering

In [4]:
import lancedb
from pprint import pprint

db = lancedb.connect("./.lancedb")
# create table using above data
tbl = db.create_table("context", data, mode='overwrite')



def search_lancedb(query):
    # extract named entities from the query
    ne = extract_named_entities([query])
    # Use LLM to resolve and refine the original list of entities
    resolved_batch = [resolve_entities(", ".join(t)) for t in ne]
    processed_entities = [r.entities for r in resolved_batch]
    # create embeddings for the query
    xq = get_embedding(query)
    # query the lancedb table while applying named entity filter
    xc = tbl.search(np.array(xq)).limit(100).to_list()
    xdf = tbl.search(np.array(xq)).limit(100).to_pandas()
    xdf["score"] = 1 - xdf["_distance"]
    
    # count the number of matching entities for each search result
    res = []
    for x in xc:
        match_count = sum(1 for i in list(x["named_entities"]) if i in processed_entities[0][0])
        if match_count > 0:
            res.append((x['context'], x['id'], match_count))
    
    # sort the results by match count in descending order
    res.sort(key=lambda x: x[2], reverse=True)
    
    idx_list = [r[1] for r in res]
    df_out = xdf[xdf["id"].isin(idx_list)]
    
    # add a new column to store the match count
    df_out["match_count"] = [r[2] for r in res]
    
    # sort the DataFrame by match count and vector similarity
    df_out = df_out.sort_values(by=["match_count", "score"], ascending=[False, False])
    
    print(f"Found {len(df_out)} matches")
    pprint({"Extracted Named Entities": ne})
    return df_out

In [10]:
tst = df.sample(1)['context'].tolist()[0]

In [11]:
res = search_lancedb(tst)

Found 20 matches
{'Extracted Named Entities': [[]]}


In [12]:
res

Unnamed: 0,vector,metadata,named_entities,context,id,_distance,score,match_count
2,"[-0.017206425, 0.0019140481, 0.018233476, -0.0...",{'attorneys': '['Bill Porter and Reagan F. Goi...,"State Farm, Niles, Illinois, Liberty",The complaint sought a declaration ruling that...,1602,0.214159,0.785841,1
4,"[-0.009269857, -0.0014559502, 0.017910114, -0....","{'attorneys': '['Power Rogers & Smith, EC., of...","State Farm, Niles, Illinois, Liberty",The trial court found the policy did not cover...,158,0.290655,0.709345,1
5,"[-0.0152307, -0.0036428552, 0.02178313, -0.045...","{'attorneys': '['James A. Doherty, Jr., Scanlo...",Liberty Mutual Insurance Company,(Doc. 21 ¶ 19; Doc. 27 ¶ 19.) The 2012-2013 Se...,2520,0.290879,0.709121,1
6,"[-0.030017506, -0.0040518306, 0.021853916, -0....","{'attorneys': '['A. Daniel Vazquez, Jack Joshu...",Liberty Mutual Insurance Company,Plaintiff brings this action against Defendant...,2492,0.292349,0.707651,1
8,"[-0.012538062, 0.0042348937, 0.022674039, -0.0...","{'attorneys': '['Beermann, Swerdlove, Woloshin...","State Farm, Niles, Illinois, Liberty",Summary judgment is appropriate if there is no...,639,0.323391,0.676609,1
10,"[-0.023384955, 0.0015839229, 0.0144932335, -0....",{'attorneys': '['James K. Horstman and Douglas...,"State Farm, Niles, Illinois, Liberty",Black’s Law Dictionary 393 (7th ed. 1999). As ...,765,0.334107,0.665893,1
11,"[-0.025934834, -0.017128048, 0.011713157, -0.0...","{'attorneys': '['Joseph E. Hoefert, of Hoefert...",Liberty Mutual Insurance Company,It is understood and agreed that the company h...,50,0.341697,0.658303,1
13,"[-0.015161019, -0.0054785195, 0.016388848, -0....","{'attorneys': '['Hugh C. Griffin, Arthur J. Mc...",Liberty Mutual Insurance Company,"In addition, the circuit court certified the d...",1092,0.348776,0.651224,1
22,"[-0.016750062, -0.01498477, 0.012936492, -0.00...","{'attorneys': '['Robert Marc Chemers (argued),...","State Farm, Niles, Illinois, Liberty",Danner and Watson also ask this court to find ...,1470,0.366239,0.633761,1
26,"[-0.019881386, -0.0014845061, 0.0019319726, -0...","{'attorneys': '['Wildman, Harrold, Allen & Dix...",Liberty Mutual Insurance Company,. Tri City settled the claims brought by the m...,734,0.377001,0.622999,1
