# Introduction to NER Powered Semantic Search

In this notebook we will explore how we can use qdrant's feature to use payload as a filter to search through records. We will use Named Entity Recognition (NER) to find named entities in the query to filter the records before cosine similarity is used to search.
Doing so increases the speed of the search and provides better results.
We will need three things for doing so:


1. **Qdrant**-To store the vector embeddings of the data with the payload.
2. **NER model**- used to extract named entities to store in qdrant as payload and to extract named entities from queries which will be used to filter the search space.
3. **Retriever Model**- It helps in embedding context passages into numerical representations (vectors) that Qdrant can store and search efficiently.

We will use the **newspop** dataset from huggingface datasets which is a dataset Card for News Popularity in Multiple Social Media Platforms.

## Install Dependencies

In [25]:
!pip install -qU qdrant-client==1.2.0 cohere==4.11.2 sentence-transformers==2.2.2\
    datasets==2.12.0 tqdm==4.65.0 spacy==3.5.3 spacy-transformers==1.2.5 en-core-web-sm==3.5.0

## Import libraries

In [26]:
from datasets import load_dataset
import torch
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import cohere
from qdrant_client import QdrantClient
from pathlib import Path
from qdrant_client.http import models
import en_core_web_sm
import spacy
import spacy_transformers
from tqdm.auto import tqdm
from pprint import pprint
import os

## Load Dataset
The **newspop** dataset contains about 90k records of news with various columns.

In [3]:
# load the dataset and convert to pandas dataframe
df = load_dataset("newspop", split="train").to_pandas()
print(len(df))
df[:1]

Found cached dataset newspop (C:/Users/karti/.cache/huggingface/datasets/newspop/default/0.0.0/9904d4082ffd3c0953efa538ff926c43d27da8f37c9b5d6a13f51ab96740474e)


93239


Unnamed: 0,id,title,headline,source,topic,publish_date,facebook,google_plus,linked_in
0,99248,Obama Lays Wreath at Arlington National Cemetery,Obama Lays Wreath at Arlington National Cemete...,USA TODAY,obama,2002-04-02 00:00:00,-1,-1,-1


## Prepare Dataset
We select 50k random articles and remove the columns we do not need.

In [4]:
# drop empty rows and select 50k articles
df = df.dropna().sample(50000, random_state=32)
df.head()

Unnamed: 0,id,title,headline,source,topic,publish_date,facebook,google_plus,linked_in
89388,59266,Jumpstarting Europe's Economy,"PARIS – Not so long ago, the notion of the Eur...",Project Syndicate,economy,2016-06-26 16:51:04,-1,6,9
44396,27606,San Angelo's diverse economy shows resiliency,The stronger-than-expected economic expansion ...,San Angelo Standard Times,economy,2016-02-28 08:51:26,0,0,0
29596,73770,7 Great Ways to Spend Obama's Autonomous Car Cash,"Last week, President Obama announced plans to ...",Gizmodo,obama,2016-01-21 21:05:26,22,1,5
33986,21485,"US oil drops 6% on China, slim OPEC deal hope",U.S. crude futures closed nearly 6 percent on ...,CNBC,economy,2016-02-01 17:40:23,87,10,57
56884,87063,Fidel Castro Rejects Obama's Advice,Former Cuban dictator Fidel Castro has rejecte...,Voice of America (blog),obama,2016-03-30 22:15:07,19,1,0


We will combine the title and headline for generating vector embeddings.

In [5]:
# join article title and the headline and remove not required columns
df["title_headline"] = df["title"] + ". " + df["headline"]
df.drop(
    labels=[
        "source",
        "topic",
        "publish_date",
        "facebook",
        "google_plus",
        "id",
        "linked_in",
        "headline",
    ],
    axis=1,
    inplace=True,
)
df.head()

Unnamed: 0,title,title_headline
89388,Jumpstarting Europe's Economy,Jumpstarting Europe's Economy. PARIS – Not so ...
44396,San Angelo's diverse economy shows resiliency,San Angelo's diverse economy shows resiliency....
29596,7 Great Ways to Spend Obama's Autonomous Car Cash,7 Great Ways to Spend Obama's Autonomous Car C...
33986,"US oil drops 6% on China, slim OPEC deal hope","US oil drops 6% on China, slim OPEC deal hope...."
56884,Fidel Castro Rejects Obama's Advice,Fidel Castro Rejects Obama's Advice. Former Cu...


## Initialize NER Model
We need a NER model to extract named entities from our title_headline and from our query. Let's take a look at two NER models:
1. **dslim/bert-base-NER** from huggingface
2. **en_core_web_sm** from spacy

In [6]:
# set device to GPU if available
device = torch.cuda.current_device() if torch.cuda.is_available() else None

In [7]:
ner_model_id = "dslim/bert-base-NER"

# load the tokenizer from huggingface
tokenizer = AutoTokenizer.from_pretrained(ner_model_id)
# load the NER model from huggingface
ner_model = AutoModelForTokenClassification.from_pretrained(ner_model_id)
# load the tokenizer and ner_model into a NER pipeline
nlp_bert = pipeline(
    "ner",
    model=ner_model,
    tokenizer=tokenizer,
    aggregation_strategy="max",
    device=device,
)
text = "apple is looking at buying U.K. startup for $1 billion"
nlp_bert(text)

Downloading model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


[{'entity_group': 'LOC',
  'score': 0.9561979,
  'word': 'U. K.',
  'start': 27,
  'end': 31}]

In [8]:
nlp_spacy = spacy.load("en_core_web_sm")
nlp_spacy = en_core_web_sm.load()
text = "apple is looking at buying U.K. startup for $1 billion"

doc = nlp_spacy(text)
for ent in doc.ents:
    print(ent)

apple
U.K.
$1 billion


bert based ner_model could only extract U.K. from text but spaCy based ner model did a better job and extracted apple and $1 billion also. In general, spaCy is known for its speed, reliability, and established reputation in the field of NLP. On the other hand, BERT, being a pre-trained model, have been trained on data that predates its release, which could make it less up-to-date compared to newer models or libraries like spaCy. So, we will use the spacy ner_model.

## Initialize Retriever
We will use two models **cohere** and **multi-qa-MiniLM-L6-cos-v1** to create vector representations of our records (i.e., title_headlines) and also for our search queries. These vector embeddings capture the semantic meaning of the documents or records. Then, during the retrieval phase, similarity measure (i.e., cosine similarity) is applied in vector space to find the most similar records to a given query.
As both models create vector embeddings of different dimensions we will use a qdrant collection with multiple vectors or named vectors.

In [38]:
# Initialize cohere client using your api key, you can get your api key from cohere after signing up
COHERE_API_KEY = os.getenv("COHERE_API_KEY")
cohere_client = cohere.Client(COHERE_API_KEY)

# load the MiniLM model from huggingface model hub
minilm = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1", device=device)

## Initialize Qdrant client and create a collection

Our collection will have two types of vectors embeddings namely cohere and minilm with size 1024 and 384 respectively.

In [36]:
# Initialize Qdrant client

current_folder = Path.cwd()  # Get the current folder
qdrant_folder = current_folder / "Qdrant-db"
qdrant_folder.mkdir()  # Create qdrant folder to store collection

client = QdrantClient(path=qdrant_folder.resolve())  # path to new qdrant folder
news_collection = "ner-search"

collections = client.get_collections()
print(collections)

# only create collection if it doesn't exist
if news_collection not in collections:
    client.recreate_collection(
        collection_name=news_collection,
        vectors_config={  # dimensionality of vectors output by retriever models and metric used to check similarity
            "cohere": models.VectorParams(size=1024, distance=models.Distance.COSINE),
            "minilm": models.VectorParams(size=384, distance=models.Distance.COSINE),
        },
    )
collections = client.get_collections()
print(collections)

collections=[]
collections=[CollectionDescription(name='ner-search')]


We create a helper function to extract named entities from a batch of text as it is faster to do it in batches. Along with the embeddings, we will also store the named entities as the payload in our collection which will be used to filter our search space while querying.

In [14]:
def extract_named_entities(text_batch):
    # extract named entities using the NER pipeline
    entities = []
    for text in text_batch:
        doc = nlp_spacy(text)
        named_entities = [
            ent.text for ent in doc.ents
        ]  # loop through the results and only select the entity names
        entities.append(named_entities)
    return entities

## Generate Embeddings -> Store in Qdrant collection

In [40]:
%%time
pd.options.mode.chained_assignment = (
    None  # to suppress warning while making a new column in batch as
)
# we are not intending to make a column in df, so it is safe to do so.

batch_size = 512  # specify batch size according to your RAM and compute, higher batch size = more RAM usage

for i in tqdm(range(0, len(df), batch_size)):
    i_end = min(i + batch_size, len(df))  # find end of batch
    batch = df.iloc[i:i_end]  # extract batch

    # Generate embeddings using Cohere
    cohere_emb = cohere_client.embed(
        model="small", texts=batch["title_headline"].tolist()
    ).embeddings
    for j in range(len(cohere_emb)):
        for k in range(len(cohere_emb[j])):
            cohere_emb[j][k] = float(cohere_emb[j][k])

    minilm_emb = minilm.encode(
        batch["title_headline"].tolist()
    ).tolist()  # generate embeddings using MiniLM

    entities = extract_named_entities(
        batch["title_headline"].tolist()
    )  # extract named entities from batch
    batch["named_entities"] = [
        list(set(entity)) for entity in entities
    ]  # remove duplicate entities

    meta = batch.to_dict(orient="records")  # get metadata
    ids = list(range(i, i_end))  # create unique IDs

    # upsert to qdrant
    client.upsert(
        collection_name=news_collection,
        points=models.Batch(
            ids=ids,
            vectors={
                "cohere": cohere_emb,
                "minilm": minilm_emb,
            },
            payloads=meta,
        ),
    )
print(
    "vector count in collection- ",
    client.get_collection(collection_name=news_collection).vectors_count,
)

  0%|          | 0/98 [00:00<?, ?it/s]

vector count in collection-  100000
CPU times: total: 10min 49s
Wall time: 19min 30s


Now all the components we need are ready. We can move on to querying.

## Search Qdrant

Helper function to create vector embeddings for our query and then search qdrant with the named entities as a filter. The filter is configured in a way that the results must have at least one of the named entities from the query. Also we will have option to use both embedding models to embed our query vector and then search accordingly.

In [48]:
def search_qdrant(query, embed_model):
    ne = extract_named_entities([query])[0]  # extract named entities from the query
    print(ne)
    if embed_model == "cohere":
        encoded_query = cohere_client.embed(model="small", texts=[query]).embeddings[0]
    elif embed_model == "minilm":
        encoded_query = minilm.encode(
            query
        ).tolist()  # generate embeddings for the question
    else:
        print("please enter valid embedding model")

    results = (
        client.search(  # query the qdrant collection while applying named entity filter
            collection_name=news_collection,
            query_vector=("cohere", encoded_query)
            if embed_model == "cohere"
            else ("minilm", encoded_query),
            query_filter=models.Filter(  # create and apply the named entity filter
                must=[
                    models.FieldCondition(
                        key="named_entities", match=models.MatchAny(any=ne)
                    )
                ]
            ),
            limit=10,
        )
    )
    # extract article titles from the search result
    r = [x.payload["title"] for x in results]
    return pprint({"Extracted Named Entities": ne, "Result": r})

let's start querying.

In [46]:
query = "What are the best performing economies in Asia"
search_qdrant(query, "minilm")

['Asia']
{'Extracted Named Entities': ['Asia'],
 'Result': ['Thomson Reuters/INSEAD Q4 Asian Business Sentiment Survey ...',
            'Japan leads Asia to modest gains despite China data; dollar '
            'steady',
            'PH to remain Asia’s best performing economy',
            "Philippine economy to remain 'best performer' in Asia, says HSBC",
            'Asia’s fastest: PH economy grows 6.9%',
            "Asia's Strongest Economy Is on Fire",
            'PHL shifted into one of the most improved economies in Asia '
            '\x9d\x9d\x9d NEDA',
            'India poised for one of highest growth in emerging Asia: OECD',
            "China doing 'great job' in managing economic shift: CS",
            'The Philippine economy may have overshot its largest companies']}


In [49]:
query = "What are the best performing economies in Asia"
search_qdrant(query, "cohere")

['Asia']
{'Extracted Named Entities': ['Asia'],
 'Result': ['PH to remain Asia’s best performing economy',
            'Asia’s fastest: PH economy grows 6.9%',
            'PHL is Asia’s fastest growing economy in 1st Qtr 2016',
            'PHL shifted into one of the most improved economies in Asia '
            '\x9d\x9d\x9d NEDA',
            "Asia's New Year Economic Resolutions",
            "The Vietnamese Economy: One Of Asia's Strongest",
            'Singapore tops mobile app economy, but for how long?',
            'Asia reforms key for global economic growth: IMF chief',
            "Philippine economy to remain 'best performer' in Asia, says HSBC",
            'India poised for one of highest growth in emerging Asia: OECD']}


We get almost same results with both the embedding models

In [50]:
query = "How is the US adressing the China threat?"
search_qdrant(query, "cohere")

['US', 'China']
{'Extracted Named Entities': ['US', 'China'],
 'Result': ["Obama needs to make Russia, China stop playing 'chicken' with US "
            '...',
            'China just made a move to tackle the biggest threat to its '
            'economy ...',
            "China is crushing the US in 'economic warfare'",
            'China should pay attention to US pessimism of its economy',
            'Why is President Obama threatening to side with China against the '
            '...',
            "Strangle China's Economy: America's Ultimate Trump Card?",
            'China knows what to do to fix economy; the question is how?',
            "'Our rules, not China's': Obama invokes Beijing threat in defense "
            'of ...',
            "Obama's pivot east fuels an Asian Cold War",
            "Obama's Cautious and Calibrated Approach to an Assertive China"]}


In [51]:
query = "How is the US adressing the China threat?"
search_qdrant(query, "minilm")

['US', 'China']
{'Extracted Named Entities': ['US', 'China'],
 'Result': ['Why is President Obama threatening to side with China against the '
            '...',
            'China testing Obama as it expands its influence in Southeast Asia',
            'China should pay attention to US pessimism of its economy',
            "Obama's Cautious and Calibrated Approach to an Assertive China",
            "China is crushing the US in 'economic warfare'",
            'Disruptions in Chinese economy may have global consequences ...',
            "Xi warns Obama against threatening China's sovereignty &amp; "
            'national ...',
            "Who's worried about China now?",
            "POLL-China's yuan to weaken further as US dollar rallies, economy "
            '...',
            'Obama finds common cause with China on North Korea']}


In [52]:
query = "What is the value of Indian economy?"
search_qdrant(query, "cohere")

['Indian']
{'Extracted Named Entities': ['Indian'],
 'Result': ['Indian economy: Interesting mid-year signals',
            'Indian economy: Here’s why it is time to speed up\xa0reforms',
            'Economy improving? 15 questions and a plea',
            'The reality of the Indian economy',
            'Three cheers for Indian economy: Low Inflation, positive IIP and '
            '...',
            'Indian economy on the whole is doing very well: Professor Kaushik '
            '...',
            'India is no saviour for a sagging global economy',
            '#dnaEdit: Mid-Year Economic Review gives unclear view of Indian '
            '...',
            'India is a light in a gloomy world economy',
            'Weaker global growth to hit Indian economy: FM Arun\xa0Jaitley']}


In [53]:
query = "What is the value of Indian economy?"
search_qdrant(query, "minilm")

['Indian']
{'Extracted Named Entities': ['Indian'],
 'Result': ['The reality of the Indian economy',
            'Indian economy resilient',
            'Indian economy to more than double to $5 tn in few years: FM Arun '
            'Jaitley',
            'Indian economy on the whole is doing very well: Professor Kaushik '
            '...',
            'A turnaround economy',
            'Indian economy: Here’s why it is time to speed up\xa0reforms',
            'Indian economy to more than double to $5 tn in few years: Arun '
            'Jaitley',
            'Indian Economy to Grow 7.7% in This Fiscal: Survey',
            'Indian economy doing better than others, says FinMin',
            "India's economy can hit $20 trillion-mark in over two decades, "
            'says ...']}


We get the results we wanted, making the most of qdrant's filtering capability while searching, you can make more queries if you want.

In [31]:
client.delete_collection(collection_name=news_collection)

True