# Hybrid Search with BM25 Sparse Vectors

## Overview

BM25 is a popular technique for retrieving text. It uses term frequencies to determine the relative importance of the term to the query. It is a simple but effective and only requires knowing the number of documents in a corpus and the frequency of terms across documents. In the following guide we will show how to use BM25 with Pinecone's sparse-dense index for use in hybrid search.

Skip the embedding creation step by using the [companion guide]().

## Install

In [2]:
!pip install -qU \
          'pinecone-client[grpc]' \
          transformers \
          torch \
          sentence_transformers \
          spacy \
          scikit-learn

In [3]:
from tqdm import tqdm
import os
import pinecone
import requests

In [6]:
with open('pinecone_text.py' ,'w') as fb:
    fb.write(requests.get('https://gist.githubusercontent.com/gdj0nes/8bf4f85df522a8b1454754eec284691d/raw/183b60fced37bcb64b206182cb80bca9fe5fe8e2/pinecone_text.py').text)

## Quora Dataset

Load the popular Quora dataset

In [7]:
import pandas as pd

sample = 10_000
df = pd.read_parquet('https://storage.googleapis.com/gareth-pinecone-datasets/quora-bm25.parquet', columns=['id', 'text'])
df = df.sample(n=sample)

In [10]:
df.head()

Unnamed: 0,id,text
163020,168111,What are some good movies based on real life ...
409526,421603,Does the multiverse really exists?
107223,110619,What tie and pant should I wear with maroon s...
25767,26565,What is the role of electrical engineer in me...
429605,442225,Is Jio available for 3G?


### Fit BM25 with Spacy Tokenizer

We'll create fit a BM25 model using Spacy to tokenize data


In [11]:
%%capture
!python -m spacy download en_core_web_sm

In [12]:
import spacy
import pinecone_text

nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"])

def tokenizer(text):
    return [token.text for token in nlp(text)]

bm25 = pinecone_text.BM25(tokenizer)

We need to calculate how often tokens appear in documents

In [13]:
%%time
bm25.fit(df['text'])

CPU times: user 3.88 s, sys: 26.5 ms, total: 3.91 s
Wall time: 3.91 s


BM25(avgdl=13.2248,
     doc_freq=[11.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
               0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
               21.0, 1.0, 1.0, 2.0, ...],
     ndocs=10000)

### Dense Model

We use the popular all-MiniLM-L6-v2 model available on HuggingFace for dense vectors

In [14]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

### Compute Dense & Sparse Embeddings

In [15]:
%%time
df['sparse_values'] = df['text'].apply(bm25.transform_doc)

CPU times: user 7.25 s, sys: 353 ms, total: 7.61 s
Wall time: 7.09 s


In [16]:
%%time
df['values'] = df['text'].apply(lambda x: model.encode(x).tolist())

CPU times: user 5min 32s, sys: 6.79 s, total: 5min 39s
Wall time: 2min 52s


## Upsert to Pinecone

In [17]:
import pinecone

# Init Pinecone
api_key = os.getenv('PINECONE_API_KEY') or None
environment = None
if (api_key is None) or (environment is None):
    raise ValueError('You must specify an environment and API Key')

pinecone.init(
    api_key=api_key,
    environment="internal-beta"
)

In [18]:
index_name = "bm25-embeddings"
batch_size = 300
dimension = 384

In [22]:
pinecone.create_index(
    index_name,
    pod_type='s1',
    metric='dotproduct',
    dimension=dimension,
    metadata_config={"indexed": []}
)

### Upsert

In [23]:
from pinecone import GRPCVector, GRPCSparseValues
from google.protobuf.struct_pb2 import Struct

with pinecone.GRPCIndex(index_name) as index:
    for i in tqdm(range(0, len(df), batch_size)):
        batch = df[i:min(i+batch_size, len(df))].to_dict(orient='records')
        upserts = []
        for row in batch:
            metadata = Struct()
            metadata.update(dict(text=row['text']))
            u = GRPCVector(
                id=str(row['id']),
                values=row['values'],
                metadata=metadata,
                sparse_values=GRPCSparseValues(
                    indices=row['sparse_values']['indices'],
                    values=row['sparse_values']['values']
                )
            )
            upserts.append(u)
        index.upsert(vectors=upserts, async_req=False)

100%|██████████| 34/34 [00:04<00:00,  7.69it/s]


## Hybrid Queries with BM25

We can fetch records by calculating distance over both sparse and dense vectors

In [24]:
def hybrid_score_norm(dense, sparse, alpha: float):
    """Hybrid score using a convex combination

    alpha * dense + (1 - alpha) * sparse

    Args:
        dense: Array of floats representing
        sparse: a dict of `indices` and `values`
        alpha: scale between 0 and 1
    """
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    hs = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    return [v * alpha for v in dense], hs

In [25]:
from pinecone import SparseValues

index = pinecone.Index(index_name)

In [30]:
text = "nyc bar"
sparse = bm25.transform_query(text)
dense = model.encode(text).tolist()

### Only Sparse

In [31]:
hdense, hsparse = hybrid_score_norm(dense, sparse, 0.0)
index.query(top_k=3, vector=hdense, sparse_vector=SparseValues(**hsparse), include_metadata=True)['matches']

[{'id': '447471',
  'metadata': {'text': ' How much does a gold bar mill cost?'},
  'score': 0.216695413,
  'values': []},
 {'id': '327836',
  'metadata': {'text': ' What is the meaning of water resistant 10 bar?'},
  'score': 0.208496243,
  'values': []},
 {'id': '124618',
  'metadata': {'text': ' What are some bars in NYC that do not card?'},
  'score': 0.200894922,
  'values': []}]

### Hybrid

In [32]:
hdense, hsparse = hybrid_score_norm(dense, sparse, 0.25)
index.query(top_k=3, vector=hdense, sparse_vector=SparseValues(**hsparse), include_metadata=True)['matches']

[{'id': '124618',
  'metadata': {'text': ' What are some bars in NYC that do not card?'},
  'score': 0.332611889,
  'values': []},
 {'id': '447471',
  'metadata': {'text': ' How much does a gold bar mill cost?'},
  'score': 0.23921828,
  'values': []},
 {'id': '476354',
  'metadata': {'text': " Why don't people post flyers in the NYC subway?"},
  'score': 0.237853348,
  'values': []}]

### Dense

In [33]:
hdense, hsparse = hybrid_score_norm(dense, sparse, 1.0)
index.query(top_k=3, vector=hdense, sparse_vector=SparseValues(**hsparse), include_metadata=True)['matches']

[{'id': '124618',
  'metadata': {'text': ' What are some bars in NYC that do not card?'},
  'score': 0.727762818,
  'values': []},
 {'id': '498343',
  'metadata': {'text': ' What are bars in rap music?'},
  'score': 0.550006151,
  'values': []},
 {'id': '254425',
  'metadata': {'text': ' What is the best colocation center in New York?'},
  'score': 0.504735351,
  'values': []}]