# Hybrid Search with BM25 Sparse Vectors

## Overview

BM25 is a popular technique for retrieving text. It uses term frequencies to determine the relative importance of the term to the query. It is a simple but effective and only requires knowing the number of documents in a corpus and the frequency of terms across documents. In the following guide we will show how to use BM25 with Pinecone's sparse-dense index for use in hybrid search.

Learn how to create embeddings in the [companion guide]().

## Install

In [1]:
!pip install -qU \
          'pinecone-client[grpc]' \
          torch \
          sentence_transformers \
          spacy \
          scikit-learn

In [2]:
import pandas as pd
from tqdm import tqdm
import os
import requests
import json

In [3]:
with open('pinecone_text.py' ,'w') as fb:
    fb.write(requests.get('https://gist.githubusercontent.com/gdj0nes/8bf4f85df522a8b1454754eec284691d/raw/183b60fced37bcb64b206182cb80bca9fe5fe8e2/pinecone_text.py').text)

## Quora Dataset

Load the popular Quora dataset with embeddings precomputed using

* Dense: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
* Sparse: BM25


In [6]:
df = pd.read_parquet('https://storage.googleapis.com/gareth-pinecone-datasets/quora_all-MiniLM-L6-bm25.parquet')

In [7]:
df.head()

Unnamed: 0,id,text,values,sparse_values
0,1,What is the step by step guide to invest in s...,"[0.4024894, -0.23425448, -0.36006898, 0.044094...","{'indices': [7096, 8508, 13677, 23041, 24734, ..."
1,2,What is the step by step guide to invest in s...,"[0.5111937, -0.1987632, -0.32637578, 0.1264907...","{'indices': [7096, 8508, 13677, 24734, 26026, ..."
2,3,What is the story of Kohinoor (Koh-i-Noor) Di...,"[-0.2237151, 0.74151665, -0.18739395, 0.233195...","{'indices': [6065, 13677, 17109, 20780, 24734,..."
3,4,What would happen if the Indian government st...,"[-0.37123987, 0.7097032, -0.06182622, -0.16823...","{'indices': [2408, 6065, 7582, 12225, 17109, 2..."
4,5,How can I increase the speed of my internet c...,"[-0.16656642, 0.21881323, -0.0023541958, 0.104...","{'indices': [5388, 12812, 18181, 19960, 20780,..."


## Load Models

BM25 with pre-computed document frequencies

In [None]:
%%capture
!python -m spacy download en_core_web_sm

In [8]:
import pinecone_text
import spacy

nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"])

def tokenizer(text):
    return [token.text for token in nlp(text)]

bm25 = pinecone_text.BM25(tokenizer)
bm25.set_params(**json.loads(requests.get('https://storage.googleapis.com/gareth-pinecone-datasets/quora_all-bm25-params.json').text))

In [9]:
bm25.transform_query("hello world")

{'indices': [49925, 64071], 'values': [0.6327617652573566, 0.6327617652573566]}

Load [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

In [10]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

## Index Creation

In [11]:
import pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENVIRONMENT")

In [12]:
index_name = "bm25-query"
batch_size = 300
dimension = 384

In [14]:
pinecone.create_index(
    index_name,
    pod_type='s1',
    metric='dotproduct',
    dimension=dimension,
    metadata_config={"indexed": []}
)

## Upsert

Upsert vectors to the index

In [15]:
from pinecone import GRPCVector, GRPCSparseValues
from google.protobuf.struct_pb2 import Struct

with pinecone.GRPCIndex(index_name) as index:
    for i in tqdm(range(0, len(df), batch_size)):
        batch = df[i:min(i+batch_size, len(df))].to_dict(orient='records')
        upserts = []
        for row in batch:
            metadata = Struct()
            metadata.update(dict(text=row['text']))
            u = GRPCVector(
                id=str(row['id']),
                values=row['values'].tolist(),
                metadata=metadata,
                sparse_values=GRPCSparseValues(
                    indices=row['sparse_values']['indices'].tolist(),
                    values=row['sparse_values']['values'].tolist()
                )
            )
            upserts.append(u)
        index.upsert(vectors=upserts, async_req=False)

100%|██████████| 1744/1744 [03:11<00:00,  9.13it/s]


## Hybrid Queries with BM25

We can fetch records by calculating distance over both sparse and dense vectors

In [16]:
def hybrid_score_norm(dense, sparse, alpha: float):
    """Hybrid score using a convex combination

    alpha * dense + (1 - alpha) * sparse

    Args:
        dense: Array of floats representing
        sparse: a dict of `indices` and `values`
        alpha: scale between 0 and 1
    """
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    hs = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    return [v * alpha for v in dense], hs

In [17]:
from pinecone import SparseValues

index = pinecone.Index(index_name)

In [18]:
text = "nyc bites"
sparse = bm25.transform_query(text)
dense = model.encode(text).tolist()

#### Only Sparse

In [19]:
hdense, hsparse = hybrid_score_norm(dense, sparse, 0.0)
index.query(top_k=3, vector=hdense, sparse_vector=SparseValues(**hsparse), include_metadata=True)['matches']

{'matches': [{'id': '287876',
              'metadata': {'text': ' What bites me at night?'},
              'score': 0.264369607,
              'values': []},
             {'id': '295419',
              'metadata': {'text': ' How dangerous are turtle bites?'},
              'score': 0.264369607,
              'values': []},
             {'id': '140484',
              'metadata': {'text': ' How can I prevent flea bites?'},
              'score': 0.253131,
              'values': []}],
 'namespace': ''}

#### Hybrid

In [20]:
hdense, hsparse = hybrid_score_norm(dense, sparse, 0.25)
index.query(top_k=3, vector=hdense, sparse_vector=SparseValues(**hsparse), include_metadata=True)['matches']

{'matches': [{'id': '45040',
              'metadata': {'text': ' Which is the worst NYC borough?'},
              'score': 1.0034976,
              'values': []},
             {'id': '453598',
              'metadata': {'text': ' What are places to eat in NYC?'},
              'score': 1.00097561,
              'values': []},
             {'id': '80563',
              'metadata': {'text': ' What are some cool things to do in NYC?'},
              'score': 0.889402211,
              'values': []}],
 'namespace': ''}

In [None]:
hdense, hsparse = hybrid_score_norm(dense, sparse, 0.6)
index.query(top_k=3, vector=hdense, sparse_vector=SparseValues(**hsparse), include_metadata=True)['matches']

#### Only Dense

In [21]:
hdense, hsparse = hybrid_score_norm(dense, sparse, 1.0)
index.query(top_k=3, vector=hdense, sparse_vector=SparseValues(**hsparse), include_metadata=True)['matches']

{'matches': [{'id': '282819',
              'metadata': {'text': ' Do you like New York?'},
              'score': 4.39973736,
              'values': []},
             {'id': '453598',
              'metadata': {'text': ' What are places to eat in NYC?'},
              'score': 4.16753483,
              'values': []},
             {'id': '45040',
              'metadata': {'text': ' Which is the worst NYC borough?'},
              'score': 4.14454842,
              'values': []}],
 'namespace': ''}