# Hybrid Search with BM25 Sparse Vectors

## Overview

BM25 is a popular technique for retrieving text. It uses term frequencies to determine the relative importance of the term to the query. It is a simple but effective and only requires knowing the number of documents in a corpus and the frequency of terms across documents. In the following guide we will show how to use BM25 with Pinecone's sparse-dense index for use in hybrid search.

Skip the embedding creation step by using the [companion guide]().

## Install

In [None]:
!pip install -qU \
          git+https://git@github.com/pinecone-io/pinecone-python-client.git#egg=pinecone-client[grpc] \
          torch \
          sentence_transformers \
          spacy \
          scikit-learn

In [None]:
import requests

Download a helper file with BM25

In [None]:
with open('pinecone_text.py' ,'w') as fb:
    fb.write(requests.get('https://storage.googleapis.com/gareth-pinecone-datasets/pinecone_text.py').text)

## Quora Dataset

Load the popular Quora dataset

In [None]:
import pandas as pd

sample = 10_000
df = pd.read_parquet('https://storage.googleapis.com/gareth-pinecone-datasets/quora-bm25.parquet', columns=['id', 'text'])
df = df.sample(n=sample)

In [None]:
df.head()

### Fit BM25 with Spacy Tokenizer

We'll create fit a BM25 model using Spacy to tokenize data


In [None]:
%%capture
!python -m spacy download en_core_web_sm

In [None]:
import spacy
import pinecone_text

nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"])

def tokenizer(text):
    return [token.text for token in nlp(text)]

bm25 = pinecone_text.BM25(tokenizer)

We need to calculate how often tokens appear in documents

In [None]:
%%time
bm25.fit(df['text'])

### Dense Model

We use the popular all-MiniLM-L6-v2 model available on HuggingFace for dense vectors

In [None]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

model = SentenceTransformer(
    'sentence-transformers/all-MiniLM-L6-v2',
    device=device
)

### Compute Dense & Sparse Embeddings

In [None]:
%%time
df['sparse_values'] = df['text'].apply(bm25.transform_doc)

In [None]:
%%time
df['values'] = df['text'].apply(lambda x: model.encode(x).tolist())

## Upsert to Pinecone

In [None]:
import pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENVIRONMENT")

In [None]:
index_name = "bm25-embeddings"
batch_size = 300
dimension = 384

In [None]:
pinecone.create_index(
    index_name,
    pod_type='s1',
    metric='dotproduct',
    dimension=dimension,
    metadata_config={"indexed": []}
)

### Upsert

In [None]:
from pinecone import GRPCVector, GRPCSparseValues
from google.protobuf.struct_pb2 import Struct
from tqdm import tqdm

with pinecone.GRPCIndex(index_name) as index:
    for i in tqdm(range(0, len(df), batch_size)):
        batch = df[i:min(i+batch_size, len(df))].to_dict(orient='records')
        upserts = []
        for row in batch:
            metadata = Struct()
            metadata.update(dict(text=row['text']))
            u = GRPCVector(
                id=str(row['id']),
                values=row['values'],
                metadata=metadata,
                sparse_values=GRPCSparseValues(
                    indices=row['sparse_values']['indices'],
                    values=row['sparse_values']['values']
                )
            )
            upserts.append(u)
        index.upsert(vectors=upserts, async_req=False)

## Hybrid Queries with BM25

We can fetch records by calculating distance over both sparse and dense vectors

In [None]:
def hybrid_score_norm(dense, sparse, alpha: float):
    """Hybrid score using a convex combination

    alpha * dense + (1 - alpha) * sparse

    Args:
        dense: Array of floats representing
        sparse: a dict of `indices` and `values`
        alpha: scale between 0 and 1
    """
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    hs = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    return [v * alpha for v in dense], hs

In [None]:
from pinecone import SparseValues

index = pinecone.Index(index_name)

In [None]:
text = "nyc bites"
sparse = bm25.transform_query(text)
dense = model.encode(text).tolist()

### Only Sparse

In [None]:
hdense, hsparse = hybrid_score_norm(dense, sparse, 0.0)
index.query(top_k=3, vector=hdense, sparse_vector=SparseValues(**hsparse), include_metadata=True)['matches']

### Hybrid

In [None]:
hdense, hsparse = hybrid_score_norm(dense, sparse, 0.25)
index.query(top_k=3, vector=hdense, sparse_vector=SparseValues(**hsparse), include_metadata=True)['matches']

### Dense

In [None]:
hdense, hsparse = hybrid_score_norm(dense, sparse, 1.0)
index.query(top_k=3, vector=hdense, sparse_vector=SparseValues(**hsparse), include_metadata=True)['matches']