[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/semantic-search/sparse/bm25/bm25-quora.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/semantic-search/sparse/bm25/bm25-quora.ipynb)

# Hybrid Search with BM25 Sparse Vectors

[![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/full-link.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/pinecone/sparse/bm25/bm25-vector-generation.ipynb)

## Overview

BM25 is a popular technique for retrieving text. It uses term frequencies to determine the relative importance of the term to the query. It is simple but effective and only requires knowing the number of documents in a corpus and the frequency of terms across documents. In the following guide we will show how to use BM25 with Pinecone's sparse-dense index for use in hybrid search.

Learn how to create embeddings in the [companion guide](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/bm25/bm25-vector-generation.ipynb).

## Prerequisites

We'll install the required libraries: the `pinecone-client` for interacting with Pinecone, the `pinecone-datasets` library that we will use for fast processing of the Quora dataset, and `numpy`.

In [2]:
!pip install --no-color -qU \
          "pinecone-client[grpc]"==2.2.1 \
          pinecone-datasets=='0.5.0rc11' \
          pandas

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.1/181.1 KB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m41.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 KB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.0/35.0 MB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m39.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

## Quora Dataset

We'll load the popular Quora dataset with precomputed embeddings. Both dense and sparse embeddings have been precomputed using the following models:

* Dense: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

* Sparse: BM25

In [5]:
from pinecone_datasets import load_dataset

dataset = load_dataset("quora_all-MiniLM-L6-bm25-100K")
dataset.documents.head()

Unnamed: 0,id,values,sparse_values,metadata,blob
0,1,"[0.06814987, -0.039664183, -0.06096721, 0.0074...","{'indices': [7096, 8508, 13677, 23041, 24734, ...",,{'text': ' What is the step by step guide to i...
1,2,"[0.08983771, -0.03493085, -0.057357617, 0.0222...","{'indices': [7096, 8508, 13677, 24734, 26026, ...",,{'text': ' What is the step by step guide to i...
2,3,"[-0.046798065, 0.1551149, -0.03920019, 0.04878...","{'indices': [6065, 13677, 17109, 20780, 24734,...",,{'text': ' What is the story of Kohinoor (Koh-...
3,4,"[-0.077349104, 0.14786911, -0.0128817065, -0.0...","{'indices': [2408, 6065, 7582, 12225, 17109, 2...",,{'text': ' What would happen if the Indian gov...
4,5,"[-0.028324936, 0.037209604, -0.00040033547, 0....","{'indices': [5388, 12812, 18181, 19960, 20780,...",,{'text': ' How can I increase the speed of my ...


As you can see, this data is already loaded with the sparse and dense representations of each document. To learn about the generation process of this values, see [this walkthrough](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/bm25/bm25-vector-generation.ipynb).

## Index Creation

We first need to initialize our connection to Pinecone to create our vector index. For this, we need a [free API key](https://app.pinecone.io/). We initialize the connection like so:


In [26]:
import os
from pinecone import Pinecone

api_key = os.getenv("PINECONE_API_KEY") or "PINECONE_API_KEY"
# find environment next to your API key in the Pinecone console
env = os.getenv("PINECONE_ENVIRONMENT") or "PINECONE_ENVIRONMENT"

pc = Pinecone(api_key=api_key)
pinecone.whoami()

WhoAmIResponse(username='load', user_label='label', projectname='load-test')

We create the index like so:

In [None]:
index_name = "bm25-quora"
dimension = 384

In [28]:
if index_name not in pinecone.list_indexes().names():
    pinecone.create_index(
        index_name,
        pod_type='s1',
        metric='dotproduct',
        dimension=dimension
    )

And we connect to the index like so:

In [29]:
index = pinecone.Index(index_name)

## Upsert


Now let's upsert vectors to the index. We are using async upload with batching. For more information on performance boosting, see the Pinecone documentation for [Performance Tuning](https://docs.pinecone.io/docs/performance-tuning).

In [30]:
for batch in dataset.iter_documents(batch_size=100):
    index.upsert(batch)

sending upsert requests:   0%|          | 0/522931 [00:00<?, ?it/s]

collecting async responses:   0%|          | 0/1046 [00:00<?, ?it/s]

upserted_count: 522931

In [31]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.1,
 'namespaces': {'': {'vector_count': 522931}},
 'total_vector_count': 522931}

## Query

The dataset comes with a set of prewritten queries that can be used. We view them like so:

In [32]:
dataset.queries.head()

Unnamed: 0,vector,sparse_vector,filter,top_k,blob
0,"[-0.07095234841108322, 0.0012621647911146283, ...","{'indices': [18989, 23463, 27058, 31925, 38916...",,5,"{'id': '318', 'text': 'How does Quora look to ..."
1,"[0.05170859768986702, -0.024982793256640434, -...","{'indices': [31604, 31925, 36513, 36821, 38049...",,5,"{'id': '378', 'text': 'How do I refuse to chos..."
2,"[0.005764591973274946, 0.004137433134019375, -...","{'indices': [947, 2793, 15453, 15498, 35356, 4...",,5,"{'id': '379', 'text': 'Did Ben Affleck shine m..."
3,"[0.00809027161449194, -0.009231459349393845, -...","{'indices': [8642, 19100, 20780, 24734, 26798,...",,5,"{'id': '399', 'text': 'What are the effects of..."
4,"[0.024374842643737793, 0.07713444530963898, 0....","{'indices': [1657, 13677, 33956, 43002, 57110]...",,5,"{'id': '420', 'text': 'Why creativity is impor..."


Here we define a function that merges the query results with the actual texts of the documents and shows them as a dataframe.

In [33]:
import pandas as pd

def merge_with_documents(query_response, documents_df):
    results_df = pd.DataFrame([res.to_dict() for res in query_response["matches"]])
    results_df = results_df.merge(documents_df, on="id", how="inner")
    results_df["text"] = results_df["blob"].apply(lambda b: b["text"])
    return results_df[["text", "score"]].sort_values("score", ascending=False)

We can load a sample query like so:

In [34]:
sample_query = dataset.queries.iloc[14226].to_dict()
sample_query["blob"]["text"]

'How can I teach my kids the alphabet?'

Now we find the similarity scores for the top `5` returned items from the index:

In [35]:
query_response = index.query(**sample_query)
merge_with_documents(query_response, dataset.documents)

Unnamed: 0,text,score
0,What is the best way to teach children the al...,1.171509
1,What is the best way to teach kids alphabets?,1.13852
2,What should I teach my children after alphabets?,1.074607
3,What are the best ways to teach kids how to r...,0.922343
4,How do I effectively teach the kids to read?,0.920467


Because we have both dense and sparse vectors in the index, the `score` above is calculated like so:

`alpha * dense_score + (1 - alpha) * sparse_score`

The `alpha` parameter specifies the weighting of the two scores. In the following code, we explore the impact of various alpha values using a sample query.

In [36]:
from copy import deepcopy
import numpy as np

def hybrid_weight_query(query, alpha):
    query_transformed = deepcopy(query)
    query_transformed["vector"] = list(np.array(query_transformed["vector"]) * alpha)
    query_transformed["sparse_vector"]["values"] = list(np.array(query_transformed["sparse_vector"]["values"]) * (1.0 - alpha))
    return query_transformed

### Only Sparse (alpha = 0.0)

In [37]:
query_response = index.query(**hybrid_weight_query(sample_query, 0.0))
merge_with_documents(query_response, dataset.documents)

Unnamed: 0,text,score
0,How do I teach kids electronics?,0.252553
1,"How can I teach kids ""not to give up""?",0.242496
2,What can little kids teach adults?,0.238482
3,How do I effectively teach the kids to read?,0.237153
4,How do I teach my twelve-year-old not to bite...,0.225475


### Hybrid (0 < alpha < 1)

In [38]:
# alpha=0.25
query_response = index.query(**hybrid_weight_query(sample_query, 0.25))
merge_with_documents(query_response, dataset.documents)

Unnamed: 0,text,score
0,What is the best way to teach children the al...,0.404796
1,What is the best way to teach kids alphabets?,0.383141
2,How do I effectively teach the kids to read?,0.348693
3,What should I teach my children after alphabets?,0.346379
4,How do I teach kids electronics?,0.339263


In [39]:
# alpha=0.6
query_response = index.query(**hybrid_weight_query(sample_query, 0.6))
merge_with_documents(query_response, dataset.documents)

Unnamed: 0,text,score
0,What is the best way to teach children the al...,0.658138
1,What is the best way to teach kids alphabets?,0.643708
2,What should I teach my children after alphabets?,0.613673
3,How can an adult re-learn the alphabet?,0.512711
4,What are the best ways to teach kids how to r...,0.511781


### Only Dense (alpha = 1.0)

In [40]:
query_response = index.query(**hybrid_weight_query(sample_query, 1.0))
merge_with_documents(query_response, dataset.documents)

Unnamed: 0,text,score
0,What is the best way to teach children the al...,0.947671
1,What is the best way to teach kids alphabets?,0.941499
2,What should I teach my children after alphabets?,0.919153
3,How can an adult re-learn the alphabet?,0.736157
4,What are the best ways to teach kids how to r...,0.71422


Once we're done, delete the index to save resources:

In [None]:
pinecone.delete_index(index_name)

---