[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/pinecone/sparse/bm25-quora.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/pinecone/sparse/bm25-quora.ipynb)

# Hybrid Search with BM25 Sparse Vectors

## Overview

BM25 is a popular technique for retrieving text. It uses term frequencies to determine the relative importance of the term to the query. It is a simple but effective and only requires knowing the number of documents in a corpus and the frequency of terms across documents. In the following guide we will show how to use BM25 with Pinecone's sparse-dense index for use in hybrid search.

Learn how to create embeddings in the [companion guide]().

## Install

In [120]:
!pip install -qU \
          git+https://git@github.com/pinecone-io/pinecone-python-client.git@upsert_dataframe#egg=pinecone-client[grpc] \
          pinecone-datasets \
          numpy

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## Quora Dataset

Load the popular Quora dataset with embeddings precomputed using

* Dense: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
* Sparse: BM25


In [121]:
from pinecone_datasets import load_dataset

In [122]:
dataset = load_dataset("quora_all-MiniLM-L6-bm25")

In [123]:
dataset.documents.head()

Unnamed: 0,id,values,sparse_values,metadata,blob
0,1,"[0.06814987, -0.039664183, -0.06096721, 0.0074...","{'indices': [7096, 8508, 13677, 23041, 24734, ...",,{'text': ' What is the step by step guide to i...
1,2,"[0.08983771, -0.03493085, -0.057357617, 0.0222...","{'indices': [7096, 8508, 13677, 24734, 26026, ...",,{'text': ' What is the step by step guide to i...
2,3,"[-0.046798065, 0.1551149, -0.03920019, 0.04878...","{'indices': [6065, 13677, 17109, 20780, 24734,...",,{'text': ' What is the story of Kohinoor (Koh-...
3,4,"[-0.077349104, 0.14786911, -0.0128817065, -0.0...","{'indices': [2408, 6065, 7582, 12225, 17109, 2...",,{'text': ' What would happen if the Indian gov...
4,5,"[-0.028324936, 0.037209604, -0.00040033547, 0....","{'indices': [5388, 12812, 18181, 19960, 20780,...",,{'text': ' How can I increase the speed of my ...


As you can see, this data is already loaded with the sparse and dense representaions of each document. To learn about the generation process of this values checkout [TODO: ADD LINK]

## Index Creation

We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:


In [124]:
import pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(api_key="YOUR_API_KEY",
              environment="YOUR_ENV")  # find next to API key in console)

In [125]:
index_name = "bm25-qoura"
dimension = 384

In [126]:
if index_name not in pinecone.list_indexes():
  pinecone.create_index(
      index_name,
      pod_type='s1',
      metric='dotproduct',
      dimension=dimension,
      metadata_config={"indexed": ["blah"]}
  )

In [127]:
index = pinecone.GRPCIndex(index_name)

## Upsert


For this demo we are going to store the texts as vector metadata.

In [128]:
dataset.documents["metadata"] = dataset.documents["blob"]

Now let's upsert vectors to the index, we are using async upload with batching. For more imformation on performance boosting visit this [link](https://docs.pinecone.io/docs/performance-tuning)

In [129]:
index.upsert_dataframe(dataset.documents)

  0%|          | 0/1045 [00:00<?, ?it/s]

In [130]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.1,
 'namespaces': {'': {'vector_count': 522931}},
 'total_vector_count': 522931}

In [131]:
dataset.documents.dtypes

id               object
values           object
sparse_values    object
metadata         object
blob             object
dtype: object

## Query

First let's look at our dataset queries

In [132]:
dataset.queries.head()

Unnamed: 0,vector,sparse_vector,filter,top_k,blob
0,"[-0.07095234841108322, 0.0012621647911146283, ...","{'indices': [18989, 23463, 27058, 31925, 38916...",,5,"{'id': '318', 'text': 'How does Quora look to ..."
1,"[0.05170859768986702, -0.024982793256640434, -...","{'indices': [31604, 31925, 36513, 36821, 38049...",,5,"{'id': '378', 'text': 'How do I refuse to chos..."
2,"[0.005764591973274946, 0.004137433134019375, -...","{'indices': [947, 2793, 15453, 15498, 35356, 4...",,5,"{'id': '379', 'text': 'Did Ben Affleck shine m..."
3,"[0.00809027161449194, -0.009231459349393845, -...","{'indices': [8642, 19100, 20780, 24734, 26798,...",,5,"{'id': '399', 'text': 'What are the effects of..."
4,"[0.024374842643737793, 0.07713444530963898, 0....","{'indices': [1657, 13677, 33956, 43002, 57110]...",,5,"{'id': '420', 'text': 'Why creativity is impor..."


A hybrid score is obtained by applying a convex combination of the sparse and dense scores, with the alpha parameter specifying the weighting of the two scores:

`alpha * dense_score + (1 - alpha) * sparse_score`

In the folowing we explore the impact of various alpha values using a sample query.

In [133]:
sample_query = dataset.queries.sample(1).to_dict(orient="records")[0]
sample_query["blob"]["text"]

"What's the best way to travel from cochin to Munnar?"

In [134]:
index.query(**sample_query, include_metadata=True)["matches"]

[{'id': '433906',
  'metadata': {'text': ' Which are the best travel options from Cochin to '
                       'Munaar?'},
  'score': 1.1573259,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '228492',
  'metadata': {'text': ' What is the best way to spend two days in Munnar, '
                       'Kerala?'},
  'score': 0.8347726,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '48475',
  'metadata': {'text': ' What all places can be covered in 5 days in Kerala '
                       'from Cochin?'},
  'score': 0.80995196,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '40318',
  'metadata': {'text': ' What are the best places to visit in Munnar?'},
  'score': 0.7721201,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '410021',
  'metadata': {'text': ' What is the best travel place in Kerala?'},
  'score': 0.76315665,
  'sparse_values': {'indices': [], 'values': []},
  '

In [135]:
import numpy as np

def hybrid_weight_query(query, alpha):
  query = query.copy()
  query["vector"] = list(np.array(query["vector"]) * alpha)
  query["sparse_vector"]["values"] = list(np.array(query["sparse_vector"]["values"]) * (1.0 - alpha))
  return query

### Only Sparse (alpha = 0)

In [136]:
index.query(**hybrid_weight_query(sample_query, 0.0), include_metadata=True)["matches"]

[{'id': '433906',
  'metadata': {'text': ' Which are the best travel options from Cochin to '
                       'Munaar?'},
  'score': 0.23170872,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '297395',
  'metadata': {'text': " What's the best way to travel from Houston to Mexico "
                       'via bus?'},
  'score': 0.19312891,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '45473',
  'metadata': {'text': " What's the best way to travel the world?"},
  'score': 0.19192813,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '384232',
  'metadata': {'text': " What's the cheapest way to travel from Europe to "
                       'India?'},
  'score': 0.18268965,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '299466',
  'metadata': {'text': " What's the best way to travel with your bicycle?"},
  'score': 0.17798156,
  'sparse_values': {'indices': [], 'values': []},

### Hybrid (0 < alpha < 1)

In [137]:
# alpha=0.25
index.query(**hybrid_weight_query(sample_query, 0.25), include_metadata=True)["matches"]

[{'id': '433906',
  'metadata': {'text': ' Which are the best travel options from Cochin to '
                       'Munaar?'},
  'score': 0.40518582,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '228492',
  'metadata': {'text': ' What is the best way to spend two days in Munnar, '
                       'Kerala?'},
  'score': 0.294321,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '40318',
  'metadata': {'text': ' What are the best places to visit in Munnar?'},
  'score': 0.27211666,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '48475',
  'metadata': {'text': ' What all places can be covered in 5 days in Kerala '
                       'from Cochin?'},
  'score': 0.26679716,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '6232',
  'metadata': {'text': ' What are the best places to stay in munnar?'},
  'score': 0.2639953,
  'sparse_values': {'indices': [], 'values': []},
  

In [138]:
# alpha=0.6
index.query(**hybrid_weight_query(sample_query, 0.6), include_metadata=False)["matches"]

[{'id': '433906',
  'metadata': {},
  'score': 0.6248829,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '228492',
  'metadata': {},
  'score': 0.44948688,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '48475',
  'metadata': {},
  'score': 0.4473857,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '410021',
  'metadata': {},
  'score': 0.42628524,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '359065',
  'metadata': {},
  'score': 0.41658056,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}]

### Only Dense (alpha = 1)

In [139]:
index.query(**hybrid_weight_query(sample_query, 1.0), include_metadata=True)["matches"]

[{'id': '433906',
  'metadata': {'text': ' Which are the best travel options from Cochin to '
                       'Munaar?'},
  'score': 0.9256171,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '48475',
  'metadata': {'text': ' What all places can be covered in 5 days in Kerala '
                       'from Cochin?'},
  'score': 0.68133366,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '228492',
  'metadata': {'text': ' What is the best way to spend two days in Munnar, '
                       'Kerala?'},
  'score': 0.66351694,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '410021',
  'metadata': {'text': ' What is the best travel place in Kerala?'},
  'score': 0.65779406,
  'sparse_values': {'indices': [], 'values': []},
  'values': []}, {'id': '359065',
  'metadata': {'text': ' What is the best trip plan to Kerala from bangalore?'},
  'score': 0.65143514,
  'sparse_values': {'indices': [], 'values