[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/semantic-search/sparse/splade/splade-quora.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/semantic-search/sparse/splade/splade-quora.ipynb)

# Hybrid Search with Splade Sparse Vectors

[![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/full-link.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade-vector-generation.ipynb)

## Overview

SPLADE is a class of models that produce sparse embeddings. Unlike dense embeddings which can be difficult to interpret sparse embeddings map to tokens for easier interpretability. SPLADE models have been shown to consistently outperform dense models, particularly in out-of-domain settings. 

The following guide will show you how to construct SPLADE embeddings to use with Pinecone's sparse-dense index. See the [companion guide](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade-vector-generation.ipynb) to learn how to generate embeddings


## Prerequisites

We'll install the required libraries: the `pinecone-client` for interacting with Pinecone, the `pinecone-datasets` library that we will use for fast processing of the Quora dataset, and `numpy`.

In [1]:
!pip install --no-color -qU \
          "pinecone-client[grpc]"==2.2.1 \
          pinecone-datasets=='0.5.0rc11' \
          numpy==1.24.3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.1/181.1 KB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 KB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.0/35.0 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

## Quora Dataset

We'll load the popular Quora dataset with precomputed embeddings. Both dense and sparse embeddings have been precomputed using the following models:

* Dense: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

* Sparse: [naver/splade-cocondenser-ensembledistil](https://huggingface.co/naver/splade-cocondenser-ensembledistil)

In [7]:
from pinecone_datasets import load_dataset

dataset = load_dataset("quora_all-MiniLM-L6-v2_Splade-100K")

dataset.documents.head()

Unnamed: 0,id,values,sparse_values,metadata,blob
0,1,"[0.4024894, -0.23425448, -0.36006898, 0.044094...","{'indices': [1012, 2000, 2011, 2017, 2022, 204...",,{'text': ' What is the step by step guide to i...
1,2,"[0.5111937, -0.1987632, -0.32637578, 0.1264907...","{'indices': [1000, 1012, 1999, 2000, 2011, 201...",,{'text': ' What is the step by step guide to i...
2,3,"[-0.2237151, 0.74151665, -0.18739395, 0.233195...","{'indices': [1005, 1006, 1007, 1010, 1011, 101...",,{'text': ' What is the story of Kohinoor (Koh-...
3,4,"[-0.37123987, 0.7097032, -0.06182622, -0.16823...","{'indices': [1005, 1011, 1012, 1045, 1047, 199...",,{'text': ' What would happen if the Indian gov...
4,5,"[-0.16656642, 0.21881323, -0.0023541958, 0.104...","{'indices': [2006, 2017, 2064, 2076, 2078, 209...",,{'text': ' How can I increase the speed of my ...


As you can see, this data is already loaded with the sparse and dense representations of each document. To learn about the generation process of this values, see [this walkthrough](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade-vector-generation.ipynb).

## Index Creation

We first need to initialize our connection to Pinecone to create our vector index. For this, we need a [free API key](https://app.pinecone.io/). We initialize the connection like so:


In [9]:
import os
from pinecone import Pinecone

api_key = os.getenv("PINECONE_API_KEY") or "PINECONE_API_KEY"
# find environment next to your API key in the Pinecone console
env = os.getenv("PINECONE_ENVIRONMENT") or "PINECONE_ENVIRONMENT"

pc = Pinecone(api_key=api_key)
pinecone.whoami()

WhoAmIResponse(username='load', user_label='label', projectname='load-test')

In [10]:
index_name = "splade-quora"
dimension = 384

We create the index like so:

In [11]:
if index_name not in pinecone.list_indexes().names():
  pinecone.create_index(
      index_name,
      pod_type='s1',
      metric='dotproduct',
      dimension=dimension
  )

And we connect to the index like so:

In [12]:
index = pinecone.Index(index_name)

## Upsert


Now let's upsert vectors to the index. We are using async upload with batching. For more information on performance boosting, see the Pinecone documentation for [Performance Tuning](https://docs.pinecone.io/docs/performance-tuning).

In [13]:
for batch in dataset.iter_documents(batch_size=100):
    index.upsert(batch)

sending upsert requests:   0%|          | 0/522931 [00:00<?, ?it/s]

collecting async responses:   0%|          | 0/1046 [00:00<?, ?it/s]

upserted_count: 522931

In [14]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.2,
 'namespaces': {'': {'vector_count': 522931}},
 'total_vector_count': 522931}

## Query

The dataset comes with a set of prewritten queries that can be used. We view them like so:

In [15]:
dataset.queries.head()

Unnamed: 0,vector,sparse_vector,filter,top_k,blob
0,"[-0.07095234841108322, 0.0012621647911146283, ...","{'indices': [18989, 23463, 27058, 31925, 38916...",,5,"{'id': '318', 'text': 'How does Quora look to ..."
1,"[0.05170859768986702, -0.024982793256640434, -...","{'indices': [31604, 31925, 36513, 36821, 38049...",,5,"{'id': '378', 'text': 'How do I refuse to chos..."
2,"[0.005764591973274946, 0.004137433134019375, -...","{'indices': [947, 2793, 15453, 15498, 35356, 4...",,5,"{'id': '379', 'text': 'Did Ben Affleck shine m..."
3,"[0.00809027161449194, -0.009231459349393845, -...","{'indices': [8642, 19100, 20780, 24734, 26798,...",,5,"{'id': '399', 'text': 'What are the effects of..."
4,"[0.024374842643737793, 0.07713444530963898, 0....","{'indices': [1657, 13677, 33956, 43002, 57110]...",,5,"{'id': '420', 'text': 'Why creativity is impor..."


Here we define a function that merges the query results with the actual texts of the documents and shows them as a dataframe.

In [16]:
import pandas as pd

def merge_with_documents(query_response, documents_df):
    results_df = pd.DataFrame([res.to_dict() for res in query_response["matches"]])
    results_df = results_df.merge(documents_df, on="id", how="inner")
    results_df["text"] = results_df["blob"].apply(lambda b: b["text"])
    return results_df[["text", "score"]].sort_values("score", ascending=False)

We can load a sample query like so:

In [35]:
sample_query = dataset.queries.iloc[14226].to_dict()
sample_query["blob"]["text"]

'How can I teach my kids the alphabet?'

Now we find the similarity scores for the top `5` returned items from the index:

In [33]:
query_response = index.query(**sample_query)
merge_with_documents(query_response, dataset.documents)

Unnamed: 0,text,score
0,How do I prepare for software interviews?,5.739118
1,How do I prepare for programming interviews?,5.189726
2,How should I prepare for programming interviews?,5.008953
3,How do I prepare for a software engineering j...,4.843315
4,How do I prepare for interviews?,4.840115


Because we have both dense and sparse vectors in the index, the `score` above is calculated like so:

`alpha * dense_score + (1 - alpha) * sparse_score`

The `alpha` parameter specifies the weighting of the two scores. In the following code, we explore the impact of various alpha values using a sample query.

In [19]:
from copy import deepcopy
import numpy as np

def hybrid_weight_query(query, alpha):
  query_transformed = deepcopy(query)
  query_transformed["vector"] = list(np.array(query_transformed["vector"]) * alpha)
  query_transformed["sparse_vector"]["values"] = list(np.array(query_transformed["sparse_vector"]["values"]) * (1.0 - alpha))
  return query_transformed

### Only Sparse (alpha = 0)

In [34]:
query_response = index.query(**hybrid_weight_query(sample_query, 0.0))
merge_with_documents(query_response, dataset.documents)

Unnamed: 0,text,score
0,What is an example of isolationism?,0.71182
1,"If one lived in isolation, would they know an...",0.701062
2,Is Donald Trump an isolationist?,0.699042
3,What is behavioral isolation?,0.697808
4,What is isolationism? What are some examples?,0.676771


### Hybrid (0 < alpha < 1)

In [21]:
# alpha=0.25
query_response = index.query(**hybrid_weight_query(sample_query, 0.25))
merge_with_documents(query_response, dataset.documents)

Unnamed: 0,text,score
0,What is the best way to teach kids alphabets?,1.354345
1,What is the best way to teach children the al...,1.333429
2,What should I teach my children after alphabets?,1.308289
3,How can an adult re-learn the alphabet?,1.118512
4,How do I teach a child to read and write?,1.102335


In [22]:
# alpha=0.6
query_response = index.query(**hybrid_weight_query(sample_query, 0.6))
merge_with_documents(query_response, dataset.documents)

Unnamed: 0,text,score
0,What is the best way to teach kids alphabets?,3.250428
1,What is the best way to teach children the al...,3.20023
2,What should I teach my children after alphabets?,3.139894
3,How can an adult re-learn the alphabet?,2.684429
4,How do I teach a child to read and write?,2.645604


### Only Dense (alpha = 1)

In [23]:
query_response = index.query(**hybrid_weight_query(sample_query, 1.0))
merge_with_documents(query_response, dataset.documents)

Unnamed: 0,text,score
0,What is the best way to teach kids alphabets?,5.417379
1,What is the best way to teach children the al...,5.333716
2,What should I teach my children after alphabets?,5.233157
3,How can an adult re-learn the alphabet?,4.474048
4,How do I teach a child to read and write?,4.409339


Once we're done, delete the index to save resources:

In [None]:
pinecone.delete_index(index_name)

---