[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade-quora.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade-quora.ipynb)

# Hybrid Search with Splade Sparse Vectors

## Overview

SPLADE is a class of models that produce sparse embeddings. Unlike dense embeddings which can be difficult to interpret sparse embeddings map to tokens for easier interpretability. SPLADE models have been shown to consistently outperform dense models, particularly in out-of-domain settings. 

The following guide will show you how to construct SPLADE embeddings to use with Pinecone's sparse-dense index. See the [companion guide](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade-vector-generation.ipynb) to learn how to generate embeddings


## Prerequisites

We'll install the required libraries: the `pinecone-client` for interacting with Pinecone, the `pinecone-datasets` library that we will use for fast processing of the Quora dataset, and `numpy`.

In [None]:
!pip install --no-color -qU \
          "pinecone-client[grpc]" \
          pinecone-datasets \
          numpy

## Quora Dataset

We'll load the popular Quora dataset with precomputed embeddings. Both dense and sparse embeddings have been precomputed using the following models:

* Dense: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

* Sparse: [naver/splade-cocondenser-ensembledistil](https://huggingface.co/naver/splade-cocondenser-ensembledistil)

In [None]:
from pinecone_datasets import load_dataset

dataset = load_dataset("quora_all-MiniLM-L6-v2_Splade")

dataset.documents.head()

As you can see, this data is already loaded with the sparse and dense representations of each document. To learn about the generation process of this values, see [this walkthrough](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/bm25/splade-vector-generation.ipynb).

## Index Creation

We first need to initialize our connection to Pinecone to create our vector index. For this, we need a [free API key](https://app.pinecone.io/). We initialize the connection like so:


In [None]:
import pinecone

pinecone.init(
    api_key="YOUR_API_KEY",  # app.pinecone.io
    environment="YOUR_ENV"  # next to API key in console
)
pinecone.whoami()

In [None]:
index_name = "splade-quora"
dimension = 384

We create the index like so:

In [None]:
if index_name not in pinecone.list_indexes():
  pinecone.create_index(
      index_name,
      pod_type='s1',
      metric='dotproduct',
      dimension=dimension
  )

And we connect to the index like so:

In [None]:
index = pinecone.GRPCIndex(index_name)

## Upsert


Now let's upsert vectors to the index. We are using async upload with batching. For more information on performance boosting, see the Pinecone documentation for [Performance Tuning](https://docs.pinecone.io/docs/performance-tuning).

In [None]:
index.upsert_from_dataframe(dataset.documents.drop(columns="blob"))

In [None]:
index.describe_index_stats()

## Query

The dataset comes with a set of prewritten queries that can be used. We view them like so:

In [None]:
dataset.queries.head()

Here we define a function that merges the query results with the actual texts of the documents and shows them as a dataframe.

In [None]:
import pandas as pd

def merge_with_documents(query_response, documents_df):
    results_df = pd.DataFrame([res.to_dict() for res in query_response["matches"]])
    results_df = results_df.merge(documents_df, on="id", how="inner")
    results_df["text"] = results_df["blob"].apply(lambda b: b["text"])
    return results_df[["text", "score"]].sort_values("score", ascending=False)

We can load a sample query like so:

In [None]:
sample_query = dataset.queries.iloc[14226].to_dict()
sample_query["blob"]["text"]

Now we find the similarity scores for the top `5` returned items from the index:

In [None]:
query_response = index.query(**sample_query)
merge_with_documents(query_response, dataset.documents)

Because we have both dense and sparse vectors in the index, the `score` above is calculated like so:

`alpha * dense_score + (1 - alpha) * sparse_score`

The `alpha` parameter specifies the weighting of the two scores. In the following code, we explore the impact of various alpha values using a sample query.

In [None]:
from copy import deepcopy
import numpy as np

def hybrid_weight_query(query, alpha):
  query_transformed = deepcopy(query)
  query_transformed["vector"] = list(np.array(query_transformed["vector"]) * alpha)
  query_transformed["sparse_vector"]["values"] = list(np.array(query_transformed["sparse_vector"]["values"]) * (1.0 - alpha))
  return query_transformed

### Only Sparse (alpha = 0)

In [None]:
query_response = index.query(**hybrid_weight_query(sample_query, 0.0))
merge_with_documents(query_response, dataset.documents)

### Hybrid (0 < alpha < 1)

In [None]:
# alpha=0.25
query_response = index.query(**hybrid_weight_query(sample_query, 0.25))
merge_with_documents(query_response, dataset.documents)

In [None]:
# alpha=0.6
query_response = index.query(**hybrid_weight_query(sample_query, 0.6))
merge_with_documents(query_response, dataset.documents)

### Only Dense (alpha = 1)

In [None]:
query_response = index.query(**hybrid_weight_query(sample_query, 1.0))
merge_with_documents(query_response, dataset.documents)

Once we're done, delete the index to save resources:

In [None]:
pinecone.delete_index(index_name)

---