[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade-vector-generation.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade-vector-generation.ipynb)

# SPLADE Sparse-Dense Embedding Generation

## Overview

SPLADE is a class of models that produce sparse embeddings. Dense embeddings are often difficult to interpret, but sparse embeddings have clearly identifiable token overlap, making sparse vector search results more interpretable. SPLADE models have been shown to consistently outperform dense models, particularly in out-of-domain settings. 

The following guide will show you how to construct SPLADE embeddings to use in Pinecone's sparse-dense vectors. See the [companion guide to skip embedding generation](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade-quora.ipynb).

## Prerequisites

We'll install the required libraries:

In [None]:
!pip install -qU \
          transformers \
          torch \
          sentence_transformers \
          tqdm \
          pandas

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m44.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sentence_transformers (setup.py) ... [?25l[?25hdone


## Quora Dataset

We'll load the popular Quora dataset:

In [None]:
import pandas as pd

df = pd.read_parquet("https://storage.googleapis.com/pinecone-datasets-dev/quora_all-MiniLM-L6-v2_Splade/raw/quora_questions_sample200.parquet")

In [None]:
df.head()

Unnamed: 0,id,text
0,17248,"If I fall under the Brady law due to PTSD, is..."
1,240419,Which question can't be answered with a yes o...
2,262372,How can I write a children's book for older k...
3,180057,What happens when you view a public Instagram...
4,456610,What is the fact about NIBIRU the Planet X?


### Sparse Embeddings with SPLADE 

In the following example we will use the [naver/splade-cocondenser-ensembledistil](https://huggingface.co/naver/splade-cocondenser-ensembledistil) SPLADE model.


In [None]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

class SPLADE:
    def __init__(self, model):
        # check device
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.tokenizer = AutoTokenizer.from_pretrained(model)
        self.model = AutoModelForMaskedLM.from_pretrained(model)
        # move to gpu if available
        self.model.to(self.device)

    def __call__(self, text: str):
        inputs = self.tokenizer(text, return_tensors="pt").to(self.device)

        with torch.no_grad():
            logits = self.model(**inputs).logits

        inter = torch.log1p(torch.relu(logits[0]))
        token_max = torch.max(inter, dim=0)  # sum over input tokens
        nz_tokens = torch.where(token_max.values > 0)[0]
        nz_weights = token_max.values[nz_tokens]

        order = torch.sort(nz_weights, descending=True)
        nz_weights = nz_weights[order[1]]
        nz_tokens = nz_tokens[order[1]]
        return {
            'indices': nz_tokens.cpu().numpy().tolist(),
            'values': nz_weights.cpu().numpy().tolist()
        }

In [None]:
splade = SPLADE("naver/splade-cocondenser-ensembledistil")

Downloading (…)okenizer_config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [None]:
doc = "what is the capital of france?"
sparse_vector = splade(doc)


### Dense Model

We use the popular all-MiniLM-L6-v2 model available on Hugging Face for dense vectors.

In [None]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"running on {device}")

model = SentenceTransformer(
    'sentence-transformers/all-MiniLM-L6-v2',
    device=device
)

running on cpu


Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

### Compute Dense & Sparse Embeddings

Create BM25 sparse embeddings:

In [None]:
from tqdm.notebook import tqdm

tqdm.pandas()
df['sparse_values'] = df['text'].progress_apply(lambda x: splade(x))

  0%|          | 0/200 [00:00<?, ?it/s]

And now encode dense vector embeddings:

In [None]:
df['values'] = df['text'].progress_apply(lambda x: model.encode(x))

  0%|          | 0/200 [00:00<?, ?it/s]

We organize our dataframe to align to the `pinecone-datasets` format:

In [None]:
df_result = df.copy()
df_result["metadata"] = None
df_result["blob"] = df_result["text"].apply(lambda t: {"text": t})
df_result = df_result.drop(columns="text")

In [None]:
df_result.head()

Unnamed: 0,id,sparse_values,values,metadata,blob
0,17248,"{'indices': [10184, 19637, 2104, 5334, 2991, 4...","[0.021123115, 0.043918036, -0.032318894, -0.01...",,{'text': ' If I fall under the Brady law due t...
1,240419,"{'indices': [2748, 2053, 3980, 3160, 3437, 466...","[0.015179832, 0.06904052, -0.023286428, -0.003...",,{'text': ' Which question can't be answered wi...
2,262372,"{'indices': [2338, 3080, 2808, 2336, 2517, 221...","[0.038049288, 0.084497035, 0.008177851, 0.0328...",,{'text': ' How can I write a children's book f...
3,180057,"{'indices': [23091, 3796, 2270, 16021, 2145, 2...","[-0.028053429, -0.04008296, 0.016164199, 0.020...",,{'text': ' What happens when you view a public...
4,456610,"{'indices': [17706, 1060, 9152, 4774, 2755, 22...","[0.005622366, 0.1048758, 0.02587494, 0.0329212...",,{'text': ' What is the fact about NIBIRU the P...


And now we have all we need to start using Pinecone vector database 🚀

For more details on that, check out [this notebook](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade-quora.ipynb).