[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade_vector_generation.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade_vector_generation.ipynb)

# SPLADE Sparse-Dense Embedding Generation

## Overview

SPLADE is a class of models that produce sparse embeddings. Unlike dense embeddings which can be difficult to interpret sparse embeddings map to tokens for easier interpretability. SPLADE models have been shown to consistently outperform dense models, particularly in out-of-domain settings. 

The following guide will show you how to construct SPLADE embeddings to use with Pinecone's sparse-dense index. See the companion guide to learn to skip embedding generation

Skip the embedding creation step by using the [companion guide](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade_quora.ipynb).

## Prerequisites

We'll install the required libraries, those are the `pinecone-client` for interacting with Pinecone, the `pinecone-datasets` library that we will use for fast processing of the Quora dataset, and `numpy`.

In [1]:
!pip install -qU \
          transformers \
          torch \
          sentence_transformers \
          tqdm \
          pandas

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m6.3/6.3 MB[0m [31m44.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m86.0/86.0 KB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m7.6/7.6 MB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m190.3/190.3 KB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m

### Quora Dataset 



In [3]:
import pandas as pd

df = pd.read_parquet("https://storage.googleapis.com/pinecone-datasets-dev/quora_all-MiniLM-L6-v2_Splade/raw/quora_questions_sample200.parquet")

In [4]:
df.head()

Unnamed: 0,id,text
0,17248,"If I fall under the Brady law due to PTSD, is..."
1,240419,Which question can't be answered with a yes o...
2,262372,How can I write a children's book for older k...
3,180057,What happens when you view a public Instagram...
4,456610,What is the fact about NIBIRU the Planet X?


### Sparse Embeddings with SPLADE 

In the following example we will use. SPLADE Model: [naver/splade-cocondenser-ensembledistil](https://huggingface.co/naver/splade-cocondenser-ensembledistil)


In [5]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

class SPLADE:
    def __init__(self, model):
        # check device
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.tokenizer = AutoTokenizer.from_pretrained(model)
        self.model = AutoModelForMaskedLM.from_pretrained(model)
        # move to gpu if available
        self.model.to(self.device)

    def __call__(self, text: str):
        inputs = self.tokenizer(text, return_tensors="pt").to(self.device)

        with torch.no_grad():
            logits = self.model(**inputs).logits

        inter = torch.log1p(torch.relu(logits[0]))
        token_max = torch.max(inter, dim=0)  # sum over input tokens
        nz_tokens = torch.where(token_max.values > 0)[0]
        nz_weights = token_max.values[nz_tokens]

        order = torch.sort(nz_weights, descending=True)
        nz_weights = nz_weights[order[1]]
        nz_tokens = nz_tokens[order[1]]
        return {
            'indices': nz_tokens.cpu().numpy().tolist(),
            'values': nz_weights.cpu().numpy().tolist()
        }

In [6]:
splade = SPLADE("naver/splade-cocondenser-ensembledistil")

Downloading (‚Ä¶)okenizer_config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

Downloading (‚Ä¶)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (‚Ä¶)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (‚Ä¶)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (‚Ä¶)lve/main/config.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

Downloading (‚Ä¶)"pytorch_model.bin";:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [7]:
doc = "what is the capital of france?"
sparse_vector = splade(doc)


### Dense Model

We use the popular all-MiniLM-L6-v2 model available on HuggingFace for dense vectors

In [8]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"running on {device}")

model = SentenceTransformer(
    'sentence-transformers/all-MiniLM-L6-v2',
    device=device
)

running on cpu


Downloading (‚Ä¶)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (‚Ä¶)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (‚Ä¶)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (‚Ä¶)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (‚Ä¶)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (‚Ä¶)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading (‚Ä¶)"pytorch_model.bin";:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (‚Ä¶)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (‚Ä¶)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (‚Ä¶)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (‚Ä¶)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (‚Ä¶)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (‚Ä¶)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (‚Ä¶)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

### Compute Dense & Sparse Embeddings

In [9]:
from tqdm.notebook import tqdm

tqdm.pandas()
df['sparse_values'] = df['text'].progress_apply(lambda x: splade(x))

  0%|          | 0/200 [00:00<?, ?it/s]

In [10]:
df['values'] = df['text'].progress_apply(lambda x: model.encode(x))

  0%|          | 0/200 [00:00<?, ?it/s]

Orgnise our dataframe to align to Pinecone datasets format

In [11]:
df_result = df.copy()
df_result["metadata"] = None
df_result["blob"] = df_result["text"].apply(lambda t: {"text": t})
df_result = df_result.drop(columns="text")

In [12]:
df_result.head()

Unnamed: 0,id,sparse_values,values,metadata,blob
0,17248,"{'indices': [10184, 19637, 2104, 5334, 2991, 4...","[0.021123115, 0.043918036, -0.032318894, -0.01...",,{'text': ' If I fall under the Brady law due t...
1,240419,"{'indices': [2748, 2053, 3980, 3160, 3437, 466...","[0.015179832, 0.06904052, -0.023286428, -0.003...",,{'text': ' Which question can't be answered wi...
2,262372,"{'indices': [2338, 3080, 2808, 2336, 2517, 221...","[0.038049288, 0.084497035, 0.008177851, 0.0328...",,{'text': ' How can I write a children's book f...
3,180057,"{'indices': [23091, 3796, 2270, 16021, 2145, 2...","[-0.028053429, -0.04008296, 0.016164199, 0.020...",,{'text': ' What happens when you view a public...
4,456610,"{'indices': [17706, 1060, 9152, 4774, 2755, 22...","[0.005622366, 0.1048758, 0.02587494, 0.0329212...",,{'text': ' What is the fact about NIBIRU the P...


And now we have all we need to start using Pinecone vector database üöÄ for more details on that checkout [this](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade_quora.ipynb) notebook.