In [None]:
import pinecone
import sentence_transformers
import pinecone_notebooks
import pinecone_datasets

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb)

# Semantic Search

In this walkthrough we will see how to use Pinecone for semantic search. To begin we must install the required prerequisite libraries:

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Data Download

In this notebook we will skip the data preparation steps as they can be very time consuming and jump straight into it with the prebuilt dataset from *Pinecone Datasets*.

The dataset we are working with represents embeddings of [400K question pairs from Quora](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs). The embeddings were created using the `all-MiniLM-L6-v2` model from Hugging Face via the `sentence-transformers` package.

If you'd rather see how it's all done, please refer to [this notebook](https://github.com/pinecone-io/examples/blob/master/learn/search/semantic-search/semantic-search.ipynb).

Let's go ahead and download the dataset.

In [None]:
from pinecone_datasets import load_dataset

dataset = load_dataset('quora_all-MiniLM-L6-bm25')

# The metadata we need is actually stored in the "blob" column so let's rename it
dataset.documents.drop(['metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)

# We don't need sparse_values for this demo either so let's drop those as well
dataset.documents.drop(['sparse_values'], axis=1, inplace=True)

# To speed things up in this demo, we will use 80K rows of the dataset between rows 240K -> 320K
dataset.documents.drop(dataset.documents.index[320_000:], inplace=True)
dataset.documents.drop(dataset.documents.index[:240_000], inplace=True)
dataset.head()

Loading documents parquet files: 100%|██████████| 10/10 [02:15<00:00, 13.52s/it]


Unnamed: 0,id,values,metadata
240000,515997,"[-0.00531694, 0.06937869, -0.0092854, 0.003286...","{'text': ' Why is a ""law of sciences"" importan..."
240001,515998,"[-0.09243751, 0.065432355, -0.06946959, 0.0669...",{'text': ' Is it possible to format a BitLocke...
240002,515999,"[-0.021924071, 0.032280188, -0.020190848, 0.07...",{'text': ' Can formatting a hard drive stress ...
240003,516000,"[-0.120020054, 0.024080949, 0.10693012, -0.018...",{'text': ' Are the new Samsung Galaxy J7 and J...
240004,516001,"[-0.095293395, -0.048446465, -0.017618902, -0....",{'text': ' I just watched an add for Indonesia...


In [None]:
print(f"Rows in dataset: {len(dataset)}")

Rows in dataset: 80000


Let's take a closer look at one of these rows to see what we're dealing with. In the metadata we have stored the original question text.

In [None]:
row1 = dataset.documents.iloc[0:1].to_dict(orient="row_count")[0]
dimension = len(row1['target_values'])
print(f"These embeddings have dimension {dimension}")

These embeddings have dimension 384


In [None]:
print("Here are some example questions in the data set:\n")
for r in dataset.documents.iloc[0:10].to_dict(orient="row_count"):
    print("  -" + r['metadata']['path'])

Here are some example questions in the data set:

  - Why is a "law of sciences" important for our life?
  - Is it possible to format a BitLocker or FileVault protected drive?
  - Can formatting a hard drive stress it out?
  - Are the new Samsung Galaxy J7 and J5 worth their price?
  - I just watched an add for Indonesia 2026 World Cup bid in YouTube, is it viable?
  - I am an 18 year old college student. Is it a viable idea to play poker in order to pay for my college tuition?
  - If the French monarchy had never been abolished, who would be the current king/queen?
  - Who was the best French King?
  - How do I obtain a free United States phone number using the Internet?
  - What is the change in your opinion about PM Narendra Modi after demonetization of 1000 and 500 rupees currency notes?


## Creating an Index

Now the data is ready, we can set up our index to store it.

We begin by instantiating the Pinecone client. To do this we need a [free API key](https://app.pinecone.io).

In [None]:
import os

if not os.environ.get("PINECONE_API_KEY"):
    from pinecone_notebooks.colab import Authenticate
    Authenticate()

In [None]:
from pinecone import Pinecone

# Initialize client
pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))

Now we create a new index called `semantic-search-fast`. It's important that we align the index `dimension` and `metric` parameters with those required by the `MiniLM-L6` model.

### Creating a Pinecone Index

When creating the index we need to define several configuration properties.

- `name` can be anything we like. The name is used as an identifier for the index when performing other operations such as `describe_index`, `delete_index`, and so on.
- `metric` specifies the similarity metric that will be used later when you make queries to the index.
- `dimension` should correspond to the dimension of the dense vectors produced by your embedding model. In this quick start, we are using made-up data so a small value is simplest.
- `spec` holds a specification which tells Pinecone how you would like to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

There are more configurations available, but this minimal set will get us started.

In [None]:
from pinecone import ServerlessSpec

index_name = 'semantic-search-fast'

# Check if index already exists (it shouldn't if this is first time running the demo)
if not pc.has_index(name=index_name):
    # If does not exist, create_small_embedding index
    pc.create_index(
        name=index_name,
        dimension=384, # dimensionality of MiniLM
        metric='dotproduct',
        spec = ServerlessSpec(
            cloud='aws',
            region='us-east-1'
        )
    )

# Initialize index client
index = pc.Index(name=index_name)

# View index stats
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'dotproduct',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

## Upserting data into the Pinecone index

In [None]:
from tqdm import tqdm

batch_size = 100

for start in tqdm(range(0, len(dataset.documents), batch_size), "Upserting row_count batch"):
    batch = dataset.documents.iloc[start:start + batch_size].to_dict(orient="row_count")
    index.upsert(vectors=batch)

Upserting records batch: 100%|██████████| 800/800 [05:45<00:00,  2.32it/s]


## Making Queries

Now that our index is populated we can begin making queries. We are performing a semantic search for *similar questions*, so we should embed and search with another question. Let's begin.

In [None]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=device)
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Now let's use this model to embed our question and find similar questions.

In [None]:
def find_similar_questions(question):
    # Embed the question into a query vector
    xq = model.encode(question).tolist()

    # Now query Pinecone to find similar questions
    return index.query(vector=xq, top_k=5, include_metadata=True)


In [None]:
question = "Which city has the highest population in the world?"
xc = find_similar_questions(question)
xc

/pytorch/third_party/ideep/mkl-dnn/src/cpu/aarch64/xbyak_aarch64/src/util_impl_linux.h, 451: Can't read MIDR_EL1 sysfs entry


{'matches': [{'id': '69331',
              'metadata': {'text': " What's the world's largest city?"},
              'score': 0.785789311,
              'values': []},
             {'id': '69332',
              'metadata': {'text': ' What is the biggest city?'},
              'score': 0.727474,
              'values': []},
             {'id': '84749',
              'metadata': {'text': " What are the world's most advanced "
                                   'cities?'},
              'score': 0.709189653,
              'values': []},
             {'id': '109231',
              'metadata': {'text': ' Where is the most beautiful city in the '
                                   'world?'},
              'score': 0.695605934,
              'values': []},
             {'id': '109230',
              'metadata': {'text': ' What is the greatest, most beautiful city '
                                   'in the world?'},
              'score': 0.657157958,
              'values': []}],
 'namespace

In the returned response `xc` we can see the most relevant questions to our particular query — we don't have any exact matches but we can see that the returned questions are similar in the topics they are asking about. We can reformat this response to be a little easier to read:

In [None]:
def print_query_results(results):
    for result in results['matches']:
        print(f"{round(result['score'], 2)}: {result['metadata']['path']}")

print_query_results(xc)

0.79:  What's the world's largest city?
0.73:  What is the biggest city?
0.71:  What are the world's most advanced cities?
0.7:  Where is the most beautiful city in the world?
0.66:  What is the greatest, most beautiful city in the world?


These are good results, let's try and modify the words being used to see if we still surface similar results.

In [None]:
question2 = "Which metropolis has the highest num of people?"

xc2 = find_similar_questions(question2)
print_query_results(xc2)

0.64:  What is the biggest city?
0.6:  What is the most dangerous city in USA?
0.59:  What's the world's largest city?
0.59:  What is the most dangerous city in USA? Why?
0.58:  What are the world's most advanced cities?


Here we used different terms in our query than that of the returned documents. We substituted **"city"** for **"metropolis"** and **"populated"** for **"number of people"**.

Despite these very different terms and *lack* of term overlap between query and returned documents — we get highly relevant results — this is the power of *semantic search*.

## Demo Cleanup

You can go ahead and ask more questions above. When you're done, delete the index to save resources:

In [None]:
pc.delete_index(name=index_name)

---