[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb)

# Semantic Search

In this walkthrough we will see how to use Pinecone for semantic search. To begin we must install the required prerequisite libraries:

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
!pip uninstall -q \
  numpy==1.26.4 \
  pinecone-client==3.1.0 \
  pinecone-datasets==0.7.0 \
  sentence-transformers==3.3.0 \
  pinecone-notebooks==0.1.1 --quiet

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Data Download

In this notebook we will skip the data preparation steps as they can be very time consuming and jump straight into it with the prebuilt dataset from *Pinecone Datasets*. If you'd rather see how it's all done, please refer to [this notebook](https://github.com/pinecone-io/examples/blob/master/learn/search/semantic-search/semantic-search.ipynb).

Let's go ahead and download the dataset.

Quora es una plataforma de preguntas y respuestas en línea donde los usuarios pueden hacer preguntas y proporcionar respuestas.
epresentaciones vectoriales (embeddings) de preguntas utilizando el modelo [MiniLM-L6](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

In [6]:
# from pinecone_datasets import load_dataset

# dataset = load_dataset('quora_all-MiniLM-L6-bm25')
# # we drop metadata as will use blob column
# dataset.documents.drop(['metadata'], axis=1, inplace=True)
# dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)
# # we will use 80K rows of the dataset between rows 240K -> 320K
# dataset.documents.drop(dataset.documents.index[320_000:], inplace=True)
# dataset.documents.drop(dataset.documents.index[:240_000], inplace=True)
# dataset.head()


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "d:\Users\juanp_schamun\AppData\Local\anaconda3\envs\llm-curso\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "d:\Users\juanp_schamun\AppData\Local\anaconda3\envs\llm-curso\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "d:\Users\juanp_schamun\AppData\Local\anaconda3\envs\llm-curso\lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "d:\Users\juanp_schamun\AppData\Local\anaconda3\envs\llm-c

AttributeError: _ARRAY_API not found

ImportError: numpy.core.multiarray failed to import

In [None]:
print(len(dataset))

80000


## Creating an Index

Now the data is ready, we can set up our index to store it.

We begin by initializing our connection to Pinecone. To do this we need a [free API key](https://app.pinecone.io).

In [None]:
import os

# initialize connection to pinecone (orget API key at app.pinecone.io)
if not os.environ.get("PINECONE_API_KEY"):
    from pinecone_notebooks.colab import Authenticate
    Authenticate()

In [None]:
from pinecone import Pinecone

api_key = os.environ.get("PINECONE_API_KEY")

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [None]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

Now we create a new index called `semantic-search-fast`. It's important that we align the index `dimension` and `metric` parameters with those required by the `MiniLM-L6` model.

In [None]:
index_name = 'semantic-search-fast'

In [None]:
import time

existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=384,  # dimensionality of minilm
        metric='cosine',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Upsert the data:

In [None]:
from tqdm.auto import tqdm

for batch in tqdm(dataset.iter_documents(batch_size=500), total=160):
    index.upsert(batch)

100%|██████████| 160/160 [02:46<00:00,  1.04s/it]


## Making Queries

Now that our index is populated we can begin making queries. We are performing a semantic search for *similar questions*, so we should embed and search with another question. Let's begin.

In [None]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=device)
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Now let's query.

In [None]:
query = "which city has the highest population in the world?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
xc

{'matches': [{'id': '69331',
              'metadata': {'text': " What's the world's largest city?"},
              'score': 0.78565526,
              'values': []},
             {'id': '69332',
              'metadata': {'text': ' What is the biggest city?'},
              'score': 0.727139473,
              'values': []},
             {'id': '84749',
              'metadata': {'text': " What are the world's most advanced "
                                   'cities?'},
              'score': 0.709211528,
              'values': []},
             {'id': '109231',
              'metadata': {'text': ' Where is the most beautiful city in the '
                                   'world?'},
              'score': 0.696055,
              'values': []},
             {'id': '109230',
              'metadata': {'text': ' What is the greatest, most beautiful city '
                                   'in the world?'},
              'score': 0.657444596,
              'values': []}],
 'namespace'

In the returned response `xc` we can see the most relevant questions to our particular query — we don't have any exact matches but we can see that the returned questions are similar in the topics they are asking about. We can reformat this response to be a little easier to read:

In [None]:
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.79:  What's the world's largest city?
0.73:  What is the biggest city?
0.71:  What are the world's most advanced cities?
0.7:  Where is the most beautiful city in the world?
0.66:  What is the greatest, most beautiful city in the world?


These are good results, let's try and modify the words being used to see if we still surface similar results.

In [None]:
query = "which metropolis has the highest number of people?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.64:  What is the biggest city?
0.6:  What is the most dangerous city in USA?
0.59:  What's the world's largest city?
0.59:  What is the most dangerous city in USA? Why?
0.58:  What are the world's most advanced cities?


Here we used different terms in our query than that of the returned documents. We substituted **"city"** for **"metropolis"** and **"populated"** for **"number of people"**.

Despite these very different terms and *lack* of term overlap between query and returned documents — we get highly relevant results — this is the power of *semantic search*.

You can go ahead and ask more questions above. When you're done, delete the index to save resources:

In [None]:
pc.delete_index(index_name)

---