[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb)

# Semantic Search

In this walkthrough we will see how to use Pinecone for semantic search, using a multilingual translation dataset. To begin we must install the required prerequisite libraries:

In [1]:
!pip install -qU \
  pinecone==6.0.2 \
  pinecone-notebooks==0.1.1 \
  datasets==3.5.1

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Data Download

In [2]:
import os
from getpass import getpass

def get_pinecone_api_key():
    """
    Get Pinecone API key from environment variable or prompt user for input.
    Returns the API key as a string.
    """
    api_key = os.environ.get("PINECONE_API_KEY")
    
    if not api_key:
        try:
            # Try Colab authentication if available
            from pinecone_notebooks.colab import Authenticate
            Authenticate()
            # If successful, key will now be in environment
            api_key = os.environ.get("PINECONE_API_KEY")
        except ImportError:
            # If not in Colab or authentication fails, prompt user
            print("Pinecone API key not found in environment.")
            api_key = getpass("Please enter your Pinecone API key: ")
            # Save to environment for future use in session
            os.environ["PINECONE_API_KEY"] = api_key
    
    return api_key

# Initialize Pinecone client with API key
api_key = get_pinecone_api_key()

Pinecone API key not found in environment.


In [3]:
from pinecone import Pinecone

# Initialize client
pc = Pinecone(api_key=api_key)

  from .autonotebook import tqdm as notebook_tqdm


### Creating a Pinecone Index

When creating the index we need to define several configuration properties. 

- `name` can be anything we like. The name is used as an identifier for the index when performing other operations such as `describe_index`, `delete_index`, and so on. 
- `metric` specifies the similarity metric that will be used later when you make queries to the index.
- `dimension` should correspond to the dimension of the dense vectors produced by your embedding model. In this quick start, we are using made-up data so a small value is simplest.
- `spec` holds a specification which tells Pinecone how you would like to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

There are more configurations available, but this minimal set will get us started.

In [4]:

index_name = "semantic-search"

if not pc.has_index(index_name):
    pc.create_index_for_model(
        name=index_name,
        cloud="aws",
        region="us-east-1",
        embed={
            "model":"multilingual-e5-large",
            "field_map":{"text": "chunk_text"}
        }
    )

# Initialize index client
index = pc.Index(name=index_name)

# View index stats
index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'english-sentences': {'vector_count': 416}},
 'total_vector_count': 416,
 'vector_type': 'dense'}

## Creating our dataset

In [5]:
from datasets import load_dataset
# specify that we want the english-spanish translation pairs
tatoeba = load_dataset("Helsinki-NLP/tatoeba", lang1="en", lang2="es", trust_remote_code=True, split="train")

In [6]:
keywords= ["park"]
namespace = "english-sentences"

def simple_keyword_filter(sentence, keywords):
  # filter for a list of keywords by sentence
    for keyword in keywords:
        if keyword in sentence:
            return True
    return False

def transform_dataset_for_pinecone(dataset, use_filter=True):
    # Feel free to adjust this code to simulate a larger search!

    if use_filter:
        # filter for a list of keywords by sentence, helpful for building intuition on semantic search
        translation_pairs = dataset.filter(lambda x: simple_keyword_filter(
        sentence = x["translation"]["en"], keywords=keywords))
    else:
        # use the full 200k+ dataset
        translation_pairs = dataset

    # flatten and shuffle for ease of use
    translation_pairs = translation_pairs.flatten()
    translation_pairs = translation_pairs.shuffle(seed=1)

    english_sentences = translation_pairs.rename_column("translation.en", "text").remove_columns("translation.es")

    # add lang column to indicate embedding origin
    english_sentences = english_sentences.add_column("lang", ["en"]*len(english_sentences))


    records = []

    for idx, sentence in enumerate(english_sentences):
        records.append(
            {
                "id": str(idx),
                "chunk_text": sentence["text"],
                "lang": sentence["lang"]
            }
        )

    # convert to record format
    return records


records = transform_dataset_for_pinecone(tatoeba)

## Upserting data into the Pinecone index

In [7]:
from tqdm import tqdm

batch_size = 96

for start in tqdm(range(0, len(records), batch_size), f"Upserting records batch: "):
    index.upsert_records(records=records[start:start+batch_size], namespace = namespace)

Upserting records batch: 100%|██████████| 5/5 [00:04<00:00,  1.08it/s]


## Making Queries

Now that our index is populated we can begin making queries. We are performing a semantic search for *similar questions*, so we should embed and search with another question. Let's begin.

In [8]:
search_query = "I want to go to the park and relax"

results = index.search(
    namespace=namespace,
    query={
        "top_k": 10,
        "inputs": {
            'text': search_query
        }
    }
)

for result in results["result"]["hits"]:
    print(f'Sentence: {result["fields"]["chunk_text"]} Semantic Similarity Score: {result["_score"]}\n')

Sentence: I have the afternoon off today, so I plan to go to the park, sit under a tree and read a book. Semantic Similarity Score: 0.8618664741516113

Sentence: Let's go to the park where it's not noisy. Semantic Similarity Score: 0.8589204549789429

Sentence: Let's go to the park where it is not noisy. Semantic Similarity Score: 0.8587626814842224

Sentence: Let's go to the park where it isn't noisy. Semantic Similarity Score: 0.858618438243866

Sentence: I go to the park. Semantic Similarity Score: 0.8583135604858398

Sentence: I'll go to the park. Semantic Similarity Score: 0.8503443598747253

Sentence: I like going for a walk in the park. Semantic Similarity Score: 0.8475707769393921

Sentence: Let's take a walk in the park. Semantic Similarity Score: 0.8399624228477478

Sentence: Who wants to go to the park? Semantic Similarity Score: 0.83930504322052

Sentence: Do you like to walk in the park? Semantic Similarity Score: 0.8344495296478271



## Demo Cleanup

You can go ahead and ask more questions above. When you're done, delete the index to save resources:

In [9]:
#pc.delete_index(name=index_name)

---