[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb)

# Semantic Search  With Pinecone Vector database

In this walkthrough we will see how to use Pinecone for semantic search. To begin we must install the required prerequisite libraries:

# 1. Create Index --similar to table
# 2. Data Prep
# 3. Insert data into Index created in step 1
# 4. Perform Semantic Search
# 5. Delete Index if not required.

In [1]:
!pip install -qU \
pinecone-client==3.1.0  \
pinecone-notebooks==0.1.1

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/211.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m204.8/211.0 kB[0m [31m6.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.0/211.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## 1. Creating an Index

Now the data is ready, we can set up our index to store it.

We begin by initializing our connection to Pinecone. To do this we need a [free API key](https://app.pinecone.io).

In [2]:
import os

In [3]:
# initialize connection to pinecone (orget API key at app.pinecone.io)
if not os.environ.get("PINECONE_API_KEY"):
  from pinecone_notebooks.colab import Authenticate
  Authenticate()

In [4]:
from pinecone import Pinecone
from google.colab import userdata

os.environ["PINECONE_API_KEY"]=userdata.get("PINECONE_API_KEY")
api_key = os.environ.get("PINECONE_API_KEY")
#print(api_key)

# configure client
pc = Pinecone(api_key=api_key)

In [5]:
from pinecone import ServerlessSpec

cloud = os.environ.get("PINECONE_CLOUD") or 'aws'
region = os.environ.get("PINECONE_REGION") or 'us-east-1'
spec = ServerlessSpec(cloud=cloud, region=region)

Now we create a new index called `semantic-search-fast`. It's important that we align the index `dimension` and `metric` parameters with those required by the `MiniLM-L6` model.

In [6]:
index_name = 'semantic-search-obama-text-decemer'


In [7]:
import time

existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]
# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
  # if does not exist, create index
  pc.create_index(
      index_name,
      dimension=1536, #dimensionality of minilm
      metric="cosine",
      spec=spec
  )
  # wait for index to be initialized
  while not pc.describe_index(index_name).status["ready"]:
    time.sleep(1)

#connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [9]:
!ls -lh /content/

total 40K
drwxr-xr-x 1 root root 4.0K Jan  9 14:24 sample_data
-rw-r--r-- 1 root root  35K Jan 12 05:35 sotu_obama.txt


In [10]:
!pip install -qU langchain langchain_community langchain_openai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.5/2.5 MB[0m [31m15.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m41.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.2/54.2 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.6/49.6 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [11]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

In [13]:
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader("/content/sotu_obama.txt").load() #   Doc ai Models advanced
text_splitter = CharacterTextSplitter(chunk_size=500,separator="\n",chunk_overlap=100)
documents = text_splitter.split_documents(raw_documents)



In [20]:
import os
from google.colab import userdata
OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [21]:
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

In [18]:
len(documents)

76

In [22]:
for chunk in documents[0:2]:
  print(chunk.page_content)
  values=embeddings.embed_query(chunk.page_content)
  print(values)

Mr. Speaker, Mr. Vice President, Members of Congress, my fellow Americans:
Tonight marks the eighth year that I’ve come here to report on the State of the Union. And for this final one, I’m going to try to make it a little shorter. (Applause.) I know some of you are antsy to get back to Iowa. (Laughter.) I've been there. I'll be shaking hands afterwards if you want some tips. (Laughter.)
[-0.024996774271130562, -0.005424804519861937, 0.01516876183450222, -0.011554380878806114, 0.012828143313527107, 0.008825814351439476, -0.014017850160598755, -0.006423769984394312, -0.008179234340786934, 0.011554380878806114, 0.029070226475596428, 0.011483256705105305, -0.013668696396052837, 0.0055379560217261314, 0.023393256589770317, -0.001952670980244875, 0.026690814644098282, -0.008974527940154076, 0.019849998876452446, -0.020147426053881645, 0.0070800487883389, -0.01509117241948843, 0.010325878858566284, 0.015349804423749447, 0.0025847028009593487, -0.013358338735997677, 0.03199276700615883, -0.01

In [23]:
from importlib import metadata
docs=[]
docs_json = {}
id=0
for chunk in documents:
  id+=1
  values=embeddings.embed_query(chunk.page_content)
  docs_json={
      "id":str(id),
      "values":values,
      "metadata":{
          "text": chunk.page_content,
          "source": chunk.metadata["source"],
          "author": "zion",
          "createdBy": "12-15-2024"
      }
  }
  docs.append(docs_json)

In [24]:
len(docs)

76

In [25]:
len(docs[1]["values"])

1536

In [26]:
docs[2]["id"]

'3'

# 3. Insert data into Index

In [28]:
# prompt: give me a code to process python list of batch size of 10
batch_size = 10
for i in range(0, len(docs), batch_size):
  batch = docs[i:i + batch_size]
  print(f"upserting batch {i} to {i + batch_size}")
  print(batch)
  index.upsert(batch)

upserting batch 0 to 10
[{'id': '1', 'values': [-0.024996774271130562, -0.005424804519861937, 0.01516876183450222, -0.011554380878806114, 0.012828143313527107, 0.008825814351439476, -0.014017850160598755, -0.006423769984394312, -0.008179234340786934, 0.011554380878806114, 0.029070226475596428, 0.011483256705105305, -0.013668696396052837, 0.0055379560217261314, 0.023393256589770317, -0.001952670980244875, 0.026690814644098282, -0.008974527940154076, 0.019849998876452446, -0.020147426053881645, 0.0070800487883389, -0.01509117241948843, 0.010325878858566284, 0.015349804423749447, 0.0025847028009593487, -0.013358338735997677, 0.03199276700615883, -0.01471615582704544, 0.01716022752225399, 0.006071384064853191, 0.001195364398881793, -0.016487784683704376, -0.01431527640670538, -0.009912068024277687, -0.03320833668112755, -0.021686285734176636, -0.023664820939302444, 0.014897198416292667, 0.02299237810075283, -0.01915169321000576, -0.011813012883067131, -0.011425064876675606, 0.0118130128830

Output hidden; open in https://colab.research.google.com to view.

## 4. Making Queries - Semantic Search

Now that our index is populated we can begin making queries. We are performing a semantic search for *similar questions*, so we should embed and search with another question. Let's begin.
Now let's query.

In [29]:
# prompt ='''Summarize what was obama said about schools based on {context}'''.format(context=text))
# print(prompt)

In [30]:
query = "summarize was obama said about schools?"

# create the query vector
#xq = model.encode(query).tolist()
xq = embeddings.embed_query(query)

In [32]:
# now query
xc = index.query(vector=xq, top_k=3, include_metadata=True)
xc

{'matches': [{'id': '17',
              'metadata': {'author': 'zion',
                           'createdBy': '12-15-2024',
                           'source': '/content/sotu_obama.txt',
                           'text': 'We agree that real opportunity requires '
                                   'every American to get the education and '
                                   'training they need to land a good-paying '
                                   'job. The bipartisan reform of No Child '
                                   'Left Behind was an important start, and '
                                   'together, we’ve increased early childhood '
                                   'education, lifted high school graduation '
                                   'rates to new highs, boosted graduates in '
                                   'fields like engineering. In the coming '
                                   'years, we should build on that progress, '
                           

In the returned response `xc` we can see the most relevant questions to our particular query — we don't have any exact matches but we can see that the returned questions are similar in the topics they are asking about. We can reformat this response to be a little easier to read:

In [33]:
for result in xc["matches"]:
  print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.82: We agree that real opportunity requires every American to get the education and training they need to land a good-paying job. The bipartisan reform of No Child Left Behind was an important start, and together, we’ve increased early childhood education, lifted high school graduation rates to new highs, boosted graduates in fields like engineering. In the coming years, we should build on that progress, by providing Pre-K for all and -- (applause) -- offering every student the hands-on computer science and math classes that make them job-ready on day one. We should recruit and support more great teachers for our kids. (Applause.)
0.79: And over the past seven years, we’ve nurtured that spirit. We’ve protected an open Internet, and taken bold new steps to get more students and low-income Americans online. (Applause.) We’ve launched next-generation manufacturing hubs, and online tools that give an entrepreneur everything he or she needs to start a business in a single day. But we can 

In [34]:
query = "which metropolis has the highest number of people?"
# create the query vector
#xq = model.encode(query).tolist()
xq = embeddings.embed_query(query)

#now query
xc = index.query(vector=xq, top_k=2, include_metadata=True)
for result in xc["matches"]:
  print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.75: In fact, it turns out many of our best corporate citizens are also our most creative. And this brings me to the second big question we as a country have to answer: How do we reignite that spirit of innovation to meet our biggest challenges?
0.74: Let me start with the economy, and a basic fact: The United States of America, right now, has the strongest, most durable economy in the world. (Applause.) We’re in the middle of the longest streak of private sector job creation in history. (Applause.) More than 14 million new jobs, the strongest two years of job growth since the ‘90s, an unemployment rate cut in half. Our auto industry just had its best year ever. (Applause.) That's just part of a manufacturing surge that's created nearly 900,000 new jobs in the past six years. And we’ve done all this while cutting our deficits by almost three-quarters. (Applause.)


# 5. Delete Index if not required

In [35]:
#index.delete(delete_all=True, index_name=index_name)