# Semantic Search with Pinecone and OpenAI

In this notebook we will work through a demo example of building a SOTA semantic search tool with OpenAI vector embeddings and the Pinecone vector DB.

## Setup

We first need to setup our environment and retrieve API keys for OpenAI and Pinecone. Let's start with our environment, we need HuggingFace *Datasets* for our data, and the OpenAI and Pinecone clients:

In [1]:
!pip install pinecone-client openai datasets

Collecting pinecone-client
  Downloading pinecone_client-2.0.8-py3-none-any.whl (149 kB)
[?25l[K     |██▏                             | 10 kB 25.5 MB/s eta 0:00:01[K     |████▍                           | 20 kB 26.1 MB/s eta 0:00:01[K     |██████▋                         | 30 kB 15.0 MB/s eta 0:00:01[K     |████████▊                       | 40 kB 12.1 MB/s eta 0:00:01[K     |███████████                     | 51 kB 10.3 MB/s eta 0:00:01[K     |█████████████▏                  | 61 kB 12.0 MB/s eta 0:00:01[K     |███████████████▎                | 71 kB 11.6 MB/s eta 0:00:01[K     |█████████████████▌              | 81 kB 10.8 MB/s eta 0:00:01[K     |███████████████████▊            | 92 kB 11.7 MB/s eta 0:00:01[K     |█████████████████████▉          | 102 kB 12.3 MB/s eta 0:00:01[K     |████████████████████████        | 112 kB 12.3 MB/s eta 0:00:01[K     |██████████████████████████▎     | 122 kB 12.3 MB/s eta 0:00:01[K     |████████████████████████████▍   | 133 kB

### Creating Embeddings

Then we initialize our connection to OpenAI Embeddings *and* Pinecone vector DB. Sign up for an API key over at [OpenAI](https://beta.openai.com/signup) and [Pinecone](https://app.pinecone.io).

In [2]:
import openai
import os

openai.organization = "<<YOUR_ORG_KEY>>"
# get this from top-right dropdown on OpenAI under organization > settings
openai.api_key = "<<YOUR_API_KEY>>"
# get API key from top-right dropdown on OpenAI website

openai.Engine.list()  # check we have authenticated

<OpenAIObject list at 0x7f1d607d6290> JSON: {
  "data": [
    {
      "created": null,
      "id": "ada",
      "max_replicas": null,
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true,
      "replicas": null
    },
    {
      "created": null,
      "id": "ada-code-search-code",
      "max_replicas": null,
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true,
      "replicas": null
    },
    {
      "created": null,
      "id": "ada-code-search-text",
      "max_replicas": null,
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true,
      "replicas": null
    },
    {
      "created": null,
      "id": "ada-instruct-beta",
      "max_replicas": null,
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true,
      "replicas": null
    },
    {
      "created": null,
      "id": "ada-search-document",
      "max

We can now create embeddings with the OpenAI Babbage similarity model like so:

In [3]:
MODEL = "text-similarity-babbage-001"

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=MODEL
)
res

<OpenAIObject list at 0x7f1d60564170> JSON: {
  "data": [
    {
      "embedding": [
        -0.002167685655876994,
        -0.0007424376090057194,
        0.00584240909665823,
        -0.030613547191023827,
        0.04518579691648483,
        0.001435273909009993,
        0.004010323900729418,
        0.009472807869315147,
        0.02254224196076393,
        -0.05021769925951958,
        -0.009861175902187824,
        -0.0039997706189751625,
        0.0038499110378324986,
        0.016640733927488327,
        0.002847330179065466,
        -0.005914172623306513,
        0.020076949149370193,
        -0.013947485014796257,
        0.034277718514204025,
        0.01867544651031494,
        0.14899832010269165,
        0.008594757877290249,
        -0.015872441232204437,
        -0.008295038715004921,
        -0.008480779826641083,
        -0.010046917013823986,
        -0.03657415509223938,
        0.003497424768283963,
        -0.0351557657122612,
        -0.007991098798811436,
      

In [4]:
print(f"vector 0: {len(res['data'][0]['embedding'])}\nvector 1: {len(res['data'][1]['embedding'])}")

vector 0: 2048
vector 1: 2048


In [5]:
# we can extract embeddings to a list
embeds = [record['embedding'] for record in res['data']]
len(embeds)

2

Next, we initialize our index to store vector embeddings with Pinecone.

In [6]:
import pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(
    api_key="<<YOUR_API_KEY>>",
    environment="us-west1-gcp"
)
# check if 'openai' index already exists (only create index if not)
if 'openai' not in pinecone.list_indexes():
    pinecone.create_index('openai', dimension=len(embeds[0]))
# connect to index
index = pinecone.Index('openai')

## Populating the Index

Now we will take 1K questions from the TREC dataset

In [7]:
from datasets import load_dataset

# load the first 1K rows of the TREC dataset
trec = load_dataset('trec', split='train[:1000]')
trec

Downloading builder script:   0%|          | 0.00/2.22k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset trec/default (download: 350.79 KiB, generated: 403.39 KiB, post-processed: Unknown size, total: 754.18 KiB) to /root/.cache/huggingface/datasets/trec/default/1.1.0/751da1ab101b8d297a3d6e9c79ee9b0173ff94c4497b75677b59b61d5467a9b9...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/336k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.4k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/5452 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset trec downloaded and prepared to /root/.cache/huggingface/datasets/trec/default/1.1.0/751da1ab101b8d297a3d6e9c79ee9b0173ff94c4497b75677b59b61d5467a9b9. Subsequent calls will reuse this data.


Dataset({
    features: ['label-coarse', 'label-fine', 'text'],
    num_rows: 1000
})

In [8]:
trec[0]

{'label-coarse': 0,
 'label-fine': 0,
 'text': 'How did serfdom develop in and then leave Russia ?'}

Then we create a vector embedding for each phrase using OpenAI, and `upsert` the ID, vector embedding, and original text for each phrase to Pinecone.

In [10]:
from tqdm.auto import tqdm

count = 0  # we'll use the count to create unique IDs
batch_size = 32  # process everything in batches of 32
for i in tqdm(range(0, len(trec['text']), batch_size)):
    # set end position of batch
    i_end = min(i+batch_size, len(trec['text']))
    # get batch of lines and IDs
    lines_batch = trec['text'][i: i+batch_size]
    ids_batch = [str(n) for n in range(i, i_end)]
    # create embeddings
    res = openai.Embedding.create(input=lines_batch, engine=MODEL)
    embeds = [record['embedding'] for record in res['data']]
    # prep metadata and upsert batch
    meta = [{'text': line} for line in lines_batch]
    to_upsert = zip(ids_batch, embeds, meta)
    # upsert to Pinecone
    index.upsert(vectors=list(to_upsert))

  0%|          | 0/32 [00:00<?, ?it/s]

---

# Querying

With our data indexed, we're now ready to move onto performing searches. This follows a similar process to indexing. We start with a text `query`, that we would like to use to find similar sentences. As before we encode this with OpenAI's text similarity Babbage model to create a *query vector* `xq`. We then use `xq` to query the Pinecone index.

In [11]:
query = "What caused the 1929 Great Depression?"

xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']

Now query...

In [12]:
res = index.query([xq], top_k=5, include_metadata=True)
res

{'results': [{'matches': [{'id': '932',
                           'metadata': {'text': 'Why did the world enter a '
                                                'global depression in 1929 ?'},
                           'score': 0.948046744,
                           'values': []},
                          {'id': '787',
                           'metadata': {'text': 'When was `` the Great '
                                                "Depression '' ?"},
                           'score': 0.865037203,
                           'values': []},
                          {'id': '400',
                           'metadata': {'text': 'What crop failure caused the '
                                                'Irish Famine ?'},
                           'score': 0.86109376,
                           'values': []},
                          {'id': '481',
                           'metadata': {'text': 'What caused the Lynmouth '
                                               

The response from Pinecone includes our original text in the `metadata` field, let's print out the `top_k` most similar questions and their respective similarity scores.

In [13]:
for match in res['results'][0]['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.95: Why did the world enter a global depression in 1929 ?
0.87: When was `` the Great Depression '' ?
0.86: What crop failure caused the Irish Famine ?
0.82: What caused the Lynmouth floods ?
0.79: What caused Harry Houdini 's death ?


Looks good, let's make it harder and replace *"depression"* with the incorrect term *"recession"*.

In [16]:
query = "What was the cause of the major recession in the early 20th century?"

# create the query embedding
xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']

# query, returning the top 5 most similar results
res = index.query([xq], top_k=5, include_metadata=True)

for match in res['results'][0]['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.92: Why did the world enter a global depression in 1929 ?
0.85: What crop failure caused the Irish Famine ?
0.83: When was `` the Great Depression '' ?
0.82: What are some of the significant historical events of the 1990s ?
0.82: What is considered the costliest disaster the insurance industry has ever faced ?


And again...

In [17]:
query = "Why was there a long-term economic downturn in the early 20th century?"

# create the query embedding
xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']

# query, returning the top 5 most similar results
res = index.query([xq], top_k=5, include_metadata=True)

for match in res['results'][0]['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.93: Why did the world enter a global depression in 1929 ?
0.83: What crop failure caused the Irish Famine ?
0.82: When was `` the Great Depression '' ?
0.82: How did serfdom develop in and then leave Russia ?
0.80: Why were people recruited for the Vietnam War ?


Looks great, our semantic search pipeline is clearly able to identify the meaning between each of our queries and return the most semantically similar questions from the already indexed questions.

---