[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/integrations/cohere/semantic_search_trec.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/integrations/cohere/semantic_search_trec.ipynb)

# Semantic Search with Cohere and Pinecone

In this notebook we will demonstrate how to perform semantic search for identifying similar or duplicate questions using Cohere and Pinecone.

![Steps in semantic search process](https://raw.githubusercontent.com/pinecone-io/examples/master/integrations/cohere/assets/index_query_pinecone_cohere.png)

## Setup

We first need to setup our environment and retrieve API keys for Cohere and Pinecone. Let's start with our environment, we need HuggingFace *Datasets* for our data, and the Cohere and Pinecone clients:

In [None]:
!pip install cohere pinecone-client datasets

Collecting cohere
  Downloading cohere-1.3.2.tar.gz (8.0 kB)
Collecting pinecone-client
  Downloading pinecone_client-2.0.8-py3-none-any.whl (149 kB)
[K     |████████████████████████████████| 149 kB 7.9 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.0.0-py3-none-any.whl (325 kB)
[K     |████████████████████████████████| 325 kB 47.9 MB/s 
Collecting dnspython>=2.0.0
  Downloading dnspython-2.2.1-py3-none-any.whl (269 kB)
[K     |████████████████████████████████| 269 kB 48.1 MB/s 
Collecting pyyaml>=5.4
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 48.4 MB/s 
[?25hCollecting loguru>=0.5.0
  Downloading loguru-0.6.0-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 5.3 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinu

And sign up for an API key over at [Cohere](https://os.cohere.ai/) and [Pinecone](https://app.pinecone.io), you can enter the keys directly in the cell below.

In [None]:
COHERE_KEY = '<<YOUR-KEY-HERE>>'
PINECONE_KEY = '<<YOUR-KEY-HERE>>'

## Create Embeddings

We can create sentence embeddings easily using Cohere. First, we import the Cohere client and initialize our connection using the API key we retrieved earlier.

In [None]:
import cohere

co = cohere.Client(COHERE_KEY)

We will load the **T**ext **RE**trieval **C**onference (TREC) question classification dataset which contains 5.5K labeled questions. We will take the first 1K samples for this demo, but this can be scaled to millions or even billions of samples.

In [None]:
from datasets import load_dataset

# load the first 1K rows of the TREC dataset
trec = load_dataset('trec', split='train[:1000]')
trec

Downloading builder script:   0%|          | 0.00/2.22k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset trec/default (download: 350.79 KiB, generated: 403.39 KiB, post-processed: Unknown size, total: 754.18 KiB) to /root/.cache/huggingface/datasets/trec/default/1.1.0/751da1ab101b8d297a3d6e9c79ee9b0173ff94c4497b75677b59b61d5467a9b9...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/336k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.4k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/5452 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset trec downloaded and prepared to /root/.cache/huggingface/datasets/trec/default/1.1.0/751da1ab101b8d297a3d6e9c79ee9b0173ff94c4497b75677b59b61d5467a9b9. Subsequent calls will reuse this data.


Dataset({
    features: ['label-coarse', 'label-fine', 'text'],
    num_rows: 1000
})

In [None]:
trec[0]

{'label-coarse': 0,
 'label-fine': 0,
 'text': 'How did serfdom develop in and then leave Russia ?'}

We can then pass these questions to Cohere to create embeddings.

In [None]:
embeds = co.embed(
    texts=trec['text'],
    model='small',
    truncate='LEFT'
).embeddings

We can check the dimensionality of the returned vectors, for this we will convert it from a list of lists to a Numpy array. We will need to save the embedding dimensionality from this to be used when initializing our Pinecone index later.

In [None]:
import numpy as np

shape = np.array(embeds).shape
shape

(1000, 1024)

Here we can see the `1024` embedding dimensionality produced by Cohere's small model, and the `1000` samples we built embeddings for.

## Storing the Embeddings

Now that we have our embeddings we can move on to indexing them in the Pinecone vector database. Again, this is very simple, we just initialize our connection to Pinecone and then create a new index for storing the embeddings, making sure to specify that we would like to use the cosine similarity metric to align with Cohere's embeddings.

In [None]:
from pinecone import Pinecone

pinecone.init(
    PINECONE_KEY,
    environment="YOUR_ENV"  # find next to API key in console
)

index_name = 'cohere-pinecone-trec'

# if the index does not exist, we create it
if index_name not in pinecone.list_indexes().names():
    pinecone.create_index(
        index_name,
        dimension=shape[1],
        metric='cosine'
    )

# connect to index
index = pinecone.Index(index_name)

Now we can begin populating the index with our embeddings. Pinecone expects us to provide a list of tuples in the format *(id, vector, metadata)*, where the *metadata* field is an optional extra field where we can store anything we want in a dictionary format. For this example, we will store the original text of the embeddings.

While uploading our data, we will batch everything to avoid pushing too much data in one go.

In [None]:
batch_size = 128

ids = [str(i) for i in range(shape[0])]
# create list of metadata dictionaries
meta = [{'text': text} for text in trec['text']]

# create list of (id, vector, metadata) tuples to be upserted
to_upsert = list(zip(ids, embeds, meta))

for i in range(0, shape[0], batch_size):
    i_end = min(i+batch_size, shape[0])
    index.upsert(vectors=to_upsert[i:i_end])

# let's view the index statistics
index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1000}}}

Perfect, we can see from `index.describe_index_stats` that we have a *1024-dimensionality* index populated with *1000* embeddings. The `indexFullness` metric tells us how full our index is, at the moment it is empty. Using the default value of one *p1* pod we can fit ~750K embeddings before the `indexFullness` reaches capacity. The [Usage Estimator](www.pinecone.io/pricing) can be used to identify the number of pods required for a given number of *n*-dimensional embeddings.

## Semantic Search

Now that we have our indexed vectors we can perform a few search queries. When searching we will first embed our query using Cohere, and then search using the returned vector in Pinecone.

In [None]:
query = "What caused the 1929 Great Depression?"

# create the query embedding
xq = co.embed(
    texts=[query],
    model='small',
    truncate='LEFT'
).embeddings

print(np.array(xq).shape)

# query, returning the top 10 most similar results
res = index.query(vector=xq, top_k=10, include_metadata=True)
res

(1, 1024)


{'results': [{'matches': [{'id': '932',
                           'metadata': {'text': 'Why did the world enter a '
                                                'global depression in 1929 ?'},
                           'score': 0.832818151,
                           'values': []},
                          {'id': '787',
                           'metadata': {'text': 'When was `` the Great '
                                                "Depression '' ?"},
                           'score': 0.752612948,
                           'values': []},
                          {'id': '400',
                           'metadata': {'text': 'What crop failure caused the '
                                                'Irish Famine ?'},
                           'score': 0.499015927,
                           'values': []},
                          {'id': '160',
                           'metadata': {'text': 'What war did the '
                                                'Wanna

The response from Pinecone includes our original text in the `metadata` field, let's print out the `top_k` most similar questions and their respective similarity scores.

In [None]:
for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.83: Why did the world enter a global depression in 1929 ?
0.75: When was `` the Great Depression '' ?
0.50: What crop failure caused the Irish Famine ?
0.34: What war did the Wanna-Go-Home Riots occur after ?
0.34: What were popular songs and types of songs in the 1920s ?
0.34: What caused the Lynmouth floods ?
0.33: When did the Dow first reach ?
0.32: What is considered the costliest disaster the insurance industry has ever faced ?
0.32: When did World War I start ?
0.31: What caused Harry Houdini 's death ?


Looks good, let's make it harder and replace *"depression"* with the incorrect term *"recession"*.

In [None]:
query = "What was the cause of the major recession in the early 20th century?"

# create the query embedding
xq = co.embed(
    texts=[query],
    model='small',
    truncate='LEFT'
).embeddings

# query, returning the top 10 most similar results
res = index.query(vector=xq, top_k=10, include_metadata=True)

for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.66: Why did the world enter a global depression in 1929 ?
0.61: When was `` the Great Depression '' ?
0.43: What are some of the significant historical events of the 1990s ?
0.43: What crop failure caused the Irish Famine ?
0.37: What were popular songs and types of songs in the 1920s ?
0.36: When did the Dow first reach ?
0.35: What war did the Wanna-Go-Home Riots occur after ?
0.34: What historical event happened in Dogtown in 1899 ?
0.33: What is considered the costliest disaster the insurance industry has ever faced ?
0.31: What was the education system in the 1960 's ?


And again.

In [None]:
query = "Why was there a long-term economic downturn in the early 20th century?"

# create the query embedding
xq = co.embed(
    texts=[query],
    model='small',
    truncate='LEFT'
).embeddings

# query, returning the top 10 most similar results
res = index.query(vector=xq, top_k=10, include_metadata=True)

for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.71: Why did the world enter a global depression in 1929 ?
0.62: When was `` the Great Depression '' ?
0.40: What crop failure caused the Irish Famine ?
0.38: What are some of the significant historical events of the 1990s ?
0.38: When did the Dow first reach ?
0.35: What were popular songs and types of songs in the 1920s ?
0.33: What was the education system in the 1960 's ?
0.32: Give a reason for American Indians oftentimes dropping out of school .
0.31: What war did the Wanna-Go-Home Riots occur after ?
0.30: What historical event happened in Dogtown in 1899 ?


Looks great, our semantic search pipeline is clearly able to identify the meaning between each of our queries and return the most semantically similar questions from the already indexed questions.

Once we're done with the index we delete it to save resources:

In [None]:
piencone.delete_index(index_name)

---