[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/integrations/cohere/webinar_classification_and_search/01_semantic_search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/integrations/cohere/webinar_classification_and_search/01_semantic_search.ipynb)

# Semantic Search with Cohere and Pinecone

In this notebook we will demonstrate how to perform semantic search for identifying similar or duplicate questions using Cohere and Pinecone.

![Steps in semantic search process](https://drive.google.com/uc?id=1hcTdsaJSq4nb6chXyOcmoJ0M_udWMT0e&authuser=1)

## Setup

We first need to setup our environment and retrieve API keys for Cohere and Pinecone. Let's start with our environment, we need HuggingFace *Datasets* for our data, and the Cohere and Pinecone clients:

In [None]:
!pip install cohere pinecone-client datasets

And sign up for an API key over at [Cohere](https://os.cohere.ai/) and [Pinecone](https://app.pinecone.io), we can enter the keys directly in the cell below.

In [None]:
COHERE_KEY = "<<COHERE_KEY_HERE>>"
PINECONE_KEY = "<<PINECONE_KEY_HERE>>"  # app.pinecone.io

## Create Embeddings

We can create sentence embeddings easily using Cohere. First, we import the Cohere client and initialize our connection using the API key we retrieved earlier.

In [None]:
import cohere

co = cohere.Client(COHERE_KEY)

We will load a set of question-answer pairs scraped from Reddit's QA subreddits. We will use only a small number of vectors here, but this can be scaled to millions or even billions of samples.

In [None]:
import pandas as pd

# load the r/AskScience dataset
qa = pd.read_csv('../data/askscience_2015-2022-4mo.tsv', sep='\t')
qa.head()

Unnamed: 0.1,Unnamed: 0,title,score,url,body,created_utc,id,link_flair_text,time
0,899,Stephen Hawking megathread,65836,https://www.reddit.com/r/askscience/comments/8...,,1521003828,84auzr,Physics,2018-03-14 8:03:48
1,600,Do giraffes get struck by lightning more often...,31843,https://www.reddit.com/r/askscience/comments/6...,,1490801125,627akk,Biology,2017-03-29 18:25:25
2,799,What % of my weight am I actually lifting when...,31671,https://www.reddit.com/r/askscience/comments/7...,,1509042346,78xinz,Physics,2017-10-26 21:25:46
3,699,What is the point of using screws with a Phill...,30971,https://www.reddit.com/r/askscience/comments/6...,,1495915189,6dpog2,Engineering,2017-05-27 22:59:49
4,800,"If hand sanitizer kills 99.99% of germs, then ...",28019,https://www.reddit.com/r/askscience/comments/7...,,1507730421,75p8dn,Biology,2017-10-11 17:00:21


In [None]:
len(qa)

2093

In [None]:
qa['link_flair_text'].fillna("Unknown", inplace=True)

We can then pass these questions to Cohere to create embeddings.

In [None]:
embeds = co.embed(
    texts=qa['title'].tolist(),
    model='large',
    truncate='LEFT'
).embeddings

We can check the dimensionality of the returned vectors, for this we will convert it from a list of lists to a Numpy array. We will need to save the embedding dimensionality from this to be used when initializing our Pinecone index later.

In [None]:
import numpy as np

shape = np.array(embeds).shape
shape

(2093, 4096)

Here we can see the `4096` embedding dimensionality produced by Cohere's large model, and the `2093` questions we built embeddings for.

## Storing the Embeddings

Now tht we have our embeddings we can move on to indexing them in the Pinecone vector database. Again, this is very simple, we just initialize our connection to Pinecone and then create a new index for storing the embeddings, making sure to specify that we would like to use the cosine similarity metric to align with Cohere's embeddings.

In [None]:
from pinecone import Pinecone

pinecone.init(
    PINECONE_KEY,
    environment="YOUR_ENV"  # find next to API key in console
)

index_name = 'cohere-pinecone-askscience'

# if the index does not exist, we create it
if index_name not in pinecone.list_indexes().names():
    pinecone.create_index(
        index_name,
        dimension=shape[1],
        metric='cosine'
    )

# connect to index
index = pinecone.Index(index_name)

Now we can begin populating the index with our embeddings. Pinecone expects us to provide a list of tuples in the format *(id, vector, metadata)*, where the *metadata* field is an optional extra field where we can store anything we want in a dictionary format. For this example, we will store the original text of the embeddings.

While uploading our data, we will batch everything to avoid pushing too much data in one go.

In [None]:
batch_size = 16

ids = [ids for ids in qa['id'].tolist()]

meta = [{
    'url': row['url'],
    'title': row['title'],
    'link_flair_text': row['link_flair_text']
} for i, row in qa.iterrows()]

# create list of (id, vector, metadata) tuples to be upserted
to_upsert = list(zip(ids, embeds, meta))

for i in range(0, shape[0], batch_size):
    i_end = min(i+batch_size, shape[0])
    index.upsert(vectors=to_upsert[i:i_end])

# let's view the index statistics
index.describe_index_stats()

{'dimension': 4096,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 2093}}}

Perfect, we can see from `index.describe_index_stats` that we have a *4096-dimensionality* index populated with *2093* embeddings. The `indexFullness` metric tells us how full our index is, at the moment it is empty. Using the default value of one *p1* pod we can fit ~200K embeddings before the `indexFullness` reaches capacity. The [Usage Estimator](www.pinecone.io/pricing) can be used to identify the number of pods required for a given number of *n*-dimensional embeddings.

## Semantic Search

Now that we have our indexed vectors we can perform a few search queries. When searching we will first embed our query using Cohere, and then search using the returned vector in Pinecone.

In [None]:
query = "Can flowing lava exist underwater?"

# create the query embedding
xq = co.embed(
    texts=[query],
    model='large',
    truncate='LEFT'
).embeddings

print(np.array(xq).shape)

# query, returning the top 10 most similar results
res = index.query(vector=xq, top_k=10, include_metadata=True)
res

(1, 4096)


{'results': [{'matches': [{'id': '7vgwdg',
                           'metadata': {'link_flair_text': 'Earth Sciences',
                                        'title': 'The video game "Subnautica" '
                                                 'depicts an alien planet with '
                                                 'many exotic underwater '
                                                 'ecosystems. One of these is '
                                                 'a "lava zone" where molten '
                                                 'lava stays in liquid form '
                                                 'under the sea. Is this '
                                                 'possible?',
                                        'url': 'https://www.reddit.com/r/askscience/comments/7vgwdg/the_video_game_subnautica_depicts_an_alien_planet/'},
                           'score': 0.776414514,
                           'values': []},
                         

The response from Pinecone includes our original text in the `metadata` field, let's print out the `top_k` most similar questions and their respective similarity scores.

In [None]:
for match in res['results'][0]['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['title']}")

0.78: The video game "Subnautica" depicts an alien planet with many exotic underwater ecosystems. One of these is a "lava zone" where molten lava stays in liquid form under the sea. Is this possible?
0.46: How do lava lamps work?
0.43: Can waste dumped in the sea cause underwater landslides or tsunamis?
0.42: How deep can water be before the water at the bottom starts to phase change from liquid to solid?
0.40: How far do you have to go beneath the ocean floor before the earth becomes dry again?
0.40: If there was a body of water that was as deep as the Marianas Trench but perfectly clear and straight down, would you be able to see all the way to the bottom?
0.38: If you throw a waterproof speaker under water, and then dive under water yourself, can you hear the sound?
0.38: From how high up can you dive before water may as well be concrete?
0.37: Does lightning strike the ocean? If so, does it electrocute nearby fish?
0.37: How are underwater tunnels built? (Such as the one from Copen

In [None]:
query = "What is the anthropocene?"

# create the query embedding
xq = co.embed(
    texts=[query],
    model='large',
    truncate='LEFT'
).embeddings

# query, returning the top 5 most similar results
res = index.query(vector=xq, top_k=5, include_metadata=True)

for match in res['results'][0]['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['title']}")

0.49: AskScience AMA Series: We mapped human transformation of Earth over the past 10,000 years and the results will surprise you! Ask us anything!
0.46: So atmospheric CO2 levels just reached 410 ppm, first time in 3 million years it's been that high. What happened 3 million years ago?
0.38: Askscience Megathread: Climate Change
0.38: How different was this world ecologically, about 2000 to 2500 yrs ago?
0.34: Ask Anything Wednesday - Economics, Political Science, Linguistics, Anthropology


Looks great, our semantic search pipeline is clearly able to identify the meaning between each of our queries and return the most semantically similar questions from the already indexed questions.

---

## Adding Filtering

Taking our search one step further, we can add filtering to specify our search scope, while still maintaining fast search times using Pinecone's single stage filtering.

For the filters we will use *four* categories, each of which includes many flairs used by users in **r/askscience**.

In [None]:
all_tags = ['Physics', 'Biology', 'Engineering', 'Unknown', 'Earth Sciences',
       'Astronomy', 'Anthropology', 'Human Body', 'Social Science',
       'Medicine', 'Computing', 'Psychology', 'Chemistry', 'Linguistics',
       'Mathematics', 'Planetary Sci.', 'Neuroscience', 'Paleontology',
       'COVID-19', 'Archaeology', 'Earth Sciences and Biology', 'Meta',
       'Economics', 'CERN AMA', 'Dog Cognition AMA',
       'Cancer Treatment AMA', 'Psychology AMA', 'Archaeology AMA',
       'Alzheimer’s disease AMA', 'Oceanography AMA', 'Biology AMA',
       'Biology/Agriculture', 'Neuroscience AMA', 'Climate History AMA',
       'Climate Science AMA', 'Food Safety AMA', 'Ecology and Evolution']

chats = {
    "#general": all_tags,
    "#medical": [
        'Human Body', 'Medicine', 'COVID-19', 'Cancer Treatment AMA', 'Food Safety AMA'
    ],
    "#natural-sciences": [
        'Physics', 'Biology', 'Earth Sciences', 'Astronomy', 'Anthropology'
        'Human Body', 'Chemistry', 'Mathematics', 'Planetry Sci.', 'Neuroscience',
        'Earth Sciences and Biology', 'CERN AMA', 'Oceanography AMA',
        'Biology AMA', 'Biology/Argiculture', 'Neuroscience AMA', 'Climate History AMA',
        'Climate Science AMA', 'Ecology and Evolution'
    ],
    "#social-sciences": [
        'Anthropology', 'Social Science', 'Psychology', 'Linguistics', 'Economics',
        'Psychology AMA'
    ]
}

First lets try querying *without* any filters.

In [None]:
query = "what are the effects of the anthropocene?"

# create embedding with cohere
xq = co.embed(
    texts=[query],
    model='large',
    truncate='LEFT'
).embeddings

# query, returning the top 5 most similar results
res = index.query(vector=xq, top_k=5, include_metadata=True)

for match in res['results'][0]['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['title']} ({match['metadata']['link_flair_text']})")

0.48: Are there any positive effects of climate change? (Earth Sciences)
0.42: AskScience AMA Series: We mapped human transformation of Earth over the past 10,000 years and the results will surprise you! Ask us anything! (Unknown)
0.36: Has human society and culture fundamentally altered our own biological evolution? (Ecology and Evolution)
0.36: What environmental impacts would a border wall between the United States and Mexico cause? (Earth Sciences)
0.36: How different was this world ecologically, about 2000 to 2500 yrs ago? (Earth Sciences)


Naturally there's some overlap between topics (and this example may be pretty inaccurate), but these will build the filters we will use.

Filtering in Pinecone is pretty simple, we pass our conditions to the `filter` parameter using operators like equal to `$eq`, in `$in`, greater than `$gt`, etc. So if we want to return `Paleontology` specific results we can like so:

In [None]:
query = "what are the effects of the anthropocene?"

# create embedding with cohere
xq = co.embed(
    texts=[query],
    model='large',
    truncate='LEFT'
).embeddings

# then query pinecone w/ a filter
res = index.query(
    vector=xq, top_k=5, include_metadata=True,
    filter={
        'link_flair_text': {'$eq': 'Paleontology'}
    })

for match in res['results'][0]['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['title']} ({match['metadata']['link_flair_text']})")

0.25: What exactly would the landscape of the British Isles have looked like prior to human cultivation? (Paleontology)
0.17: AskScience AMA Series: I am paleontologist Hans Sues, I study late Paleozoic and Mesozoic vertebrates. Ask Me Anything! (Paleontology)
0.16: Given the way the Indian subcontinent was once a very large island, is it possible to find the fossils of coastal animals in the Himalayas? (Paleontology)
0.15: If I went back to the Cretacious era to go fishing, what would I catch? How big would they be? What eon would be most interesting to fish in? (Paleontology)
0.13: We are paleontologists who study fossils from an incredible site in Texas called the Arlington Archosaur Site. Ask us anything! (Paleontology)


Or as with our demo, we might group flair labels together and use `$in`.

In [None]:
query = "what are the effects of the anthropocene?"

# create embedding with cohere
xq = co.embed(
    texts=[query],
    model='large',
    truncate='LEFT'
).embeddings

# then query pinecone w/ a filter
res = index.query(
    vector=xq, top_k=5, include_metadata=True,
    filter={
        'link_flair_text': {'$in': chats['#social-sciences']}
    })

for match in res['results'][0]['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['title']} ({match['metadata']['link_flair_text']})")

0.23: Why has Europe's population remained relatively constant whereas other continents have shown clear increase? (Social Science)
0.22: AskScience AMA Series: I’m Stephan Lewandowsky, here with Klaus Oberauer, we will be responding to your questions about the conflict between our brains and our globe: How will we meet the challenges of the 21st century despite our cognitive limitations? AMA! (Psychology)
0.20: Has the growing % of the population avoiding meat consumption had any impact on meat production? (Anthropology)
0.20: What will happen to us if the birth replacement rate keeps falling? (Social Science)
0.19: If modern man came into existence 200k years ago, but modern day societies began about 10k years ago with the discoveries of agriculture and livestock, what the hell where they doing the other 190k years?? (Anthropology)


Once we're finished with the index we delete it to save resources.

In [None]:
pinecone.delete_index(index_name)

---