<a href="https://colab.research.google.com/github/pinecone-io/examples/blob/cohere-webinar-2205/integrations/cohere/webinar_classification_and_search/03_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Searching and Filtering

In [1]:
!pip install cohere pinecone-client

Collecting cohere
  Downloading cohere-1.3.9-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
[K     |████████████████████████████████| 18.0 MB 180 kB/s 
[?25hCollecting pinecone-client
  Downloading pinecone_client-2.0.10-py3-none-any.whl (159 kB)
[K     |████████████████████████████████| 159 kB 58.5 MB/s 
Collecting pyyaml>=5.4
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 36.8 MB/s 
Collecting loguru>=0.5.0
  Downloading loguru-0.6.0-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 6.3 MB/s 
Collecting dnspython>=2.0.0
  Downloading dnspython-2.2.1-py3-none-any.whl (269 kB)
[K     |████████████████████████████████| 269 kB 37.2 MB/s 
Installing collected packages: pyyaml, loguru, dnspython, pinecone-client, cohere
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstallin



Taking our search one step further, we can add filtering to specify our search scope, while still maintaining fast search times using Pinecone's single stage filtering.

We can start by initializing Cohere + Pinecone.

In [2]:
COHERE_KEY = "<<COHERE_KEY_HERE>>"
PINECONE_KEY = "<<PINECONE_KEY_HERE>>"  # app.pinecone.io

In [3]:
import cohere
import pinecone

co = cohere.Client(COHERE_KEY)

pinecone.init(PINECONE_KEY, environment='us-west1-gcp')

index_name = 'cohere-pinecone-askscience'
# connect to index
index = pinecone.Index(index_name)

For the filters we will use *four* categories, each of which includes many flairs used by users in **r/askscience**.

In [4]:
all_tags = ['Physics', 'Biology', 'Engineering', 'Unknown', 'Earth Sciences',
       'Astronomy', 'Anthropology', 'Human Body', 'Social Science',
       'Medicine', 'Computing', 'Psychology', 'Chemistry', 'Linguistics',
       'Mathematics', 'Planetary Sci.', 'Neuroscience', 'Paleontology',
       'COVID-19', 'Archaeology', 'Earth Sciences and Biology', 'Meta',
       'Economics', 'CERN AMA', 'Dog Cognition AMA',
       'Cancer Treatment AMA', 'Psychology AMA', 'Archaeology AMA',
       'Alzheimer’s disease AMA', 'Oceanography AMA', 'Biology AMA',
       'Biology/Agriculture', 'Neuroscience AMA', 'Climate History AMA',
       'Climate Science AMA', 'Food Safety AMA', 'Ecology and Evolution']

chats = {
    "#general": all_tags,
    "#medical": [
        'Human Body', 'Medicine', 'COVID-19', 'Cancer Treatment AMA', 'Food Safety AMA'
    ],
    "#natural-sciences": [
        'Physics', 'Biology', 'Earth Sciences', 'Astronomy', 'Anthropology'
        'Human Body', 'Chemistry', 'Mathematics', 'Planetry Sci.', 'Neuroscience',
        'Earth Sciences and Biology', 'CERN AMA', 'Oceanography AMA',
        'Biology AMA', 'Biology/Argiculture', 'Neuroscience AMA', 'Climate History AMA',
        'Climate Science AMA', 'Ecology and Evolution'
    ],
    "#social-sciences": [
        'Anthropology', 'Social Science', 'Psychology', 'Linguistics', 'Economics',
        'Psychology AMA'
    ]
}

In [5]:
query = "what are the effects of the anthropocene?"

# create embedding with cohere
xq = co.embed(
    texts=[query],
    model='large',
    truncate='LEFT'
).embeddings

# query, returning the top 5 most similar results
res = index.query(xq, top_k=5, include_metadata=True)

for match in res['results'][0]['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['title']} ({match['metadata']['link_flair_text']})")

0.48: Are there any positive effects of climate change? (Earth Sciences)
0.42: AskScience AMA Series: We mapped human transformation of Earth over the past 10,000 years and the results will surprise you! Ask us anything! (Unknown)
0.36: Has human society and culture fundamentally altered our own biological evolution? (Ecology and Evolution)
0.36: What environmental impacts would a border wall between the United States and Mexico cause? (Earth Sciences)
0.36: How different was this world ecologically, about 2000 to 2500 yrs ago? (Earth Sciences)


Naturally there's some overlap between topics (and this example may be pretty inaccurate), but these will build the filters we will use.

Filtering in Pinecone is pretty simple, we pass our conditions to the `filter` parameter using operators like equal to `$eq`, in `$in`, greater than `$gt`, etc. So if we want to return `Paleontology` specific results we can like so:

In [7]:
query = "what are the effects of the anthropocene?"

# create embedding with cohere
xq = co.embed(
    texts=[query],
    model='large',
    truncate='LEFT'
).embeddings

# then query pinecone w/ a filter
res = index.query(
    xq, top_k=5, include_metadata=True,
    filter={
        'link_flair_text': {'$eq': 'Paleontology'}
    })

for match in res['results'][0]['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['title']} ({match['metadata']['link_flair_text']})")

0.25: What exactly would the landscape of the British Isles have looked like prior to human cultivation? (Paleontology)
0.17: AskScience AMA Series: I am paleontologist Hans Sues, I study late Paleozoic and Mesozoic vertebrates. Ask Me Anything! (Paleontology)
0.16: Given the way the Indian subcontinent was once a very large island, is it possible to find the fossils of coastal animals in the Himalayas? (Paleontology)
0.15: If I went back to the Cretacious era to go fishing, what would I catch? How big would they be? What eon would be most interesting to fish in? (Paleontology)
0.13: We are paleontologists who study fossils from an incredible site in Texas called the Arlington Archosaur Site. Ask us anything! (Paleontology)


Or as with our demo, we might group flair labels together and use `$in`.

In [8]:
query = "what are the effects of the anthropocene?"

# create embedding with cohere
xq = co.embed(
    texts=[query],
    model='large',
    truncate='LEFT'
).embeddings

# then query pinecone w/ a filter
res = index.query(
    xq, top_k=5, include_metadata=True,
    filter={
        'link_flair_text': {'$in': chats['#social-sciences']}
    })

for match in res['results'][0]['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['title']} ({match['metadata']['link_flair_text']})")

0.23: Why has Europe's population remained relatively constant whereas other continents have shown clear increase? (Social Science)
0.22: AskScience AMA Series: I’m Stephan Lewandowsky, here with Klaus Oberauer, we will be responding to your questions about the conflict between our brains and our globe: How will we meet the challenges of the 21st century despite our cognitive limitations? AMA! (Psychology)
0.20: Has the growing % of the population avoiding meat consumption had any impact on meat production? (Anthropology)
0.20: What will happen to us if the birth replacement rate keeps falling? (Social Science)
0.19: If modern man came into existence 200k years ago, but modern day societies began about 10k years ago with the discoveries of agriculture and livestock, what the hell where they doing the other 190k years?? (Anthropology)
