[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/quick-tour/metadata-filtering.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/quick-tour/metadata-filtering.ipynb)

# Metadata filtering with Pinecone

Metadata filtering is a new feature in Pinecone that allows you to apply filters on vector search based on metadata.
You can add the metadata to the embeddings within Pinecone, and then filter for those criteria when sending the query. Pinecone will search for similar vector embeddings only among those items that match the filter.
The metadata filtering accepts arbitrary filters on metadata, and it retrieves exactly the number of nearest-neighbor results that match the filters. For most cases, the search latency will be even lower than unfiltered searches.

In this notebook, we will walk through a simple use of filtering while performing vector search on documents.

## Prerequisites

Install dependencies.

In [None]:
!pip install -qU \
  pinecone-client==3.0.0 \
  pandas==2.0.3

Set up Pinecone.

Before getting started, decide whether to use serverless or pod-based index.

In [None]:
import os

use_serverless = os.environ.get("USE_SERVERLESS", "False").lower() == "true"

## Creating an Index

Now the data is ready, we can set up our index to store it.

We begin by initializing our connection to Pinecone. To do this we need a [free API key](https://app.pinecone.io).

In [None]:
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pc.io)
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'
environment = os.environ.get('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [None]:
from pinecone import ServerlessSpec, PodSpec

if use_serverless:
    cloud = os.environ.get('PINECONE_CLOUD') or 'PINECONE_CLOUD'
    spec = ServerlessSpec(cloud='aws', region='us-west-2')
else:
    spec = PodSpec(environment=environment)

In [None]:
index_name = "pinecone-metadata-filtering"

In [None]:
import time

# Delete index if exists
if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)

# Create an index
pc.create_index(
    name=index_name, 
    dimension=2, 
    metric="euclidean",
    spec=spec
)

# wait for index to be ready before connecting
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

In [None]:
# Connect to the index
index = pc.Index(index_name)

### Generate sample document data

In [None]:
# Generate some data
import pandas as pd

df = pd.DataFrame()
df["id"] = ["F-1", "F-2", "S-1", "S-2"]
df["vector"] = [[1., 1.], [2., 2.], [3., 3.], [4., 4.]]
df["metadata"] = [
    {"category": "finance", "published": 2015},
    {"category": "finance", "published": 2016},
    {"category": "sport", "published": 2017},
    {"category": "sport", "published": 2018}]
df

Unnamed: 0,id,vector,metadata
0,F-1,"[1.0, 1.0]","{'category': 'finance', 'published': 2015}"
1,F-2,"[2.0, 2.0]","{'category': 'finance', 'published': 2016}"
2,S-1,"[3.0, 3.0]","{'category': 'sport', 'published': 2017}"
3,S-2,"[4.0, 4.0]","{'category': 'sport', 'published': 2018}"


### Insert vectors

In [None]:
# Insert vectors without specifying a namespace
index.upsert(vectors=zip(df.id, df.vector, df.metadata))
index.describe_index_stats()

{'dimension': 2,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

### Fetch a vector

In [None]:
index.fetch(ids=["F-1"])

{'namespace': '',
 'vectors': {'F-1': {'id': 'F-1',
                     'metadata': {'category': 'finance', 'published': 2015.0},
                     'values': [1.0, 1.0]}}}

### Query top-3 without filtering

In [None]:
query_results = index.query(vector=df[df.id == "F-1"].vector[0], top_k=3)
query_results

{'matches': [{'id': 'F-1', 'score': 0.0, 'values': []},
             {'id': 'F-2', 'score': 1.99999905, 'values': []},
             {'id': 'S-1', 'score': 7.99999809, 'values': []}],
 'namespace': ''}

### Query results with articles in finance published after 2015

We should expect to see only 1 article that matches this query.

In [None]:
filter_condition = {
    "category" : {"$eq": "finance"},
    "published": {"$gt": 2015 }
}
query_results = index.query(vector=
    df[df.id == "F-1"].vector[0], top_k=3, filter=filter_condition
)
query_results

{'matches': [{'id': 'F-2', 'score': 1.99999905, 'values': []}], 'namespace': ''}

### Delete the index

In [None]:
# delete the index
pc.delete_index(index_name)