[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/quick-tour/metadata-filtering.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/quick-tour/metadata-filtering.ipynb)

# Metadata filtering with Pinecone

Metadata filtering is a new feature in Pinecone that allows you to apply filters on vector search based on metadata.
You can add the metadata to the embeddings within Pinecone, and then filter for those criteria when sending the query. Pinecone will search for similar vector embeddings only among those items that match the filter.
The metadata filtering accepts arbitrary filters on metadata, and it retrieves exactly the number of nearest-neighbor results that match the filters. For most cases, the search latency will be even lower than unfiltered searches.

In this notebook, we will walk through a simple use of filtering while performing vector search on documents.

## Prerequisites

Install dependencies.

In [1]:
!pip install -qU pandas==2.2.3 pinecone==6.0.2

## Creating an Index

We begin by instantiating an instance of the Pinecone client. To do this we need a [free API key](https://app.pinecone.io).

We begin by instantiating an instance of the Pinecone client. To do this we need a [free API key](https://app.pinecone.io).

In [2]:
import os
from pinecone import Pinecone

# Initialize client
api_key = os.environ.get("PINECONE_API_KEY") or "PINECONE_API_KEY"
pc = Pinecone(api_key=api_key)

  from .autonotebook import tqdm as notebook_tqdm


## Creating a Pinecone Index

When creating the index we need to define several configuration properties. 

- `name` can be anything we like. The name is used as an identifier for the index when performing other operations such as `describe_index`, `delete_index`, and so on. 
- `metric` specifies the similarity metric that will be used later when you make queries to the index.
- `dimension` should correspond to the dimension of the dense vectors produced by your embedding model. In this quick start, we are using made-up data so a small value is simplest.
- `spec` holds a specification which tells Pinecone how you would like to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/troubleshooting/available-cloud-regions).

There are more configurations available, but this minimal set will get us started.

In [3]:
index_name = "pinecone-metadata-filtering"

In [4]:
# Delete the demo index if already exists
if pc.has_index(name=index_name):
    pc.delete_index(name=index_name)

In [5]:
from pinecone import ServerlessSpec, Metric, CloudProvider, AwsRegion

# Create an index
index_config = pc.create_index(
    name=index_name,
    dimension=2,
    metric=Metric.EUCLIDEAN,
    spec=ServerlessSpec(cloud=CloudProvider.AWS, region=AwsRegion.US_EAST_1),
)

## Working with the Index

Data operations such as `upsert` and `query` are sent directly to the index host instead of `api.pinecone.io`, so we use a different client object object for these operations. By using the `.Index()` helper method to construct this client object, it will automatically inherit your API Key and any other configurations from the parent `Pinecone` instance.

In [6]:
# Instantiate an index client
index = pc.Index(host=index_config.host)

### Generate sample document data

In [7]:
# Generate some data
import pandas as pd

df = pd.DataFrame()
df["id"] = ["F-1", "F-2", "S-1", "S-2"]
df["vector"] = [[1.0, 1.0], [2.0, 2.0], [3.0, 3.0], [4.0, 4.0]]
df["metadata"] = [
    {"category": "finance", "published": 2015},
    {"category": "finance", "published": 2016},
    {"category": "sport", "published": 2017},
    {"category": "sport", "published": 2018},
]
df

Unnamed: 0,id,vector,metadata
0,F-1,"[1.0, 1.0]","{'category': 'finance', 'published': 2015}"
1,F-2,"[2.0, 2.0]","{'category': 'finance', 'published': 2016}"
2,S-1,"[3.0, 3.0]","{'category': 'sport', 'published': 2017}"
3,S-2,"[4.0, 4.0]","{'category': 'sport', 'published': 2018}"


### Insert vectors

Most operations accept an optional param called `namespace`. When this parameter is not specified, the operation assumes you wish to use the default namespace.

In [8]:
# Insert vectors without specifying a namespace
index.upsert(vectors=zip(df.id, df.vector, df.metadata))

{'upserted_count': 4}

In [9]:
import time


def is_fresh(index):
    stats = index.describe_index_stats()
    vector_count = stats.total_vector_count
    return vector_count > 0


while not is_fresh(index):
    # It takes a few moments for vectors we just upserted
    # to become available for querying
    time.sleep(5)

# View index stats
index.describe_index_stats()

{'dimension': 2,
 'index_fullness': 0.0,
 'metric': 'euclidean',
 'namespaces': {'': {'vector_count': 4}},
 'total_vector_count': 4,
 'vector_type': 'dense'}

### Fetch a vector

Again, without specifying a namespace, the API will return results from the default namespace.

In [10]:
index.fetch(ids=["F-1"])

FetchResponse(namespace='', vectors={'F-1': Vector(id='F-1', values=[1.0, 1.0], metadata={'category': 'finance', 'published': 2015.0}, sparse_values=None)}, usage={'read_units': 1})

### Query top-3 without filtering

The `top_k` param is used to specify how many query results we would like returned.

In [11]:
query_results = index.query(
    vector=df[df.id == "F-1"].vector[0], top_k=3, include_metadata=True
)
query_results

{'matches': [{'id': 'F-1',
              'metadata': {'category': 'finance', 'published': 2015.0},
              'score': 0.0,
              'values': []},
             {'id': 'F-2',
              'metadata': {'category': 'finance', 'published': 2016.0},
              'score': 1.99999905,
              'values': []},
             {'id': 'S-1',
              'metadata': {'category': 'sport', 'published': 2017.0},
              'score': 7.99999809,
              'values': []}],
 'namespace': '',
 'usage': {'read_units': 1}}

### Query results with articles in finance published after 2015

By passing a `filter` condition, we can limit the matches to those matching specific criteria in addition to vector similarity. See [Understanding Metadata](https://docs.pinecone.io/guides/data/understanding-metadata) for more information about available filter conditions.  

Even though we requeusted up to 3 results with `top_k=3`, we should expect to see only 1 article that matches this query due to the metadata filter applied.

In [12]:
query_results = index.query(
    vector=df[df.id == "F-1"].vector[0],
    top_k=3,
    filter={"category": {"$eq": "finance"}, "published": {"$gt": 2015}},
    include_metadata=True,
)
query_results

{'matches': [{'id': 'F-2',
              'metadata': {'category': 'finance', 'published': 2016.0},
              'score': 1.99999905,
              'values': []}],
 'namespace': '',
 'usage': {'read_units': 1}}

### Delete the index

Once we're done with this demo we don't need the index anymore, so let's delete it.

In [13]:
# Delete the index
pc.delete_index(name=index_name)