[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/quick-tour/metadata-filtering.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/quick-tour/metadata-filtering.ipynb)

# Metadata filtering with Pinecone

Metadata filtering is a new feature in Pinecone that allows you to apply filters on vector search based on metadata.
You can add the metadata to the embeddings within Pinecone, and then filter for those criteria when sending the query. Pinecone will search for similar vector embeddings only among those items that match the filter.
The metadata filtering accepts arbitrary filters on metadata, and it retrieves exactly the number of nearest-neighbor results that match the filters. For most cases, the search latency will be even lower than unfiltered searches.

In this notebook, we will walk through a simple use of filtering while performing vector search on documents.

## Prerequisites

Install dependencies.

In [1]:
!pip install -qU pinecone-client pandas

Set up Pinecone.

In [1]:
import pinecone
import os

# Load Pinecone API key
api_key = os.getenv('PINECONE_API_KEY') or 'YOUR_API_KEY'
pinecone.init(
    api_key=api_key,
    environment="YOUR_ENV"  # find next to API key in console
)

## Creating the Index

In [3]:
index_name = "pinecone-metadata-filtering"

# Delete index if exists
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

# Create an index
pinecone.create_index(name=index_name, dimension=2, metric="euclidean", shards=1)

In [4]:
# Connect to the index
index = pinecone.Index(index_name=index_name)

### Generate sample document data

In [5]:
# Generate some data
import pandas as pd

df = pd.DataFrame()
df["id"] = ["F-1", "F-2", "S-1", "S-2"]
df["vector"] = [[1., 1.], [2., 2.], [3., 3.], [4., 4.]]
df["metadata"] = [ 
    {"category": "finance", "published": 2015},
    {"category": "finance", "published": 2016},
    {"category": "sport", "published": 2017},
    {"category": "sport", "published": 2018}]
df

Unnamed: 0,id,vector,metadata
0,F-1,"[1.0, 1.0]","{'category': 'finance', 'published': 2015}"
1,F-2,"[2.0, 2.0]","{'category': 'finance', 'published': 2016}"
2,S-1,"[3.0, 3.0]","{'category': 'sport', 'published': 2017}"
3,S-2,"[4.0, 4.0]","{'category': 'sport', 'published': 2018}"


### Insert vectors

In [6]:
# Insert vectors without specifying a namespace
index.upsert(vectors=zip(df.id, df.vector, df.metadata))
index.describe_index_stats()

{'dimension': 2, 'namespaces': {'': {'vector_count': 4}}}

### Fetch a vector

In [7]:
index.fetch(ids=["F-1"])

{'namespace': '',
 'vectors': {'F-1': {'id': 'F-1',
                     'metadata': {'category': 'finance', 'published': 2015.0},
                     'values': [0.99999994, 0.99999994]}}}

### Query top-3 without filtering

In [8]:
query_results = index.query(queries=df[df.id == "F-1"].vector, top_k=3)
query_results

{'results': [{'matches': [{'id': 'F-1', 'score': 2.3567037e-07, 'values': []},
                          {'id': 'F-2', 'score': 2.00000095, 'values': []},
                          {'id': 'S-1', 'score': 7.99999905, 'values': []}],
              'namespace': ''}]}

### Query results with articles in finance published after 2015

We should expect to see only 1 article that matches this query.

In [9]:
filter_condition = {
    "category" : {"$eq": "finance"},
    "published": {"$gt": 2015 }
}
query_results = index.query(
    queries=df[df.id == "F-1"].vector, top_k=3, filter=filter_condition
)
query_results

{'results': [{'matches': [{'id': 'F-2', 'score': 2.00000095, 'values': []}],
              'namespace': ''}]}

### Delete the index

In [10]:
# delete the index
pinecone.delete_index(index_name)