[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/metadata-filtered-search/metadata-filtered-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/metadata-filtered-search/metadata-filtered-search.ipynb)

# Semantic AND Keyword Search (Hybrid Search)

We will take a look at how to use Pinecone to perform a semantic search, while applying a traditional keyword search.

In [1]:
all_sentences = [
    "purple is the best city in the forest",
    "No way chimps go bananas for snacks!",
    "it is not often you find soggy bananas on the street",
    "green should have smelled more tranquil but somehow it just tasted rotten",
    "joyce enjoyed eating pancakes with ketchup",
    "throwing bananas on to the street is not art",
    "as the asteroid hurtled toward earth becky was upset her dentist appointment had been canceled",
    "I'm getting way too old. I don't even buy green bananas anymore.",
    "to get your way you must not bombard the road with yellow fruit",
    "Time flies like an arrow; fruit flies like a banana"
]

We will use the `sentence-transformers` library to build our sentence embeddings. It can be installed using `pip` like so:

In [2]:
!pip install sentence-transformers sacremoses
!pip install pinecone-client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 4.0 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 28.7 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 37.5 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 30.9 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2

*(The notebook may need to be restarted for the install to take effect)*

In [3]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')

Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.85k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/591 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

We use this pretrained sentence transformer model to encode the sentences.

In [4]:
all_embeddings = model.encode(all_sentences)
all_embeddings.shape

(10, 768)

We have **10** embeddings, each with a dimensionality of *768*. For the keyword search we will also need to store our sentences. But for the keyword search to work we need *keywords*. So, we will use a 'word-level' tokenizer from the HuggingFace transformers library to break our text into words - for this we will use the [`transfo-xl-wt103` model](https://huggingface.co/transformers/model_doc/transformerxl.html). 

*(If needed, run `!pip install transformers` - although this package should have been install when installing `sentence-transformers` above)*

In [5]:
from transformers import AutoTokenizer

# transfo-xl tokenizer uses word-level encodings
tokenizer = AutoTokenizer.from_pretrained('transfo-xl-wt103')

all_tokens = [tokenizer.tokenize(sentence.lower()) for sentence in all_sentences]
all_tokens[0]

Downloading:   0%|          | 0.00/856 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.72M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.72M [00:00<?, ?B/s]

['purple', 'is', 'the', 'best', 'city', 'in', 'the', 'forest']

We have everything we need, the dense vector representations of each sentence, and the stripped list of tokens for each sentence. So let's establish a connection to Pinecone ready for upserting our data.

Next we need to connect to a Pinecone instance, you can get a [free API key here](https://app.pinecone.io). You can find your environment in the [Pinecone console](https://app.pinecone.io) under **API Keys**

In [6]:
from pinecone import Pinecone

# connect to pinecone environment
pinecone.init(
    api_key="YOUR_API_KEY",
    environment="YOUR_ENV"  # find next to API key in console
)

We can check for existing indexes with:

In [7]:
pinecone.list_indexes().names()

[]

There are none, so let's create a new index with `create_index` and connect with `Index`.

In [8]:
pinecone.create_index(name='keyword-search', dimension=all_embeddings.shape[1])
index = pinecone.Index('keyword-search')

We now merge our data into a list of tuples, where each tuple is structured as `(id, value, metadata)`.

In [9]:
upserts = []
for i, (embedding, tokens) in enumerate(zip(all_embeddings, all_tokens)):
    upserts.append((str(i), embedding.tolist(), {'tokens': tokens}))

In [10]:
# then we upsert
index.upsert(vectors=upserts)

{'upserted_count': 10}

### Upsert with CURL

Alternatively, we can upsert using curl. For this we need to reformat our data and save it as a JSON file.

In [None]:
import json

# reformat the data
upserts = {'vectors': []}
for i, (embedding, tokens) in enumerate(zip(all_embeddings, all_tokens)):
    vector = {'id':f'{i}',
              'values': embedding.tolist(),
              'metadata':{'tokens':tokens}}
    upserts['vectors'].append(vector)

# save to JSON
with open('./upsert.json', 'w') as f:
    json.dump(upserts, f, indent=4)

This produces a JSON containing a list of *10* dictionaries within the `vectors` key. Each dictionary contains the embeddings and metadata for a single sample in the format:

```json
{
    'id': 'sentence_n',
    'values': [0.001, 0.002, ...],
    'metadata': {
        'tokens': ['purple', 'is', ...]
    }
}
```

To upsert with curl, we first find the index URL in the [Pinecone dashboard](https://app.pinecone.io), for `https://keyword-search-1234.svc.us-west1-gcp.pinecone.io/vectors/upsert` so I'd type:

In [None]:
!curl -X POST \
    https://keyword-search-1234.svc.us-west1-gcp.pinecone.io/vectors/upsert \
    -H 'Content-Type: application/json' \
    -H 'Api-Key: YOUR_API_KEY' \
    -d @./upsert.json

## Querying

We now have the data in our index, let's first perform a semantic search using a query sentence, we will return the most *semantically* similar sentences.

We define the query, and encode as we did for `all_sentences` before.

In [11]:
query_sentence = "there is an art to getting your way and throwing bananas on to the street is not it"
xq = model.encode(query_sentence).tolist()

When querying with `index.query` we can pass the query vector as our first argument, and *later* when filtering for specific keywords we will add the `filter` parameter.

In [12]:
result = index.query(vector=xq, top_k=10, includeMetadata=True)
result

{'matches': [{'id': '5',
              'metadata': {'tokens': ['throwing',
                                      'bananas',
                                      'on',
                                      'to',
                                      'the',
                                      'street',
                                      'is',
                                      'not',
                                      'art']},
              'score': 0.732851923,
              'values': []},
             {'id': '8',
              'metadata': {'tokens': ['to',
                                      'get',
                                      'your',
                                      'way',
                                      'you',
                                      'must',
                                      'not',
                                      'bombard',
                                      'the',
                                      'road',
             

Let's extract just the sentence IDs to see the order of what we have returned.

In [13]:
[x['id'] for x in result['matches']]

['5', '8', '2', '1', '9', '7', '0', '3', '4', '6']

Now let's add a keyword filter. Let's restrict the search to only return sentences that contain the word `bananas`.

In [14]:
result = index.query(vector=xq, top_k=10, filter={'tokens': 'bananas'})
[x['id'] for x in result['matches']]

['5', '2', '1', '7']

Again, let's extract IDs and then use these to see which sentences we're returning in the query above.

In [15]:
ids = [int(x['id']) for x in result['matches']]
for i in ids:
    print(all_sentences[i])

throwing bananas on to the street is not art
it is not often you find soggy bananas on the street
No way chimps go bananas for snacks!
I'm getting way too old. I don't even buy green bananas anymore.


Okay cool, we can see that we're now filtering out all samples that do *not* contain the word 'bananas'. Maybe we'd like to extend this keyword filter further - for example we could filter for any samples that contain the word 'bananas' **OR** 'way' by modifying our filter to `{'$or': [{'tokens': 'bananas'}, {'tokens': 'way'}]}`.

In [16]:
result = index.query(vector=xq, top_k=10, filter={'$or': [
                         {'tokens': 'bananas'},
                         {'tokens': 'way'}
                     ]})

ids = [int(x['id']) for x in result['matches']]
for i in ids:
    print(all_sentences[i])

throwing bananas on to the street is not art
to get your way you must not bombard the road with yellow fruit
it is not often you find soggy bananas on the street
No way chimps go bananas for snacks!
I'm getting way too old. I don't even buy green bananas anymore.


Alternatively we can us the **in** `$in` condition rather than `$or` - it will produce the same results:

In [17]:
result = index.query(vector=xq, top_k=10, filter={
    'tokens': {'$in': ['bananas', 'way']}
})

ids = [int(x['id']) for x in result['matches']]
for i in ids:
    print(all_sentences[i])

throwing bananas on to the street is not art
to get your way you must not bombard the road with yellow fruit
it is not often you find soggy bananas on the street
No way chimps go bananas for snacks!
I'm getting way too old. I don't even buy green bananas anymore.


We could decide we only want to return samples that contain *both* 'bananas' **AND** 'way' by swapping the `$or` modifier for `$and`.

In [18]:
result = index.query(vector=xq, top_k=10, filter={'$and': [
                         {'tokens': 'bananas'},
                         {'tokens': 'way'}
                     ]})

ids = [int(x['id']) for x in result['matches']]
for i in ids:
    print(all_sentences[i])

No way chimps go bananas for snacks!
I'm getting way too old. I don't even buy green bananas anymore.


If we have a lot of keywords including every single one manually like above can quickly get tiresome, so we can just write something like this instead:

In [19]:
keywords = ['bananas', 'way', 'green']
filter_dict = [{'tokens': word} for word in keywords]
filter_dict

[{'tokens': 'bananas'}, {'tokens': 'way'}, {'tokens': 'green'}]

And add it to our `query`.

In [20]:
result = index.query(vector=xq, top_k=10, filter={'$and': filter_dict})

ids = [int(x['id']) for x in result['matches']]
for i in ids:
    print(all_sentences[i])

I'm getting way too old. I don't even buy green bananas anymore.


We may also want to restrict our search to sentences that do *not* satisfy our conditions above, for example we may want all sentences that *do not* contain *'bananas'* but *do* contain *'way'*. To do this we can add **not equals** `$ne` to the `bananas` part of the query.

In [21]:
result = index.query(vector=xq, top_k=10, filter={'$and': [
                         {'tokens': {'$ne': 'bananas'}},
                         {'tokens': 'way'}
                     ]})

ids = [int(x['id']) for x in result['matches']]
for i in ids:
    print(all_sentences[i])

to get your way you must not bombard the road with yellow fruit


We can exclude multiple keywords too using the **not in** `$nin` condition.

In [22]:
result = index.query(vector=xq, top_k=10, filter={'tokens':
    {'$nin': ['bananas', 'way']}
})

ids = [int(x['id']) for x in result['matches']]
for i in ids:
    print(all_sentences[i])

Time flies like an arrow; fruit flies like a banana
purple is the best city in the forest
green should have smelled more tranquil but somehow it just tasted rotten
joyce enjoyed eating pancakes with ketchup
as the asteroid hurtled toward earth becky was upset her dentist appointment had been canceled
