# Semantic AND Keyword Search (Hybrid Search)

We will take a look at how to use Pinecone to perform a semantic search, while applying a traditional keyword search.

In [1]:
all_sentences = [
    "purple is the best city in the forest",
    "No way chimps go bananas for snacks!",
    "it is not often you find soggy bananas on the street",
    "green should have smelled more tranquil but somehow it just tasted rotten",
    "joyce enjoyed eating pancakes with ketchup",
    "throwing bananas on to the street is not art",
    "as the asteroid hurtled toward earth becky was upset her dentist appointment had been canceled",
    "I'm getting way too old. I don't even buy green bananas anymore.",
    "to get your way you must not bombard the road with yellow fruit",
    "Time flies like an arrow; fruit flies like a banana"
]

We will use the `sentence-transformers` library to build our sentence embeddings. It can be installed using `pip` like so:

In [None]:
!pip install sentence-transformers

*(The notebook may need to be restarted for the install to take effect)*

In [2]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')

We use this pretrained sentence transformer model to encode the sentences.

In [3]:
all_embeddings = model.encode(all_sentences)
all_embeddings.shape

(10, 768)

We have **10** embeddings, each with a dimensionality of *768*. For the keyword search we will also need to store our sentences. But for the keyword search to be effective we should strip the sentences and break them into lists of words - eg, tokens.

In [4]:
from string import punctuation

def tokenize(sentence):
    return [w.lower().strip(punctuation) for w in sentence.split()]

all_tokens = [tokenize(sentence) for sentence in all_sentences]
all_tokens[0]

['purple', 'is', 'the', 'best', 'city', 'in', 'the', 'forest']

We have everything we need, the dense vector representations of each sentence, and the stripped list of tokens for each sentence. Let's prepare that data for *upserting* to Pinecone.

In [5]:
data = []
for i, (embedding, tokens) in enumerate(zip(all_embeddings, all_tokens)):
    vector = {'id':f'{i}',
              'values': embedding.tolist(),
              'metadata':{'tokens':tokens}}
    data.append(vector)

This produces a list of *10* dictionaries, each containing the embeddings and metadata for a single sample in the format:

```json
{
    'id': 'sentence_n',
    'values': [0.001, 0.002, ...],
    'metadata': {
        'tokens': ['purple', 'is', ...]
    }
}
```

Next we need to connect to a Pinecone instance, you can get a [free API key here](https://app.pinecone.io).

In [6]:
import pinecone
with open('../../../secret/pinecone', 'r') as fp:
    api_key = fp.read()
pinecone.init(api_key=api_key, environment='us-west1-gcp')

We can check for existing indexes with:

In [7]:
pinecone.list_indexes()

[]

There are none, so let's create a new index with `create_index` and connect with `Index`.

In [8]:
pinecone.create_index(name='keyword-search', dimension=all_embeddings.shape[1])
index = pinecone.Index('keyword-search')

We reformat the list of dictionaries in `upserts` to a list of tuples, ready to be upserted.

In [9]:
upserts = [(v['id'], v['values'], v['metadata']) for v in data]
# then we upsert
index.upsert(vectors=upserts)

{'upsertedCount': 10.0}

## Querying

We now have the data in our index, let's first perform a semantic search using a query sentence, we will return the most *semantically* similar sentences.

We define the query, and encode as we did for `all_sentences` before.

In [10]:
query_sentence = "there is an art to getting your way and throwing bananas on to the street is not it"
xq = model.encode([query_sentence]).tolist()

When querying with `index.query` we can pass a list of queries. We will pass the query vector as our first argument, and *later* when filtering for specific keywords we will add the `filter` parameter.

In [11]:
result = index.query(xq, top_k=10, includeMetadata=True)
result

{'results': [{'matches': [{'id': '5',
                           'metadata': {'tokens': ['throwing',
                                                   'bananas',
                                                   'on',
                                                   'to',
                                                   'the',
                                                   'street',
                                                   'is',
                                                   'not',
                                                   'art']},
                           'score': 0.732851863,
                           'values': []},
                          {'id': '8',
                           'metadata': {'tokens': ['to',
                                                   'get',
                                                   'your',
                                                   'way',
                                                   'you',
          

Let's extract just the sentence IDs to see the order of what we have returned.

In [12]:
[x['id'] for x in result['results'][0]['matches']]

['5', '8', '2', '1', '9', '7', '0', '3', '4', '6']

Now let's add a keyword filter. Let's restrict the search to only return sentences that contain the word `bananas`.

In [13]:
result = index.query(xq, top_k=10, filter={'tokens': 'bananas'})
[x['id'] for x in result['results'][0]['matches']]

['5', '2', '1', '7']

Again, let's extract IDs and then use these to see which sentences we're returning in the query above.

In [14]:
ids = [int(x['id']) for x in result['results'][0]['matches']]
for i in ids:
    print(all_sentences[i])

throwing bananas on to the street is not art
it is not often you find soggy bananas on the street
No way chimps go bananas for snacks!
I'm getting way too old. I don't even buy green bananas anymore.


Okay cool, we can see that we're now filtering out all samples that do *not* contain the word 'bananas'. Maybe we'd like to extend this keyword filter further - for example we could filter for any samples that contain the word 'bananas' **OR** 'way' by modifying our filter to `{'$or': [{'tokens': 'bananas'}, {'tokens': 'way'}]}`.

In [15]:
result = index.query(xq, top_k=10, filter={'$or': [
                         {'tokens': 'bananas'},
                         {'tokens': 'way'}
                     ]})

ids = [int(x['id']) for x in result['results'][0]['matches']]
for i in ids:
    print(all_sentences[i])

throwing bananas on to the street is not art
to get your way you must not bombard the road with yellow fruit
it is not often you find soggy bananas on the street
No way chimps go bananas for snacks!
I'm getting way too old. I don't even buy green bananas anymore.


Or we could decide we only want to return samples that contain *both* 'bananas' **AND** 'way' by swapping the `$or` modifier for `$and`.

In [16]:
result = index.query(xq, top_k=10, filter={'$and': [
                         {'tokens': 'bananas'},
                         {'tokens': 'way'}
                     ]})

ids = [int(x['id']) for x in result['results'][0]['matches']]
for i in ids:
    print(all_sentences[i])

No way chimps go bananas for snacks!
I'm getting way too old. I don't even buy green bananas anymore.


If we have a lot of keywords including every single one manually like above can quickly get tiresome, so we can just write something like this instead:

In [17]:
keywords = ['bananas', 'way', 'purple']
filter_dict = [{'tokens': word} for word in keywords]
filter_dict

[{'tokens': 'bananas'}, {'tokens': 'way'}, {'tokens': 'purple'}]

And add it to our `query`.

In [18]:
result = index.query(xq, top_k=10, filter={'$or': filter_dict})

ids = [int(x['id']) for x in result['results'][0]['matches']]
for i in ids:
    print(all_sentences[i])

throwing bananas on to the street is not art
to get your way you must not bombard the road with yellow fruit
it is not often you find soggy bananas on the street
No way chimps go bananas for snacks!
I'm getting way too old. I don't even buy green bananas anymore.
purple is the best city in the forest
