installing prerequisite libraries:

In [None]:
!pip install sentence-transformers pinecone-client torch datasets sacremoses

There are many sentence transformer models from HuggingFace for sentence incoder. Pretrained models was found at https://sbert.net/docs/pretrained_models.html. I download and initialize a model instance like so

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-distilroberta-v1')

I am using the Quora question duplicates dataset, which contains pairs of questions are not syntactically the same, but share the same meaning. I am using HuggingFace's datasets library to access the dataset.

In [None]:
import datasets

quora = datasets.load_dataset('quora', split='train[:10000]')  
# the full dataset contains 404K pairs
quora

In [None]:
quora[1020]

The full dataset contains >404K pairs, encoding all of these at once in-memory is not efficient so I upsert them in batches to Pinecone. I upsert each sample as a tuple `(id, vectors, metadata)`, which each contain:

* `id` - a str ID

* `vectors` - the sentence vector (in list format)

* `metadata` - a dictionary in the format:

To create 'vectors' and 'tokens' I used sentence transformer encode method and a tokenizer. The tokenizer will come from HuggingFace transformers and should break text into words.

In [None]:
from transformers import AutoTokenizer

# transfo-xl tokenizer uses word-level encodings
tokenizer = AutoTokenizer.from_pretrained('transfo-xl-wt103')

tokenizer.tokenize('TechnicFMC is a big company'.lower())

To upsert to Pinecone i create an index to upsert to. i do this via the Pinecone Python client. First i initialize the connection to Pinecone API key 

In [1]:
import pinecone
pinecone.init(api_key='9eb9963c-d9bf-42de-b8ff-8b577403ea32', environment='us-west1-gcp')

ModuleNotFoundError: ignored

Create a new index...

In [None]:
pinecone.create_index(name='semantic-search-demo', dimension=768)

Then connect to the index with:

In [None]:
index = pinecone.Index('semantic-search-demo')

I process the data in batches creating the *vectors* and *metadata* and upserting them to Pinecone

In [None]:
from tqdm.auto import tqdm  # progress bar

data = []

# loop through and create JSON files
for i, row in enumerate(tqdm(quora)):
    # each Quora row contains a pair of sentences, loop through both
    for pair in [0, 1]:
        text = row['questions']['text'][pair]
        # append the (id, vectors, metadata) tuple to our 'data' list
        data.append((
            str(row['questions']['id'][pair]),
            model.encode(text).tolist(),
            {
                'tokens': tokenizer.tokenize(text.lower()),
                'is_duplicate': int(row['is_duplicate']),
                'char_length': len(text)
            }
        ))
    # once we reach end of dataset OR 100 samples, upsert to Pinecone
    if len(data) == 100 or i == len(quora):
        index.upsert(vectors=data)
        # and now reset the data list
        data = []

  0%|          | 0/10000 [00:00<?, ?it/s]

## Querying with Pinecone

i have our index and data ready. First I create a *'query vector'* `xq`. This is a sentence (or in this case question) encoded using the same model that we encoded the quora dataset with.


In [None]:
query = "What questions are asked in Google Interviews?"
xq = model.encode([query]).tolist()

I can return similar sentences using the `query` method.

In [None]:
result = index.query(xq, top_k=5, includeMetadata=True)
result

{'matches': [{'id': '6849',
              'metadata': {'char_length': 46.0,
                           'is_duplicate': 1.0,
                           'tokens': ['what',
                                      'questions',
                                      'are',
                                      'asked',
                                      'in',
                                      'google',
                                      'interviews',
                                      '?']},
              'score': 1.00000012,
              'sparseValues': {},
              'values': []},
             {'id': '6848',
              'metadata': {'char_length': 66.0,
                           'is_duplicate': 1.0,
                           'tokens': ['what',
                                      'are',
                                      'some',
                                      'questions',
                                      'that',
                                      'i',

I use the ID values to map these back to the original sentences, to create a dictionary mapping IDs to text like 

In [None]:
id2text = {}
for row in quora:
    for pair in [0, 1]:
        id2text[str(row['questions']['id'][pair])] = row['questions']['text'][pair]

Now I map IDs to text.

In [None]:
for item in result['matches']:
    print(round(item['score'], 2))
    print(id2text[item['id']])

1.0
What questions are asked in Google Interviews?
0.82
What are some questions that I may be asked in a Google interview?
0.73
What are some interesting questions asked in an interview?
0.7
What are the best interview questions ever asked?
0.67
What are good questions for you to ask an interviewer?


using metadata filtering to only return questions marked as *not* duplicates.

In [None]:
result = index.query(xq, top_k=5, filter={'is_duplicate': {'$eq': 0}})

for item in result['matches']:
    print(round(item['score'], 2))
    print(id2text[item['id']])

0.73
What are some interesting questions asked in an interview?
0.67
What are good questions for you to ask an interviewer?
0.66
What are the trickiest questions asked in an interview?
0.66
What questions should a job candidate ask the interviewer?
0.61
What should I expect in a Software Engineer interview at Google and how should I prepare?


with keyword search, to find what appears when excluding the word 'MacBook'.

In [None]:
result = index.query(xq, top_k=5, filter={'tokens': {'$nin': ['quora']}})

for item in result['matches']:
    print(round(item['score'], 2))
    print(id2text[item['id']])

1.0
What questions are asked in Google Interviews?
0.82
What are some questions that I may be asked in a Google interview?
0.73
What are some interesting questions asked in an interview?
0.7
What are the best interview questions ever asked?
0.67
What are good questions for you to ask an interviewer?


for generic - to return questions containing one of several keywords.

In [None]:
query = "how to ask a good question?"
xq = model.encode([query]).tolist()

result = index.query(xq, top_k=5, filter={'tokens': {
    '$nin': ['quora', 'quorans'],
    '$in': ['google', 'reddit', 'stackoverflow']
}})

for item in result['matches']:
    print(round(item['score'], 2))
    print(id2text[item['id']])

NameError: ignored