[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/semantic-search/light-demo/light-demo.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/semantic-search/light-demo/light-demo.ipynb)

We start by installing prerequisite libraries:

In [1]:
!pip install sentence-transformers pinecone-client torch datasets sacremoses

Collecting sentence-transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 3.3 MB/s 
[?25hCollecting pinecone-client
  Downloading pinecone_client-2.0.3-py3-none-any.whl (156 kB)
[K     |████████████████████████████████| 156 kB 11.7 MB/s 
Collecting datasets
  Downloading datasets-1.15.1-py3-none-any.whl (290 kB)
[K     |████████████████████████████████| 290 kB 35.8 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.12.3-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 35.3 MB/s 
[?25hCollecting tokenizers>=0.10.3
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 45.5 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |██████████████████████

There are many sentence transformer models covering paraphrasing, Q&A, text-image, and in our case semantic similarity. Pretrained models can be found from [here](https://sbert.net/docs/pretrained_models.html). We download and initialize a model instance like so:

In [2]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer('all-distilroberta-v1', device=device)

Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.86k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/653 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

We'll encode some example sentences and work through the process of *upserting* those to Pinecone.

We will use the Quora question duplicates dataset, which contains pairs of questions are not syntactically the same, but share the same meaning. We use HuggingFace's `datasets` library to access the dataset.

In [3]:
import datasets

quora = datasets.load_dataset('quora', split='train[:300]')  # we only include the first 300 samples in this notebook
# the full dataset contains 404K pairs
quora

Downloading:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/559 [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset quora/default (download: 55.48 MiB, generated: 55.46 MiB, post-processed: Unknown size, total: 110.94 MiB) to /root/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04...


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/58.2M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

Dataset quora downloaded and prepared to /root/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04. Subsequent calls will reuse this data.


Dataset({
    features: ['questions', 'is_duplicate'],
    num_rows: 300
})

In [4]:
quora[0]

{'is_duplicate': False,
 'questions': {'id': [1, 2],
  'text': ['What is the step by step guide to invest in share market in india?',
   'What is the step by step guide to invest in share market?']}}

The full dataset contains >404K pairs, encoding all of these at once in-memory is not efficient so we will work through the data in batches and upsert them to Pinecone as we go. We will be upserting each sample as a tuple `(id, vectors, metadata)`, which each contain:

* `id` - a str ID

* `vectors` - the sentence vector (in list format)

* `metadata` - a dictionary in the format:

```json
{
    'tokens': <list of tokens for keyword search>,
    'is_duplicate': <True/False whether this is a duplicate question>,
    'char_length': <length of sentence (in text characters)>
}
```

To create `'vectors'` and `'tokens'` we need to use our sentence transformer `encode` method and a tokenizer respectively. The tokenizer will come from HuggingFace transformers and *should* break text into words like so:

In [5]:
from transformers import AutoTokenizer

# transfo-xl tokenizer uses word-level encodings
tokenizer = AutoTokenizer.from_pretrained('transfo-xl-wt103')

tokenizer.tokenize('Purple is the BEST city in the forest'.lower())

Downloading:   0%|          | 0.00/856 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.72M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.72M [00:00<?, ?B/s]

['purple', 'is', 'the', 'best', 'city', 'in', 'the', 'forest']

We will be processing and upserting all in one go. To upsert to Pinecone we would need to create an index to upsert to. We will do this via the Pinecone Python client. First we initialize our connection to Pinecone, this does require a [free API key](https://app.pinecone.io). You can find your environment in the [Pinecone console](https://app.pinecone.io) under **API Keys**

In [6]:
from pinecone import Pinecone
pinecone.init(
    api_key='YOUR_API_KEY',
    environment="YOUR_ENV"  # find next to API key in console
)

Create a new index...

In [7]:
pinecone.create_index(name='search-webinar', dimension=768)

Then we connect to the index with:

In [8]:
index = pinecone.Index('search-webinar')

Now we can put all of this together, we will process our data in batches creating the *vectors* and *metadata* and upserting them to Pinecone as we go.

In [9]:
from tqdm.auto import tqdm  # progress bar

data = []

# loop through and create JSON files
for i, row in enumerate(tqdm(quora)):
    # each Quora row contains a pair of sentences, loop through both
    for pair in [0, 1]:
        text = row['questions']['text'][pair]
        # append the (id, vectors, metadata) tuple to our 'data' list
        data.append((
            str(row['questions']['id'][pair]),
            model.encode(text).tolist(),
            {
                'tokens': tokenizer.tokenize(text.lower()),
                'is_duplicate': int(row['is_duplicate']),
                'char_length': len(text)
            }
        ))
    # once we reach end of dataset OR 100 samples, upsert to Pinecone
    if len(data) == 100 or i == len(quora):
        index.upsert(vectors=data)
        # and now reset the data list
        data = []

  0%|          | 0/300 [00:00<?, ?it/s]

## Querying with Pinecone

We have our index and data ready-to-go - let's move onto querying. First we need to create a *'query vector'* `xq`. This is a sentence (or in this case question) encoded using the same model that we encoded the quora dataset with.

*(if you are not running the full dataset - this will not return the same results! You can try the full dataset by removing `[:100]` in the `split` of `load_dataset` near the start of the notebook)*

In [10]:
query = "which Quora queries are good?"
xq = model.encode([query]).tolist()

With this, we can return similar sentences using the `query` method.

In [11]:
result = index.query(vector=xq, top_k=5, includeMetadata=True)
result

{'results': [{'matches': [{'id': '46',
                           'metadata': {'char_length': 37.0,
                                        'is_duplicate': 0.0,
                                        'tokens': ['which',
                                                   'question',
                                                   'should',
                                                   'i',
                                                   'ask',
                                                   'on',
                                                   'quora',
                                                   '?']},
                           'score': 0.546800077,
                           'values': []},
                          {'id': '45',
                           'metadata': {'char_length': 47.0,
                                        'is_duplicate': 0.0,
                                        'tokens': ['what',
                                                   'are

We can use the ID values to map these back to the original sentences, we need to create a dictionary mapping IDs to text like so:

In [12]:
id2text = {}
for row in quora:
    for pair in [0, 1]:
        id2text[str(row['questions']['id'][pair])] = row['questions']['text'][pair]

Now we can map IDs to text.

In [13]:
for item in result['matches']:
    print(round(item['score'], 2))
    print(id2text[item['id']])

0.55
Which question should I ask on Quora?
0.47
What are the questions should not ask on Quora?
0.39
Why nobody answer my questions in Quora?
0.36
Why do people ask Quora questions which can be answered easily by Google?
0.35
Why is no one answering my questions in Quora?


Let's try again but this time using metadata filtering to only return questions marked as *not* duplicates.

In [14]:
result = index.query(vector=xq, top_k=5, filter={'is_duplicate': {'$eq': 0}})

for item in result['matches']:
    print(round(item['score'], 2))
    print(id2text[item['id']])

0.55
Which question should I ask on Quora?
0.47
What are the questions should not ask on Quora?
0.23
What is bestmytest.com?
0.22
What are the best quotes/lessons of the Assassin's Creed series?
0.2
What are the best YouTube channels to learn medicine?


Let's try adding a keyword search into this, let's see what appears when excluding the word 'Quora'.

In [15]:
result = index.query(vector=xq, top_k=5, filter={'tokens': {'$nin': ['quora']}})

for item in result['matches']:
    print(round(item['score'], 2))
    print(id2text[item['id']])

0.23
What is bestmytest.com?
0.22
What are the best quotes/lessons of the Assassin's Creed series?
0.2
What are the best YouTube channels to learn medicine?
0.2
Which test series is the best for GATE computer science stream?
0.2
How can I ask a question without getting marked as ‘need to improve’?


Alternatively, we might change our query to be more generic - but then restrict the search to return questions containing one of several keywords.

In [16]:
query = "how to ask a good question?"
xq = model.encode([query]).tolist()

result = index.query(vector=xq, top_k=5, filter={'tokens': {
    '$nin': ['quora', 'quorans'],
    '$in': ['google', 'reddit', 'stackoverflow']
}})

for item in result['matches']:
    print(round(item['score'], 2))
    print(id2text[item['id']])

0.06
How Google helps in spam ranking adjustment of the search results?
-0.01
If I do not monetize YouTube videos & upload copyright content, then are there chances that Google may block my account?
-0.02
What is the distribution of traffic between Google organic search results? e.g. #1 vs. #2 in rankings, first page vs. second page


And there is our demo on semantic search with sentence transformers and Pinecone.